{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e765555c",
   "metadata": {},
   "source": [
    "\n",
    "# Retrieval-Augmented **Fine-Tuning** (RAFT) Dataset Generation with **Local Ollama**\n",
    "\n",
    "This notebook builds a **supervised fine-tuning dataset** (JSONL) for *retrieval-augmented* tasks, by:\n",
    "1. **Ingesting** your local corpus (Markdown, text, HTML; PDFs optional with extra deps).\n",
    "2. **Chunking** and **embedding** documents using Ollama's local **embedding model** (e.g., `nomic-embed-text`, `mxbai-embed-large`).\n",
    "3. Building a **lightweight vector index** (FAISS).\n",
    "4. **Sampling contexts** and using a local **generation model** via Ollama (e.g., `llama3.1`, `qwen2`, `phi3`) to synthesize **grounded Q&A** or instruction–response pairs.\n",
    "5. Emitting a **RAFT-style JSONL** for supervised training (e.g., `input`, `output`, `meta` with source citations).\n",
    "\n",
    "> **Requirements**\n",
    ">\n",
    "> - Local [Ollama](https://ollama.com/) running at `http://localhost:11434`\n",
    "> - At least one **embedding** model pulled (e.g., `ollama pull nomic-embed-text`)\n",
    "> - At least one **generation** model pulled (e.g., `ollama pull llama3.1`)\n",
    ">\n",
    "> You can adapt the prompts and schema for your specific downstream trainer (Llama.cpp, vLLM, Axolotl, mlx, etc.).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "554f4b94",
   "metadata": {},
   "source": [
    "\n",
    "## 0) Setup\n",
    "\n",
    "Install Python dependencies. If you're offline, pre-install or remove what you don't need.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "e4e8a7e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# If needed, uncomment:\n",
    "# %pip install --quiet requests faiss-cpu rich markdownify python-frontmatter pypdf regex\n",
    "# Optional extras:\n",
    "# %pip install --quiet tiktoken beautifulsoup4 lxml\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7a796e8",
   "metadata": {},
   "source": [
    "\n",
    "## 1) Configuration\n",
    "\n",
    "Set paths, models, and chunking/index params.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "562f06f9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">{</span>\n",
       "    <span style=\"color: #008000; text-decoration-color: #008000\">'DATA_DIR'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'/home/marvin/repo/jupyter-ai-test/raft/corpus'</span>,\n",
       "    <span style=\"color: #008000; text-decoration-color: #008000\">'OUTPUT_DIR'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'/home/marvin/repo/jupyter-ai-test/raft/outputs'</span>,\n",
       "    <span style=\"color: #008000; text-decoration-color: #008000\">'OLLAMA_URL'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'http://localhost:11434'</span>,\n",
       "    <span style=\"color: #008000; text-decoration-color: #008000\">'EMBED_MODEL'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'nomic-embed-text'</span>,\n",
       "    <span style=\"color: #008000; text-decoration-color: #008000\">'GEN_MODEL'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'llama3.1:8b'</span>,\n",
       "    <span style=\"color: #008000; text-decoration-color: #008000\">'CHUNK_SIZE'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1200</span>,\n",
       "    <span style=\"color: #008000; text-decoration-color: #008000\">'CHUNK_OVERLAP'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">200</span>,\n",
       "    <span style=\"color: #008000; text-decoration-color: #008000\">'TOP_K'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">4</span>,\n",
       "    <span style=\"color: #008000; text-decoration-color: #008000\">'SAMPLES_PER_DOC'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">4</span>\n",
       "<span style=\"font-weight: bold\">}</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\u001b[1m{\u001b[0m\n",
       "    \u001b[32m'DATA_DIR'\u001b[0m: \u001b[32m'/home/marvin/repo/jupyter-ai-test/raft/corpus'\u001b[0m,\n",
       "    \u001b[32m'OUTPUT_DIR'\u001b[0m: \u001b[32m'/home/marvin/repo/jupyter-ai-test/raft/outputs'\u001b[0m,\n",
       "    \u001b[32m'OLLAMA_URL'\u001b[0m: \u001b[32m'http://localhost:11434'\u001b[0m,\n",
       "    \u001b[32m'EMBED_MODEL'\u001b[0m: \u001b[32m'nomic-embed-text'\u001b[0m,\n",
       "    \u001b[32m'GEN_MODEL'\u001b[0m: \u001b[32m'llama3.1:8b'\u001b[0m,\n",
       "    \u001b[32m'CHUNK_SIZE'\u001b[0m: \u001b[1;36m1200\u001b[0m,\n",
       "    \u001b[32m'CHUNK_OVERLAP'\u001b[0m: \u001b[1;36m200\u001b[0m,\n",
       "    \u001b[32m'TOP_K'\u001b[0m: \u001b[1;36m4\u001b[0m,\n",
       "    \u001b[32m'SAMPLES_PER_DOC'\u001b[0m: \u001b[1;36m4\u001b[0m\n",
       "\u001b[1m}\u001b[0m\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "\n",
    "from dataclasses import dataclass, asdict\n",
    "from pathlib import Path\n",
    "from typing import List, Dict, Any, Optional, Tuple\n",
    "import os, re, json, uuid, math, glob, random, time\n",
    "import hashlib\n",
    "import requests\n",
    "from rich import print\n",
    "import regex\n",
    "import numpy as np\n",
    "\n",
    "# ---- Core config ----\n",
    "DATA_DIR = Path(\"./corpus\")  # Put your source docs here\n",
    "OUTPUT_DIR = Path(\"./outputs\")  # Where artifacts are saved\n",
    "OUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "# Ollama endpoints & models\n",
    "OLLAMA_URL = os.environ.get(\"OLLAMA_URL\", \"http://localhost:11434\")\n",
    "EMBED_MODEL = os.environ.get(\"EMBED_MODEL\", \"nomic-embed-text\")\n",
    "GEN_MODEL = os.environ.get(\"GEN_MODEL\", \"llama3.1:8b\")\n",
    "\n",
    "# Chunking\n",
    "CHUNK_SIZE = 1200      # characters\n",
    "CHUNK_OVERLAP = 200    # characters\n",
    "MIN_CHARS = 200        # minimum viable chunk length\n",
    "\n",
    "# Index\n",
    "USE_FAISS = True\n",
    "TOP_K = 4\n",
    "\n",
    "# RAFT generation\n",
    "SEED = 7\n",
    "SAMPLES_PER_DOC = 4\n",
    "MAX_TOKENS_GEN = 512  # Generation max tokens (approx; Ollama supports 'num_predict')\n",
    "TEMPERATURE = 0.6\n",
    "\n",
    "random.seed(SEED)\n",
    "np.random.seed(SEED)\n",
    "\n",
    "print({\n",
    "    \"DATA_DIR\": str(DATA_DIR.resolve()),\n",
    "    \"OUTPUT_DIR\": str(OUTPUT_DIR.resolve()),\n",
    "    \"OLLAMA_URL\": OLLAMA_URL,\n",
    "    \"EMBED_MODEL\": EMBED_MODEL,\n",
    "    \"GEN_MODEL\": GEN_MODEL,\n",
    "    \"CHUNK_SIZE\": CHUNK_SIZE,\n",
    "    \"CHUNK_OVERLAP\": CHUNK_OVERLAP,\n",
    "    \"TOP_K\": TOP_K,\n",
    "    \"SAMPLES_PER_DOC\": SAMPLES_PER_DOC\n",
    "})\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f932e07",
   "metadata": {},
   "source": [
    "\n",
    "## 2) Load & Normalize Documents\n",
    "\n",
    "Basic loaders for `.md`, `.txt`, `.html`. PDF support is optional (requires `pypdf`). You can extend as needed.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "b4a4b5ee",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #008000; text-decoration-color: #008000\">Loaded </span><span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">265</span><span style=\"color: #008000; text-decoration-color: #008000\"> documents</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\u001b[32mLoaded \u001b[0m\u001b[1;32m265\u001b[0m\u001b[32m documents\u001b[0m\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "265"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\n",
    "from bs4 import BeautifulSoup  # if you didn't install bs4, comment HTML support below\n",
    "try:\n",
    "    import frontmatter\n",
    "except Exception:\n",
    "    frontmatter = None\n",
    "\n",
    "def read_text_file(p: Path) -> str:\n",
    "    return p.read_text(encoding=\"utf-8\", errors=\"ignore\")\n",
    "\n",
    "def read_markdown(p: Path) -> str:\n",
    "    text = p.read_text(encoding=\"utf-8\", errors=\"ignore\")\n",
    "    # Optional: strip YAML frontmatter\n",
    "    if frontmatter:\n",
    "        try:\n",
    "            fm = frontmatter.loads(text)\n",
    "            return fm.content\n",
    "        except Exception:\n",
    "            return text\n",
    "    return text\n",
    "\n",
    "def read_html(p: Path) -> str:\n",
    "    html = p.read_text(encoding=\"utf-8\", errors=\"ignore\")\n",
    "    soup = BeautifulSoup(html, \"lxml\")\n",
    "    # Remove script/style\n",
    "    for tag in soup([\"script\", \"style\", \"noscript\"]):\n",
    "        tag.decompose()\n",
    "    text = soup.get_text(\" \", strip=True)\n",
    "    return text\n",
    "\n",
    "def read_pdf(p: Path) -> str:\n",
    "    try:\n",
    "        from pypdf import PdfReader\n",
    "    except Exception as e:\n",
    "        print(\"[yellow]Install pypdf to enable PDF parsing: %pip install pypdf[/yellow]\")\n",
    "        raise e\n",
    "    reader = PdfReader(str(p))\n",
    "    parts = []\n",
    "    for page in reader.pages:\n",
    "        try:\n",
    "            parts.append(page.extract_text() or \"\")\n",
    "        except Exception:\n",
    "            parts.append(\"\")\n",
    "    return \"\\n\".join(parts)\n",
    "\n",
    "SUPPORTED_EXTS = {\".txt\": read_text_file, \".md\": read_markdown, \".markdown\": read_markdown,\n",
    "                  \".html\": read_html, \".htm\": read_html, \".pdf\": read_pdf}\n",
    "\n",
    "def load_corpus(data_dir: Path) -> Dict[str, str]:\n",
    "    docs = {}\n",
    "    for p in data_dir.rglob(\"*\"):\n",
    "        if not p.is_file():\n",
    "            continue\n",
    "        fn = p.suffix.lower()\n",
    "        if fn in SUPPORTED_EXTS:\n",
    "            try:\n",
    "                docs[str(p)] = SUPPORTED_EXTS[fn](p)\n",
    "            except Exception as e:\n",
    "                print(f\"[red]Failed to read {p}: {e}[/red]\")\n",
    "    print(f\"[green]Loaded {len(docs)} documents[/green]\")\n",
    "    return docs\n",
    "\n",
    "docs = load_corpus(DATA_DIR)\n",
    "len(docs)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ffb354e",
   "metadata": {},
   "source": [
    "\n",
    "## 3) Chunking\n",
    "\n",
    "Simple character-based chunker with overlap. Swap in a token-based chunker if you prefer.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "a1abf43d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #008000; text-decoration-color: #008000\">Total chunks: </span><span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">7373</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\u001b[32mTotal chunks: \u001b[0m\u001b[1;32m7373\u001b[0m\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "\n",
    "@dataclass\n",
    "class Chunk:\n",
    "    id: str\n",
    "    doc_path: str\n",
    "    start: int\n",
    "    end: int\n",
    "    text: str\n",
    "    sha1: str\n",
    "\n",
    "def chunk_text(text: str, doc_path: str, chunk_size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> List[Chunk]:\n",
    "    chunks: List[Chunk] = []\n",
    "    i = 0\n",
    "    n = len(text)\n",
    "    while i < n:\n",
    "        j = min(i + chunk_size, n)\n",
    "        piece = text[i:j].strip()\n",
    "        if len(piece) >= MIN_CHARS:\n",
    "            sha1 = hashlib.sha1(piece.encode(\"utf-8\")).hexdigest()\n",
    "            chunks.append(Chunk(\n",
    "                id=str(uuid.uuid4()),\n",
    "                doc_path=doc_path,\n",
    "                start=i, end=j,\n",
    "                text=piece,\n",
    "                sha1=sha1\n",
    "            ))\n",
    "        if j == n:\n",
    "            break\n",
    "        i = j - overlap\n",
    "        if i < 0:\n",
    "            i = 0\n",
    "        if i >= n:\n",
    "            break\n",
    "    return chunks\n",
    "\n",
    "all_chunks: List[Chunk] = []\n",
    "for path, text in docs.items():\n",
    "    all_chunks.extend(chunk_text(text, path))\n",
    "\n",
    "print(f\"[green]Total chunks: {len(all_chunks)}[/green]\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "396ab28c",
   "metadata": {},
   "source": [
    "\n",
    "## 4) Embeddings via Ollama\n",
    "\n",
    "Uses Ollama's `POST /api/embeddings` endpoint with your selected embedding model.  \n",
    "Make sure you've pulled it locally: `ollama pull nomic-embed-text` (or your chosen model).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "037fc70d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(7373, 768)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\n",
    "EMBED_ENDPOINT = f\"{OLLAMA_URL}/api/embeddings\"\n",
    "\n",
    "def embed_texts(texts: List[str], model: str = EMBED_MODEL, batch_size: int = 32) -> np.ndarray:\n",
    "    vectors = []\n",
    "    for i in range(0, len(texts), batch_size):\n",
    "        batch = texts[i:i+batch_size]\n",
    "        # Ollama supports a single prompt or list? We'll call one by one to be safe with large content.\n",
    "        for t in batch:\n",
    "            r = requests.post(EMBED_ENDPOINT, json={\"model\": model, \"prompt\": t})\n",
    "            r.raise_for_status()\n",
    "            data = r.json()\n",
    "            vec = np.array(data[\"embedding\"], dtype=np.float32)\n",
    "            vectors.append(vec)\n",
    "    return np.vstack(vectors) if vectors else np.zeros((0, 768), dtype=np.float32)\n",
    "\n",
    "chunk_texts = [c.text for c in all_chunks]\n",
    "emb_matrix = embed_texts(chunk_texts, model=EMBED_MODEL, batch_size=8)\n",
    "emb_matrix.shape\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b4af05b",
   "metadata": {},
   "source": [
    "\n",
    "## 5) Build Vector Index (FAISS)\n",
    "\n",
    "We normalize vectors and use inner product (equivalent to cosine on normalized vectors).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "350bd2a4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #008000; text-decoration-color: #008000\">FAISS index built:</span> <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">7373</span> vectors\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\u001b[32mFAISS index built:\u001b[0m \u001b[1;36m7373\u001b[0m vectors\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "\n",
    "def normalize_rows(x: np.ndarray) -> np.ndarray:\n",
    "    norms = np.linalg.norm(x, axis=1, keepdims=True) + 1e-12\n",
    "    return x / norms\n",
    "\n",
    "if USE_FAISS:\n",
    "    import faiss\n",
    "    xb = normalize_rows(emb_matrix).astype(np.float32)\n",
    "    d = xb.shape[1]\n",
    "    index = faiss.IndexFlatIP(d)\n",
    "    index.add(xb)\n",
    "    print(\"[green]FAISS index built:[/green]\", index.ntotal, \"vectors\")\n",
    "else:\n",
    "    index = None\n",
    "    xb = normalize_rows(emb_matrix).astype(np.float32)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f5d26cb",
   "metadata": {},
   "source": [
    "\n",
    "## 6) Retrieval Helper\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "198ddb95",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">[(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">883</span>, <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.4678046703338623</span><span style=\"font-weight: bold\">)</span>, <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">5593</span>, <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.45775240659713745</span><span style=\"font-weight: bold\">)</span>, <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">4318</span>, <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.45724332332611084</span><span style=\"font-weight: bold\">)]</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\u001b[1m[\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m883\u001b[0m, \u001b[1;36m0.4678046703338623\u001b[0m\u001b[1m)\u001b[0m, \u001b[1m(\u001b[0m\u001b[1;36m5593\u001b[0m, \u001b[1;36m0.45775240659713745\u001b[0m\u001b[1m)\u001b[0m, \u001b[1m(\u001b[0m\u001b[1;36m4318\u001b[0m, \u001b[1;36m0.45724332332611084\u001b[0m\u001b[1m)\u001b[0m\u001b[1m]\u001b[0m\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "\n",
    "def search(query: str, top_k: int = TOP_K) -> List[Tuple[int, float]]:\n",
    "    # Embed the query\n",
    "    qv = embed_texts([query], model=EMBED_MODEL, batch_size=1)\n",
    "    qv = normalize_rows(qv).astype(np.float32)\n",
    "    if USE_FAISS and index is not None:\n",
    "        D, I = index.search(qv, top_k)\n",
    "        hits = list(zip(I[0].tolist(), D[0].tolist()))\n",
    "    else:\n",
    "        sims = (xb @ qv.T).ravel()\n",
    "        I = np.argsort(-sims)[:top_k]\n",
    "        hits = [(int(i), float(sims[i])) for i in I]\n",
    "    return hits\n",
    "\n",
    "# quick smoke test (no error means it's wired up)\n",
    "print(search(\"What does this corpus talk about?\", 3))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0046d618",
   "metadata": {},
   "source": [
    "\n",
    "## 7) Synthesize Grounded Q&A / Instructions with Ollama\n",
    "\n",
    "We sample chunks, retrieve neighbors for richer context, and prompt a local LLM to create **high-quality** pairs.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "1d12cf4f",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "GEN_ENDPOINT = f\"{OLLAMA_URL}/api/generate\"\n",
    "\n",
    "SYSTEM_PROMPT = (\n",
    "    \"You are a careful dataset writer. Given only the provided CONTEXT, craft high-quality, factual \"\n",
    "    \"question–answer pairs for supervised fine-tuning. Answers must be grounded strictly in the context. \"\n",
    "    \"If the context lacks the answer, say 'INSUFFICIENT_CONTEXT'. Focus on clarity, specificity, and avoid hallucinations.\"\n",
    ")\n",
    "\n",
    "USER_PROMPT_TEMPLATE = (\n",
    "    \"CONTEXT:\\n\\n{context}\\n\\n\"\n",
    "    \"Task: Produce {n} diverse Q&A pairs about the content above. \"\n",
    "    \"Use JSON lines (one JSON object per line) with keys: 'input' (question/instruction), 'output' (concise grounded answer), \"\n",
    "    \"'meta' (object with 'source_path', 'chunk_ids', and optional 'citations': list of quotes). \"\n",
    "    \"Do NOT include markdown; output JSON objects only.\"\n",
    ")\n",
    "\n",
    "def ollama_generate(prompt: str, model: str = GEN_MODEL, temperature: float = TEMPERATURE, num_predict: int = MAX_TOKENS_GEN) -> str:\n",
    "    payload = {\n",
    "        \"model\": model,\n",
    "        \"prompt\": prompt,\n",
    "        \"system\": SYSTEM_PROMPT,\n",
    "        \"options\": {\n",
    "            \"temperature\": temperature,\n",
    "            \"num_predict\": num_predict\n",
    "        },\n",
    "        \"stream\": False\n",
    "    }\n",
    "    r = requests.post(GEN_ENDPOINT, json=payload)\n",
    "    r.raise_for_status()\n",
    "    data = r.json()\n",
    "    return data.get(\"response\", \"\")\n",
    "\n",
    "def build_context(primary_idx: int, k: int = TOP_K) -> Tuple[str, List[str]]:\n",
    "    primary_chunk = all_chunks[primary_idx]\n",
    "    query = primary_chunk.text[:400]  # use the start of the chunk as a pseudo-query\n",
    "    hits = search(query, k)\n",
    "    pieces, ids = [], []\n",
    "    for i, score in hits:\n",
    "        ch = all_chunks[i]\n",
    "        ids.append(ch.id)\n",
    "        pieces.append(f\"[{Path(ch.doc_path).name}::{ch.start}-{ch.end}]\\n{ch.text}\")\n",
    "    return \"\\n\\n---\\n\\n\".join(pieces), ids\n",
    "\n",
    "def parse_llm_jsonl(text: str) -> List[Dict[str, Any]]:\n",
    "    rows = []\n",
    "    for line in text.splitlines():\n",
    "        line = line.strip()\n",
    "        if not line:\n",
    "            continue\n",
    "        # be forgiving for trailing commas etc.\n",
    "        try:\n",
    "            obj = json.loads(line)\n",
    "            if isinstance(obj, dict):\n",
    "                rows.append(obj)\n",
    "        except Exception:\n",
    "            # try to salvage with regex for JSON-ish\n",
    "            try:\n",
    "                fixed = regex.sub(r\",\\s*}\", \"}\", line)\n",
    "                fixed = regex.sub(r\",\\s*]\", \"]\", fixed)\n",
    "                obj = json.loads(fixed)\n",
    "                if isinstance(obj, dict):\n",
    "                    rows.append(obj)\n",
    "            except Exception:\n",
    "                pass\n",
    "    return rows\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab8c47f2",
   "metadata": {},
   "source": [
    "\n",
    "## 8) Generate the RAFT Dataset\n",
    "\n",
    "This step iterates over documents, samples chunks, retrieves neighbors, and asks the model to produce JSONL rows.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "74493d61",
   "metadata": {},
   "outputs": [
    {
     "ename": "KeyboardInterrupt",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
      "\u001b[31mKeyboardInterrupt\u001b[39m                         Traceback (most recent call last)",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[13]\u001b[39m\u001b[32m, line 43\u001b[39m\n\u001b[32m     40\u001b[39m     \u001b[38;5;28mprint\u001b[39m(\u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33m[green]Wrote \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mtotal_target\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m rows -> \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mout_path\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m[/green]\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m     41\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m out_path\n\u001b[32m---> \u001b[39m\u001b[32m43\u001b[39m OUT_JSONL = \u001b[43msynthesize_dataset\u001b[49m\u001b[43m(\u001b[49m\u001b[43msamples_per_doc\u001b[49m\u001b[43m=\u001b[49m\u001b[43mSAMPLES_PER_DOC\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m     44\u001b[39m OUT_JSONL\n",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[13]\u001b[39m\u001b[32m, line 20\u001b[39m, in \u001b[36msynthesize_dataset\u001b[39m\u001b[34m(samples_per_doc, out_path)\u001b[39m\n\u001b[32m     18\u001b[39m ctx, ids = build_context(pi, k=TOP_K)\n\u001b[32m     19\u001b[39m user = USER_PROMPT_TEMPLATE.format(context=ctx, n=\u001b[32m3\u001b[39m)\n\u001b[32m---> \u001b[39m\u001b[32m20\u001b[39m raw = \u001b[43mollama_generate\u001b[49m\u001b[43m(\u001b[49m\u001b[43muser\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmodel\u001b[49m\u001b[43m=\u001b[49m\u001b[43mGEN_MODEL\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtemperature\u001b[49m\u001b[43m=\u001b[49m\u001b[43mTEMPERATURE\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mnum_predict\u001b[49m\u001b[43m=\u001b[49m\u001b[43mMAX_TOKENS_GEN\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m     21\u001b[39m rows = parse_llm_jsonl(raw)\n\u001b[32m     22\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m r \u001b[38;5;129;01min\u001b[39;00m rows:\n\u001b[32m     23\u001b[39m     \u001b[38;5;66;03m# enforce schema & enrich meta\u001b[39;00m\n",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[9]\u001b[39m\u001b[32m, line 28\u001b[39m, in \u001b[36mollama_generate\u001b[39m\u001b[34m(prompt, model, temperature, num_predict)\u001b[39m\n\u001b[32m     17\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mollama_generate\u001b[39m(prompt: \u001b[38;5;28mstr\u001b[39m, model: \u001b[38;5;28mstr\u001b[39m = GEN_MODEL, temperature: \u001b[38;5;28mfloat\u001b[39m = TEMPERATURE, num_predict: \u001b[38;5;28mint\u001b[39m = MAX_TOKENS_GEN) -> \u001b[38;5;28mstr\u001b[39m:\n\u001b[32m     18\u001b[39m     payload = {\n\u001b[32m     19\u001b[39m         \u001b[33m\"\u001b[39m\u001b[33mmodel\u001b[39m\u001b[33m\"\u001b[39m: model,\n\u001b[32m     20\u001b[39m         \u001b[33m\"\u001b[39m\u001b[33mprompt\u001b[39m\u001b[33m\"\u001b[39m: prompt,\n\u001b[32m   (...)\u001b[39m\u001b[32m     26\u001b[39m         \u001b[33m\"\u001b[39m\u001b[33mstream\u001b[39m\u001b[33m\"\u001b[39m: \u001b[38;5;28;01mFalse\u001b[39;00m\n\u001b[32m     27\u001b[39m     }\n\u001b[32m---> \u001b[39m\u001b[32m28\u001b[39m     r = \u001b[43mrequests\u001b[49m\u001b[43m.\u001b[49m\u001b[43mpost\u001b[49m\u001b[43m(\u001b[49m\u001b[43mGEN_ENDPOINT\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mjson\u001b[49m\u001b[43m=\u001b[49m\u001b[43mpayload\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m     29\u001b[39m     r.raise_for_status()\n\u001b[32m     30\u001b[39m     data = r.json()\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/requests/api.py:115\u001b[39m, in \u001b[36mpost\u001b[39m\u001b[34m(url, data, json, **kwargs)\u001b[39m\n\u001b[32m    103\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mpost\u001b[39m(url, data=\u001b[38;5;28;01mNone\u001b[39;00m, json=\u001b[38;5;28;01mNone\u001b[39;00m, **kwargs):\n\u001b[32m    104\u001b[39m \u001b[38;5;250m    \u001b[39m\u001b[33mr\u001b[39m\u001b[33;03m\"\"\"Sends a POST request.\u001b[39;00m\n\u001b[32m    105\u001b[39m \n\u001b[32m    106\u001b[39m \u001b[33;03m    :param url: URL for the new :class:`Request` object.\u001b[39;00m\n\u001b[32m   (...)\u001b[39m\u001b[32m    112\u001b[39m \u001b[33;03m    :rtype: requests.Response\u001b[39;00m\n\u001b[32m    113\u001b[39m \u001b[33;03m    \"\"\"\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m115\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mrequest\u001b[49m\u001b[43m(\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mpost\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdata\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mjson\u001b[49m\u001b[43m=\u001b[49m\u001b[43mjson\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/requests/api.py:59\u001b[39m, in \u001b[36mrequest\u001b[39m\u001b[34m(method, url, **kwargs)\u001b[39m\n\u001b[32m     55\u001b[39m \u001b[38;5;66;03m# By using the 'with' statement we are sure the session is closed, thus we\u001b[39;00m\n\u001b[32m     56\u001b[39m \u001b[38;5;66;03m# avoid leaving sockets open which can trigger a ResourceWarning in some\u001b[39;00m\n\u001b[32m     57\u001b[39m \u001b[38;5;66;03m# cases, and look like a memory leak in others.\u001b[39;00m\n\u001b[32m     58\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m sessions.Session() \u001b[38;5;28;01mas\u001b[39;00m session:\n\u001b[32m---> \u001b[39m\u001b[32m59\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43msession\u001b[49m\u001b[43m.\u001b[49m\u001b[43mrequest\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43murl\u001b[49m\u001b[43m=\u001b[49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/requests/sessions.py:589\u001b[39m, in \u001b[36mSession.request\u001b[39m\u001b[34m(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)\u001b[39m\n\u001b[32m    584\u001b[39m send_kwargs = {\n\u001b[32m    585\u001b[39m     \u001b[33m\"\u001b[39m\u001b[33mtimeout\u001b[39m\u001b[33m\"\u001b[39m: timeout,\n\u001b[32m    586\u001b[39m     \u001b[33m\"\u001b[39m\u001b[33mallow_redirects\u001b[39m\u001b[33m\"\u001b[39m: allow_redirects,\n\u001b[32m    587\u001b[39m }\n\u001b[32m    588\u001b[39m send_kwargs.update(settings)\n\u001b[32m--> \u001b[39m\u001b[32m589\u001b[39m resp = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43msend\u001b[49m\u001b[43m(\u001b[49m\u001b[43mprep\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43msend_kwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    591\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m resp\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/requests/sessions.py:703\u001b[39m, in \u001b[36mSession.send\u001b[39m\u001b[34m(self, request, **kwargs)\u001b[39m\n\u001b[32m    700\u001b[39m start = preferred_clock()\n\u001b[32m    702\u001b[39m \u001b[38;5;66;03m# Send the request\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m703\u001b[39m r = \u001b[43madapter\u001b[49m\u001b[43m.\u001b[49m\u001b[43msend\u001b[49m\u001b[43m(\u001b[49m\u001b[43mrequest\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    705\u001b[39m \u001b[38;5;66;03m# Total elapsed time of the request (approximately)\u001b[39;00m\n\u001b[32m    706\u001b[39m elapsed = preferred_clock() - start\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/requests/adapters.py:667\u001b[39m, in \u001b[36mHTTPAdapter.send\u001b[39m\u001b[34m(self, request, stream, timeout, verify, cert, proxies)\u001b[39m\n\u001b[32m    664\u001b[39m     timeout = TimeoutSauce(connect=timeout, read=timeout)\n\u001b[32m    666\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m667\u001b[39m     resp = \u001b[43mconn\u001b[49m\u001b[43m.\u001b[49m\u001b[43murlopen\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m    668\u001b[39m \u001b[43m        \u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m=\u001b[49m\u001b[43mrequest\u001b[49m\u001b[43m.\u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    669\u001b[39m \u001b[43m        \u001b[49m\u001b[43murl\u001b[49m\u001b[43m=\u001b[49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    670\u001b[39m \u001b[43m        \u001b[49m\u001b[43mbody\u001b[49m\u001b[43m=\u001b[49m\u001b[43mrequest\u001b[49m\u001b[43m.\u001b[49m\u001b[43mbody\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    671\u001b[39m \u001b[43m        \u001b[49m\u001b[43mheaders\u001b[49m\u001b[43m=\u001b[49m\u001b[43mrequest\u001b[49m\u001b[43m.\u001b[49m\u001b[43mheaders\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    672\u001b[39m \u001b[43m        \u001b[49m\u001b[43mredirect\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[32m    673\u001b[39m \u001b[43m        \u001b[49m\u001b[43massert_same_host\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[32m    674\u001b[39m \u001b[43m        \u001b[49m\u001b[43mpreload_content\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[32m    675\u001b[39m \u001b[43m        \u001b[49m\u001b[43mdecode_content\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[32m    676\u001b[39m \u001b[43m        \u001b[49m\u001b[43mretries\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mmax_retries\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    677\u001b[39m \u001b[43m        \u001b[49m\u001b[43mtimeout\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtimeout\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    678\u001b[39m \u001b[43m        \u001b[49m\u001b[43mchunked\u001b[49m\u001b[43m=\u001b[49m\u001b[43mchunked\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    679\u001b[39m \u001b[43m    \u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    681\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m (ProtocolError, \u001b[38;5;167;01mOSError\u001b[39;00m) \u001b[38;5;28;01mas\u001b[39;00m err:\n\u001b[32m    682\u001b[39m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mConnectionError\u001b[39;00m(err, request=request)\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/urllib3/connectionpool.py:787\u001b[39m, in \u001b[36mHTTPConnectionPool.urlopen\u001b[39m\u001b[34m(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)\u001b[39m\n\u001b[32m    784\u001b[39m response_conn = conn \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m release_conn \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m    786\u001b[39m \u001b[38;5;66;03m# Make the request on the HTTPConnection object\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m787\u001b[39m response = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_make_request\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m    788\u001b[39m \u001b[43m    \u001b[49m\u001b[43mconn\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    789\u001b[39m \u001b[43m    \u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    790\u001b[39m \u001b[43m    \u001b[49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    791\u001b[39m \u001b[43m    \u001b[49m\u001b[43mtimeout\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtimeout_obj\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    792\u001b[39m \u001b[43m    \u001b[49m\u001b[43mbody\u001b[49m\u001b[43m=\u001b[49m\u001b[43mbody\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    793\u001b[39m \u001b[43m    \u001b[49m\u001b[43mheaders\u001b[49m\u001b[43m=\u001b[49m\u001b[43mheaders\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    794\u001b[39m \u001b[43m    \u001b[49m\u001b[43mchunked\u001b[49m\u001b[43m=\u001b[49m\u001b[43mchunked\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    795\u001b[39m \u001b[43m    \u001b[49m\u001b[43mretries\u001b[49m\u001b[43m=\u001b[49m\u001b[43mretries\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    796\u001b[39m \u001b[43m    \u001b[49m\u001b[43mresponse_conn\u001b[49m\u001b[43m=\u001b[49m\u001b[43mresponse_conn\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    797\u001b[39m \u001b[43m    \u001b[49m\u001b[43mpreload_content\u001b[49m\u001b[43m=\u001b[49m\u001b[43mpreload_content\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    798\u001b[39m \u001b[43m    \u001b[49m\u001b[43mdecode_content\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdecode_content\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    799\u001b[39m \u001b[43m    \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mresponse_kw\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    800\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    802\u001b[39m \u001b[38;5;66;03m# Everything went great!\u001b[39;00m\n\u001b[32m    803\u001b[39m clean_exit = \u001b[38;5;28;01mTrue\u001b[39;00m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/urllib3/connectionpool.py:534\u001b[39m, in \u001b[36mHTTPConnectionPool._make_request\u001b[39m\u001b[34m(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)\u001b[39m\n\u001b[32m    532\u001b[39m \u001b[38;5;66;03m# Receive the response from the server\u001b[39;00m\n\u001b[32m    533\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m534\u001b[39m     response = \u001b[43mconn\u001b[49m\u001b[43m.\u001b[49m\u001b[43mgetresponse\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    535\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m (BaseSSLError, \u001b[38;5;167;01mOSError\u001b[39;00m) \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[32m    536\u001b[39m     \u001b[38;5;28mself\u001b[39m._raise_timeout(err=e, url=url, timeout_value=read_timeout)\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/urllib3/connection.py:516\u001b[39m, in \u001b[36mHTTPConnection.getresponse\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m    513\u001b[39m _shutdown = \u001b[38;5;28mgetattr\u001b[39m(\u001b[38;5;28mself\u001b[39m.sock, \u001b[33m\"\u001b[39m\u001b[33mshutdown\u001b[39m\u001b[33m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m)\n\u001b[32m    515\u001b[39m \u001b[38;5;66;03m# Get the response from http.client.HTTPConnection\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m516\u001b[39m httplib_response = \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m.\u001b[49m\u001b[43mgetresponse\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    518\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m    519\u001b[39m     assert_header_parsing(httplib_response.msg)\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/.pyenv/versions/3.12.12/lib/python3.12/http/client.py:1430\u001b[39m, in \u001b[36mHTTPConnection.getresponse\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m   1428\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m   1429\u001b[39m     \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1430\u001b[39m         \u001b[43mresponse\u001b[49m\u001b[43m.\u001b[49m\u001b[43mbegin\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m   1431\u001b[39m     \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mConnectionError\u001b[39;00m:\n\u001b[32m   1432\u001b[39m         \u001b[38;5;28mself\u001b[39m.close()\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/.pyenv/versions/3.12.12/lib/python3.12/http/client.py:331\u001b[39m, in \u001b[36mHTTPResponse.begin\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m    329\u001b[39m \u001b[38;5;66;03m# read until we get a non-100 response\u001b[39;00m\n\u001b[32m    330\u001b[39m \u001b[38;5;28;01mwhile\u001b[39;00m \u001b[38;5;28;01mTrue\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m331\u001b[39m     version, status, reason = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_read_status\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    332\u001b[39m     \u001b[38;5;28;01mif\u001b[39;00m status != CONTINUE:\n\u001b[32m    333\u001b[39m         \u001b[38;5;28;01mbreak\u001b[39;00m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/.pyenv/versions/3.12.12/lib/python3.12/http/client.py:292\u001b[39m, in \u001b[36mHTTPResponse._read_status\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m    291\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34m_read_status\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[32m--> \u001b[39m\u001b[32m292\u001b[39m     line = \u001b[38;5;28mstr\u001b[39m(\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mfp\u001b[49m\u001b[43m.\u001b[49m\u001b[43mreadline\u001b[49m\u001b[43m(\u001b[49m\u001b[43m_MAXLINE\u001b[49m\u001b[43m \u001b[49m\u001b[43m+\u001b[49m\u001b[43m \u001b[49m\u001b[32;43m1\u001b[39;49m\u001b[43m)\u001b[49m, \u001b[33m\"\u001b[39m\u001b[33miso-8859-1\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m    293\u001b[39m     \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(line) > _MAXLINE:\n\u001b[32m    294\u001b[39m         \u001b[38;5;28;01mraise\u001b[39;00m LineTooLong(\u001b[33m\"\u001b[39m\u001b[33mstatus line\u001b[39m\u001b[33m\"\u001b[39m)\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/.pyenv/versions/3.12.12/lib/python3.12/socket.py:720\u001b[39m, in \u001b[36mSocketIO.readinto\u001b[39m\u001b[34m(self, b)\u001b[39m\n\u001b[32m    718\u001b[39m \u001b[38;5;28;01mwhile\u001b[39;00m \u001b[38;5;28;01mTrue\u001b[39;00m:\n\u001b[32m    719\u001b[39m     \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m720\u001b[39m         \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_sock\u001b[49m\u001b[43m.\u001b[49m\u001b[43mrecv_into\u001b[49m\u001b[43m(\u001b[49m\u001b[43mb\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    721\u001b[39m     \u001b[38;5;28;01mexcept\u001b[39;00m timeout:\n\u001b[32m    722\u001b[39m         \u001b[38;5;28mself\u001b[39m._timeout_occurred = \u001b[38;5;28;01mTrue\u001b[39;00m\n",
      "\u001b[31mKeyboardInterrupt\u001b[39m: "
     ]
    }
   ],
   "source": [
    "\n",
    "import datetime\n",
    "import time\n",
    "\n",
    "\n",
    "def synthesize_dataset(samples_per_doc: int = SAMPLES_PER_DOC, out_path: Path = OUTPUT_DIR / \"raft_dataset.jsonl\") -> Path:\n",
    "    rng = random.Random(SEED)\n",
    "    doc_to_chunk_idx = {}\n",
    "    for i, ch in enumerate(all_chunks):\n",
    "        doc_to_chunk_idx.setdefault(ch.doc_path, []).append(i)\n",
    "\n",
    "    total_target = 0\n",
    "    with out_path.open(\"w\", encoding=\"utf-8\") as f:\n",
    "        for doc_path, idxs in doc_to_chunk_idx.items():\n",
    "            print(f\"[blue]Synthesizing for: {doc_path} ({len(idxs)} chunks)[/blue]\")\n",
    "            doc_idx = list(doc_to_chunk_idx.keys()).index(doc_path)\n",
    "            total_docs = len(doc_to_chunk_idx)\n",
    "            percent = (doc_idx + 1) / total_docs * 100\n",
    "            print(f\"[cyan]Progress: {doc_idx + 1}/{total_docs} ({percent:.1f}%) completed[/cyan]\")\n",
    "            if not idxs:\n",
    "                continue\n",
    "            chosen = rng.sample(idxs, min(samples_per_doc, len(idxs)))\n",
    "            for pi in chosen:\n",
    "                ctx, ids = build_context(pi, k=TOP_K)\n",
    "                user = USER_PROMPT_TEMPLATE.format(context=ctx, n=3)\n",
    "                raw = ollama_generate(user, model=GEN_MODEL, temperature=TEMPERATURE, num_predict=MAX_TOKENS_GEN)\n",
    "                rows = parse_llm_jsonl(raw)\n",
    "                for r in rows:\n",
    "                    # enforce schema & enrich meta\n",
    "                    inp = r.get(\"input\") or r.get(\"question\") or r.get(\"query\")\n",
    "                    out = r.get(\"output\") or r.get(\"answer\") or r.get(\"response\")\n",
    "                    meta = r.get(\"meta\") or {}\n",
    "                    if not isinstance(meta, dict):\n",
    "                        meta = {}\n",
    "                    meta.update({\n",
    "                        \"source_path\": str(doc_path),\n",
    "                        \"chunk_ids\": ids,\n",
    "                        \"generated_at\": datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S'),\n",
    "                        \"model\": GEN_MODEL,\n",
    "                        \"embed_model\": EMBED_MODEL\n",
    "                    })\n",
    "                    if inp and out:\n",
    "                        obj = {\"input\": inp, \"output\": out, \"meta\": meta}\n",
    "                        f.write(json.dumps(obj, ensure_ascii=False) + \"\\n\")\n",
    "                        total_target += 1\n",
    "    print(f\"[green]Wrote {total_target} rows -> {out_path}[/green]\")\n",
    "    return out_path\n",
    "\n",
    "OUT_JSONL = synthesize_dataset(samples_per_doc=SAMPLES_PER_DOC)\n",
    "OUT_JSONL\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b7d110c",
   "metadata": {},
   "source": [
    "\n",
    "## 9) Preview Samples\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "73dbd2e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "from itertools import islice\n",
    "\n",
    "def head_jsonl(p: Path, n: int = 5):\n",
    "    with p.open(\"r\", encoding=\"utf-8\") as f:\n",
    "        for line in islice(f, n):\n",
    "            print(line.rstrip())\n",
    "\n",
    "head_jsonl(OUT_JSONL, 5)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "769567bf",
   "metadata": {},
   "source": [
    "\n",
    "## 10) Optional: Spot-Check Generation Quality\n",
    "\n",
    "Run a tiny evaluation by asking the model with and without retrieval and compare answers.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18c7d3d7",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "EVAL_QUESTIONS = []\n",
    "\n",
    "# Collect inputs from the dataset (first N)\n",
    "with (OUTPUT_DIR / \"raft_dataset.jsonl\").open(\"r\", encoding=\"utf-8\") as f:\n",
    "    for i, line in enumerate(f):\n",
    "        try:\n",
    "            obj = json.loads(line)\n",
    "            EVAL_QUESTIONS.append(obj[\"input\"])\n",
    "        except Exception:\n",
    "            pass\n",
    "        if len(EVAL_QUESTIONS) >= 5:\n",
    "            break\n",
    "\n",
    "def rag_answer(q: str, k: int = TOP_K) -> str:\n",
    "    hits = search(q, k)\n",
    "    ctx = \"\\n\\n\".join([all_chunks[i].text for i,_ in hits])\n",
    "    user = f\"Answer the question using ONLY this context. If missing, say INSUFFICIENT_CONTEXT.\\n\\nCONTEXT:\\n{ctx}\\n\\nQUESTION: {q}\"\n",
    "    return ollama_generate(user, model=GEN_MODEL, temperature=0.2, num_predict=256)\n",
    "\n",
    "for q in EVAL_QUESTIONS:\n",
    "    print(\"\\n[bold]Q:[/bold]\", q)\n",
    "    ans = rag_answer(q)\n",
    "    print(\"[bold]A:[/bold]\", ans.strip()[:500], \"...\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "065a54cf",
   "metadata": {},
   "source": [
    "\n",
    "## 11) Artifacts\n",
    "\n",
    "- `outputs/raft_dataset.jsonl` — your RAFT dataset (input/output/meta per line)\n",
    "- `corpus/` — your source documents (you provide)\n",
    "- You can also persist `emb_matrix.npy` and a FAISS index for reuse.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "300fe34e",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Optionally persist embeddings and index for later reuse\n",
    "np.save(OUTPUT_DIR / \"emb_matrix.npy\", emb_matrix)\n",
    "\n",
    "if USE_FAISS:\n",
    "    import faiss\n",
    "    faiss.write_index(index, str(OUTPUT_DIR / \"faiss.index\"))\n",
    "    print(\"[green]Saved FAISS index and embeddings.[/green]\")\n",
    "else:\n",
    "    print(\"[yellow]FAISS disabled; only saved embeddings.[/yellow]\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7bc8f310",
   "metadata": {},
   "source": [
    "\n",
    "## 12) Troubleshooting\n",
    "\n",
    "- **Connection error to Ollama**: ensure `ollama serve` is running and models are pulled (`ollama pull nomic-embed-text`, `ollama pull llama3.1`).\n",
    "- **Empty dataset**: your corpus may be too small or the parser skipped files. Check `corpus/` content and chunk parameters.\n",
    "- **Hallucinations**: tighten the system prompt, lower temperature, or increase `TOP_K` and chunk size.\n",
    "- **JSON parsing issues**: the notebook tries to be forgiving; you can harden `parse_llm_jsonl` per your needs.\n",
    "- **PDFs**: `pip install pypdf` and try again.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".env",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}