{ "cells": [ { "cell_type": "markdown", "id": "e765555c", "metadata": {}, "source": [ "\n", "# Retrieval-Augmented **Fine-Tuning** (RAFT) Dataset Generation with **Local Ollama**\n", "\n", "This notebook builds a **supervised fine-tuning dataset** (JSONL) for *retrieval-augmented* tasks, by:\n", "1. **Ingesting** your local corpus (Markdown, text, HTML; PDFs optional with extra deps).\n", "2. **Chunking** and **embedding** documents using Ollama's local **embedding model** (e.g., `nomic-embed-text`, `mxbai-embed-large`).\n", "3. Building a **lightweight vector index** (FAISS).\n", "4. **Sampling contexts** and using a local **generation model** via Ollama (e.g., `llama3.1`, `qwen2`, `phi3`) to synthesize **grounded Q&A** or instruction–response pairs.\n", "5. Emitting a **RAFT-style JSONL** for supervised training (e.g., `input`, `output`, `meta` with source citations).\n", "\n", "> **Requirements**\n", ">\n", "> - Local [Ollama](https://ollama.com/) running at `http://localhost:11434`\n", "> - At least one **embedding** model pulled (e.g., `ollama pull nomic-embed-text`)\n", "> - At least one **generation** model pulled (e.g., `ollama pull llama3.1`)\n", ">\n", "> You can adapt the prompts and schema for your specific downstream trainer (Llama.cpp, vLLM, Axolotl, mlx, etc.).\n" ] }, { "cell_type": "markdown", "id": "554f4b94", "metadata": {}, "source": [ "\n", "## 0) Setup\n", "\n", "Install Python dependencies. If you're offline, pre-install or remove what you don't need.\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "e4e8a7e0", "metadata": {}, "outputs": [], "source": [ "\n", "# If needed, uncomment:\n", "# %pip install --quiet requests faiss-cpu rich markdownify python-frontmatter pypdf regex\n", "# Optional extras:\n", "# %pip install --quiet tiktoken beautifulsoup4 lxml\n" ] }, { "cell_type": "markdown", "id": "a7a796e8", "metadata": {}, "source": [ "\n", "## 1) Configuration\n", "\n", "Set paths, models, and chunking/index params.\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "562f06f9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
{\n", " 'DATA_DIR': '/home/marvin/repo/jupyter-ai-test/raft/corpus',\n", " 'OUTPUT_DIR': '/home/marvin/repo/jupyter-ai-test/raft/outputs',\n", " 'OLLAMA_URL': 'http://localhost:11434',\n", " 'EMBED_MODEL': 'nomic-embed-text',\n", " 'GEN_MODEL': 'llama3.1:8b',\n", " 'CHUNK_SIZE': 1200,\n", " 'CHUNK_OVERLAP': 200,\n", " 'TOP_K': 4,\n", " 'SAMPLES_PER_DOC': 4\n", "}\n", "\n" ], "text/plain": [ "\u001b[1m{\u001b[0m\n", " \u001b[32m'DATA_DIR'\u001b[0m: \u001b[32m'/home/marvin/repo/jupyter-ai-test/raft/corpus'\u001b[0m,\n", " \u001b[32m'OUTPUT_DIR'\u001b[0m: \u001b[32m'/home/marvin/repo/jupyter-ai-test/raft/outputs'\u001b[0m,\n", " \u001b[32m'OLLAMA_URL'\u001b[0m: \u001b[32m'http://localhost:11434'\u001b[0m,\n", " \u001b[32m'EMBED_MODEL'\u001b[0m: \u001b[32m'nomic-embed-text'\u001b[0m,\n", " \u001b[32m'GEN_MODEL'\u001b[0m: \u001b[32m'llama3.1:8b'\u001b[0m,\n", " \u001b[32m'CHUNK_SIZE'\u001b[0m: \u001b[1;36m1200\u001b[0m,\n", " \u001b[32m'CHUNK_OVERLAP'\u001b[0m: \u001b[1;36m200\u001b[0m,\n", " \u001b[32m'TOP_K'\u001b[0m: \u001b[1;36m4\u001b[0m,\n", " \u001b[32m'SAMPLES_PER_DOC'\u001b[0m: \u001b[1;36m4\u001b[0m\n", "\u001b[1m}\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "\n", "from dataclasses import dataclass, asdict\n", "from pathlib import Path\n", "from typing import List, Dict, Any, Optional, Tuple\n", "import os, re, json, uuid, math, glob, random, time\n", "import hashlib\n", "import requests\n", "from rich import print\n", "import regex\n", "import numpy as np\n", "\n", "# ---- Core config ----\n", "DATA_DIR = Path(\"./corpus\") # Put your source docs here\n", "OUTPUT_DIR = Path(\"./outputs\") # Where artifacts are saved\n", "OUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n", "\n", "# Ollama endpoints & models\n", "OLLAMA_URL = os.environ.get(\"OLLAMA_URL\", \"http://localhost:11434\")\n", "EMBED_MODEL = os.environ.get(\"EMBED_MODEL\", \"nomic-embed-text\")\n", "GEN_MODEL = os.environ.get(\"GEN_MODEL\", \"llama3.1:8b\")\n", "\n", "# Chunking\n", "CHUNK_SIZE = 1200 # characters\n", "CHUNK_OVERLAP = 200 # characters\n", "MIN_CHARS = 200 # minimum viable chunk length\n", "\n", "# Index\n", "USE_FAISS = True\n", "TOP_K = 4\n", "\n", "# RAFT generation\n", "SEED = 7\n", "SAMPLES_PER_DOC = 4\n", "MAX_TOKENS_GEN = 512 # Generation max tokens (approx; Ollama supports 'num_predict')\n", "TEMPERATURE = 0.6\n", "\n", "random.seed(SEED)\n", "np.random.seed(SEED)\n", "\n", "print({\n", " \"DATA_DIR\": str(DATA_DIR.resolve()),\n", " \"OUTPUT_DIR\": str(OUTPUT_DIR.resolve()),\n", " \"OLLAMA_URL\": OLLAMA_URL,\n", " \"EMBED_MODEL\": EMBED_MODEL,\n", " \"GEN_MODEL\": GEN_MODEL,\n", " \"CHUNK_SIZE\": CHUNK_SIZE,\n", " \"CHUNK_OVERLAP\": CHUNK_OVERLAP,\n", " \"TOP_K\": TOP_K,\n", " \"SAMPLES_PER_DOC\": SAMPLES_PER_DOC\n", "})\n" ] }, { "cell_type": "markdown", "id": "5f932e07", "metadata": {}, "source": [ "\n", "## 2) Load & Normalize Documents\n", "\n", "Basic loaders for `.md`, `.txt`, `.html`. PDF support is optional (requires `pypdf`). You can extend as needed.\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "b4a4b5ee", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Loaded 265 documents\n", "\n" ], "text/plain": [ "\u001b[32mLoaded \u001b[0m\u001b[1;32m265\u001b[0m\u001b[32m documents\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "265" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "from bs4 import BeautifulSoup # if you didn't install bs4, comment HTML support below\n", "try:\n", " import frontmatter\n", "except Exception:\n", " frontmatter = None\n", "\n", "def read_text_file(p: Path) -> str:\n", " return p.read_text(encoding=\"utf-8\", errors=\"ignore\")\n", "\n", "def read_markdown(p: Path) -> str:\n", " text = p.read_text(encoding=\"utf-8\", errors=\"ignore\")\n", " # Optional: strip YAML frontmatter\n", " if frontmatter:\n", " try:\n", " fm = frontmatter.loads(text)\n", " return fm.content\n", " except Exception:\n", " return text\n", " return text\n", "\n", "def read_html(p: Path) -> str:\n", " html = p.read_text(encoding=\"utf-8\", errors=\"ignore\")\n", " soup = BeautifulSoup(html, \"lxml\")\n", " # Remove script/style\n", " for tag in soup([\"script\", \"style\", \"noscript\"]):\n", " tag.decompose()\n", " text = soup.get_text(\" \", strip=True)\n", " return text\n", "\n", "def read_pdf(p: Path) -> str:\n", " try:\n", " from pypdf import PdfReader\n", " except Exception as e:\n", " print(\"[yellow]Install pypdf to enable PDF parsing: %pip install pypdf[/yellow]\")\n", " raise e\n", " reader = PdfReader(str(p))\n", " parts = []\n", " for page in reader.pages:\n", " try:\n", " parts.append(page.extract_text() or \"\")\n", " except Exception:\n", " parts.append(\"\")\n", " return \"\\n\".join(parts)\n", "\n", "SUPPORTED_EXTS = {\".txt\": read_text_file, \".md\": read_markdown, \".markdown\": read_markdown,\n", " \".html\": read_html, \".htm\": read_html, \".pdf\": read_pdf}\n", "\n", "def load_corpus(data_dir: Path) -> Dict[str, str]:\n", " docs = {}\n", " for p in data_dir.rglob(\"*\"):\n", " if not p.is_file():\n", " continue\n", " fn = p.suffix.lower()\n", " if fn in SUPPORTED_EXTS:\n", " try:\n", " docs[str(p)] = SUPPORTED_EXTS[fn](p)\n", " except Exception as e:\n", " print(f\"[red]Failed to read {p}: {e}[/red]\")\n", " print(f\"[green]Loaded {len(docs)} documents[/green]\")\n", " return docs\n", "\n", "docs = load_corpus(DATA_DIR)\n", "len(docs)\n" ] }, { "cell_type": "markdown", "id": "0ffb354e", "metadata": {}, "source": [ "\n", "## 3) Chunking\n", "\n", "Simple character-based chunker with overlap. Swap in a token-based chunker if you prefer.\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "a1abf43d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Total chunks: 7373\n", "\n" ], "text/plain": [ "\u001b[32mTotal chunks: \u001b[0m\u001b[1;32m7373\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "\n", "@dataclass\n", "class Chunk:\n", " id: str\n", " doc_path: str\n", " start: int\n", " end: int\n", " text: str\n", " sha1: str\n", "\n", "def chunk_text(text: str, doc_path: str, chunk_size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> List[Chunk]:\n", " chunks: List[Chunk] = []\n", " i = 0\n", " n = len(text)\n", " while i < n:\n", " j = min(i + chunk_size, n)\n", " piece = text[i:j].strip()\n", " if len(piece) >= MIN_CHARS:\n", " sha1 = hashlib.sha1(piece.encode(\"utf-8\")).hexdigest()\n", " chunks.append(Chunk(\n", " id=str(uuid.uuid4()),\n", " doc_path=doc_path,\n", " start=i, end=j,\n", " text=piece,\n", " sha1=sha1\n", " ))\n", " if j == n:\n", " break\n", " i = j - overlap\n", " if i < 0:\n", " i = 0\n", " if i >= n:\n", " break\n", " return chunks\n", "\n", "all_chunks: List[Chunk] = []\n", "for path, text in docs.items():\n", " all_chunks.extend(chunk_text(text, path))\n", "\n", "print(f\"[green]Total chunks: {len(all_chunks)}[/green]\")\n" ] }, { "cell_type": "markdown", "id": "396ab28c", "metadata": {}, "source": [ "\n", "## 4) Embeddings via Ollama\n", "\n", "Uses Ollama's `POST /api/embeddings` endpoint with your selected embedding model. \n", "Make sure you've pulled it locally: `ollama pull nomic-embed-text` (or your chosen model).\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "037fc70d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(7373, 768)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "EMBED_ENDPOINT = f\"{OLLAMA_URL}/api/embeddings\"\n", "\n", "def embed_texts(texts: List[str], model: str = EMBED_MODEL, batch_size: int = 32) -> np.ndarray:\n", " vectors = []\n", " for i in range(0, len(texts), batch_size):\n", " batch = texts[i:i+batch_size]\n", " # Ollama supports a single prompt or list? We'll call one by one to be safe with large content.\n", " for t in batch:\n", " r = requests.post(EMBED_ENDPOINT, json={\"model\": model, \"prompt\": t})\n", " r.raise_for_status()\n", " data = r.json()\n", " vec = np.array(data[\"embedding\"], dtype=np.float32)\n", " vectors.append(vec)\n", " return np.vstack(vectors) if vectors else np.zeros((0, 768), dtype=np.float32)\n", "\n", "chunk_texts = [c.text for c in all_chunks]\n", "emb_matrix = embed_texts(chunk_texts, model=EMBED_MODEL, batch_size=8)\n", "emb_matrix.shape\n" ] }, { "cell_type": "markdown", "id": "5b4af05b", "metadata": {}, "source": [ "\n", "## 5) Build Vector Index (FAISS)\n", "\n", "We normalize vectors and use inner product (equivalent to cosine on normalized vectors).\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "350bd2a4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
FAISS index built: 7373 vectors\n", "\n" ], "text/plain": [ "\u001b[32mFAISS index built:\u001b[0m \u001b[1;36m7373\u001b[0m vectors\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "\n", "def normalize_rows(x: np.ndarray) -> np.ndarray:\n", " norms = np.linalg.norm(x, axis=1, keepdims=True) + 1e-12\n", " return x / norms\n", "\n", "if USE_FAISS:\n", " import faiss\n", " xb = normalize_rows(emb_matrix).astype(np.float32)\n", " d = xb.shape[1]\n", " index = faiss.IndexFlatIP(d)\n", " index.add(xb)\n", " print(\"[green]FAISS index built:[/green]\", index.ntotal, \"vectors\")\n", "else:\n", " index = None\n", " xb = normalize_rows(emb_matrix).astype(np.float32)\n" ] }, { "cell_type": "markdown", "id": "7f5d26cb", "metadata": {}, "source": [ "\n", "## 6) Retrieval Helper\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "198ddb95", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
[(883, 0.4678046703338623), (5593, 0.45775240659713745), (4318, 0.45724332332611084)]\n", "\n" ], "text/plain": [ "\u001b[1m[\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m883\u001b[0m, \u001b[1;36m0.4678046703338623\u001b[0m\u001b[1m)\u001b[0m, \u001b[1m(\u001b[0m\u001b[1;36m5593\u001b[0m, \u001b[1;36m0.45775240659713745\u001b[0m\u001b[1m)\u001b[0m, \u001b[1m(\u001b[0m\u001b[1;36m4318\u001b[0m, \u001b[1;36m0.45724332332611084\u001b[0m\u001b[1m)\u001b[0m\u001b[1m]\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "\n", "def search(query: str, top_k: int = TOP_K) -> List[Tuple[int, float]]:\n", " # Embed the query\n", " qv = embed_texts([query], model=EMBED_MODEL, batch_size=1)\n", " qv = normalize_rows(qv).astype(np.float32)\n", " if USE_FAISS and index is not None:\n", " D, I = index.search(qv, top_k)\n", " hits = list(zip(I[0].tolist(), D[0].tolist()))\n", " else:\n", " sims = (xb @ qv.T).ravel()\n", " I = np.argsort(-sims)[:top_k]\n", " hits = [(int(i), float(sims[i])) for i in I]\n", " return hits\n", "\n", "# quick smoke test (no error means it's wired up)\n", "print(search(\"What does this corpus talk about?\", 3))\n" ] }, { "cell_type": "markdown", "id": "0046d618", "metadata": {}, "source": [ "\n", "## 7) Synthesize Grounded Q&A / Instructions with Ollama\n", "\n", "We sample chunks, retrieve neighbors for richer context, and prompt a local LLM to create **high-quality** pairs.\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "1d12cf4f", "metadata": {}, "outputs": [], "source": [ "\n", "GEN_ENDPOINT = f\"{OLLAMA_URL}/api/generate\"\n", "\n", "SYSTEM_PROMPT = (\n", " \"You are a careful dataset writer. Given only the provided CONTEXT, craft high-quality, factual \"\n", " \"question–answer pairs for supervised fine-tuning. Answers must be grounded strictly in the context. \"\n", " \"If the context lacks the answer, say 'INSUFFICIENT_CONTEXT'. Focus on clarity, specificity, and avoid hallucinations.\"\n", ")\n", "\n", "USER_PROMPT_TEMPLATE = (\n", " \"CONTEXT:\\n\\n{context}\\n\\n\"\n", " \"Task: Produce {n} diverse Q&A pairs about the content above. \"\n", " \"Use JSON lines (one JSON object per line) with keys: 'input' (question/instruction), 'output' (concise grounded answer), \"\n", " \"'meta' (object with 'source_path', 'chunk_ids', and optional 'citations': list of quotes). \"\n", " \"Do NOT include markdown; output JSON objects only.\"\n", ")\n", "\n", "def ollama_generate(prompt: str, model: str = GEN_MODEL, temperature: float = TEMPERATURE, num_predict: int = MAX_TOKENS_GEN) -> str:\n", " payload = {\n", " \"model\": model,\n", " \"prompt\": prompt,\n", " \"system\": SYSTEM_PROMPT,\n", " \"options\": {\n", " \"temperature\": temperature,\n", " \"num_predict\": num_predict\n", " },\n", " \"stream\": False\n", " }\n", " r = requests.post(GEN_ENDPOINT, json=payload)\n", " r.raise_for_status()\n", " data = r.json()\n", " return data.get(\"response\", \"\")\n", "\n", "def build_context(primary_idx: int, k: int = TOP_K) -> Tuple[str, List[str]]:\n", " primary_chunk = all_chunks[primary_idx]\n", " query = primary_chunk.text[:400] # use the start of the chunk as a pseudo-query\n", " hits = search(query, k)\n", " pieces, ids = [], []\n", " for i, score in hits:\n", " ch = all_chunks[i]\n", " ids.append(ch.id)\n", " pieces.append(f\"[{Path(ch.doc_path).name}::{ch.start}-{ch.end}]\\n{ch.text}\")\n", " return \"\\n\\n---\\n\\n\".join(pieces), ids\n", "\n", "def parse_llm_jsonl(text: str) -> List[Dict[str, Any]]:\n", " rows = []\n", " for line in text.splitlines():\n", " line = line.strip()\n", " if not line:\n", " continue\n", " # be forgiving for trailing commas etc.\n", " try:\n", " obj = json.loads(line)\n", " if isinstance(obj, dict):\n", " rows.append(obj)\n", " except Exception:\n", " # try to salvage with regex for JSON-ish\n", " try:\n", " fixed = regex.sub(r\",\\s*}\", \"}\", line)\n", " fixed = regex.sub(r\",\\s*]\", \"]\", fixed)\n", " obj = json.loads(fixed)\n", " if isinstance(obj, dict):\n", " rows.append(obj)\n", " except Exception:\n", " pass\n", " return rows\n" ] }, { "cell_type": "markdown", "id": "ab8c47f2", "metadata": {}, "source": [ "\n", "## 8) Generate the RAFT Dataset\n", "\n", "This step iterates over documents, samples chunks, retrieves neighbors, and asks the model to produce JSONL rows.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "74493d61", "metadata": {}, "outputs": [ { "ename": "KeyboardInterrupt", "evalue": "", "output_type": "error", "traceback": [ "\u001b[31m---------------------------------------------------------------------------\u001b[39m", "\u001b[31mKeyboardInterrupt\u001b[39m Traceback (most recent call last)", "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[13]\u001b[39m\u001b[32m, line 43\u001b[39m\n\u001b[32m 40\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33m[green]Wrote \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mtotal_target\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m rows -> \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mout_path\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m[/green]\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m 41\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m out_path\n\u001b[32m---> \u001b[39m\u001b[32m43\u001b[39m OUT_JSONL = \u001b[43msynthesize_dataset\u001b[49m\u001b[43m(\u001b[49m\u001b[43msamples_per_doc\u001b[49m\u001b[43m=\u001b[49m\u001b[43mSAMPLES_PER_DOC\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 44\u001b[39m OUT_JSONL\n", "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[13]\u001b[39m\u001b[32m, line 20\u001b[39m, in \u001b[36msynthesize_dataset\u001b[39m\u001b[34m(samples_per_doc, out_path)\u001b[39m\n\u001b[32m 18\u001b[39m ctx, ids = build_context(pi, k=TOP_K)\n\u001b[32m 19\u001b[39m user = USER_PROMPT_TEMPLATE.format(context=ctx, n=\u001b[32m3\u001b[39m)\n\u001b[32m---> \u001b[39m\u001b[32m20\u001b[39m raw = \u001b[43mollama_generate\u001b[49m\u001b[43m(\u001b[49m\u001b[43muser\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmodel\u001b[49m\u001b[43m=\u001b[49m\u001b[43mGEN_MODEL\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtemperature\u001b[49m\u001b[43m=\u001b[49m\u001b[43mTEMPERATURE\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mnum_predict\u001b[49m\u001b[43m=\u001b[49m\u001b[43mMAX_TOKENS_GEN\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 21\u001b[39m rows = parse_llm_jsonl(raw)\n\u001b[32m 22\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m r \u001b[38;5;129;01min\u001b[39;00m rows:\n\u001b[32m 23\u001b[39m \u001b[38;5;66;03m# enforce schema & enrich meta\u001b[39;00m\n", "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[9]\u001b[39m\u001b[32m, line 28\u001b[39m, in \u001b[36mollama_generate\u001b[39m\u001b[34m(prompt, model, temperature, num_predict)\u001b[39m\n\u001b[32m 17\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mollama_generate\u001b[39m(prompt: \u001b[38;5;28mstr\u001b[39m, model: \u001b[38;5;28mstr\u001b[39m = GEN_MODEL, temperature: \u001b[38;5;28mfloat\u001b[39m = TEMPERATURE, num_predict: \u001b[38;5;28mint\u001b[39m = MAX_TOKENS_GEN) -> \u001b[38;5;28mstr\u001b[39m:\n\u001b[32m 18\u001b[39m payload = {\n\u001b[32m 19\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mmodel\u001b[39m\u001b[33m\"\u001b[39m: model,\n\u001b[32m 20\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mprompt\u001b[39m\u001b[33m\"\u001b[39m: prompt,\n\u001b[32m (...)\u001b[39m\u001b[32m 26\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mstream\u001b[39m\u001b[33m\"\u001b[39m: \u001b[38;5;28;01mFalse\u001b[39;00m\n\u001b[32m 27\u001b[39m }\n\u001b[32m---> \u001b[39m\u001b[32m28\u001b[39m r = \u001b[43mrequests\u001b[49m\u001b[43m.\u001b[49m\u001b[43mpost\u001b[49m\u001b[43m(\u001b[49m\u001b[43mGEN_ENDPOINT\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mjson\u001b[49m\u001b[43m=\u001b[49m\u001b[43mpayload\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 29\u001b[39m r.raise_for_status()\n\u001b[32m 30\u001b[39m data = r.json()\n", "\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/requests/api.py:115\u001b[39m, in \u001b[36mpost\u001b[39m\u001b[34m(url, data, json, **kwargs)\u001b[39m\n\u001b[32m 103\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mpost\u001b[39m(url, data=\u001b[38;5;28;01mNone\u001b[39;00m, json=\u001b[38;5;28;01mNone\u001b[39;00m, **kwargs):\n\u001b[32m 104\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33mr\u001b[39m\u001b[33;03m\"\"\"Sends a POST request.\u001b[39;00m\n\u001b[32m 105\u001b[39m \n\u001b[32m 106\u001b[39m \u001b[33;03m :param url: URL for the new :class:`Request` object.\u001b[39;00m\n\u001b[32m (...)\u001b[39m\u001b[32m 112\u001b[39m \u001b[33;03m :rtype: requests.Response\u001b[39;00m\n\u001b[32m 113\u001b[39m \u001b[33;03m \"\"\"\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m115\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mrequest\u001b[49m\u001b[43m(\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mpost\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdata\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mjson\u001b[49m\u001b[43m=\u001b[49m\u001b[43mjson\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n", "\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/requests/api.py:59\u001b[39m, in \u001b[36mrequest\u001b[39m\u001b[34m(method, url, **kwargs)\u001b[39m\n\u001b[32m 55\u001b[39m \u001b[38;5;66;03m# By using the 'with' statement we are sure the session is closed, thus we\u001b[39;00m\n\u001b[32m 56\u001b[39m \u001b[38;5;66;03m# avoid leaving sockets open which can trigger a ResourceWarning in some\u001b[39;00m\n\u001b[32m 57\u001b[39m \u001b[38;5;66;03m# cases, and look like a memory leak in others.\u001b[39;00m\n\u001b[32m 58\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m sessions.Session() \u001b[38;5;28;01mas\u001b[39;00m session:\n\u001b[32m---> \u001b[39m\u001b[32m59\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43msession\u001b[49m\u001b[43m.\u001b[49m\u001b[43mrequest\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43murl\u001b[49m\u001b[43m=\u001b[49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n", "\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/requests/sessions.py:589\u001b[39m, in \u001b[36mSession.request\u001b[39m\u001b[34m(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)\u001b[39m\n\u001b[32m 584\u001b[39m send_kwargs = {\n\u001b[32m 585\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mtimeout\u001b[39m\u001b[33m\"\u001b[39m: timeout,\n\u001b[32m 586\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mallow_redirects\u001b[39m\u001b[33m\"\u001b[39m: allow_redirects,\n\u001b[32m 587\u001b[39m }\n\u001b[32m 588\u001b[39m send_kwargs.update(settings)\n\u001b[32m--> \u001b[39m\u001b[32m589\u001b[39m resp = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43msend\u001b[49m\u001b[43m(\u001b[49m\u001b[43mprep\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43msend_kwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 591\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m resp\n", "\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/requests/sessions.py:703\u001b[39m, in \u001b[36mSession.send\u001b[39m\u001b[34m(self, request, **kwargs)\u001b[39m\n\u001b[32m 700\u001b[39m start = preferred_clock()\n\u001b[32m 702\u001b[39m \u001b[38;5;66;03m# Send the request\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m703\u001b[39m r = \u001b[43madapter\u001b[49m\u001b[43m.\u001b[49m\u001b[43msend\u001b[49m\u001b[43m(\u001b[49m\u001b[43mrequest\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 705\u001b[39m \u001b[38;5;66;03m# Total elapsed time of the request (approximately)\u001b[39;00m\n\u001b[32m 706\u001b[39m elapsed = preferred_clock() - start\n", "\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/requests/adapters.py:667\u001b[39m, in \u001b[36mHTTPAdapter.send\u001b[39m\u001b[34m(self, request, stream, timeout, verify, cert, proxies)\u001b[39m\n\u001b[32m 664\u001b[39m timeout = TimeoutSauce(connect=timeout, read=timeout)\n\u001b[32m 666\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m667\u001b[39m resp = \u001b[43mconn\u001b[49m\u001b[43m.\u001b[49m\u001b[43murlopen\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 668\u001b[39m \u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m=\u001b[49m\u001b[43mrequest\u001b[49m\u001b[43m.\u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 669\u001b[39m \u001b[43m \u001b[49m\u001b[43murl\u001b[49m\u001b[43m=\u001b[49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 670\u001b[39m \u001b[43m \u001b[49m\u001b[43mbody\u001b[49m\u001b[43m=\u001b[49m\u001b[43mrequest\u001b[49m\u001b[43m.\u001b[49m\u001b[43mbody\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 671\u001b[39m \u001b[43m \u001b[49m\u001b[43mheaders\u001b[49m\u001b[43m=\u001b[49m\u001b[43mrequest\u001b[49m\u001b[43m.\u001b[49m\u001b[43mheaders\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 672\u001b[39m \u001b[43m \u001b[49m\u001b[43mredirect\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[32m 673\u001b[39m \u001b[43m \u001b[49m\u001b[43massert_same_host\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[32m 674\u001b[39m \u001b[43m \u001b[49m\u001b[43mpreload_content\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[32m 675\u001b[39m \u001b[43m \u001b[49m\u001b[43mdecode_content\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[32m 676\u001b[39m \u001b[43m \u001b[49m\u001b[43mretries\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mmax_retries\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 677\u001b[39m \u001b[43m \u001b[49m\u001b[43mtimeout\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtimeout\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 678\u001b[39m \u001b[43m \u001b[49m\u001b[43mchunked\u001b[49m\u001b[43m=\u001b[49m\u001b[43mchunked\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 679\u001b[39m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 681\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m (ProtocolError, \u001b[38;5;167;01mOSError\u001b[39;00m) \u001b[38;5;28;01mas\u001b[39;00m err:\n\u001b[32m 682\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mConnectionError\u001b[39;00m(err, request=request)\n", "\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/urllib3/connectionpool.py:787\u001b[39m, in \u001b[36mHTTPConnectionPool.urlopen\u001b[39m\u001b[34m(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)\u001b[39m\n\u001b[32m 784\u001b[39m response_conn = conn \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m release_conn \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 786\u001b[39m \u001b[38;5;66;03m# Make the request on the HTTPConnection object\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m787\u001b[39m response = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_make_request\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 788\u001b[39m \u001b[43m \u001b[49m\u001b[43mconn\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 789\u001b[39m \u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 790\u001b[39m \u001b[43m \u001b[49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 791\u001b[39m \u001b[43m \u001b[49m\u001b[43mtimeout\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtimeout_obj\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 792\u001b[39m \u001b[43m \u001b[49m\u001b[43mbody\u001b[49m\u001b[43m=\u001b[49m\u001b[43mbody\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 793\u001b[39m \u001b[43m \u001b[49m\u001b[43mheaders\u001b[49m\u001b[43m=\u001b[49m\u001b[43mheaders\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 794\u001b[39m \u001b[43m \u001b[49m\u001b[43mchunked\u001b[49m\u001b[43m=\u001b[49m\u001b[43mchunked\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 795\u001b[39m \u001b[43m \u001b[49m\u001b[43mretries\u001b[49m\u001b[43m=\u001b[49m\u001b[43mretries\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 796\u001b[39m \u001b[43m \u001b[49m\u001b[43mresponse_conn\u001b[49m\u001b[43m=\u001b[49m\u001b[43mresponse_conn\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 797\u001b[39m \u001b[43m \u001b[49m\u001b[43mpreload_content\u001b[49m\u001b[43m=\u001b[49m\u001b[43mpreload_content\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 798\u001b[39m \u001b[43m \u001b[49m\u001b[43mdecode_content\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdecode_content\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 799\u001b[39m \u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mresponse_kw\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 800\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 802\u001b[39m \u001b[38;5;66;03m# Everything went great!\u001b[39;00m\n\u001b[32m 803\u001b[39m clean_exit = \u001b[38;5;28;01mTrue\u001b[39;00m\n", "\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/urllib3/connectionpool.py:534\u001b[39m, in \u001b[36mHTTPConnectionPool._make_request\u001b[39m\u001b[34m(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)\u001b[39m\n\u001b[32m 532\u001b[39m \u001b[38;5;66;03m# Receive the response from the server\u001b[39;00m\n\u001b[32m 533\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m534\u001b[39m response = \u001b[43mconn\u001b[49m\u001b[43m.\u001b[49m\u001b[43mgetresponse\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 535\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m (BaseSSLError, \u001b[38;5;167;01mOSError\u001b[39;00m) \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[32m 536\u001b[39m \u001b[38;5;28mself\u001b[39m._raise_timeout(err=e, url=url, timeout_value=read_timeout)\n", "\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/urllib3/connection.py:516\u001b[39m, in \u001b[36mHTTPConnection.getresponse\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 513\u001b[39m _shutdown = \u001b[38;5;28mgetattr\u001b[39m(\u001b[38;5;28mself\u001b[39m.sock, \u001b[33m\"\u001b[39m\u001b[33mshutdown\u001b[39m\u001b[33m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m)\n\u001b[32m 515\u001b[39m \u001b[38;5;66;03m# Get the response from http.client.HTTPConnection\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m516\u001b[39m httplib_response = \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m.\u001b[49m\u001b[43mgetresponse\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 518\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m 519\u001b[39m assert_header_parsing(httplib_response.msg)\n", "\u001b[36mFile \u001b[39m\u001b[32m~/.pyenv/versions/3.12.12/lib/python3.12/http/client.py:1430\u001b[39m, in \u001b[36mHTTPConnection.getresponse\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 1428\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m 1429\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1430\u001b[39m \u001b[43mresponse\u001b[49m\u001b[43m.\u001b[49m\u001b[43mbegin\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1431\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mConnectionError\u001b[39;00m:\n\u001b[32m 1432\u001b[39m \u001b[38;5;28mself\u001b[39m.close()\n", "\u001b[36mFile \u001b[39m\u001b[32m~/.pyenv/versions/3.12.12/lib/python3.12/http/client.py:331\u001b[39m, in \u001b[36mHTTPResponse.begin\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 329\u001b[39m \u001b[38;5;66;03m# read until we get a non-100 response\u001b[39;00m\n\u001b[32m 330\u001b[39m \u001b[38;5;28;01mwhile\u001b[39;00m \u001b[38;5;28;01mTrue\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m331\u001b[39m version, status, reason = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_read_status\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 332\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m status != CONTINUE:\n\u001b[32m 333\u001b[39m \u001b[38;5;28;01mbreak\u001b[39;00m\n", "\u001b[36mFile \u001b[39m\u001b[32m~/.pyenv/versions/3.12.12/lib/python3.12/http/client.py:292\u001b[39m, in \u001b[36mHTTPResponse._read_status\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 291\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34m_read_status\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[32m--> \u001b[39m\u001b[32m292\u001b[39m line = \u001b[38;5;28mstr\u001b[39m(\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mfp\u001b[49m\u001b[43m.\u001b[49m\u001b[43mreadline\u001b[49m\u001b[43m(\u001b[49m\u001b[43m_MAXLINE\u001b[49m\u001b[43m \u001b[49m\u001b[43m+\u001b[49m\u001b[43m \u001b[49m\u001b[32;43m1\u001b[39;49m\u001b[43m)\u001b[49m, \u001b[33m\"\u001b[39m\u001b[33miso-8859-1\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m 293\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(line) > _MAXLINE:\n\u001b[32m 294\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m LineTooLong(\u001b[33m\"\u001b[39m\u001b[33mstatus line\u001b[39m\u001b[33m\"\u001b[39m)\n", "\u001b[36mFile \u001b[39m\u001b[32m~/.pyenv/versions/3.12.12/lib/python3.12/socket.py:720\u001b[39m, in \u001b[36mSocketIO.readinto\u001b[39m\u001b[34m(self, b)\u001b[39m\n\u001b[32m 718\u001b[39m \u001b[38;5;28;01mwhile\u001b[39;00m \u001b[38;5;28;01mTrue\u001b[39;00m:\n\u001b[32m 719\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m720\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_sock\u001b[49m\u001b[43m.\u001b[49m\u001b[43mrecv_into\u001b[49m\u001b[43m(\u001b[49m\u001b[43mb\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 721\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m timeout:\n\u001b[32m 722\u001b[39m \u001b[38;5;28mself\u001b[39m._timeout_occurred = \u001b[38;5;28;01mTrue\u001b[39;00m\n", "\u001b[31mKeyboardInterrupt\u001b[39m: " ] } ], "source": [ "\n", "import datetime\n", "import time\n", "\n", "\n", "def synthesize_dataset(samples_per_doc: int = SAMPLES_PER_DOC, out_path: Path = OUTPUT_DIR / \"raft_dataset.jsonl\") -> Path:\n", " rng = random.Random(SEED)\n", " doc_to_chunk_idx = {}\n", " for i, ch in enumerate(all_chunks):\n", " doc_to_chunk_idx.setdefault(ch.doc_path, []).append(i)\n", "\n", " total_target = 0\n", " with out_path.open(\"w\", encoding=\"utf-8\") as f:\n", " for doc_path, idxs in doc_to_chunk_idx.items():\n", " print(f\"[blue]Synthesizing for: {doc_path} ({len(idxs)} chunks)[/blue]\")\n", " doc_idx = list(doc_to_chunk_idx.keys()).index(doc_path)\n", " total_docs = len(doc_to_chunk_idx)\n", " percent = (doc_idx + 1) / total_docs * 100\n", " print(f\"[cyan]Progress: {doc_idx + 1}/{total_docs} ({percent:.1f}%) completed[/cyan]\")\n", " if not idxs:\n", " continue\n", " chosen = rng.sample(idxs, min(samples_per_doc, len(idxs)))\n", " for pi in chosen:\n", " ctx, ids = build_context(pi, k=TOP_K)\n", " user = USER_PROMPT_TEMPLATE.format(context=ctx, n=3)\n", " raw = ollama_generate(user, model=GEN_MODEL, temperature=TEMPERATURE, num_predict=MAX_TOKENS_GEN)\n", " rows = parse_llm_jsonl(raw)\n", " for r in rows:\n", " # enforce schema & enrich meta\n", " inp = r.get(\"input\") or r.get(\"question\") or r.get(\"query\")\n", " out = r.get(\"output\") or r.get(\"answer\") or r.get(\"response\")\n", " meta = r.get(\"meta\") or {}\n", " if not isinstance(meta, dict):\n", " meta = {}\n", " meta.update({\n", " \"source_path\": str(doc_path),\n", " \"chunk_ids\": ids,\n", " \"generated_at\": datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S'),\n", " \"model\": GEN_MODEL,\n", " \"embed_model\": EMBED_MODEL\n", " })\n", " if inp and out:\n", " obj = {\"input\": inp, \"output\": out, \"meta\": meta}\n", " f.write(json.dumps(obj, ensure_ascii=False) + \"\\n\")\n", " total_target += 1\n", " print(f\"[green]Wrote {total_target} rows -> {out_path}[/green]\")\n", " return out_path\n", "\n", "OUT_JSONL = synthesize_dataset(samples_per_doc=SAMPLES_PER_DOC)\n", "OUT_JSONL\n" ] }, { "cell_type": "markdown", "id": "2b7d110c", "metadata": {}, "source": [ "\n", "## 9) Preview Samples\n" ] }, { "cell_type": "code", "execution_count": null, "id": "73dbd2e8", "metadata": {}, "outputs": [], "source": [ "\n", "from itertools import islice\n", "\n", "def head_jsonl(p: Path, n: int = 5):\n", " with p.open(\"r\", encoding=\"utf-8\") as f:\n", " for line in islice(f, n):\n", " print(line.rstrip())\n", "\n", "head_jsonl(OUT_JSONL, 5)\n" ] }, { "cell_type": "markdown", "id": "769567bf", "metadata": {}, "source": [ "\n", "## 10) Optional: Spot-Check Generation Quality\n", "\n", "Run a tiny evaluation by asking the model with and without retrieval and compare answers.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "18c7d3d7", "metadata": {}, "outputs": [], "source": [ "\n", "EVAL_QUESTIONS = []\n", "\n", "# Collect inputs from the dataset (first N)\n", "with (OUTPUT_DIR / \"raft_dataset.jsonl\").open(\"r\", encoding=\"utf-8\") as f:\n", " for i, line in enumerate(f):\n", " try:\n", " obj = json.loads(line)\n", " EVAL_QUESTIONS.append(obj[\"input\"])\n", " except Exception:\n", " pass\n", " if len(EVAL_QUESTIONS) >= 5:\n", " break\n", "\n", "def rag_answer(q: str, k: int = TOP_K) -> str:\n", " hits = search(q, k)\n", " ctx = \"\\n\\n\".join([all_chunks[i].text for i,_ in hits])\n", " user = f\"Answer the question using ONLY this context. If missing, say INSUFFICIENT_CONTEXT.\\n\\nCONTEXT:\\n{ctx}\\n\\nQUESTION: {q}\"\n", " return ollama_generate(user, model=GEN_MODEL, temperature=0.2, num_predict=256)\n", "\n", "for q in EVAL_QUESTIONS:\n", " print(\"\\n[bold]Q:[/bold]\", q)\n", " ans = rag_answer(q)\n", " print(\"[bold]A:[/bold]\", ans.strip()[:500], \"...\")\n" ] }, { "cell_type": "markdown", "id": "065a54cf", "metadata": {}, "source": [ "\n", "## 11) Artifacts\n", "\n", "- `outputs/raft_dataset.jsonl` — your RAFT dataset (input/output/meta per line)\n", "- `corpus/` — your source documents (you provide)\n", "- You can also persist `emb_matrix.npy` and a FAISS index for reuse.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "300fe34e", "metadata": {}, "outputs": [], "source": [ "\n", "# Optionally persist embeddings and index for later reuse\n", "np.save(OUTPUT_DIR / \"emb_matrix.npy\", emb_matrix)\n", "\n", "if USE_FAISS:\n", " import faiss\n", " faiss.write_index(index, str(OUTPUT_DIR / \"faiss.index\"))\n", " print(\"[green]Saved FAISS index and embeddings.[/green]\")\n", "else:\n", " print(\"[yellow]FAISS disabled; only saved embeddings.[/yellow]\")\n" ] }, { "cell_type": "markdown", "id": "7bc8f310", "metadata": {}, "source": [ "\n", "## 12) Troubleshooting\n", "\n", "- **Connection error to Ollama**: ensure `ollama serve` is running and models are pulled (`ollama pull nomic-embed-text`, `ollama pull llama3.1`).\n", "- **Empty dataset**: your corpus may be too small or the parser skipped files. Check `corpus/` content and chunk parameters.\n", "- **Hallucinations**: tighten the system prompt, lower temperature, or increase `TOP_K` and chunk size.\n", "- **JSON parsing issues**: the notebook tries to be forgiving; you can harden `parse_llm_jsonl` per your needs.\n", "- **PDFs**: `pip install pypdf` and try again.\n" ] } ], "metadata": { "kernelspec": { "display_name": ".env", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.12" } }, "nbformat": 4, "nbformat_minor": 5 }