mirror of
https://github.com/marvinscham/masterthesis-playground.git
synced 2025-12-06 10:10:50 +01:00
887 lines
58 KiB
Plaintext
887 lines
58 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e765555c",
|
||
"metadata": {},
|
||
"source": [
|
||
"\n",
|
||
"# Retrieval-Augmented **Fine-Tuning** (RAFT) Dataset Generation with **Local Ollama**\n",
|
||
"\n",
|
||
"This notebook builds a **supervised fine-tuning dataset** (JSONL) for *retrieval-augmented* tasks, by:\n",
|
||
"1. **Ingesting** your local corpus (Markdown, text, HTML; PDFs optional with extra deps).\n",
|
||
"2. **Chunking** and **embedding** documents using Ollama's local **embedding model** (e.g., `nomic-embed-text`, `mxbai-embed-large`).\n",
|
||
"3. Building a **lightweight vector index** (FAISS).\n",
|
||
"4. **Sampling contexts** and using a local **generation model** via Ollama (e.g., `llama3.1`, `qwen2`, `phi3`) to synthesize **grounded Q&A** or instruction–response pairs.\n",
|
||
"5. Emitting a **RAFT-style JSONL** for supervised training (e.g., `input`, `output`, `meta` with source citations).\n",
|
||
"\n",
|
||
"> **Requirements**\n",
|
||
">\n",
|
||
"> - Local [Ollama](https://ollama.com/) running at `http://localhost:11434`\n",
|
||
"> - At least one **embedding** model pulled (e.g., `ollama pull nomic-embed-text`)\n",
|
||
"> - At least one **generation** model pulled (e.g., `ollama pull llama3.1`)\n",
|
||
">\n",
|
||
"> You can adapt the prompts and schema for your specific downstream trainer (Llama.cpp, vLLM, Axolotl, mlx, etc.).\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "554f4b94",
|
||
"metadata": {},
|
||
"source": [
|
||
"\n",
|
||
"## 0) Setup\n",
|
||
"\n",
|
||
"Install Python dependencies. If you're offline, pre-install or remove what you don't need.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"id": "e4e8a7e0",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\n",
|
||
"# If needed, uncomment:\n",
|
||
"# %pip install --quiet requests faiss-cpu rich markdownify python-frontmatter pypdf regex\n",
|
||
"# Optional extras:\n",
|
||
"# %pip install --quiet tiktoken beautifulsoup4 lxml\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a7a796e8",
|
||
"metadata": {},
|
||
"source": [
|
||
"\n",
|
||
"## 1) Configuration\n",
|
||
"\n",
|
||
"Set paths, models, and chunking/index params.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"id": "562f06f9",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">{</span>\n",
|
||
" <span style=\"color: #008000; text-decoration-color: #008000\">'DATA_DIR'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'/home/marvin/repo/jupyter-ai-test/raft/corpus'</span>,\n",
|
||
" <span style=\"color: #008000; text-decoration-color: #008000\">'OUTPUT_DIR'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'/home/marvin/repo/jupyter-ai-test/raft/outputs'</span>,\n",
|
||
" <span style=\"color: #008000; text-decoration-color: #008000\">'OLLAMA_URL'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'http://localhost:11434'</span>,\n",
|
||
" <span style=\"color: #008000; text-decoration-color: #008000\">'EMBED_MODEL'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'nomic-embed-text'</span>,\n",
|
||
" <span style=\"color: #008000; text-decoration-color: #008000\">'GEN_MODEL'</span>: <span style=\"color: #008000; text-decoration-color: #008000\">'llama3.1:8b'</span>,\n",
|
||
" <span style=\"color: #008000; text-decoration-color: #008000\">'CHUNK_SIZE'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1200</span>,\n",
|
||
" <span style=\"color: #008000; text-decoration-color: #008000\">'CHUNK_OVERLAP'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">200</span>,\n",
|
||
" <span style=\"color: #008000; text-decoration-color: #008000\">'TOP_K'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">4</span>,\n",
|
||
" <span style=\"color: #008000; text-decoration-color: #008000\">'SAMPLES_PER_DOC'</span>: <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">4</span>\n",
|
||
"<span style=\"font-weight: bold\">}</span>\n",
|
||
"</pre>\n"
|
||
],
|
||
"text/plain": [
|
||
"\u001b[1m{\u001b[0m\n",
|
||
" \u001b[32m'DATA_DIR'\u001b[0m: \u001b[32m'/home/marvin/repo/jupyter-ai-test/raft/corpus'\u001b[0m,\n",
|
||
" \u001b[32m'OUTPUT_DIR'\u001b[0m: \u001b[32m'/home/marvin/repo/jupyter-ai-test/raft/outputs'\u001b[0m,\n",
|
||
" \u001b[32m'OLLAMA_URL'\u001b[0m: \u001b[32m'http://localhost:11434'\u001b[0m,\n",
|
||
" \u001b[32m'EMBED_MODEL'\u001b[0m: \u001b[32m'nomic-embed-text'\u001b[0m,\n",
|
||
" \u001b[32m'GEN_MODEL'\u001b[0m: \u001b[32m'llama3.1:8b'\u001b[0m,\n",
|
||
" \u001b[32m'CHUNK_SIZE'\u001b[0m: \u001b[1;36m1200\u001b[0m,\n",
|
||
" \u001b[32m'CHUNK_OVERLAP'\u001b[0m: \u001b[1;36m200\u001b[0m,\n",
|
||
" \u001b[32m'TOP_K'\u001b[0m: \u001b[1;36m4\u001b[0m,\n",
|
||
" \u001b[32m'SAMPLES_PER_DOC'\u001b[0m: \u001b[1;36m4\u001b[0m\n",
|
||
"\u001b[1m}\u001b[0m\n"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"\n",
|
||
"from dataclasses import dataclass, asdict\n",
|
||
"from pathlib import Path\n",
|
||
"from typing import List, Dict, Any, Optional, Tuple\n",
|
||
"import os, re, json, uuid, math, glob, random, time\n",
|
||
"import hashlib\n",
|
||
"import requests\n",
|
||
"from rich import print\n",
|
||
"import regex\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"# ---- Core config ----\n",
|
||
"DATA_DIR = Path(\"./corpus\") # Put your source docs here\n",
|
||
"OUTPUT_DIR = Path(\"./outputs\") # Where artifacts are saved\n",
|
||
"OUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n",
|
||
"\n",
|
||
"# Ollama endpoints & models\n",
|
||
"OLLAMA_URL = os.environ.get(\"OLLAMA_URL\", \"http://localhost:11434\")\n",
|
||
"EMBED_MODEL = os.environ.get(\"EMBED_MODEL\", \"nomic-embed-text\")\n",
|
||
"GEN_MODEL = os.environ.get(\"GEN_MODEL\", \"llama3.1:8b\")\n",
|
||
"\n",
|
||
"# Chunking\n",
|
||
"CHUNK_SIZE = 1200 # characters\n",
|
||
"CHUNK_OVERLAP = 200 # characters\n",
|
||
"MIN_CHARS = 200 # minimum viable chunk length\n",
|
||
"\n",
|
||
"# Index\n",
|
||
"USE_FAISS = True\n",
|
||
"TOP_K = 4\n",
|
||
"\n",
|
||
"# RAFT generation\n",
|
||
"SEED = 7\n",
|
||
"SAMPLES_PER_DOC = 4\n",
|
||
"MAX_TOKENS_GEN = 512 # Generation max tokens (approx; Ollama supports 'num_predict')\n",
|
||
"TEMPERATURE = 0.6\n",
|
||
"\n",
|
||
"random.seed(SEED)\n",
|
||
"np.random.seed(SEED)\n",
|
||
"\n",
|
||
"print({\n",
|
||
" \"DATA_DIR\": str(DATA_DIR.resolve()),\n",
|
||
" \"OUTPUT_DIR\": str(OUTPUT_DIR.resolve()),\n",
|
||
" \"OLLAMA_URL\": OLLAMA_URL,\n",
|
||
" \"EMBED_MODEL\": EMBED_MODEL,\n",
|
||
" \"GEN_MODEL\": GEN_MODEL,\n",
|
||
" \"CHUNK_SIZE\": CHUNK_SIZE,\n",
|
||
" \"CHUNK_OVERLAP\": CHUNK_OVERLAP,\n",
|
||
" \"TOP_K\": TOP_K,\n",
|
||
" \"SAMPLES_PER_DOC\": SAMPLES_PER_DOC\n",
|
||
"})\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5f932e07",
|
||
"metadata": {},
|
||
"source": [
|
||
"\n",
|
||
"## 2) Load & Normalize Documents\n",
|
||
"\n",
|
||
"Basic loaders for `.md`, `.txt`, `.html`. PDF support is optional (requires `pypdf`). You can extend as needed.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"id": "b4a4b5ee",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #008000; text-decoration-color: #008000\">Loaded </span><span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">265</span><span style=\"color: #008000; text-decoration-color: #008000\"> documents</span>\n",
|
||
"</pre>\n"
|
||
],
|
||
"text/plain": [
|
||
"\u001b[32mLoaded \u001b[0m\u001b[1;32m265\u001b[0m\u001b[32m documents\u001b[0m\n"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"265"
|
||
]
|
||
},
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"\n",
|
||
"from bs4 import BeautifulSoup # if you didn't install bs4, comment HTML support below\n",
|
||
"try:\n",
|
||
" import frontmatter\n",
|
||
"except Exception:\n",
|
||
" frontmatter = None\n",
|
||
"\n",
|
||
"def read_text_file(p: Path) -> str:\n",
|
||
" return p.read_text(encoding=\"utf-8\", errors=\"ignore\")\n",
|
||
"\n",
|
||
"def read_markdown(p: Path) -> str:\n",
|
||
" text = p.read_text(encoding=\"utf-8\", errors=\"ignore\")\n",
|
||
" # Optional: strip YAML frontmatter\n",
|
||
" if frontmatter:\n",
|
||
" try:\n",
|
||
" fm = frontmatter.loads(text)\n",
|
||
" return fm.content\n",
|
||
" except Exception:\n",
|
||
" return text\n",
|
||
" return text\n",
|
||
"\n",
|
||
"def read_html(p: Path) -> str:\n",
|
||
" html = p.read_text(encoding=\"utf-8\", errors=\"ignore\")\n",
|
||
" soup = BeautifulSoup(html, \"lxml\")\n",
|
||
" # Remove script/style\n",
|
||
" for tag in soup([\"script\", \"style\", \"noscript\"]):\n",
|
||
" tag.decompose()\n",
|
||
" text = soup.get_text(\" \", strip=True)\n",
|
||
" return text\n",
|
||
"\n",
|
||
"def read_pdf(p: Path) -> str:\n",
|
||
" try:\n",
|
||
" from pypdf import PdfReader\n",
|
||
" except Exception as e:\n",
|
||
" print(\"[yellow]Install pypdf to enable PDF parsing: %pip install pypdf[/yellow]\")\n",
|
||
" raise e\n",
|
||
" reader = PdfReader(str(p))\n",
|
||
" parts = []\n",
|
||
" for page in reader.pages:\n",
|
||
" try:\n",
|
||
" parts.append(page.extract_text() or \"\")\n",
|
||
" except Exception:\n",
|
||
" parts.append(\"\")\n",
|
||
" return \"\\n\".join(parts)\n",
|
||
"\n",
|
||
"SUPPORTED_EXTS = {\".txt\": read_text_file, \".md\": read_markdown, \".markdown\": read_markdown,\n",
|
||
" \".html\": read_html, \".htm\": read_html, \".pdf\": read_pdf}\n",
|
||
"\n",
|
||
"def load_corpus(data_dir: Path) -> Dict[str, str]:\n",
|
||
" docs = {}\n",
|
||
" for p in data_dir.rglob(\"*\"):\n",
|
||
" if not p.is_file():\n",
|
||
" continue\n",
|
||
" fn = p.suffix.lower()\n",
|
||
" if fn in SUPPORTED_EXTS:\n",
|
||
" try:\n",
|
||
" docs[str(p)] = SUPPORTED_EXTS[fn](p)\n",
|
||
" except Exception as e:\n",
|
||
" print(f\"[red]Failed to read {p}: {e}[/red]\")\n",
|
||
" print(f\"[green]Loaded {len(docs)} documents[/green]\")\n",
|
||
" return docs\n",
|
||
"\n",
|
||
"docs = load_corpus(DATA_DIR)\n",
|
||
"len(docs)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0ffb354e",
|
||
"metadata": {},
|
||
"source": [
|
||
"\n",
|
||
"## 3) Chunking\n",
|
||
"\n",
|
||
"Simple character-based chunker with overlap. Swap in a token-based chunker if you prefer.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"id": "a1abf43d",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #008000; text-decoration-color: #008000\">Total chunks: </span><span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">7373</span>\n",
|
||
"</pre>\n"
|
||
],
|
||
"text/plain": [
|
||
"\u001b[32mTotal chunks: \u001b[0m\u001b[1;32m7373\u001b[0m\n"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"\n",
|
||
"@dataclass\n",
|
||
"class Chunk:\n",
|
||
" id: str\n",
|
||
" doc_path: str\n",
|
||
" start: int\n",
|
||
" end: int\n",
|
||
" text: str\n",
|
||
" sha1: str\n",
|
||
"\n",
|
||
"def chunk_text(text: str, doc_path: str, chunk_size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> List[Chunk]:\n",
|
||
" chunks: List[Chunk] = []\n",
|
||
" i = 0\n",
|
||
" n = len(text)\n",
|
||
" while i < n:\n",
|
||
" j = min(i + chunk_size, n)\n",
|
||
" piece = text[i:j].strip()\n",
|
||
" if len(piece) >= MIN_CHARS:\n",
|
||
" sha1 = hashlib.sha1(piece.encode(\"utf-8\")).hexdigest()\n",
|
||
" chunks.append(Chunk(\n",
|
||
" id=str(uuid.uuid4()),\n",
|
||
" doc_path=doc_path,\n",
|
||
" start=i, end=j,\n",
|
||
" text=piece,\n",
|
||
" sha1=sha1\n",
|
||
" ))\n",
|
||
" if j == n:\n",
|
||
" break\n",
|
||
" i = j - overlap\n",
|
||
" if i < 0:\n",
|
||
" i = 0\n",
|
||
" if i >= n:\n",
|
||
" break\n",
|
||
" return chunks\n",
|
||
"\n",
|
||
"all_chunks: List[Chunk] = []\n",
|
||
"for path, text in docs.items():\n",
|
||
" all_chunks.extend(chunk_text(text, path))\n",
|
||
"\n",
|
||
"print(f\"[green]Total chunks: {len(all_chunks)}[/green]\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "396ab28c",
|
||
"metadata": {},
|
||
"source": [
|
||
"\n",
|
||
"## 4) Embeddings via Ollama\n",
|
||
"\n",
|
||
"Uses Ollama's `POST /api/embeddings` endpoint with your selected embedding model. \n",
|
||
"Make sure you've pulled it locally: `ollama pull nomic-embed-text` (or your chosen model).\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"id": "037fc70d",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"(7373, 768)"
|
||
]
|
||
},
|
||
"execution_count": 6,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"\n",
|
||
"EMBED_ENDPOINT = f\"{OLLAMA_URL}/api/embeddings\"\n",
|
||
"\n",
|
||
"def embed_texts(texts: List[str], model: str = EMBED_MODEL, batch_size: int = 32) -> np.ndarray:\n",
|
||
" vectors = []\n",
|
||
" for i in range(0, len(texts), batch_size):\n",
|
||
" batch = texts[i:i+batch_size]\n",
|
||
" # Ollama supports a single prompt or list? We'll call one by one to be safe with large content.\n",
|
||
" for t in batch:\n",
|
||
" r = requests.post(EMBED_ENDPOINT, json={\"model\": model, \"prompt\": t})\n",
|
||
" r.raise_for_status()\n",
|
||
" data = r.json()\n",
|
||
" vec = np.array(data[\"embedding\"], dtype=np.float32)\n",
|
||
" vectors.append(vec)\n",
|
||
" return np.vstack(vectors) if vectors else np.zeros((0, 768), dtype=np.float32)\n",
|
||
"\n",
|
||
"chunk_texts = [c.text for c in all_chunks]\n",
|
||
"emb_matrix = embed_texts(chunk_texts, model=EMBED_MODEL, batch_size=8)\n",
|
||
"emb_matrix.shape\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5b4af05b",
|
||
"metadata": {},
|
||
"source": [
|
||
"\n",
|
||
"## 5) Build Vector Index (FAISS)\n",
|
||
"\n",
|
||
"We normalize vectors and use inner product (equivalent to cosine on normalized vectors).\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"id": "350bd2a4",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #008000; text-decoration-color: #008000\">FAISS index built:</span> <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">7373</span> vectors\n",
|
||
"</pre>\n"
|
||
],
|
||
"text/plain": [
|
||
"\u001b[32mFAISS index built:\u001b[0m \u001b[1;36m7373\u001b[0m vectors\n"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"\n",
|
||
"def normalize_rows(x: np.ndarray) -> np.ndarray:\n",
|
||
" norms = np.linalg.norm(x, axis=1, keepdims=True) + 1e-12\n",
|
||
" return x / norms\n",
|
||
"\n",
|
||
"if USE_FAISS:\n",
|
||
" import faiss\n",
|
||
" xb = normalize_rows(emb_matrix).astype(np.float32)\n",
|
||
" d = xb.shape[1]\n",
|
||
" index = faiss.IndexFlatIP(d)\n",
|
||
" index.add(xb)\n",
|
||
" print(\"[green]FAISS index built:[/green]\", index.ntotal, \"vectors\")\n",
|
||
"else:\n",
|
||
" index = None\n",
|
||
" xb = normalize_rows(emb_matrix).astype(np.float32)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7f5d26cb",
|
||
"metadata": {},
|
||
"source": [
|
||
"\n",
|
||
"## 6) Retrieval Helper\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"id": "198ddb95",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">[(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">883</span>, <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.4678046703338623</span><span style=\"font-weight: bold\">)</span>, <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">5593</span>, <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.45775240659713745</span><span style=\"font-weight: bold\">)</span>, <span style=\"font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">4318</span>, <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.45724332332611084</span><span style=\"font-weight: bold\">)]</span>\n",
|
||
"</pre>\n"
|
||
],
|
||
"text/plain": [
|
||
"\u001b[1m[\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m883\u001b[0m, \u001b[1;36m0.4678046703338623\u001b[0m\u001b[1m)\u001b[0m, \u001b[1m(\u001b[0m\u001b[1;36m5593\u001b[0m, \u001b[1;36m0.45775240659713745\u001b[0m\u001b[1m)\u001b[0m, \u001b[1m(\u001b[0m\u001b[1;36m4318\u001b[0m, \u001b[1;36m0.45724332332611084\u001b[0m\u001b[1m)\u001b[0m\u001b[1m]\u001b[0m\n"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"\n",
|
||
"def search(query: str, top_k: int = TOP_K) -> List[Tuple[int, float]]:\n",
|
||
" # Embed the query\n",
|
||
" qv = embed_texts([query], model=EMBED_MODEL, batch_size=1)\n",
|
||
" qv = normalize_rows(qv).astype(np.float32)\n",
|
||
" if USE_FAISS and index is not None:\n",
|
||
" D, I = index.search(qv, top_k)\n",
|
||
" hits = list(zip(I[0].tolist(), D[0].tolist()))\n",
|
||
" else:\n",
|
||
" sims = (xb @ qv.T).ravel()\n",
|
||
" I = np.argsort(-sims)[:top_k]\n",
|
||
" hits = [(int(i), float(sims[i])) for i in I]\n",
|
||
" return hits\n",
|
||
"\n",
|
||
"# quick smoke test (no error means it's wired up)\n",
|
||
"print(search(\"What does this corpus talk about?\", 3))\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0046d618",
|
||
"metadata": {},
|
||
"source": [
|
||
"\n",
|
||
"## 7) Synthesize Grounded Q&A / Instructions with Ollama\n",
|
||
"\n",
|
||
"We sample chunks, retrieve neighbors for richer context, and prompt a local LLM to create **high-quality** pairs.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"id": "1d12cf4f",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\n",
|
||
"GEN_ENDPOINT = f\"{OLLAMA_URL}/api/generate\"\n",
|
||
"\n",
|
||
"SYSTEM_PROMPT = (\n",
|
||
" \"You are a careful dataset writer. Given only the provided CONTEXT, craft high-quality, factual \"\n",
|
||
" \"question–answer pairs for supervised fine-tuning. Answers must be grounded strictly in the context. \"\n",
|
||
" \"If the context lacks the answer, say 'INSUFFICIENT_CONTEXT'. Focus on clarity, specificity, and avoid hallucinations.\"\n",
|
||
")\n",
|
||
"\n",
|
||
"USER_PROMPT_TEMPLATE = (\n",
|
||
" \"CONTEXT:\\n\\n{context}\\n\\n\"\n",
|
||
" \"Task: Produce {n} diverse Q&A pairs about the content above. \"\n",
|
||
" \"Use JSON lines (one JSON object per line) with keys: 'input' (question/instruction), 'output' (concise grounded answer), \"\n",
|
||
" \"'meta' (object with 'source_path', 'chunk_ids', and optional 'citations': list of quotes). \"\n",
|
||
" \"Do NOT include markdown; output JSON objects only.\"\n",
|
||
")\n",
|
||
"\n",
|
||
"def ollama_generate(prompt: str, model: str = GEN_MODEL, temperature: float = TEMPERATURE, num_predict: int = MAX_TOKENS_GEN) -> str:\n",
|
||
" payload = {\n",
|
||
" \"model\": model,\n",
|
||
" \"prompt\": prompt,\n",
|
||
" \"system\": SYSTEM_PROMPT,\n",
|
||
" \"options\": {\n",
|
||
" \"temperature\": temperature,\n",
|
||
" \"num_predict\": num_predict\n",
|
||
" },\n",
|
||
" \"stream\": False\n",
|
||
" }\n",
|
||
" r = requests.post(GEN_ENDPOINT, json=payload)\n",
|
||
" r.raise_for_status()\n",
|
||
" data = r.json()\n",
|
||
" return data.get(\"response\", \"\")\n",
|
||
"\n",
|
||
"def build_context(primary_idx: int, k: int = TOP_K) -> Tuple[str, List[str]]:\n",
|
||
" primary_chunk = all_chunks[primary_idx]\n",
|
||
" query = primary_chunk.text[:400] # use the start of the chunk as a pseudo-query\n",
|
||
" hits = search(query, k)\n",
|
||
" pieces, ids = [], []\n",
|
||
" for i, score in hits:\n",
|
||
" ch = all_chunks[i]\n",
|
||
" ids.append(ch.id)\n",
|
||
" pieces.append(f\"[{Path(ch.doc_path).name}::{ch.start}-{ch.end}]\\n{ch.text}\")\n",
|
||
" return \"\\n\\n---\\n\\n\".join(pieces), ids\n",
|
||
"\n",
|
||
"def parse_llm_jsonl(text: str) -> List[Dict[str, Any]]:\n",
|
||
" rows = []\n",
|
||
" for line in text.splitlines():\n",
|
||
" line = line.strip()\n",
|
||
" if not line:\n",
|
||
" continue\n",
|
||
" # be forgiving for trailing commas etc.\n",
|
||
" try:\n",
|
||
" obj = json.loads(line)\n",
|
||
" if isinstance(obj, dict):\n",
|
||
" rows.append(obj)\n",
|
||
" except Exception:\n",
|
||
" # try to salvage with regex for JSON-ish\n",
|
||
" try:\n",
|
||
" fixed = regex.sub(r\",\\s*}\", \"}\", line)\n",
|
||
" fixed = regex.sub(r\",\\s*]\", \"]\", fixed)\n",
|
||
" obj = json.loads(fixed)\n",
|
||
" if isinstance(obj, dict):\n",
|
||
" rows.append(obj)\n",
|
||
" except Exception:\n",
|
||
" pass\n",
|
||
" return rows\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ab8c47f2",
|
||
"metadata": {},
|
||
"source": [
|
||
"\n",
|
||
"## 8) Generate the RAFT Dataset\n",
|
||
"\n",
|
||
"This step iterates over documents, samples chunks, retrieves neighbors, and asks the model to produce JSONL rows.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"id": "74493d61",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #000080; text-decoration-color: #000080\">Synthesizing for: corpus/</span><span style=\"color: #000080; text-decoration-color: #000080\">topic</span><span style=\"color: #000080; text-decoration-color: #000080\">=</span><span style=\"color: #000080; text-decoration-color: #000080\">2__part</span><span style=\"color: #000080; text-decoration-color: #000080\">=</span><span style=\"color: #000080; text-decoration-color: #000080\">066__n</span><span style=\"color: #000080; text-decoration-color: #000080\">=</span><span style=\"color: #000080; text-decoration-color: #000080; font-weight: bold\">60.</span><span style=\"color: #000080; text-decoration-color: #000080\">txt </span><span style=\"color: #000080; text-decoration-color: #000080; font-weight: bold\">(</span><span style=\"color: #000080; text-decoration-color: #000080; font-weight: bold\">24</span><span style=\"color: #000080; text-decoration-color: #000080\"> chunks</span><span style=\"color: #000080; text-decoration-color: #000080; font-weight: bold\">)</span>\n",
|
||
"</pre>\n"
|
||
],
|
||
"text/plain": [
|
||
"\u001b[34mSynthesizing for: corpus/\u001b[0m\u001b[34mtopic\u001b[0m\u001b[34m=\u001b[0m\u001b[34m2__part\u001b[0m\u001b[34m=\u001b[0m\u001b[34m066__n\u001b[0m\u001b[34m=\u001b[0m\u001b[1;34m60\u001b[0m\u001b[1;34m.\u001b[0m\u001b[34mtxt \u001b[0m\u001b[1;34m(\u001b[0m\u001b[1;34m24\u001b[0m\u001b[34m chunks\u001b[0m\u001b[1;34m)\u001b[0m\n"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #008080; text-decoration-color: #008080\">Progress: </span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span><span style=\"color: #008080; text-decoration-color: #008080\">/</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">265</span><span style=\"color: #008080; text-decoration-color: #008080\"> </span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.4</span><span style=\"color: #008080; text-decoration-color: #008080\">%</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">)</span><span style=\"color: #008080; text-decoration-color: #008080\"> completed</span>\n",
|
||
"</pre>\n"
|
||
],
|
||
"text/plain": [
|
||
"\u001b[36mProgress: \u001b[0m\u001b[1;36m1\u001b[0m\u001b[36m/\u001b[0m\u001b[1;36m265\u001b[0m\u001b[36m \u001b[0m\u001b[1;36m(\u001b[0m\u001b[1;36m0.4\u001b[0m\u001b[36m%\u001b[0m\u001b[1;36m)\u001b[0m\u001b[36m completed\u001b[0m\n"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #000080; text-decoration-color: #000080\">Synthesizing for: corpus/</span><span style=\"color: #000080; text-decoration-color: #000080\">topic</span><span style=\"color: #000080; text-decoration-color: #000080\">=</span><span style=\"color: #000080; text-decoration-color: #000080\">5__part</span><span style=\"color: #000080; text-decoration-color: #000080\">=</span><span style=\"color: #000080; text-decoration-color: #000080\">018__n</span><span style=\"color: #000080; text-decoration-color: #000080\">=</span><span style=\"color: #000080; text-decoration-color: #000080; font-weight: bold\">60.</span><span style=\"color: #000080; text-decoration-color: #000080\">txt </span><span style=\"color: #000080; text-decoration-color: #000080; font-weight: bold\">(</span><span style=\"color: #000080; text-decoration-color: #000080; font-weight: bold\">52</span><span style=\"color: #000080; text-decoration-color: #000080\"> chunks</span><span style=\"color: #000080; text-decoration-color: #000080; font-weight: bold\">)</span>\n",
|
||
"</pre>\n"
|
||
],
|
||
"text/plain": [
|
||
"\u001b[34mSynthesizing for: corpus/\u001b[0m\u001b[34mtopic\u001b[0m\u001b[34m=\u001b[0m\u001b[34m5__part\u001b[0m\u001b[34m=\u001b[0m\u001b[34m018__n\u001b[0m\u001b[34m=\u001b[0m\u001b[1;34m60\u001b[0m\u001b[1;34m.\u001b[0m\u001b[34mtxt \u001b[0m\u001b[1;34m(\u001b[0m\u001b[1;34m52\u001b[0m\u001b[34m chunks\u001b[0m\u001b[1;34m)\u001b[0m\n"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #008080; text-decoration-color: #008080\">Progress: </span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">2</span><span style=\"color: #008080; text-decoration-color: #008080\">/</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">265</span><span style=\"color: #008080; text-decoration-color: #008080\"> </span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0.8</span><span style=\"color: #008080; text-decoration-color: #008080\">%</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">)</span><span style=\"color: #008080; text-decoration-color: #008080\"> completed</span>\n",
|
||
"</pre>\n"
|
||
],
|
||
"text/plain": [
|
||
"\u001b[36mProgress: \u001b[0m\u001b[1;36m2\u001b[0m\u001b[36m/\u001b[0m\u001b[1;36m265\u001b[0m\u001b[36m \u001b[0m\u001b[1;36m(\u001b[0m\u001b[1;36m0.8\u001b[0m\u001b[36m%\u001b[0m\u001b[1;36m)\u001b[0m\u001b[36m completed\u001b[0m\n"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #000080; text-decoration-color: #000080\">Synthesizing for: corpus/</span><span style=\"color: #000080; text-decoration-color: #000080\">topic</span><span style=\"color: #000080; text-decoration-color: #000080\">=</span><span style=\"color: #000080; text-decoration-color: #000080\">7__part</span><span style=\"color: #000080; text-decoration-color: #000080\">=</span><span style=\"color: #000080; text-decoration-color: #000080\">005__n</span><span style=\"color: #000080; text-decoration-color: #000080\">=</span><span style=\"color: #000080; text-decoration-color: #000080; font-weight: bold\">60.</span><span style=\"color: #000080; text-decoration-color: #000080\">txt </span><span style=\"color: #000080; text-decoration-color: #000080; font-weight: bold\">(</span><span style=\"color: #000080; text-decoration-color: #000080; font-weight: bold\">49</span><span style=\"color: #000080; text-decoration-color: #000080\"> chunks</span><span style=\"color: #000080; text-decoration-color: #000080; font-weight: bold\">)</span>\n",
|
||
"</pre>\n"
|
||
],
|
||
"text/plain": [
|
||
"\u001b[34mSynthesizing for: corpus/\u001b[0m\u001b[34mtopic\u001b[0m\u001b[34m=\u001b[0m\u001b[34m7__part\u001b[0m\u001b[34m=\u001b[0m\u001b[34m005__n\u001b[0m\u001b[34m=\u001b[0m\u001b[1;34m60\u001b[0m\u001b[1;34m.\u001b[0m\u001b[34mtxt \u001b[0m\u001b[1;34m(\u001b[0m\u001b[1;34m49\u001b[0m\u001b[34m chunks\u001b[0m\u001b[1;34m)\u001b[0m\n"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #008080; text-decoration-color: #008080\">Progress: </span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3</span><span style=\"color: #008080; text-decoration-color: #008080\">/</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">265</span><span style=\"color: #008080; text-decoration-color: #008080\"> </span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">(</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1.1</span><span style=\"color: #008080; text-decoration-color: #008080\">%</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">)</span><span style=\"color: #008080; text-decoration-color: #008080\"> completed</span>\n",
|
||
"</pre>\n"
|
||
],
|
||
"text/plain": [
|
||
"\u001b[36mProgress: \u001b[0m\u001b[1;36m3\u001b[0m\u001b[36m/\u001b[0m\u001b[1;36m265\u001b[0m\u001b[36m \u001b[0m\u001b[1;36m(\u001b[0m\u001b[1;36m1.1\u001b[0m\u001b[36m%\u001b[0m\u001b[1;36m)\u001b[0m\u001b[36m completed\u001b[0m\n"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"ename": "KeyboardInterrupt",
|
||
"evalue": "",
|
||
"output_type": "error",
|
||
"traceback": [
|
||
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
|
||
"\u001b[31mKeyboardInterrupt\u001b[39m Traceback (most recent call last)",
|
||
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[14]\u001b[39m\u001b[32m, line 48\u001b[39m\n\u001b[32m 45\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33m[green]Wrote \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mtotal_target\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m rows -> \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mout_path\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m[/green]\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m 46\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m out_path\n\u001b[32m---> \u001b[39m\u001b[32m48\u001b[39m OUT_JSONL = \u001b[43msynthesize_dataset\u001b[49m\u001b[43m(\u001b[49m\u001b[43msamples_per_doc\u001b[49m\u001b[43m=\u001b[49m\u001b[43mSAMPLES_PER_DOC\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 49\u001b[39m OUT_JSONL\n",
|
||
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[14]\u001b[39m\u001b[32m, line 25\u001b[39m, in \u001b[36msynthesize_dataset\u001b[39m\u001b[34m(samples_per_doc, out_path)\u001b[39m\n\u001b[32m 23\u001b[39m ctx, ids = build_context(pi, k=TOP_K)\n\u001b[32m 24\u001b[39m user = USER_PROMPT_TEMPLATE.format(context=ctx, n=\u001b[32m3\u001b[39m)\n\u001b[32m---> \u001b[39m\u001b[32m25\u001b[39m raw = \u001b[43mollama_generate\u001b[49m\u001b[43m(\u001b[49m\u001b[43muser\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmodel\u001b[49m\u001b[43m=\u001b[49m\u001b[43mGEN_MODEL\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtemperature\u001b[49m\u001b[43m=\u001b[49m\u001b[43mTEMPERATURE\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mnum_predict\u001b[49m\u001b[43m=\u001b[49m\u001b[43mMAX_TOKENS_GEN\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 26\u001b[39m rows = parse_llm_jsonl(raw)\n\u001b[32m 27\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m r \u001b[38;5;129;01min\u001b[39;00m rows:\n\u001b[32m 28\u001b[39m \u001b[38;5;66;03m# enforce schema & enrich meta\u001b[39;00m\n",
|
||
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[9]\u001b[39m\u001b[32m, line 28\u001b[39m, in \u001b[36mollama_generate\u001b[39m\u001b[34m(prompt, model, temperature, num_predict)\u001b[39m\n\u001b[32m 17\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mollama_generate\u001b[39m(prompt: \u001b[38;5;28mstr\u001b[39m, model: \u001b[38;5;28mstr\u001b[39m = GEN_MODEL, temperature: \u001b[38;5;28mfloat\u001b[39m = TEMPERATURE, num_predict: \u001b[38;5;28mint\u001b[39m = MAX_TOKENS_GEN) -> \u001b[38;5;28mstr\u001b[39m:\n\u001b[32m 18\u001b[39m payload = {\n\u001b[32m 19\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mmodel\u001b[39m\u001b[33m\"\u001b[39m: model,\n\u001b[32m 20\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mprompt\u001b[39m\u001b[33m\"\u001b[39m: prompt,\n\u001b[32m (...)\u001b[39m\u001b[32m 26\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mstream\u001b[39m\u001b[33m\"\u001b[39m: \u001b[38;5;28;01mFalse\u001b[39;00m\n\u001b[32m 27\u001b[39m }\n\u001b[32m---> \u001b[39m\u001b[32m28\u001b[39m r = \u001b[43mrequests\u001b[49m\u001b[43m.\u001b[49m\u001b[43mpost\u001b[49m\u001b[43m(\u001b[49m\u001b[43mGEN_ENDPOINT\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mjson\u001b[49m\u001b[43m=\u001b[49m\u001b[43mpayload\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 29\u001b[39m r.raise_for_status()\n\u001b[32m 30\u001b[39m data = r.json()\n",
|
||
"\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/requests/api.py:115\u001b[39m, in \u001b[36mpost\u001b[39m\u001b[34m(url, data, json, **kwargs)\u001b[39m\n\u001b[32m 103\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mpost\u001b[39m(url, data=\u001b[38;5;28;01mNone\u001b[39;00m, json=\u001b[38;5;28;01mNone\u001b[39;00m, **kwargs):\n\u001b[32m 104\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33mr\u001b[39m\u001b[33;03m\"\"\"Sends a POST request.\u001b[39;00m\n\u001b[32m 105\u001b[39m \n\u001b[32m 106\u001b[39m \u001b[33;03m :param url: URL for the new :class:`Request` object.\u001b[39;00m\n\u001b[32m (...)\u001b[39m\u001b[32m 112\u001b[39m \u001b[33;03m :rtype: requests.Response\u001b[39;00m\n\u001b[32m 113\u001b[39m \u001b[33;03m \"\"\"\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m115\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mrequest\u001b[49m\u001b[43m(\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mpost\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdata\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mjson\u001b[49m\u001b[43m=\u001b[49m\u001b[43mjson\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
||
"\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/requests/api.py:59\u001b[39m, in \u001b[36mrequest\u001b[39m\u001b[34m(method, url, **kwargs)\u001b[39m\n\u001b[32m 55\u001b[39m \u001b[38;5;66;03m# By using the 'with' statement we are sure the session is closed, thus we\u001b[39;00m\n\u001b[32m 56\u001b[39m \u001b[38;5;66;03m# avoid leaving sockets open which can trigger a ResourceWarning in some\u001b[39;00m\n\u001b[32m 57\u001b[39m \u001b[38;5;66;03m# cases, and look like a memory leak in others.\u001b[39;00m\n\u001b[32m 58\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m sessions.Session() \u001b[38;5;28;01mas\u001b[39;00m session:\n\u001b[32m---> \u001b[39m\u001b[32m59\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43msession\u001b[49m\u001b[43m.\u001b[49m\u001b[43mrequest\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43murl\u001b[49m\u001b[43m=\u001b[49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
||
"\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/requests/sessions.py:589\u001b[39m, in \u001b[36mSession.request\u001b[39m\u001b[34m(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)\u001b[39m\n\u001b[32m 584\u001b[39m send_kwargs = {\n\u001b[32m 585\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mtimeout\u001b[39m\u001b[33m\"\u001b[39m: timeout,\n\u001b[32m 586\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mallow_redirects\u001b[39m\u001b[33m\"\u001b[39m: allow_redirects,\n\u001b[32m 587\u001b[39m }\n\u001b[32m 588\u001b[39m send_kwargs.update(settings)\n\u001b[32m--> \u001b[39m\u001b[32m589\u001b[39m resp = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43msend\u001b[49m\u001b[43m(\u001b[49m\u001b[43mprep\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43msend_kwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 591\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m resp\n",
|
||
"\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/requests/sessions.py:703\u001b[39m, in \u001b[36mSession.send\u001b[39m\u001b[34m(self, request, **kwargs)\u001b[39m\n\u001b[32m 700\u001b[39m start = preferred_clock()\n\u001b[32m 702\u001b[39m \u001b[38;5;66;03m# Send the request\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m703\u001b[39m r = \u001b[43madapter\u001b[49m\u001b[43m.\u001b[49m\u001b[43msend\u001b[49m\u001b[43m(\u001b[49m\u001b[43mrequest\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 705\u001b[39m \u001b[38;5;66;03m# Total elapsed time of the request (approximately)\u001b[39;00m\n\u001b[32m 706\u001b[39m elapsed = preferred_clock() - start\n",
|
||
"\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/requests/adapters.py:667\u001b[39m, in \u001b[36mHTTPAdapter.send\u001b[39m\u001b[34m(self, request, stream, timeout, verify, cert, proxies)\u001b[39m\n\u001b[32m 664\u001b[39m timeout = TimeoutSauce(connect=timeout, read=timeout)\n\u001b[32m 666\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m667\u001b[39m resp = \u001b[43mconn\u001b[49m\u001b[43m.\u001b[49m\u001b[43murlopen\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 668\u001b[39m \u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m=\u001b[49m\u001b[43mrequest\u001b[49m\u001b[43m.\u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 669\u001b[39m \u001b[43m \u001b[49m\u001b[43murl\u001b[49m\u001b[43m=\u001b[49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 670\u001b[39m \u001b[43m \u001b[49m\u001b[43mbody\u001b[49m\u001b[43m=\u001b[49m\u001b[43mrequest\u001b[49m\u001b[43m.\u001b[49m\u001b[43mbody\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 671\u001b[39m \u001b[43m \u001b[49m\u001b[43mheaders\u001b[49m\u001b[43m=\u001b[49m\u001b[43mrequest\u001b[49m\u001b[43m.\u001b[49m\u001b[43mheaders\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 672\u001b[39m \u001b[43m \u001b[49m\u001b[43mredirect\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[32m 673\u001b[39m \u001b[43m \u001b[49m\u001b[43massert_same_host\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[32m 674\u001b[39m \u001b[43m \u001b[49m\u001b[43mpreload_content\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[32m 675\u001b[39m \u001b[43m \u001b[49m\u001b[43mdecode_content\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[32m 676\u001b[39m \u001b[43m \u001b[49m\u001b[43mretries\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mmax_retries\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 677\u001b[39m \u001b[43m \u001b[49m\u001b[43mtimeout\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtimeout\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 678\u001b[39m \u001b[43m \u001b[49m\u001b[43mchunked\u001b[49m\u001b[43m=\u001b[49m\u001b[43mchunked\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 679\u001b[39m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 681\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m (ProtocolError, \u001b[38;5;167;01mOSError\u001b[39;00m) \u001b[38;5;28;01mas\u001b[39;00m err:\n\u001b[32m 682\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mConnectionError\u001b[39;00m(err, request=request)\n",
|
||
"\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/urllib3/connectionpool.py:787\u001b[39m, in \u001b[36mHTTPConnectionPool.urlopen\u001b[39m\u001b[34m(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)\u001b[39m\n\u001b[32m 784\u001b[39m response_conn = conn \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m release_conn \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 786\u001b[39m \u001b[38;5;66;03m# Make the request on the HTTPConnection object\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m787\u001b[39m response = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_make_request\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 788\u001b[39m \u001b[43m \u001b[49m\u001b[43mconn\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 789\u001b[39m \u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 790\u001b[39m \u001b[43m \u001b[49m\u001b[43murl\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 791\u001b[39m \u001b[43m \u001b[49m\u001b[43mtimeout\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtimeout_obj\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 792\u001b[39m \u001b[43m \u001b[49m\u001b[43mbody\u001b[49m\u001b[43m=\u001b[49m\u001b[43mbody\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 793\u001b[39m \u001b[43m \u001b[49m\u001b[43mheaders\u001b[49m\u001b[43m=\u001b[49m\u001b[43mheaders\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 794\u001b[39m \u001b[43m \u001b[49m\u001b[43mchunked\u001b[49m\u001b[43m=\u001b[49m\u001b[43mchunked\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 795\u001b[39m \u001b[43m \u001b[49m\u001b[43mretries\u001b[49m\u001b[43m=\u001b[49m\u001b[43mretries\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 796\u001b[39m \u001b[43m \u001b[49m\u001b[43mresponse_conn\u001b[49m\u001b[43m=\u001b[49m\u001b[43mresponse_conn\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 797\u001b[39m \u001b[43m \u001b[49m\u001b[43mpreload_content\u001b[49m\u001b[43m=\u001b[49m\u001b[43mpreload_content\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 798\u001b[39m \u001b[43m \u001b[49m\u001b[43mdecode_content\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdecode_content\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 799\u001b[39m \u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mresponse_kw\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 800\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 802\u001b[39m \u001b[38;5;66;03m# Everything went great!\u001b[39;00m\n\u001b[32m 803\u001b[39m clean_exit = \u001b[38;5;28;01mTrue\u001b[39;00m\n",
|
||
"\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/urllib3/connectionpool.py:534\u001b[39m, in \u001b[36mHTTPConnectionPool._make_request\u001b[39m\u001b[34m(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)\u001b[39m\n\u001b[32m 532\u001b[39m \u001b[38;5;66;03m# Receive the response from the server\u001b[39;00m\n\u001b[32m 533\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m534\u001b[39m response = \u001b[43mconn\u001b[49m\u001b[43m.\u001b[49m\u001b[43mgetresponse\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 535\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m (BaseSSLError, \u001b[38;5;167;01mOSError\u001b[39;00m) \u001b[38;5;28;01mas\u001b[39;00m e:\n\u001b[32m 536\u001b[39m \u001b[38;5;28mself\u001b[39m._raise_timeout(err=e, url=url, timeout_value=read_timeout)\n",
|
||
"\u001b[36mFile \u001b[39m\u001b[32m~/repo/jupyter-ai-test/.env/lib/python3.12/site-packages/urllib3/connection.py:516\u001b[39m, in \u001b[36mHTTPConnection.getresponse\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 513\u001b[39m _shutdown = \u001b[38;5;28mgetattr\u001b[39m(\u001b[38;5;28mself\u001b[39m.sock, \u001b[33m\"\u001b[39m\u001b[33mshutdown\u001b[39m\u001b[33m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m)\n\u001b[32m 515\u001b[39m \u001b[38;5;66;03m# Get the response from http.client.HTTPConnection\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m516\u001b[39m httplib_response = \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m.\u001b[49m\u001b[43mgetresponse\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 518\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m 519\u001b[39m assert_header_parsing(httplib_response.msg)\n",
|
||
"\u001b[36mFile \u001b[39m\u001b[32m~/.pyenv/versions/3.12.12/lib/python3.12/http/client.py:1430\u001b[39m, in \u001b[36mHTTPConnection.getresponse\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 1428\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m 1429\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1430\u001b[39m \u001b[43mresponse\u001b[49m\u001b[43m.\u001b[49m\u001b[43mbegin\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1431\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mConnectionError\u001b[39;00m:\n\u001b[32m 1432\u001b[39m \u001b[38;5;28mself\u001b[39m.close()\n",
|
||
"\u001b[36mFile \u001b[39m\u001b[32m~/.pyenv/versions/3.12.12/lib/python3.12/http/client.py:331\u001b[39m, in \u001b[36mHTTPResponse.begin\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 329\u001b[39m \u001b[38;5;66;03m# read until we get a non-100 response\u001b[39;00m\n\u001b[32m 330\u001b[39m \u001b[38;5;28;01mwhile\u001b[39;00m \u001b[38;5;28;01mTrue\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m331\u001b[39m version, status, reason = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_read_status\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 332\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m status != CONTINUE:\n\u001b[32m 333\u001b[39m \u001b[38;5;28;01mbreak\u001b[39;00m\n",
|
||
"\u001b[36mFile \u001b[39m\u001b[32m~/.pyenv/versions/3.12.12/lib/python3.12/http/client.py:292\u001b[39m, in \u001b[36mHTTPResponse._read_status\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 291\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34m_read_status\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[32m--> \u001b[39m\u001b[32m292\u001b[39m line = \u001b[38;5;28mstr\u001b[39m(\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mfp\u001b[49m\u001b[43m.\u001b[49m\u001b[43mreadline\u001b[49m\u001b[43m(\u001b[49m\u001b[43m_MAXLINE\u001b[49m\u001b[43m \u001b[49m\u001b[43m+\u001b[49m\u001b[43m \u001b[49m\u001b[32;43m1\u001b[39;49m\u001b[43m)\u001b[49m, \u001b[33m\"\u001b[39m\u001b[33miso-8859-1\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m 293\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(line) > _MAXLINE:\n\u001b[32m 294\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m LineTooLong(\u001b[33m\"\u001b[39m\u001b[33mstatus line\u001b[39m\u001b[33m\"\u001b[39m)\n",
|
||
"\u001b[36mFile \u001b[39m\u001b[32m~/.pyenv/versions/3.12.12/lib/python3.12/socket.py:720\u001b[39m, in \u001b[36mSocketIO.readinto\u001b[39m\u001b[34m(self, b)\u001b[39m\n\u001b[32m 718\u001b[39m \u001b[38;5;28;01mwhile\u001b[39;00m \u001b[38;5;28;01mTrue\u001b[39;00m:\n\u001b[32m 719\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m720\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_sock\u001b[49m\u001b[43m.\u001b[49m\u001b[43mrecv_into\u001b[49m\u001b[43m(\u001b[49m\u001b[43mb\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 721\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m timeout:\n\u001b[32m 722\u001b[39m \u001b[38;5;28mself\u001b[39m._timeout_occurred = \u001b[38;5;28;01mTrue\u001b[39;00m\n",
|
||
"\u001b[31mKeyboardInterrupt\u001b[39m: "
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"\n",
|
||
"import datetime\n",
|
||
"import time\n",
|
||
"\n",
|
||
"\n",
|
||
"def synthesize_dataset(samples_per_doc: int = SAMPLES_PER_DOC, out_path: Path = OUTPUT_DIR / \"raft_dataset.jsonl\") -> Path:\n",
|
||
" rng = random.Random(SEED)\n",
|
||
" doc_to_chunk_idx = {}\n",
|
||
" for i, ch in enumerate(all_chunks):\n",
|
||
" doc_to_chunk_idx.setdefault(ch.doc_path, []).append(i)\n",
|
||
"\n",
|
||
" total_target = 0\n",
|
||
" with out_path.open(\"w\", encoding=\"utf-8\") as f:\n",
|
||
" for doc_path, idxs in doc_to_chunk_idx.items():\n",
|
||
" print(f\"[blue]Synthesizing for: {doc_path} ({len(idxs)} chunks)[/blue]\")\n",
|
||
" doc_idx = list(doc_to_chunk_idx.keys()).index(doc_path)\n",
|
||
" total_docs = len(doc_to_chunk_idx)\n",
|
||
" percent = (doc_idx + 1) / total_docs * 100\n",
|
||
" print(f\"[cyan]Progress: {doc_idx + 1}/{total_docs} ({percent:.1f}%) completed[/cyan]\")\n",
|
||
" if not idxs:\n",
|
||
" continue\n",
|
||
" chosen = rng.sample(idxs, min(samples_per_doc, len(idxs)))\n",
|
||
" for pi in chosen:\n",
|
||
" ctx, ids = build_context(pi, k=TOP_K)\n",
|
||
" user = USER_PROMPT_TEMPLATE.format(context=ctx, n=3)\n",
|
||
" raw = ollama_generate(user, model=GEN_MODEL, temperature=TEMPERATURE, num_predict=MAX_TOKENS_GEN)\n",
|
||
" rows = parse_llm_jsonl(raw)\n",
|
||
" for r in rows:\n",
|
||
" # enforce schema & enrich meta\n",
|
||
" inp = r.get(\"input\") or r.get(\"question\") or r.get(\"query\")\n",
|
||
" out = r.get(\"output\") or r.get(\"answer\") or r.get(\"response\")\n",
|
||
" meta = r.get(\"meta\") or {}\n",
|
||
" if not isinstance(meta, dict):\n",
|
||
" meta = {}\n",
|
||
" meta.update({\n",
|
||
" \"source_path\": str(doc_path),\n",
|
||
" \"chunk_ids\": ids,\n",
|
||
" \"generated_at\": datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S'),\n",
|
||
" \"model\": GEN_MODEL,\n",
|
||
" \"embed_model\": EMBED_MODEL\n",
|
||
" })\n",
|
||
" if inp and out:\n",
|
||
" obj = {\"input\": inp, \"output\": out, \"meta\": meta}\n",
|
||
" f.write(json.dumps(obj, ensure_ascii=False) + \"\\n\")\n",
|
||
" total_target += 1\n",
|
||
" print(f\"[green]Wrote {total_target} rows -> {out_path}[/green]\")\n",
|
||
" return out_path\n",
|
||
"\n",
|
||
"OUT_JSONL = synthesize_dataset(samples_per_doc=SAMPLES_PER_DOC)\n",
|
||
"OUT_JSONL\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "2b7d110c",
|
||
"metadata": {},
|
||
"source": [
|
||
"\n",
|
||
"## 9) Preview Samples\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "73dbd2e8",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\n",
|
||
"from itertools import islice\n",
|
||
"\n",
|
||
"def head_jsonl(p: Path, n: int = 5):\n",
|
||
" with p.open(\"r\", encoding=\"utf-8\") as f:\n",
|
||
" for line in islice(f, n):\n",
|
||
" print(line.rstrip())\n",
|
||
"\n",
|
||
"head_jsonl(OUT_JSONL, 5)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "769567bf",
|
||
"metadata": {},
|
||
"source": [
|
||
"\n",
|
||
"## 10) Optional: Spot-Check Generation Quality\n",
|
||
"\n",
|
||
"Run a tiny evaluation by asking the model with and without retrieval and compare answers.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "18c7d3d7",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\n",
|
||
"EVAL_QUESTIONS = []\n",
|
||
"\n",
|
||
"# Collect inputs from the dataset (first N)\n",
|
||
"with (OUTPUT_DIR / \"raft_dataset.jsonl\").open(\"r\", encoding=\"utf-8\") as f:\n",
|
||
" for i, line in enumerate(f):\n",
|
||
" try:\n",
|
||
" obj = json.loads(line)\n",
|
||
" EVAL_QUESTIONS.append(obj[\"input\"])\n",
|
||
" except Exception:\n",
|
||
" pass\n",
|
||
" if len(EVAL_QUESTIONS) >= 5:\n",
|
||
" break\n",
|
||
"\n",
|
||
"def rag_answer(q: str, k: int = TOP_K) -> str:\n",
|
||
" hits = search(q, k)\n",
|
||
" ctx = \"\\n\\n\".join([all_chunks[i].text for i,_ in hits])\n",
|
||
" user = f\"Answer the question using ONLY this context. If missing, say INSUFFICIENT_CONTEXT.\\n\\nCONTEXT:\\n{ctx}\\n\\nQUESTION: {q}\"\n",
|
||
" return ollama_generate(user, model=GEN_MODEL, temperature=0.2, num_predict=256)\n",
|
||
"\n",
|
||
"for q in EVAL_QUESTIONS:\n",
|
||
" print(\"\\n[bold]Q:[/bold]\", q)\n",
|
||
" ans = rag_answer(q)\n",
|
||
" print(\"[bold]A:[/bold]\", ans.strip()[:500], \"...\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "065a54cf",
|
||
"metadata": {},
|
||
"source": [
|
||
"\n",
|
||
"## 11) Artifacts\n",
|
||
"\n",
|
||
"- `outputs/raft_dataset.jsonl` — your RAFT dataset (input/output/meta per line)\n",
|
||
"- `corpus/` — your source documents (you provide)\n",
|
||
"- You can also persist `emb_matrix.npy` and a FAISS index for reuse.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "300fe34e",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"\n",
|
||
"# Optionally persist embeddings and index for later reuse\n",
|
||
"np.save(OUTPUT_DIR / \"emb_matrix.npy\", emb_matrix)\n",
|
||
"\n",
|
||
"if USE_FAISS:\n",
|
||
" import faiss\n",
|
||
" faiss.write_index(index, str(OUTPUT_DIR / \"faiss.index\"))\n",
|
||
" print(\"[green]Saved FAISS index and embeddings.[/green]\")\n",
|
||
"else:\n",
|
||
" print(\"[yellow]FAISS disabled; only saved embeddings.[/yellow]\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7bc8f310",
|
||
"metadata": {},
|
||
"source": [
|
||
"\n",
|
||
"## 12) Troubleshooting\n",
|
||
"\n",
|
||
"- **Connection error to Ollama**: ensure `ollama serve` is running and models are pulled (`ollama pull nomic-embed-text`, `ollama pull llama3.1`).\n",
|
||
"- **Empty dataset**: your corpus may be too small or the parser skipped files. Check `corpus/` content and chunk parameters.\n",
|
||
"- **Hallucinations**: tighten the system prompt, lower temperature, or increase `TOP_K` and chunk size.\n",
|
||
"- **JSON parsing issues**: the notebook tries to be forgiving; you can harden `parse_llm_jsonl` per your needs.\n",
|
||
"- **PDFs**: `pip install pypdf` and try again.\n"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": ".env",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.12.12"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|