# SP-Distill Runtime — Operational Deployment Inference/deploy layer for the bounded-memory chat + RAG system built on top of `ap32_large.pt`. Drop-in scripts that embody the validated operation. For training scripts and the original handoff, see the repo root and `HANDOFF.md`. ## What this gives you - **Bounded-memory multi-turn chat** running a frozen `DeepSeek-R1-Distill-Qwen-1.5B` with SP-evict: the LLM only ever attends to ≈ **675 KV tokens** regardless of how long the conversation runs. - **External knowledge** via a **4-tier retrieval pipeline** (cache → web search → Qwen self-verification → honest refusal), with **sentence-level BGE reranking** for focused context injection. - **Hallucination guard**: BGE pre-gate + strict Qwen verifier + date guard + self-refusal post-gate + cutoff-aware system prompt. - **Multiple web backends** (Wikipedia, Tavily, Brave, headless-Chromium Bing scrape), interchangeable via a single interface. - Pure CPU runnable; same code runs on GPU. ## Files | file | what it is | |---|---| | `spchat.py` | `SPChat` class. Bounded-memory chat (KEEP-CoT mode). | | `rag.py` | `BGERetriever`. Sentence embedding + cosine top-k. | | `refusal.py` | `detect_refusal()` + `RECOMMENDED_SYSTEM` (V2 cutoff-aware). | | `cache.py` | `QueryCache`. BGE-indexed query→chunks cache (Tier 1). | | `web_search.py` | `WikipediaSearch` / `TavilySearch` / `BraveSearch` / `PlaywrightBingSearch` / `CompositeSearch`. | | `verifier.py` | `QwenVerifier`. Strict YES/NO chunk relevance, uses Qwen itself. | | `retrieval_pipeline.py` | `RetrievalPipeline`. 4-tier orchestrator + date guard + **sentence-level rerank**. | | `demo_chat.py` | 4-turn chat demo. | | `demo_rag.py` | Plain RAG demo (no gates). | | `demo_rag_gated.py` | RAG + hallucination guard (approach E). | | `demo_full_pipeline.py` | **Full integrated demo** (4-tier + chat + multi-turn). | | `try_postgate_prompts.py` | Validation harness: system-prompt strictness experiment. | | **`test_pipeline.py`** | **Smoke tests for the runtime modules** (run after install). | | `requirements.txt` | Pinned-loose deps (includes Playwright). | ## Install ```bash pip install -r runtime/requirements.txt # only needed for PlaywrightBingSearch (keyless web scrape): python -m playwright install chromium # pull the trained checkpoint git lfs pull --include="checkpoints/ap32_large.pt" ``` ## Run ```bash # fast module smoke tests (no LLM load) python runtime/test_pipeline.py --fast # full smoke (loads LLM + runs one pipeline query, ~5min on CPU) python runtime/test_pipeline.py # demos python runtime/demo_chat.py # multi-turn chat only python runtime/demo_rag.py # plain RAG over a fixed corpus python runtime/demo_rag_gated.py # RAG + hallucination guard python runtime/demo_full_pipeline.py # 4-tier pipeline end-to-end (Bing+Wikipedia merge) ``` ## The 4-tier retrieval pipeline (`retrieval_pipeline.py`) ``` user query │ ├─ looks_like_question? ─── no ─── bypass (casual chat) │ ├─ Tier 1: local cache lookup (BGE sim ≥ 0.85 to a past query) │ ├─ Web search (Composite: Bing scrape + Wikipedia by default) │ ├─ Date guard: drop chunks whose year doesn't match query year │ ├─ Sentence-split + BGE rerank ← concentrates signal at sentence level │ ├─ top_sim ≥ 0.80 → Tier 2 fast (trust BGE, no LLM verify) ├─ top_sim ∈ [0.55,0.80) → Tier 3 (Qwen strict verifier per sentence) │ ├─ verified ≥ 1 → use verified sentences │ └─ all rejected → Tier 4 ├─ top_sim < 0.55 → Tier 4 │ └─ Tier 4: honest refusal — "I don't have reliable info on that" ``` **Successful retrievals update the cache** (Tier 1) for next time. This local cache is the device-side counterpart of the planned central-server PageRank curation layer. ## Sentence-level BGE rerank (this is the latest win) Instead of comparing the whole retrieved document against the query, the pipeline: 1. Splits each retrieved doc into sentences (`split_sentences`) 2. Encodes each sentence with BGE 3. Ranks sentences against the query 4. Injects the **top-ranked sentences** (not the whole docs) into the chat Why it matters: a single answer-bearing sentence ("Thimphu is the capital of Bhutan.") scores higher than a long paragraph of mixed content, AND uses far fewer tokens in the LLM's raw window. In testing, this halved gen length on the Bhutan query (645 → 321 tokens) while giving a cleaner one-sentence final answer. ## Web search backends (`web_search.py`) All implement `search(query: str) -> List[str]`. | backend | API key? | free tier | notes | |---|---|---|---| | `WikipediaSearch` | no | unlimited | encyclopedic, robust, default | | `TavilySearch` | yes (`TAVILY_API_KEY`) | 1000 q/mo | AI-agent-tuned, clean text | | `BraveSearch` | yes (`BRAVE_API_KEY`) | 2000 q/mo | general web, snippet-level | | `PlaywrightBingSearch` | no | unlimited (subject to Bing rate limits) | scrapes Bing via headless Chromium | | `CompositeSearch` | depends on chain | — | `mode="fallback"` or `mode="merge"` | Recommended for production: `Composite([Tavily, Wikipedia], "fallback")` if you have a Tavily key (best content), or `Composite([PlaywrightBing, Wikipedia], "merge")` if you want zero keys. ## Operational settings (validated on CPU; same numbers apply to GPU) | setting | value | why | |---|---|---| | `rw` (raw window) | 512 | Drift-free operating point, near-oracle deep KL | | `drop_per_chunk` | 64 | Survivor set bounded ⇒ O(N) total compute | | `chunk_size` | 64 | Matches training, eviction cadence | | **chat** decode `rp`/`nr` | **1.15 / 4** | Prevents greedy loops in conversational gen | | **RAG/math** decode `rp`/`nr` | **1.0 / 0** | Do **not** penalize repeated digits/names (faithful copying) | | EOS between turns | required | Without it the model loops the previous response | | CoT handling | **KEEP** (do not strip) | Strip mode breaks R1 distill's chat-template expectation | | `max_resp` per turn | 600–800 | Lets `` close | | Tier 1 cache threshold | 0.85 | High bar so only near-identical queries reuse cache | | Tier 2 fast threshold | 0.80 | Trust BGE without LLM verify | | Tier 3 floor threshold | 0.55 | Below this → straight to Tier 4 | | System prompt | `refusal.RECOMMENDED_SYSTEM` (cutoff-aware) | Maximizes self-refusal capture | | Date guard | always on | Year mismatch (query has YYYY, chunk does not) → drop the chunk | ## Memory bound (any N) Regardless of total conversation length: - LLM KV at any chunk ≤ `system_prompt + 32 SP + raw 512 + chunk 64 ≈ 675 tokens` - SP survivor set ≤ 64 raw tokens (cumulative bottom-64 eviction) - Total compute **O(N)**, per-token amortized **O(1)** ## Fallback architecture (summary) | layer | decision axis | example threshold | |---|---|---| | backend chain (`CompositeSearch`) | exception / empty → next | — | | Tier 1 cache | BGE sim ≥ 0.85 | accept | | Date guard | year(query) == year(chunk) | hard drop | | Tier 2 fast | BGE sim ≥ 0.80 | accept | | Tier 3 verify | sim 0.55–0.80 + Qwen YES | accept | | Tier 4 | nothing passed | honest refusal | Each level falls through to the next using **similarity score + Qwen verifier + date integrity** as its decision axis. ## Minimum API ```python from runtime.spchat import SPChat from runtime.rag import BGERetriever from runtime.cache import QueryCache from runtime.web_search import PlaywrightBingSearch, WikipediaSearch, CompositeSearch from runtime.verifier import QwenVerifier from runtime.retrieval_pipeline import RetrievalPipeline from runtime.refusal import RECOMMENDED_SYSTEM chat = SPChat() retriever = BGERetriever() cache = QueryCache(retriever, sim_threshold=0.85) web = CompositeSearch( [PlaywrightBingSearch(n_results=3), WikipediaSearch(n_results=3)], mode="merge", ) verifier = QwenVerifier(chat) pipeline = RetrievalPipeline( retriever, cache, web, verifier, tier2_fast_threshold=0.80, tier3_min_threshold=0.55, tier3_top_n=3, ) state = chat.start_session(RECOMMENDED_SYSTEM) res = pipeline.get(user_msg) if res.tier == 4: reply = "I don't have reliable information on that." else: prompt = "Context:\n" + "\n---\n".join(res.chunks) + "\n\nQuestion: " + user_msg reply = chat.turn(state, prompt, rp=1.0, nr=0) ``` ## Background See `HANDOFF.md` for the research-side story (why bounded memory works, what the eviction lever buys you, how the SP-key vision evolved into BGE-keyed RAG with central PageRank curation as the next milestone).