| # SP-Distill Runtime β Operational Deployment |
|
|
| Inference/deploy layer for the bounded-memory chat + RAG system built on |
| top of `ap32_large.pt`. Drop-in scripts that embody the validated operation. |
| For training scripts and the original handoff, see the repo root and `HANDOFF.md`. |
|
|
| ## What this gives you |
|
|
| - **Bounded-memory multi-turn chat** running a frozen |
| `DeepSeek-R1-Distill-Qwen-1.5B` with SP-evict: the LLM only ever attends |
| to β **675 KV tokens** regardless of how long the conversation runs. |
| - **External knowledge** via a **4-tier retrieval pipeline** |
| (cache β web search β Qwen self-verification β honest refusal), |
| with **sentence-level BGE reranking** for focused context injection. |
| - **Hallucination guard**: BGE pre-gate + strict Qwen verifier + date guard |
| + self-refusal post-gate + cutoff-aware system prompt. |
| - **Multiple web backends** (Wikipedia, Tavily, Brave, headless-Chromium |
| Bing scrape), interchangeable via a single interface. |
| - Pure CPU runnable; same code runs on GPU. |
|
|
| ## Files |
|
|
| | file | what it is | |
| |---|---| |
| | `spchat.py` | `SPChat` class. Bounded-memory chat (KEEP-CoT mode). | |
| | `rag.py` | `BGERetriever`. Sentence embedding + cosine top-k. | |
| | `refusal.py` | `detect_refusal()` + `RECOMMENDED_SYSTEM` (V2 cutoff-aware). | |
| | `cache.py` | `QueryCache`. BGE-indexed queryβchunks cache (Tier 1). | |
| | `web_search.py` | `WikipediaSearch` / `TavilySearch` / `BraveSearch` / `PlaywrightBingSearch` / `CompositeSearch`. | |
| | `verifier.py` | `QwenVerifier`. Strict YES/NO chunk relevance, uses Qwen itself. | |
| | `retrieval_pipeline.py` | `RetrievalPipeline`. 4-tier orchestrator + date guard + **sentence-level rerank**. | |
| | `demo_chat.py` | 4-turn chat demo. | |
| | `demo_rag.py` | Plain RAG demo (no gates). | |
| | `demo_rag_gated.py` | RAG + hallucination guard (approach E). | |
| | `demo_full_pipeline.py` | **Full integrated demo** (4-tier + chat + multi-turn). | |
| | `try_postgate_prompts.py` | Validation harness: system-prompt strictness experiment. | |
| | **`test_pipeline.py`** | **Smoke tests for the runtime modules** (run after install). | |
| | `requirements.txt` | Pinned-loose deps (includes Playwright). | |
| |
| ## Install |
| |
| ```bash |
| pip install -r runtime/requirements.txt |
| |
| # only needed for PlaywrightBingSearch (keyless web scrape): |
| python -m playwright install chromium |
| |
| # pull the trained checkpoint |
| git lfs pull --include="checkpoints/ap32_large.pt" |
| ``` |
| |
| ## Run |
| |
| ```bash |
| # fast module smoke tests (no LLM load) |
| python runtime/test_pipeline.py --fast |
| |
| # full smoke (loads LLM + runs one pipeline query, ~5min on CPU) |
| python runtime/test_pipeline.py |
| |
| # demos |
| python runtime/demo_chat.py # multi-turn chat only |
| python runtime/demo_rag.py # plain RAG over a fixed corpus |
| python runtime/demo_rag_gated.py # RAG + hallucination guard |
| python runtime/demo_full_pipeline.py # 4-tier pipeline end-to-end (Bing+Wikipedia merge) |
| ``` |
| |
| ## The 4-tier retrieval pipeline (`retrieval_pipeline.py`) |
| |
| ``` |
| user query |
| β |
| ββ looks_like_question? βββ no βββ bypass (casual chat) |
| β |
| ββ Tier 1: local cache lookup (BGE sim β₯ 0.85 to a past query) |
| β |
| ββ Web search (Composite: Bing scrape + Wikipedia by default) |
| β |
| ββ Date guard: drop chunks whose year doesn't match query year |
| β |
| ββ Sentence-split + BGE rerank β concentrates signal at sentence level |
| β |
| ββ top_sim β₯ 0.80 β Tier 2 fast (trust BGE, no LLM verify) |
| ββ top_sim β [0.55,0.80) β Tier 3 (Qwen strict verifier per sentence) |
| β ββ verified β₯ 1 β use verified sentences |
| β ββ all rejected β Tier 4 |
| ββ top_sim < 0.55 β Tier 4 |
| β |
| ββ Tier 4: honest refusal β "I don't have reliable info on that" |
| ``` |
| |
| **Successful retrievals update the cache** (Tier 1) for next time. This local |
| cache is the device-side counterpart of the planned central-server PageRank |
| curation layer. |
|
|
| ## Sentence-level BGE rerank (this is the latest win) |
|
|
| Instead of comparing the whole retrieved document against the query, the |
| pipeline: |
| 1. Splits each retrieved doc into sentences (`split_sentences`) |
| 2. Encodes each sentence with BGE |
| 3. Ranks sentences against the query |
| 4. Injects the **top-ranked sentences** (not the whole docs) into the chat |
|
|
| Why it matters: a single answer-bearing sentence ("Thimphu is the capital |
| of Bhutan.") scores higher than a long paragraph of mixed content, AND uses |
| far fewer tokens in the LLM's raw window. In testing, this halved gen length |
| on the Bhutan query (645 β 321 tokens) while giving a cleaner one-sentence |
| final answer. |
|
|
| ## Web search backends (`web_search.py`) |
| |
| All implement `search(query: str) -> List[str]`. |
| |
| | backend | API key? | free tier | notes | |
| |---|---|---|---| |
| | `WikipediaSearch` | no | unlimited | encyclopedic, robust, default | |
| | `TavilySearch` | yes (`TAVILY_API_KEY`) | 1000 q/mo | AI-agent-tuned, clean text | |
| | `BraveSearch` | yes (`BRAVE_API_KEY`) | 2000 q/mo | general web, snippet-level | |
| | `PlaywrightBingSearch` | no | unlimited (subject to Bing rate limits) | scrapes Bing via headless Chromium | |
| | `CompositeSearch` | depends on chain | β | `mode="fallback"` or `mode="merge"` | |
| |
| Recommended for production: `Composite([Tavily, Wikipedia], "fallback")` if |
| you have a Tavily key (best content), or `Composite([PlaywrightBing, |
| Wikipedia], "merge")` if you want zero keys. |
| |
| ## Operational settings (validated on CPU; same numbers apply to GPU) |
| |
| | setting | value | why | |
| |---|---|---| |
| | `rw` (raw window) | 512 | Drift-free operating point, near-oracle deep KL | |
| | `drop_per_chunk` | 64 | Survivor set bounded β O(N) total compute | |
| | `chunk_size` | 64 | Matches training, eviction cadence | |
| | **chat** decode `rp`/`nr` | **1.15 / 4** | Prevents greedy loops in conversational gen | |
| | **RAG/math** decode `rp`/`nr` | **1.0 / 0** | Do **not** penalize repeated digits/names (faithful copying) | |
| | EOS between turns | required | Without it the model loops the previous response | |
| | CoT handling | **KEEP** (do not strip) | Strip mode breaks R1 distill's chat-template expectation | |
| | `max_resp` per turn | 600β800 | Lets `<think>` close | |
| | Tier 1 cache threshold | 0.85 | High bar so only near-identical queries reuse cache | |
| | Tier 2 fast threshold | 0.80 | Trust BGE without LLM verify | |
| | Tier 3 floor threshold | 0.55 | Below this β straight to Tier 4 | |
| | System prompt | `refusal.RECOMMENDED_SYSTEM` (cutoff-aware) | Maximizes self-refusal capture | |
| | Date guard | always on | Year mismatch (query has YYYY, chunk does not) β drop the chunk | |
|
|
| ## Memory bound (any N) |
|
|
| Regardless of total conversation length: |
|
|
| - LLM KV at any chunk β€ `system_prompt + 32 SP + raw 512 + chunk 64 β 675 tokens` |
| - SP survivor set β€ 64 raw tokens (cumulative bottom-64 eviction) |
| - Total compute **O(N)**, per-token amortized **O(1)** |
|
|
| ## Fallback architecture (summary) |
|
|
| | layer | decision axis | example threshold | |
| |---|---|---| |
| | backend chain (`CompositeSearch`) | exception / empty β next | β | |
| | Tier 1 cache | BGE sim β₯ 0.85 | accept | |
| | Date guard | year(query) == year(chunk) | hard drop | |
| | Tier 2 fast | BGE sim β₯ 0.80 | accept | |
| | Tier 3 verify | sim 0.55β0.80 + Qwen YES | accept | |
| | Tier 4 | nothing passed | honest refusal | |
|
|
| Each level falls through to the next using **similarity score + Qwen |
| verifier + date integrity** as its decision axis. |
|
|
| ## Minimum API |
|
|
| ```python |
| from runtime.spchat import SPChat |
| from runtime.rag import BGERetriever |
| from runtime.cache import QueryCache |
| from runtime.web_search import PlaywrightBingSearch, WikipediaSearch, CompositeSearch |
| from runtime.verifier import QwenVerifier |
| from runtime.retrieval_pipeline import RetrievalPipeline |
| from runtime.refusal import RECOMMENDED_SYSTEM |
| |
| chat = SPChat() |
| retriever = BGERetriever() |
| cache = QueryCache(retriever, sim_threshold=0.85) |
| web = CompositeSearch( |
| [PlaywrightBingSearch(n_results=3), WikipediaSearch(n_results=3)], |
| mode="merge", |
| ) |
| verifier = QwenVerifier(chat) |
| pipeline = RetrievalPipeline( |
| retriever, cache, web, verifier, |
| tier2_fast_threshold=0.80, tier3_min_threshold=0.55, tier3_top_n=3, |
| ) |
| |
| state = chat.start_session(RECOMMENDED_SYSTEM) |
| res = pipeline.get(user_msg) |
| if res.tier == 4: |
| reply = "I don't have reliable information on that." |
| else: |
| prompt = "Context:\n" + "\n---\n".join(res.chunks) + "\n\nQuestion: " + user_msg |
| reply = chat.turn(state, prompt, rp=1.0, nr=0) |
| ``` |
|
|
| ## Background |
|
|
| See `HANDOFF.md` for the research-side story (why bounded memory works, what |
| the eviction lever buys you, how the SP-key vision evolved into BGE-keyed |
| RAG with central PageRank curation as the next milestone). |
|
|