SP-Distill Runtime — Operational Deployment

Inference/deploy layer for the bounded-memory chat + RAG system built on top of ap32_large.pt. Drop-in scripts that embody the validated operation. For training scripts and the original handoff, see the repo root and HANDOFF.md.

What this gives you

Bounded-memory multi-turn chat running a frozen DeepSeek-R1-Distill-Qwen-1.5B with SP-evict: the LLM only ever attends to ≈ 675 KV tokens regardless of how long the conversation runs.
External knowledge via a 4-tier retrieval pipeline (cache → web search → Qwen self-verification → honest refusal), with sentence-level BGE reranking for focused context injection.
Hallucination guard: BGE pre-gate + strict Qwen verifier + date guard
- self-refusal post-gate + cutoff-aware system prompt.
Multiple web backends (Wikipedia, Tavily, Brave, headless-Chromium Bing scrape), interchangeable via a single interface.
Pure CPU runnable; same code runs on GPU.

Files

file	what it is
`spchat.py`	`SPChat` class. Bounded-memory chat (KEEP-CoT mode).
`rag.py`	`BGERetriever`. Sentence embedding + cosine top-k.
`refusal.py`	`detect_refusal()` + `RECOMMENDED_SYSTEM` (V2 cutoff-aware).
`cache.py`	`QueryCache`. BGE-indexed query→chunks cache (Tier 1).
`web_search.py`	`WikipediaSearch` / `TavilySearch` / `BraveSearch` / `PlaywrightBingSearch` / `CompositeSearch`.
`verifier.py`	`QwenVerifier`. Strict YES/NO chunk relevance, uses Qwen itself.
`retrieval_pipeline.py`	`RetrievalPipeline`. 4-tier orchestrator + date guard + sentence-level rerank.
`demo_chat.py`	4-turn chat demo.
`demo_rag.py`	Plain RAG demo (no gates).
`demo_rag_gated.py`	RAG + hallucination guard (approach E).
`demo_full_pipeline.py`	Full integrated demo (4-tier + chat + multi-turn).
`try_postgate_prompts.py`	Validation harness: system-prompt strictness experiment.
`test_pipeline.py`	Smoke tests for the runtime modules (run after install).
`requirements.txt`	Pinned-loose deps (includes Playwright).

Install

pip install -r runtime/requirements.txt

# only needed for PlaywrightBingSearch (keyless web scrape):
python -m playwright install chromium

# pull the trained checkpoint
git lfs pull --include="checkpoints/ap32_large.pt"

Run

# fast module smoke tests (no LLM load)
python runtime/test_pipeline.py --fast

# full smoke (loads LLM + runs one pipeline query, ~5min on CPU)
python runtime/test_pipeline.py

# demos
python runtime/demo_chat.py             # multi-turn chat only
python runtime/demo_rag.py              # plain RAG over a fixed corpus
python runtime/demo_rag_gated.py        # RAG + hallucination guard
python runtime/demo_full_pipeline.py    # 4-tier pipeline end-to-end (Bing+Wikipedia merge)

The 4-tier retrieval pipeline (`retrieval_pipeline.py`)

user query
   │
   ├─ looks_like_question? ─── no ─── bypass (casual chat)
   │
   ├─ Tier 1: local cache lookup (BGE sim ≥ 0.85 to a past query)
   │
   ├─ Web search (Composite: Bing scrape + Wikipedia by default)
   │
   ├─ Date guard: drop chunks whose year doesn't match query year
   │
   ├─ Sentence-split + BGE rerank   ← concentrates signal at sentence level
   │
   ├─ top_sim ≥ 0.80      → Tier 2 fast (trust BGE, no LLM verify)
   ├─ top_sim ∈ [0.55,0.80) → Tier 3 (Qwen strict verifier per sentence)
   │   ├─ verified ≥ 1     → use verified sentences
   │   └─ all rejected     → Tier 4
   ├─ top_sim < 0.55      → Tier 4
   │
   └─ Tier 4: honest refusal — "I don't have reliable info on that"

Successful retrievals update the cache (Tier 1) for next time. This local cache is the device-side counterpart of the planned central-server PageRank curation layer.

Sentence-level BGE rerank (this is the latest win)

Instead of comparing the whole retrieved document against the query, the pipeline:

Splits each retrieved doc into sentences (split_sentences)
Encodes each sentence with BGE
Ranks sentences against the query
Injects the top-ranked sentences (not the whole docs) into the chat

Why it matters: a single answer-bearing sentence ("Thimphu is the capital of Bhutan.") scores higher than a long paragraph of mixed content, AND uses far fewer tokens in the LLM's raw window. In testing, this halved gen length on the Bhutan query (645 → 321 tokens) while giving a cleaner one-sentence final answer.

Web search backends (`web_search.py`)

All implement search(query: str) -> List[str].

backend	API key?	free tier	notes
`WikipediaSearch`	no	unlimited	encyclopedic, robust, default
`TavilySearch`	yes (`TAVILY_API_KEY`)	1000 q/mo	AI-agent-tuned, clean text
`BraveSearch`	yes (`BRAVE_API_KEY`)	2000 q/mo	general web, snippet-level
`PlaywrightBingSearch`	no	unlimited (subject to Bing rate limits)	scrapes Bing via headless Chromium
`CompositeSearch`	depends on chain	—	`mode="fallback"` or `mode="merge"`

Recommended for production: Composite([Tavily, Wikipedia], "fallback") if you have a Tavily key (best content), or Composite([PlaywrightBing, Wikipedia], "merge") if you want zero keys.

Operational settings (validated on CPU; same numbers apply to GPU)

setting	value	why
`rw` (raw window)	512	Drift-free operating point, near-oracle deep KL
`drop_per_chunk`	64	Survivor set bounded ⇒ O(N) total compute
`chunk_size`	64	Matches training, eviction cadence
chat decode `rp`/`nr`	1.15 / 4	Prevents greedy loops in conversational gen
RAG/math decode `rp`/`nr`	1.0 / 0	Do not penalize repeated digits/names (faithful copying)
EOS between turns	required	Without it the model loops the previous response
CoT handling	KEEP (do not strip)	Strip mode breaks R1 distill's chat-template expectation
`max_resp` per turn	600–800	Lets `<think>` close
Tier 1 cache threshold	0.85	High bar so only near-identical queries reuse cache
Tier 2 fast threshold	0.80	Trust BGE without LLM verify
Tier 3 floor threshold	0.55	Below this → straight to Tier 4
System prompt	`refusal.RECOMMENDED_SYSTEM` (cutoff-aware)	Maximizes self-refusal capture
Date guard	always on	Year mismatch (query has YYYY, chunk does not) → drop the chunk

Memory bound (any N)

Regardless of total conversation length:

LLM KV at any chunk ≤ system_prompt + 32 SP + raw 512 + chunk 64 ≈ 675 tokens
SP survivor set ≤ 64 raw tokens (cumulative bottom-64 eviction)
Total compute O(N), per-token amortized O(1)

Fallback architecture (summary)

layer	decision axis	example threshold
backend chain (`CompositeSearch`)	exception / empty → next	—
Tier 1 cache	BGE sim ≥ 0.85	accept
Date guard	year(query) == year(chunk)	hard drop
Tier 2 fast	BGE sim ≥ 0.80	accept
Tier 3 verify	sim 0.55–0.80 + Qwen YES	accept
Tier 4	nothing passed	honest refusal

Each level falls through to the next using similarity score + Qwen verifier + date integrity as its decision axis.

Minimum API

from runtime.spchat import SPChat
from runtime.rag import BGERetriever
from runtime.cache import QueryCache
from runtime.web_search import PlaywrightBingSearch, WikipediaSearch, CompositeSearch
from runtime.verifier import QwenVerifier
from runtime.retrieval_pipeline import RetrievalPipeline
from runtime.refusal import RECOMMENDED_SYSTEM

chat       = SPChat()
retriever  = BGERetriever()
cache      = QueryCache(retriever, sim_threshold=0.85)
web        = CompositeSearch(
    [PlaywrightBingSearch(n_results=3), WikipediaSearch(n_results=3)],
    mode="merge",
)
verifier   = QwenVerifier(chat)
pipeline   = RetrievalPipeline(
    retriever, cache, web, verifier,
    tier2_fast_threshold=0.80, tier3_min_threshold=0.55, tier3_top_n=3,
)

state = chat.start_session(RECOMMENDED_SYSTEM)
res   = pipeline.get(user_msg)
if res.tier == 4:
    reply = "I don't have reliable information on that."
else:
    prompt = "Context:\n" + "\n---\n".join(res.chunks) + "\n\nQuestion: " + user_msg
    reply = chat.turn(state, prompt, rp=1.0, nr=0)

Background

See HANDOFF.md for the research-side story (why bounded memory works, what the eviction lever buys you, how the SP-key vision evolved into BGE-keyed RAG with central PageRank curation as the next milestone).