SP-Distill Runtime β Operational Deployment
Inference/deploy layer for the bounded-memory chat + RAG system built on
top of ap32_large.pt. Drop-in scripts that embody the validated operation.
For training scripts and the original handoff, see the repo root and HANDOFF.md.
What this gives you
- Bounded-memory multi-turn chat running a frozen
DeepSeek-R1-Distill-Qwen-1.5Bwith SP-evict: the LLM only ever attends to β 675 KV tokens regardless of how long the conversation runs. - External knowledge via a 4-tier retrieval pipeline (cache β web search β Qwen self-verification β honest refusal), with sentence-level BGE reranking for focused context injection.
- Hallucination guard: BGE pre-gate + strict Qwen verifier + date guard
- self-refusal post-gate + cutoff-aware system prompt.
- Multiple web backends (Wikipedia, Tavily, Brave, headless-Chromium Bing scrape), interchangeable via a single interface.
- Pure CPU runnable; same code runs on GPU.
Files
| file | what it is |
|---|---|
spchat.py |
SPChat class. Bounded-memory chat (KEEP-CoT mode). |
rag.py |
BGERetriever. Sentence embedding + cosine top-k. |
refusal.py |
detect_refusal() + RECOMMENDED_SYSTEM (V2 cutoff-aware). |
cache.py |
QueryCache. BGE-indexed queryβchunks cache (Tier 1). |
web_search.py |
WikipediaSearch / TavilySearch / BraveSearch / PlaywrightBingSearch / CompositeSearch. |
verifier.py |
QwenVerifier. Strict YES/NO chunk relevance, uses Qwen itself. |
retrieval_pipeline.py |
RetrievalPipeline. 4-tier orchestrator + date guard + sentence-level rerank. |
demo_chat.py |
4-turn chat demo. |
demo_rag.py |
Plain RAG demo (no gates). |
demo_rag_gated.py |
RAG + hallucination guard (approach E). |
demo_full_pipeline.py |
Full integrated demo (4-tier + chat + multi-turn). |
try_postgate_prompts.py |
Validation harness: system-prompt strictness experiment. |
test_pipeline.py |
Smoke tests for the runtime modules (run after install). |
requirements.txt |
Pinned-loose deps (includes Playwright). |
Install
pip install -r runtime/requirements.txt
# only needed for PlaywrightBingSearch (keyless web scrape):
python -m playwright install chromium
# pull the trained checkpoint
git lfs pull --include="checkpoints/ap32_large.pt"
Run
# fast module smoke tests (no LLM load)
python runtime/test_pipeline.py --fast
# full smoke (loads LLM + runs one pipeline query, ~5min on CPU)
python runtime/test_pipeline.py
# demos
python runtime/demo_chat.py # multi-turn chat only
python runtime/demo_rag.py # plain RAG over a fixed corpus
python runtime/demo_rag_gated.py # RAG + hallucination guard
python runtime/demo_full_pipeline.py # 4-tier pipeline end-to-end (Bing+Wikipedia merge)
The 4-tier retrieval pipeline (retrieval_pipeline.py)
user query
β
ββ looks_like_question? βββ no βββ bypass (casual chat)
β
ββ Tier 1: local cache lookup (BGE sim β₯ 0.85 to a past query)
β
ββ Web search (Composite: Bing scrape + Wikipedia by default)
β
ββ Date guard: drop chunks whose year doesn't match query year
β
ββ Sentence-split + BGE rerank β concentrates signal at sentence level
β
ββ top_sim β₯ 0.80 β Tier 2 fast (trust BGE, no LLM verify)
ββ top_sim β [0.55,0.80) β Tier 3 (Qwen strict verifier per sentence)
β ββ verified β₯ 1 β use verified sentences
β ββ all rejected β Tier 4
ββ top_sim < 0.55 β Tier 4
β
ββ Tier 4: honest refusal β "I don't have reliable info on that"
Successful retrievals update the cache (Tier 1) for next time. This local cache is the device-side counterpart of the planned central-server PageRank curation layer.
Sentence-level BGE rerank (this is the latest win)
Instead of comparing the whole retrieved document against the query, the pipeline:
- Splits each retrieved doc into sentences (
split_sentences) - Encodes each sentence with BGE
- Ranks sentences against the query
- Injects the top-ranked sentences (not the whole docs) into the chat
Why it matters: a single answer-bearing sentence ("Thimphu is the capital of Bhutan.") scores higher than a long paragraph of mixed content, AND uses far fewer tokens in the LLM's raw window. In testing, this halved gen length on the Bhutan query (645 β 321 tokens) while giving a cleaner one-sentence final answer.
Web search backends (web_search.py)
All implement search(query: str) -> List[str].
| backend | API key? | free tier | notes |
|---|---|---|---|
WikipediaSearch |
no | unlimited | encyclopedic, robust, default |
TavilySearch |
yes (TAVILY_API_KEY) |
1000 q/mo | AI-agent-tuned, clean text |
BraveSearch |
yes (BRAVE_API_KEY) |
2000 q/mo | general web, snippet-level |
PlaywrightBingSearch |
no | unlimited (subject to Bing rate limits) | scrapes Bing via headless Chromium |
CompositeSearch |
depends on chain | β | mode="fallback" or mode="merge" |
Recommended for production: Composite([Tavily, Wikipedia], "fallback") if
you have a Tavily key (best content), or Composite([PlaywrightBing, Wikipedia], "merge") if you want zero keys.
Operational settings (validated on CPU; same numbers apply to GPU)
| setting | value | why |
|---|---|---|
rw (raw window) |
512 | Drift-free operating point, near-oracle deep KL |
drop_per_chunk |
64 | Survivor set bounded β O(N) total compute |
chunk_size |
64 | Matches training, eviction cadence |
chat decode rp/nr |
1.15 / 4 | Prevents greedy loops in conversational gen |
RAG/math decode rp/nr |
1.0 / 0 | Do not penalize repeated digits/names (faithful copying) |
| EOS between turns | required | Without it the model loops the previous response |
| CoT handling | KEEP (do not strip) | Strip mode breaks R1 distill's chat-template expectation |
max_resp per turn |
600β800 | Lets <think> close |
| Tier 1 cache threshold | 0.85 | High bar so only near-identical queries reuse cache |
| Tier 2 fast threshold | 0.80 | Trust BGE without LLM verify |
| Tier 3 floor threshold | 0.55 | Below this β straight to Tier 4 |
| System prompt | refusal.RECOMMENDED_SYSTEM (cutoff-aware) |
Maximizes self-refusal capture |
| Date guard | always on | Year mismatch (query has YYYY, chunk does not) β drop the chunk |
Memory bound (any N)
Regardless of total conversation length:
- LLM KV at any chunk β€
system_prompt + 32 SP + raw 512 + chunk 64 β 675 tokens - SP survivor set β€ 64 raw tokens (cumulative bottom-64 eviction)
- Total compute O(N), per-token amortized O(1)
Fallback architecture (summary)
| layer | decision axis | example threshold |
|---|---|---|
backend chain (CompositeSearch) |
exception / empty β next | β |
| Tier 1 cache | BGE sim β₯ 0.85 | accept |
| Date guard | year(query) == year(chunk) | hard drop |
| Tier 2 fast | BGE sim β₯ 0.80 | accept |
| Tier 3 verify | sim 0.55β0.80 + Qwen YES | accept |
| Tier 4 | nothing passed | honest refusal |
Each level falls through to the next using similarity score + Qwen verifier + date integrity as its decision axis.
Minimum API
from runtime.spchat import SPChat
from runtime.rag import BGERetriever
from runtime.cache import QueryCache
from runtime.web_search import PlaywrightBingSearch, WikipediaSearch, CompositeSearch
from runtime.verifier import QwenVerifier
from runtime.retrieval_pipeline import RetrievalPipeline
from runtime.refusal import RECOMMENDED_SYSTEM
chat = SPChat()
retriever = BGERetriever()
cache = QueryCache(retriever, sim_threshold=0.85)
web = CompositeSearch(
[PlaywrightBingSearch(n_results=3), WikipediaSearch(n_results=3)],
mode="merge",
)
verifier = QwenVerifier(chat)
pipeline = RetrievalPipeline(
retriever, cache, web, verifier,
tier2_fast_threshold=0.80, tier3_min_threshold=0.55, tier3_top_n=3,
)
state = chat.start_session(RECOMMENDED_SYSTEM)
res = pipeline.get(user_msg)
if res.tier == 4:
reply = "I don't have reliable information on that."
else:
prompt = "Context:\n" + "\n---\n".join(res.chunks) + "\n\nQuestion: " + user_msg
reply = chat.turn(state, prompt, rp=1.0, nr=0)
Background
See HANDOFF.md for the research-side story (why bounded memory works, what
the eviction lever buys you, how the SP-key vision evolved into BGE-keyed
RAG with central PageRank curation as the next milestone).