baya1116's picture
Add Playwright Bing scrape + sentence-level rerank + test_pipeline.py
be86b43 verified

SP-Distill Runtime β€” Operational Deployment

Inference/deploy layer for the bounded-memory chat + RAG system built on top of ap32_large.pt. Drop-in scripts that embody the validated operation. For training scripts and the original handoff, see the repo root and HANDOFF.md.

What this gives you

  • Bounded-memory multi-turn chat running a frozen DeepSeek-R1-Distill-Qwen-1.5B with SP-evict: the LLM only ever attends to β‰ˆ 675 KV tokens regardless of how long the conversation runs.
  • External knowledge via a 4-tier retrieval pipeline (cache β†’ web search β†’ Qwen self-verification β†’ honest refusal), with sentence-level BGE reranking for focused context injection.
  • Hallucination guard: BGE pre-gate + strict Qwen verifier + date guard
    • self-refusal post-gate + cutoff-aware system prompt.
  • Multiple web backends (Wikipedia, Tavily, Brave, headless-Chromium Bing scrape), interchangeable via a single interface.
  • Pure CPU runnable; same code runs on GPU.

Files

file what it is
spchat.py SPChat class. Bounded-memory chat (KEEP-CoT mode).
rag.py BGERetriever. Sentence embedding + cosine top-k.
refusal.py detect_refusal() + RECOMMENDED_SYSTEM (V2 cutoff-aware).
cache.py QueryCache. BGE-indexed query→chunks cache (Tier 1).
web_search.py WikipediaSearch / TavilySearch / BraveSearch / PlaywrightBingSearch / CompositeSearch.
verifier.py QwenVerifier. Strict YES/NO chunk relevance, uses Qwen itself.
retrieval_pipeline.py RetrievalPipeline. 4-tier orchestrator + date guard + sentence-level rerank.
demo_chat.py 4-turn chat demo.
demo_rag.py Plain RAG demo (no gates).
demo_rag_gated.py RAG + hallucination guard (approach E).
demo_full_pipeline.py Full integrated demo (4-tier + chat + multi-turn).
try_postgate_prompts.py Validation harness: system-prompt strictness experiment.
test_pipeline.py Smoke tests for the runtime modules (run after install).
requirements.txt Pinned-loose deps (includes Playwright).

Install

pip install -r runtime/requirements.txt

# only needed for PlaywrightBingSearch (keyless web scrape):
python -m playwright install chromium

# pull the trained checkpoint
git lfs pull --include="checkpoints/ap32_large.pt"

Run

# fast module smoke tests (no LLM load)
python runtime/test_pipeline.py --fast

# full smoke (loads LLM + runs one pipeline query, ~5min on CPU)
python runtime/test_pipeline.py

# demos
python runtime/demo_chat.py             # multi-turn chat only
python runtime/demo_rag.py              # plain RAG over a fixed corpus
python runtime/demo_rag_gated.py        # RAG + hallucination guard
python runtime/demo_full_pipeline.py    # 4-tier pipeline end-to-end (Bing+Wikipedia merge)

The 4-tier retrieval pipeline (retrieval_pipeline.py)

user query
   β”‚
   β”œβ”€ looks_like_question? ─── no ─── bypass (casual chat)
   β”‚
   β”œβ”€ Tier 1: local cache lookup (BGE sim β‰₯ 0.85 to a past query)
   β”‚
   β”œβ”€ Web search (Composite: Bing scrape + Wikipedia by default)
   β”‚
   β”œβ”€ Date guard: drop chunks whose year doesn't match query year
   β”‚
   β”œβ”€ Sentence-split + BGE rerank   ← concentrates signal at sentence level
   β”‚
   β”œβ”€ top_sim β‰₯ 0.80      β†’ Tier 2 fast (trust BGE, no LLM verify)
   β”œβ”€ top_sim ∈ [0.55,0.80) β†’ Tier 3 (Qwen strict verifier per sentence)
   β”‚   β”œβ”€ verified β‰₯ 1     β†’ use verified sentences
   β”‚   └─ all rejected     β†’ Tier 4
   β”œβ”€ top_sim < 0.55      β†’ Tier 4
   β”‚
   └─ Tier 4: honest refusal β€” "I don't have reliable info on that"

Successful retrievals update the cache (Tier 1) for next time. This local cache is the device-side counterpart of the planned central-server PageRank curation layer.

Sentence-level BGE rerank (this is the latest win)

Instead of comparing the whole retrieved document against the query, the pipeline:

  1. Splits each retrieved doc into sentences (split_sentences)
  2. Encodes each sentence with BGE
  3. Ranks sentences against the query
  4. Injects the top-ranked sentences (not the whole docs) into the chat

Why it matters: a single answer-bearing sentence ("Thimphu is the capital of Bhutan.") scores higher than a long paragraph of mixed content, AND uses far fewer tokens in the LLM's raw window. In testing, this halved gen length on the Bhutan query (645 β†’ 321 tokens) while giving a cleaner one-sentence final answer.

Web search backends (web_search.py)

All implement search(query: str) -> List[str].

backend API key? free tier notes
WikipediaSearch no unlimited encyclopedic, robust, default
TavilySearch yes (TAVILY_API_KEY) 1000 q/mo AI-agent-tuned, clean text
BraveSearch yes (BRAVE_API_KEY) 2000 q/mo general web, snippet-level
PlaywrightBingSearch no unlimited (subject to Bing rate limits) scrapes Bing via headless Chromium
CompositeSearch depends on chain β€” mode="fallback" or mode="merge"

Recommended for production: Composite([Tavily, Wikipedia], "fallback") if you have a Tavily key (best content), or Composite([PlaywrightBing, Wikipedia], "merge") if you want zero keys.

Operational settings (validated on CPU; same numbers apply to GPU)

setting value why
rw (raw window) 512 Drift-free operating point, near-oracle deep KL
drop_per_chunk 64 Survivor set bounded β‡’ O(N) total compute
chunk_size 64 Matches training, eviction cadence
chat decode rp/nr 1.15 / 4 Prevents greedy loops in conversational gen
RAG/math decode rp/nr 1.0 / 0 Do not penalize repeated digits/names (faithful copying)
EOS between turns required Without it the model loops the previous response
CoT handling KEEP (do not strip) Strip mode breaks R1 distill's chat-template expectation
max_resp per turn 600–800 Lets <think> close
Tier 1 cache threshold 0.85 High bar so only near-identical queries reuse cache
Tier 2 fast threshold 0.80 Trust BGE without LLM verify
Tier 3 floor threshold 0.55 Below this β†’ straight to Tier 4
System prompt refusal.RECOMMENDED_SYSTEM (cutoff-aware) Maximizes self-refusal capture
Date guard always on Year mismatch (query has YYYY, chunk does not) β†’ drop the chunk

Memory bound (any N)

Regardless of total conversation length:

  • LLM KV at any chunk ≀ system_prompt + 32 SP + raw 512 + chunk 64 β‰ˆ 675 tokens
  • SP survivor set ≀ 64 raw tokens (cumulative bottom-64 eviction)
  • Total compute O(N), per-token amortized O(1)

Fallback architecture (summary)

layer decision axis example threshold
backend chain (CompositeSearch) exception / empty β†’ next β€”
Tier 1 cache BGE sim β‰₯ 0.85 accept
Date guard year(query) == year(chunk) hard drop
Tier 2 fast BGE sim β‰₯ 0.80 accept
Tier 3 verify sim 0.55–0.80 + Qwen YES accept
Tier 4 nothing passed honest refusal

Each level falls through to the next using similarity score + Qwen verifier + date integrity as its decision axis.

Minimum API

from runtime.spchat import SPChat
from runtime.rag import BGERetriever
from runtime.cache import QueryCache
from runtime.web_search import PlaywrightBingSearch, WikipediaSearch, CompositeSearch
from runtime.verifier import QwenVerifier
from runtime.retrieval_pipeline import RetrievalPipeline
from runtime.refusal import RECOMMENDED_SYSTEM

chat       = SPChat()
retriever  = BGERetriever()
cache      = QueryCache(retriever, sim_threshold=0.85)
web        = CompositeSearch(
    [PlaywrightBingSearch(n_results=3), WikipediaSearch(n_results=3)],
    mode="merge",
)
verifier   = QwenVerifier(chat)
pipeline   = RetrievalPipeline(
    retriever, cache, web, verifier,
    tier2_fast_threshold=0.80, tier3_min_threshold=0.55, tier3_top_n=3,
)

state = chat.start_session(RECOMMENDED_SYSTEM)
res   = pipeline.get(user_msg)
if res.tier == 4:
    reply = "I don't have reliable information on that."
else:
    prompt = "Context:\n" + "\n---\n".join(res.chunks) + "\n\nQuestion: " + user_msg
    reply = chat.turn(state, prompt, rp=1.0, nr=0)

Background

See HANDOFF.md for the research-side story (why bounded memory works, what the eviction lever buys you, how the SP-key vision evolved into BGE-keyed RAG with central PageRank curation as the next milestone).