baya1116's picture
Add Playwright Bing scrape + sentence-level rerank + test_pipeline.py
be86b43 verified
# SP-Distill Runtime β€” Operational Deployment
Inference/deploy layer for the bounded-memory chat + RAG system built on
top of `ap32_large.pt`. Drop-in scripts that embody the validated operation.
For training scripts and the original handoff, see the repo root and `HANDOFF.md`.
## What this gives you
- **Bounded-memory multi-turn chat** running a frozen
`DeepSeek-R1-Distill-Qwen-1.5B` with SP-evict: the LLM only ever attends
to β‰ˆ **675 KV tokens** regardless of how long the conversation runs.
- **External knowledge** via a **4-tier retrieval pipeline**
(cache β†’ web search β†’ Qwen self-verification β†’ honest refusal),
with **sentence-level BGE reranking** for focused context injection.
- **Hallucination guard**: BGE pre-gate + strict Qwen verifier + date guard
+ self-refusal post-gate + cutoff-aware system prompt.
- **Multiple web backends** (Wikipedia, Tavily, Brave, headless-Chromium
Bing scrape), interchangeable via a single interface.
- Pure CPU runnable; same code runs on GPU.
## Files
| file | what it is |
|---|---|
| `spchat.py` | `SPChat` class. Bounded-memory chat (KEEP-CoT mode). |
| `rag.py` | `BGERetriever`. Sentence embedding + cosine top-k. |
| `refusal.py` | `detect_refusal()` + `RECOMMENDED_SYSTEM` (V2 cutoff-aware). |
| `cache.py` | `QueryCache`. BGE-indexed query→chunks cache (Tier 1). |
| `web_search.py` | `WikipediaSearch` / `TavilySearch` / `BraveSearch` / `PlaywrightBingSearch` / `CompositeSearch`. |
| `verifier.py` | `QwenVerifier`. Strict YES/NO chunk relevance, uses Qwen itself. |
| `retrieval_pipeline.py` | `RetrievalPipeline`. 4-tier orchestrator + date guard + **sentence-level rerank**. |
| `demo_chat.py` | 4-turn chat demo. |
| `demo_rag.py` | Plain RAG demo (no gates). |
| `demo_rag_gated.py` | RAG + hallucination guard (approach E). |
| `demo_full_pipeline.py` | **Full integrated demo** (4-tier + chat + multi-turn). |
| `try_postgate_prompts.py` | Validation harness: system-prompt strictness experiment. |
| **`test_pipeline.py`** | **Smoke tests for the runtime modules** (run after install). |
| `requirements.txt` | Pinned-loose deps (includes Playwright). |
## Install
```bash
pip install -r runtime/requirements.txt
# only needed for PlaywrightBingSearch (keyless web scrape):
python -m playwright install chromium
# pull the trained checkpoint
git lfs pull --include="checkpoints/ap32_large.pt"
```
## Run
```bash
# fast module smoke tests (no LLM load)
python runtime/test_pipeline.py --fast
# full smoke (loads LLM + runs one pipeline query, ~5min on CPU)
python runtime/test_pipeline.py
# demos
python runtime/demo_chat.py # multi-turn chat only
python runtime/demo_rag.py # plain RAG over a fixed corpus
python runtime/demo_rag_gated.py # RAG + hallucination guard
python runtime/demo_full_pipeline.py # 4-tier pipeline end-to-end (Bing+Wikipedia merge)
```
## The 4-tier retrieval pipeline (`retrieval_pipeline.py`)
```
user query
β”‚
β”œβ”€ looks_like_question? ─── no ─── bypass (casual chat)
β”‚
β”œβ”€ Tier 1: local cache lookup (BGE sim β‰₯ 0.85 to a past query)
β”‚
β”œβ”€ Web search (Composite: Bing scrape + Wikipedia by default)
β”‚
β”œβ”€ Date guard: drop chunks whose year doesn't match query year
β”‚
β”œβ”€ Sentence-split + BGE rerank ← concentrates signal at sentence level
β”‚
β”œβ”€ top_sim β‰₯ 0.80 β†’ Tier 2 fast (trust BGE, no LLM verify)
β”œβ”€ top_sim ∈ [0.55,0.80) β†’ Tier 3 (Qwen strict verifier per sentence)
β”‚ β”œβ”€ verified β‰₯ 1 β†’ use verified sentences
β”‚ └─ all rejected β†’ Tier 4
β”œβ”€ top_sim < 0.55 β†’ Tier 4
β”‚
└─ Tier 4: honest refusal β€” "I don't have reliable info on that"
```
**Successful retrievals update the cache** (Tier 1) for next time. This local
cache is the device-side counterpart of the planned central-server PageRank
curation layer.
## Sentence-level BGE rerank (this is the latest win)
Instead of comparing the whole retrieved document against the query, the
pipeline:
1. Splits each retrieved doc into sentences (`split_sentences`)
2. Encodes each sentence with BGE
3. Ranks sentences against the query
4. Injects the **top-ranked sentences** (not the whole docs) into the chat
Why it matters: a single answer-bearing sentence ("Thimphu is the capital
of Bhutan.") scores higher than a long paragraph of mixed content, AND uses
far fewer tokens in the LLM's raw window. In testing, this halved gen length
on the Bhutan query (645 β†’ 321 tokens) while giving a cleaner one-sentence
final answer.
## Web search backends (`web_search.py`)
All implement `search(query: str) -> List[str]`.
| backend | API key? | free tier | notes |
|---|---|---|---|
| `WikipediaSearch` | no | unlimited | encyclopedic, robust, default |
| `TavilySearch` | yes (`TAVILY_API_KEY`) | 1000 q/mo | AI-agent-tuned, clean text |
| `BraveSearch` | yes (`BRAVE_API_KEY`) | 2000 q/mo | general web, snippet-level |
| `PlaywrightBingSearch` | no | unlimited (subject to Bing rate limits) | scrapes Bing via headless Chromium |
| `CompositeSearch` | depends on chain | β€” | `mode="fallback"` or `mode="merge"` |
Recommended for production: `Composite([Tavily, Wikipedia], "fallback")` if
you have a Tavily key (best content), or `Composite([PlaywrightBing,
Wikipedia], "merge")` if you want zero keys.
## Operational settings (validated on CPU; same numbers apply to GPU)
| setting | value | why |
|---|---|---|
| `rw` (raw window) | 512 | Drift-free operating point, near-oracle deep KL |
| `drop_per_chunk` | 64 | Survivor set bounded β‡’ O(N) total compute |
| `chunk_size` | 64 | Matches training, eviction cadence |
| **chat** decode `rp`/`nr` | **1.15 / 4** | Prevents greedy loops in conversational gen |
| **RAG/math** decode `rp`/`nr` | **1.0 / 0** | Do **not** penalize repeated digits/names (faithful copying) |
| EOS between turns | required | Without it the model loops the previous response |
| CoT handling | **KEEP** (do not strip) | Strip mode breaks R1 distill's chat-template expectation |
| `max_resp` per turn | 600–800 | Lets `<think>` close |
| Tier 1 cache threshold | 0.85 | High bar so only near-identical queries reuse cache |
| Tier 2 fast threshold | 0.80 | Trust BGE without LLM verify |
| Tier 3 floor threshold | 0.55 | Below this β†’ straight to Tier 4 |
| System prompt | `refusal.RECOMMENDED_SYSTEM` (cutoff-aware) | Maximizes self-refusal capture |
| Date guard | always on | Year mismatch (query has YYYY, chunk does not) β†’ drop the chunk |
## Memory bound (any N)
Regardless of total conversation length:
- LLM KV at any chunk ≀ `system_prompt + 32 SP + raw 512 + chunk 64 β‰ˆ 675 tokens`
- SP survivor set ≀ 64 raw tokens (cumulative bottom-64 eviction)
- Total compute **O(N)**, per-token amortized **O(1)**
## Fallback architecture (summary)
| layer | decision axis | example threshold |
|---|---|---|
| backend chain (`CompositeSearch`) | exception / empty β†’ next | β€” |
| Tier 1 cache | BGE sim β‰₯ 0.85 | accept |
| Date guard | year(query) == year(chunk) | hard drop |
| Tier 2 fast | BGE sim β‰₯ 0.80 | accept |
| Tier 3 verify | sim 0.55–0.80 + Qwen YES | accept |
| Tier 4 | nothing passed | honest refusal |
Each level falls through to the next using **similarity score + Qwen
verifier + date integrity** as its decision axis.
## Minimum API
```python
from runtime.spchat import SPChat
from runtime.rag import BGERetriever
from runtime.cache import QueryCache
from runtime.web_search import PlaywrightBingSearch, WikipediaSearch, CompositeSearch
from runtime.verifier import QwenVerifier
from runtime.retrieval_pipeline import RetrievalPipeline
from runtime.refusal import RECOMMENDED_SYSTEM
chat = SPChat()
retriever = BGERetriever()
cache = QueryCache(retriever, sim_threshold=0.85)
web = CompositeSearch(
[PlaywrightBingSearch(n_results=3), WikipediaSearch(n_results=3)],
mode="merge",
)
verifier = QwenVerifier(chat)
pipeline = RetrievalPipeline(
retriever, cache, web, verifier,
tier2_fast_threshold=0.80, tier3_min_threshold=0.55, tier3_top_n=3,
)
state = chat.start_session(RECOMMENDED_SYSTEM)
res = pipeline.get(user_msg)
if res.tier == 4:
reply = "I don't have reliable information on that."
else:
prompt = "Context:\n" + "\n---\n".join(res.chunks) + "\n\nQuestion: " + user_msg
reply = chat.turn(state, prompt, rp=1.0, nr=0)
```
## Background
See `HANDOFF.md` for the research-side story (why bounded memory works, what
the eviction lever buys you, how the SP-key vision evolved into BGE-keyed
RAG with central PageRank curation as the next milestone).