Add Playwright Bing scrape + sentence-level rerank + test_pipeline.py

be86b43 verified 6 days ago

9.25 kB

	# SP-Distill Runtime — Operational Deployment

	Inference/deploy layer for the bounded-memory chat + RAG system built on
	top of `ap32_large.pt`. Drop-in scripts that embody the validated operation.
	For training scripts and the original handoff, see the repo root and `HANDOFF.md`.

	## What this gives you

	- Bounded-memory multi-turn chat running a frozen
	`DeepSeek-R1-Distill-Qwen-1.5B` with SP-evict: the LLM only ever attends
	to ≈ 675 KV tokens regardless of how long the conversation runs.
	- External knowledge via a 4-tier retrieval pipeline
	(cache → web search → Qwen self-verification → honest refusal),
	with sentence-level BGE reranking for focused context injection.
	- Hallucination guard: BGE pre-gate + strict Qwen verifier + date guard
	+ self-refusal post-gate + cutoff-aware system prompt.
	- Multiple web backends (Wikipedia, Tavily, Brave, headless-Chromium
	Bing scrape), interchangeable via a single interface.
	- Pure CPU runnable; same code runs on GPU.

	## Files

	\| file \| what it is \|
	\|---\|---\|
	\| `spchat.py` \| `SPChat` class. Bounded-memory chat (KEEP-CoT mode). \|
	\| `rag.py` \| `BGERetriever`. Sentence embedding + cosine top-k. \|
	\| `refusal.py` \| `detect_refusal()` + `RECOMMENDED_SYSTEM` (V2 cutoff-aware). \|
	\| `cache.py` \| `QueryCache`. BGE-indexed query→chunks cache (Tier 1). \|
	\| `web_search.py` \| `WikipediaSearch` / `TavilySearch` / `BraveSearch` / `PlaywrightBingSearch` / `CompositeSearch`. \|
	\| `verifier.py` \| `QwenVerifier`. Strict YES/NO chunk relevance, uses Qwen itself. \|
	\| `retrieval_pipeline.py` \| `RetrievalPipeline`. 4-tier orchestrator + date guard + sentence-level rerank. \|
	\| `demo_chat.py` \| 4-turn chat demo. \|
	\| `demo_rag.py` \| Plain RAG demo (no gates). \|
	\| `demo_rag_gated.py` \| RAG + hallucination guard (approach E). \|
	\| `demo_full_pipeline.py` \| Full integrated demo (4-tier + chat + multi-turn). \|
	\| `try_postgate_prompts.py` \| Validation harness: system-prompt strictness experiment. \|
	\| `test_pipeline.py` \| Smoke tests for the runtime modules (run after install). \|
	\| `requirements.txt` \| Pinned-loose deps (includes Playwright). \|

	## Install

	```bash
	pip install -r runtime/requirements.txt

	# only needed for PlaywrightBingSearch (keyless web scrape):
	python -m playwright install chromium

	# pull the trained checkpoint
	git lfs pull --include="checkpoints/ap32_large.pt"
	```

	## Run

	```bash
	# fast module smoke tests (no LLM load)
	python runtime/test_pipeline.py --fast

	# full smoke (loads LLM + runs one pipeline query, ~5min on CPU)
	python runtime/test_pipeline.py

	# demos
	python runtime/demo_chat.py # multi-turn chat only
	python runtime/demo_rag.py # plain RAG over a fixed corpus
	python runtime/demo_rag_gated.py # RAG + hallucination guard
	python runtime/demo_full_pipeline.py # 4-tier pipeline end-to-end (Bing+Wikipedia merge)
	```

	## The 4-tier retrieval pipeline (`retrieval_pipeline.py`)

	```
	user query
	│
	├─ looks_like_question? ─── no ─── bypass (casual chat)
	│
	├─ Tier 1: local cache lookup (BGE sim ≥ 0.85 to a past query)
	│
	├─ Web search (Composite: Bing scrape + Wikipedia by default)
	│
	├─ Date guard: drop chunks whose year doesn't match query year
	│
	├─ Sentence-split + BGE rerank ← concentrates signal at sentence level
	│
	├─ top_sim ≥ 0.80 → Tier 2 fast (trust BGE, no LLM verify)
	├─ top_sim ∈ [0.55,0.80) → Tier 3 (Qwen strict verifier per sentence)
	│ ├─ verified ≥ 1 → use verified sentences
	│ └─ all rejected → Tier 4
	├─ top_sim < 0.55 → Tier 4
	│
	└─ Tier 4: honest refusal — "I don't have reliable info on that"
	```

	Successful retrievals update the cache (Tier 1) for next time. This local
	cache is the device-side counterpart of the planned central-server PageRank
	curation layer.

	## Sentence-level BGE rerank (this is the latest win)

	Instead of comparing the whole retrieved document against the query, the
	pipeline:
	1. Splits each retrieved doc into sentences (`split_sentences`)
	2. Encodes each sentence with BGE
	3. Ranks sentences against the query
	4. Injects the top-ranked sentences (not the whole docs) into the chat

	Why it matters: a single answer-bearing sentence ("Thimphu is the capital
	of Bhutan.") scores higher than a long paragraph of mixed content, AND uses
	far fewer tokens in the LLM's raw window. In testing, this halved gen length
	on the Bhutan query (645 → 321 tokens) while giving a cleaner one-sentence
	final answer.

	## Web search backends (`web_search.py`)

	All implement `search(query: str) -> List[str]`.

	\| backend \| API key? \| free tier \| notes \|
	\|---\|---\|---\|---\|
	\| `WikipediaSearch` \| no \| unlimited \| encyclopedic, robust, default \|
	\| `TavilySearch` \| yes (`TAVILY_API_KEY`) \| 1000 q/mo \| AI-agent-tuned, clean text \|
	\| `BraveSearch` \| yes (`BRAVE_API_KEY`) \| 2000 q/mo \| general web, snippet-level \|
	\| `PlaywrightBingSearch` \| no \| unlimited (subject to Bing rate limits) \| scrapes Bing via headless Chromium \|
	\| `CompositeSearch` \| depends on chain \| — \| `mode="fallback"` or `mode="merge"` \|

	Recommended for production: `Composite([Tavily, Wikipedia], "fallback")` if
	you have a Tavily key (best content), or `Composite([PlaywrightBing,
	Wikipedia], "merge")` if you want zero keys.

	## Operational settings (validated on CPU; same numbers apply to GPU)

	\| setting \| value \| why \|
	\|---\|---\|---\|
	\| `rw` (raw window) \| 512 \| Drift-free operating point, near-oracle deep KL \|
	\| `drop_per_chunk` \| 64 \| Survivor set bounded ⇒ O(N) total compute \|
	\| `chunk_size` \| 64 \| Matches training, eviction cadence \|
	\| chat decode `rp`/`nr` \| 1.15 / 4 \| Prevents greedy loops in conversational gen \|
	\| RAG/math decode `rp`/`nr` \| 1.0 / 0 \| Do not penalize repeated digits/names (faithful copying) \|
	\| EOS between turns \| required \| Without it the model loops the previous response \|
	\| CoT handling \| KEEP (do not strip) \| Strip mode breaks R1 distill's chat-template expectation \|
	\| `max_resp` per turn \| 600–800 \| Lets `<think>` close \|
	\| Tier 1 cache threshold \| 0.85 \| High bar so only near-identical queries reuse cache \|
	\| Tier 2 fast threshold \| 0.80 \| Trust BGE without LLM verify \|
	\| Tier 3 floor threshold \| 0.55 \| Below this → straight to Tier 4 \|
	\| System prompt \| `refusal.RECOMMENDED_SYSTEM` (cutoff-aware) \| Maximizes self-refusal capture \|
	\| Date guard \| always on \| Year mismatch (query has YYYY, chunk does not) → drop the chunk \|

	## Memory bound (any N)

	Regardless of total conversation length:

	- LLM KV at any chunk ≤ `system_prompt + 32 SP + raw 512 + chunk 64 ≈ 675 tokens`
	- SP survivor set ≤ 64 raw tokens (cumulative bottom-64 eviction)
	- Total compute O(N), per-token amortized O(1)

	## Fallback architecture (summary)

	\| layer \| decision axis \| example threshold \|
	\|---\|---\|---\|
	\| backend chain (`CompositeSearch`) \| exception / empty → next \| — \|
	\| Tier 1 cache \| BGE sim ≥ 0.85 \| accept \|
	\| Date guard \| year(query) == year(chunk) \| hard drop \|
	\| Tier 2 fast \| BGE sim ≥ 0.80 \| accept \|
	\| Tier 3 verify \| sim 0.55–0.80 + Qwen YES \| accept \|
	\| Tier 4 \| nothing passed \| honest refusal \|

	Each level falls through to the next using **similarity score + Qwen
	verifier + date integrity** as its decision axis.

	## Minimum API

	```python
	from runtime.spchat import SPChat
	from runtime.rag import BGERetriever
	from runtime.cache import QueryCache
	from runtime.web_search import PlaywrightBingSearch, WikipediaSearch, CompositeSearch
	from runtime.verifier import QwenVerifier
	from runtime.retrieval_pipeline import RetrievalPipeline
	from runtime.refusal import RECOMMENDED_SYSTEM

	chat = SPChat()
	retriever = BGERetriever()
	cache = QueryCache(retriever, sim_threshold=0.85)
	web = CompositeSearch(
	[PlaywrightBingSearch(n_results=3), WikipediaSearch(n_results=3)],
	mode="merge",
	)
	verifier = QwenVerifier(chat)
	pipeline = RetrievalPipeline(
	retriever, cache, web, verifier,
	tier2_fast_threshold=0.80, tier3_min_threshold=0.55, tier3_top_n=3,
	)

	state = chat.start_session(RECOMMENDED_SYSTEM)
	res = pipeline.get(user_msg)
	if res.tier == 4:
	reply = "I don't have reliable information on that."
	else:
	prompt = "Context:\n" + "\n---\n".join(res.chunks) + "\n\nQuestion: " + user_msg
	reply = chat.turn(state, prompt, rp=1.0, nr=0)
	```

	## Background

	See `HANDOFF.md` for the research-side story (why bounded memory works, what
	the eviction lever buys you, how the SP-key vision evolved into BGE-keyed
	RAG with central PageRank curation as the next milestone).