# SP-Distill Runtime — Operational Deployment

Inference/deploy layer for the bounded-memory chat + RAG system built on
top of `ap32_large.pt`. Drop-in scripts that embody the validated operation.
For training scripts and the original handoff, see the repo root and `HANDOFF.md`.

## What this gives you

- **Bounded-memory multi-turn chat** running a frozen
  `DeepSeek-R1-Distill-Qwen-1.5B` with SP-evict: the LLM only ever attends
  to ≈ **675 KV tokens** regardless of how long the conversation runs.
- **External knowledge** via a **4-tier retrieval pipeline**
  (cache → web search → Qwen self-verification → honest refusal),
  with **sentence-level BGE reranking** for focused context injection.
- **Hallucination guard**: BGE pre-gate + strict Qwen verifier + date guard
  + self-refusal post-gate + cutoff-aware system prompt.
- **Multiple web backends** (Wikipedia, Tavily, Brave, headless-Chromium
  Bing scrape), interchangeable via a single interface.
- Pure CPU runnable; same code runs on GPU.

## Files

| file | what it is |
|---|---|
| `spchat.py`               | `SPChat` class. Bounded-memory chat (KEEP-CoT mode). |
| `rag.py`                  | `BGERetriever`. Sentence embedding + cosine top-k. |
| `refusal.py`              | `detect_refusal()` + `RECOMMENDED_SYSTEM` (V2 cutoff-aware). |
| `cache.py`                | `QueryCache`. BGE-indexed query→chunks cache (Tier 1). |
| `web_search.py`           | `WikipediaSearch` / `TavilySearch` / `BraveSearch` / `PlaywrightBingSearch` / `CompositeSearch`. |
| `verifier.py`             | `QwenVerifier`. Strict YES/NO chunk relevance, uses Qwen itself. |
| `retrieval_pipeline.py`   | `RetrievalPipeline`. 4-tier orchestrator + date guard + **sentence-level rerank**. |
| `demo_chat.py`            | 4-turn chat demo. |
| `demo_rag.py`             | Plain RAG demo (no gates). |
| `demo_rag_gated.py`       | RAG + hallucination guard (approach E). |
| `demo_full_pipeline.py`   | **Full integrated demo** (4-tier + chat + multi-turn). |
| `try_postgate_prompts.py` | Validation harness: system-prompt strictness experiment. |
| **`test_pipeline.py`**    | **Smoke tests for the runtime modules** (run after install). |
| `requirements.txt`        | Pinned-loose deps (includes Playwright). |

## Install

```bash
pip install -r runtime/requirements.txt

# only needed for PlaywrightBingSearch (keyless web scrape):
python -m playwright install chromium

# pull the trained checkpoint
git lfs pull --include="checkpoints/ap32_large.pt"
```

## Run

```bash
# fast module smoke tests (no LLM load)
python runtime/test_pipeline.py --fast

# full smoke (loads LLM + runs one pipeline query, ~5min on CPU)
python runtime/test_pipeline.py

# demos
python runtime/demo_chat.py             # multi-turn chat only
python runtime/demo_rag.py              # plain RAG over a fixed corpus
python runtime/demo_rag_gated.py        # RAG + hallucination guard
python runtime/demo_full_pipeline.py    # 4-tier pipeline end-to-end (Bing+Wikipedia merge)
```

## The 4-tier retrieval pipeline (`retrieval_pipeline.py`)

```
user query
   │
   ├─ looks_like_question? ─── no ─── bypass (casual chat)
   │
   ├─ Tier 1: local cache lookup (BGE sim ≥ 0.85 to a past query)
   │
   ├─ Web search (Composite: Bing scrape + Wikipedia by default)
   │
   ├─ Date guard: drop chunks whose year doesn't match query year
   │
   ├─ Sentence-split + BGE rerank   ← concentrates signal at sentence level
   │
   ├─ top_sim ≥ 0.80      → Tier 2 fast (trust BGE, no LLM verify)
   ├─ top_sim ∈ [0.55,0.80) → Tier 3 (Qwen strict verifier per sentence)
   │   ├─ verified ≥ 1     → use verified sentences
   │   └─ all rejected     → Tier 4
   ├─ top_sim < 0.55      → Tier 4
   │
   └─ Tier 4: honest refusal — "I don't have reliable info on that"
```

**Successful retrievals update the cache** (Tier 1) for next time. This local
cache is the device-side counterpart of the planned central-server PageRank
curation layer.

## Sentence-level BGE rerank (this is the latest win)

Instead of comparing the whole retrieved document against the query, the
pipeline:
1. Splits each retrieved doc into sentences (`split_sentences`)
2. Encodes each sentence with BGE
3. Ranks sentences against the query
4. Injects the **top-ranked sentences** (not the whole docs) into the chat

Why it matters: a single answer-bearing sentence ("Thimphu is the capital
of Bhutan.") scores higher than a long paragraph of mixed content, AND uses
far fewer tokens in the LLM's raw window. In testing, this halved gen length
on the Bhutan query (645 → 321 tokens) while giving a cleaner one-sentence
final answer.

## Web search backends (`web_search.py`)

All implement `search(query: str) -> List[str]`.

| backend | API key? | free tier | notes |
|---|---|---|---|
| `WikipediaSearch`        | no  | unlimited | encyclopedic, robust, default |
| `TavilySearch`           | yes (`TAVILY_API_KEY`) | 1000 q/mo | AI-agent-tuned, clean text |
| `BraveSearch`            | yes (`BRAVE_API_KEY`)  | 2000 q/mo | general web, snippet-level |
| `PlaywrightBingSearch`   | no  | unlimited (subject to Bing rate limits) | scrapes Bing via headless Chromium |
| `CompositeSearch`        | depends on chain | — | `mode="fallback"` or `mode="merge"` |

Recommended for production: `Composite([Tavily, Wikipedia], "fallback")` if
you have a Tavily key (best content), or `Composite([PlaywrightBing,
Wikipedia], "merge")` if you want zero keys.

## Operational settings (validated on CPU; same numbers apply to GPU)

| setting                  | value | why                                                              |
|---|---|---|
| `rw` (raw window)        | 512   | Drift-free operating point, near-oracle deep KL                  |
| `drop_per_chunk`         | 64    | Survivor set bounded ⇒ O(N) total compute                        |
| `chunk_size`             | 64    | Matches training, eviction cadence                                |
| **chat** decode `rp`/`nr`     | **1.15 / 4** | Prevents greedy loops in conversational gen                |
| **RAG/math** decode `rp`/`nr` | **1.0 / 0**  | Do **not** penalize repeated digits/names (faithful copying) |
| EOS between turns        | required | Without it the model loops the previous response                 |
| CoT handling             | **KEEP** (do not strip) | Strip mode breaks R1 distill's chat-template expectation |
| `max_resp` per turn      | 600–800 | Lets `<think>` close                                             |
| Tier 1 cache threshold   | 0.85  | High bar so only near-identical queries reuse cache              |
| Tier 2 fast threshold    | 0.80  | Trust BGE without LLM verify                                     |
| Tier 3 floor threshold   | 0.55  | Below this → straight to Tier 4                                  |
| System prompt            | `refusal.RECOMMENDED_SYSTEM` (cutoff-aware) | Maximizes self-refusal capture |
| Date guard               | always on | Year mismatch (query has YYYY, chunk does not) → drop the chunk |

## Memory bound (any N)

Regardless of total conversation length:

- LLM KV at any chunk ≤ `system_prompt + 32 SP + raw 512 + chunk 64 ≈ 675 tokens`
- SP survivor set ≤ 64 raw tokens (cumulative bottom-64 eviction)
- Total compute **O(N)**, per-token amortized **O(1)**

## Fallback architecture (summary)

| layer | decision axis | example threshold |
|---|---|---|
| backend chain (`CompositeSearch`) | exception / empty → next | — |
| Tier 1 cache | BGE sim ≥ 0.85 | accept |
| Date guard | year(query) == year(chunk) | hard drop |
| Tier 2 fast | BGE sim ≥ 0.80 | accept |
| Tier 3 verify | sim 0.55–0.80 + Qwen YES | accept |
| Tier 4 | nothing passed | honest refusal |

Each level falls through to the next using **similarity score + Qwen
verifier + date integrity** as its decision axis.

## Minimum API

```python
from runtime.spchat import SPChat
from runtime.rag import BGERetriever
from runtime.cache import QueryCache
from runtime.web_search import PlaywrightBingSearch, WikipediaSearch, CompositeSearch
from runtime.verifier import QwenVerifier
from runtime.retrieval_pipeline import RetrievalPipeline
from runtime.refusal import RECOMMENDED_SYSTEM

chat       = SPChat()
retriever  = BGERetriever()
cache      = QueryCache(retriever, sim_threshold=0.85)
web        = CompositeSearch(
    [PlaywrightBingSearch(n_results=3), WikipediaSearch(n_results=3)],
    mode="merge",
)
verifier   = QwenVerifier(chat)
pipeline   = RetrievalPipeline(
    retriever, cache, web, verifier,
    tier2_fast_threshold=0.80, tier3_min_threshold=0.55, tier3_top_n=3,
)

state = chat.start_session(RECOMMENDED_SYSTEM)
res   = pipeline.get(user_msg)
if res.tier == 4:
    reply = "I don't have reliable information on that."
else:
    prompt = "Context:\n" + "\n---\n".join(res.chunks) + "\n\nQuestion: " + user_msg
    reply = chat.turn(state, prompt, rp=1.0, nr=0)
```

## Background

See `HANDOFF.md` for the research-side story (why bounded memory works, what
the eviction lever buys you, how the SP-key vision evolved into BGE-keyed
RAG with central PageRank curation as the next milestone).