oss-vs-frontier-assistant / docs /ARCHITECTURE.md
KevinMerchant13's picture
polish: 1-page PDF report + doc fixes
114d5f1 verified
# Architecture
How the pieces fit together, and why each design decision was made.
## Request flow
```
browser
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ app.py β€” Gradio ChatInterface β”‚
β”‚ β”‚
β”‚ 1. input guardrail (regex blocklist) β”‚
β”‚ └─ blocked β†’ canned refusal + footer (model never called) β”‚
β”‚ β”‚
β”‚ 2. memory layer (per-turn) β”‚
β”‚ RunnableWithMessageHistory.invoke({"input": msg}, session_id) β”‚
β”‚ loads last 6 turns from SQLChatMessageHistory (SQLite) β”‚
β”‚ β”‚
β”‚ 3. assistant._respond β”‚
β”‚ β”œβ”€ SystemMessage + trimmed history + HumanMessage β”‚
β”‚ └─ tool-calling loop (≀ 4 rounds): β”‚
β”‚ model.invoke β†’ if tool_calls β†’ run_tool_call β†’ repeat β”‚
β”‚ β”‚
β”‚ 4. output guardrail (Claude Haiku 4.5 moderation) β”‚
β”‚ └─ blocked β†’ refusal text, AND rewrite stored history β”‚
β”‚ β”‚
β”‚ 5. status footer (assistant | tools_used | guardrail states) β”‚
β”‚ β”‚
β”‚ Everything above is wrapped in a Langfuse @observe trace tagged β”‚
β”‚ with session_id and assistant_type; tool/model spans nest under. β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Module map
| File | Responsibility |
|-----------------------------------|--------------------------------------------------------------------------|
| `app.py` | Gradio entry; orchestrates guardrails β†’ memory β†’ assistant per turn. |
| `src/config.py` | pydantic-settings loader; drops empty env vars that would shadow `.env`. |
| `src/assistants/base.py` | `BaseAssistant` ABC; shared tool-calling loop + history trimming. |
| `src/assistants/frontier.py` | `ChatAnthropic` wrapper (Claude Sonnet 4.5). |
| `src/assistants/oss.py` | `QwenChatModel` β€” custom LangChain chat model with native tool template. |
| `src/memory.py` | SQLChatMessageHistory + `build_conversational()` factory. |
| `src/tools.py` | `calculator` (sandboxed AST eval) + `web_search` (Tavily). |
| `src/guardrails.py` | input regex blocklist + Haiku-4.5 output moderation. |
| `src/observability.py` | Langfuse client init + `@observe` decorator (no-op fallback). |
| `eval/datasets.py` | TruthfulQA / BBQ / AdvBench loaders, seed=42. |
| `eval/run_eval.py` | Resumable JSONL runner. |
| `eval/judge.py` | Claude Sonnet 4.5 LLM-as-judge with dataset-aware rubric. |
| `eval/report.py` | Bootstrap CIs, matplotlib charts, EVALUATION_REPORT.md. |
## Key design decisions and why
### 1. Why a single `BaseAssistant` with the tool-loop in the base class
For the comparison to be fair, both assistants must have *identical capabilities*. Putting the tool-calling loop, system prompt, history trimming, and memory plumbing in `BaseAssistant` means the only differences between Claude and Qwen are (a) the underlying LangChain chat model and (b) inference latency. Subclasses implement only `_build_model()`.
### 2. Why we built a custom `QwenChatModel` instead of using `ChatHuggingFace`
`langchain-huggingface.ChatHuggingFace.bind_tools()` does not render tool schemas into Qwen's chat template β€” so `bind_tools` is silently a no-op and Qwen never emits tool calls through it. Tested with a deliberate calculator question: Qwen wrote prose, never invoked the tool.
Qwen2.5-Instruct's *native* chat template fully supports tools and emits well-formed `<tool_call>{...}</tool_call>` blocks. `QwenChatModel`:
1. Overrides `bind_tools()` to attach OpenAI-style tool schemas to the runnable.
2. Calls `tokenizer.apply_chat_template(messages, tools=schemas, ...)` so Qwen sees the tools.
3. Parses `<tool_call>` blocks out of the output back into LangChain's `AIMessage.tool_calls` format.
Result: Qwen genuinely uses the calculator/search, matching the Claude interface.
### 3. Why guardrails live in the UI layer, not in the assistants
The evaluation must measure *raw* model behavior β€” that's the only way to honestly compare hallucination, bias, and safety between OSS and frontier. If guardrails ran inside `assistant.chat()`, the eval would measure the *protected* system, not the model itself. So:
- `BaseAssistant.chat()` is stateless and unmoderated β†’ used by the eval.
- `app.respond()` wraps that with input guardrail β†’ memory invocation β†’ output moderation β†’ footer β†’ used by the UI.
This keeps the eval honest while still demonstrating real guardrail behavior in the deployed app.
### 4. Why output moderation also rewrites stored history
A blocked unsafe reply, if persisted, would leak into the next turn's context and could prime the model. So on a block we both replace what the user sees and overwrite the last assistant message in `SQLChatMessageHistory` with the refusal. Rare path, so the `clear() + add_messages()` cycle is cheap.
### 5. Why `RunnableWithMessageHistory` + manual tool loop (rather than LangGraph)
`RunnableWithMessageHistory` is deprecated in LangChain 1.x in favor of LangGraph persistence β€” but it's still the canonical LangChain memory primitive that most tutorials and docs use, and adding `langgraph` would have meant a much larger dependency surface. The manual tool loop (capped at 4 rounds for safety) is ~15 lines, fully traceable, and easy to reason about.
### 6. Why a 6-turn memory window
Qwen-1.5B has a small effective context; we trim to the last 12 messages (6 user/assistant pairs) before each call so long sessions don't degrade Qwen's quality. SQLite still stores the full session (useful for audit/debugging); the trim is only at *read* time.
### 7. Why pydantic-settings drops empty env vars at config load
pydantic-settings ranks real environment variables above `.env`. The dev environment exported `ANTHROPIC_API_KEY=` (empty string) which silently shadowed the real key in `.env`. The config module deletes any empty-string entries for our known secrets so `.env` can fill them in. This is safe on Hugging Face Spaces β€” secrets there arrive as non-empty env vars and still win.
### 8. Why the judge uses a shared schema with per-dataset guidance
A single `{hallucinated, biased, refused, harmful, reasoning}` schema means all rows can be aggregated cleanly. But a one-size rubric would mis-grade: e.g., the "correct" BBQ ambig answer is *"Can't be determined"*, which a generic rubric might flag as a refusal. Per-dataset guidance in the prompt tells the judge which dimensions matter and how to interpret edge cases.
## Trade-offs accepted
- **`RunnableWithMessageHistory` deprecation**: future-LangChain incompatibility risk, but it remains the canonical memory primitive in LangChain tutorials and avoids pulling in `langgraph`.
- **Judge self-bias**: the judge is the same model family as one assistant under test. Disclosed in the report; mitigation would be a second judge or human spot-check on a subset.
- **No per-browser session id on Spaces**: a single process-global session id is used; fine for single-user demo, would need `gr.State` + cookie-derived id for genuine multi-user. Noted in README.
- **CPU-only deployment**: Qwen on shared CPU is slow. The `@spaces.GPU` decorator is in place so switching to ZeroGPU is a one-line YAML change once a PRO subscription is active.