File size: 8,651 Bytes
35c0d38 3683c14 35c0d38 3683c14 114d5f1 3683c14 114d5f1 3683c14 114d5f1 3683c14 114d5f1 3683c14 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 | # Architecture
How the pieces fit together, and why each design decision was made.
## Request flow
```
browser
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β app.py β Gradio ChatInterface β
β β
β 1. input guardrail (regex blocklist) β
β ββ blocked β canned refusal + footer (model never called) β
β β
β 2. memory layer (per-turn) β
β RunnableWithMessageHistory.invoke({"input": msg}, session_id) β
β loads last 6 turns from SQLChatMessageHistory (SQLite) β
β β
β 3. assistant._respond β
β ββ SystemMessage + trimmed history + HumanMessage β
β ββ tool-calling loop (β€ 4 rounds): β
β model.invoke β if tool_calls β run_tool_call β repeat β
β β
β 4. output guardrail (Claude Haiku 4.5 moderation) β
β ββ blocked β refusal text, AND rewrite stored history β
β β
β 5. status footer (assistant | tools_used | guardrail states) β
β β
β Everything above is wrapped in a Langfuse @observe trace tagged β
β with session_id and assistant_type; tool/model spans nest under. β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
## Module map
| File | Responsibility |
|-----------------------------------|--------------------------------------------------------------------------|
| `app.py` | Gradio entry; orchestrates guardrails β memory β assistant per turn. |
| `src/config.py` | pydantic-settings loader; drops empty env vars that would shadow `.env`. |
| `src/assistants/base.py` | `BaseAssistant` ABC; shared tool-calling loop + history trimming. |
| `src/assistants/frontier.py` | `ChatAnthropic` wrapper (Claude Sonnet 4.5). |
| `src/assistants/oss.py` | `QwenChatModel` β custom LangChain chat model with native tool template. |
| `src/memory.py` | SQLChatMessageHistory + `build_conversational()` factory. |
| `src/tools.py` | `calculator` (sandboxed AST eval) + `web_search` (Tavily). |
| `src/guardrails.py` | input regex blocklist + Haiku-4.5 output moderation. |
| `src/observability.py` | Langfuse client init + `@observe` decorator (no-op fallback). |
| `eval/datasets.py` | TruthfulQA / BBQ / AdvBench loaders, seed=42. |
| `eval/run_eval.py` | Resumable JSONL runner. |
| `eval/judge.py` | Claude Sonnet 4.5 LLM-as-judge with dataset-aware rubric. |
| `eval/report.py` | Bootstrap CIs, matplotlib charts, EVALUATION_REPORT.md. |
## Key design decisions and why
### 1. Why a single `BaseAssistant` with the tool-loop in the base class
For the comparison to be fair, both assistants must have *identical capabilities*. Putting the tool-calling loop, system prompt, history trimming, and memory plumbing in `BaseAssistant` means the only differences between Claude and Qwen are (a) the underlying LangChain chat model and (b) inference latency. Subclasses implement only `_build_model()`.
### 2. Why we built a custom `QwenChatModel` instead of using `ChatHuggingFace`
`langchain-huggingface.ChatHuggingFace.bind_tools()` does not render tool schemas into Qwen's chat template β so `bind_tools` is silently a no-op and Qwen never emits tool calls through it. Tested with a deliberate calculator question: Qwen wrote prose, never invoked the tool.
Qwen2.5-Instruct's *native* chat template fully supports tools and emits well-formed `<tool_call>{...}</tool_call>` blocks. `QwenChatModel`:
1. Overrides `bind_tools()` to attach OpenAI-style tool schemas to the runnable.
2. Calls `tokenizer.apply_chat_template(messages, tools=schemas, ...)` so Qwen sees the tools.
3. Parses `<tool_call>` blocks out of the output back into LangChain's `AIMessage.tool_calls` format.
Result: Qwen genuinely uses the calculator/search, matching the Claude interface.
### 3. Why guardrails live in the UI layer, not in the assistants
The evaluation must measure *raw* model behavior β that's the only way to honestly compare hallucination, bias, and safety between OSS and frontier. If guardrails ran inside `assistant.chat()`, the eval would measure the *protected* system, not the model itself. So:
- `BaseAssistant.chat()` is stateless and unmoderated β used by the eval.
- `app.respond()` wraps that with input guardrail β memory invocation β output moderation β footer β used by the UI.
This keeps the eval honest while still demonstrating real guardrail behavior in the deployed app.
### 4. Why output moderation also rewrites stored history
A blocked unsafe reply, if persisted, would leak into the next turn's context and could prime the model. So on a block we both replace what the user sees and overwrite the last assistant message in `SQLChatMessageHistory` with the refusal. Rare path, so the `clear() + add_messages()` cycle is cheap.
### 5. Why `RunnableWithMessageHistory` + manual tool loop (rather than LangGraph)
`RunnableWithMessageHistory` is deprecated in LangChain 1.x in favor of LangGraph persistence β but it's still the canonical LangChain memory primitive that most tutorials and docs use, and adding `langgraph` would have meant a much larger dependency surface. The manual tool loop (capped at 4 rounds for safety) is ~15 lines, fully traceable, and easy to reason about.
### 6. Why a 6-turn memory window
Qwen-1.5B has a small effective context; we trim to the last 12 messages (6 user/assistant pairs) before each call so long sessions don't degrade Qwen's quality. SQLite still stores the full session (useful for audit/debugging); the trim is only at *read* time.
### 7. Why pydantic-settings drops empty env vars at config load
pydantic-settings ranks real environment variables above `.env`. The dev environment exported `ANTHROPIC_API_KEY=` (empty string) which silently shadowed the real key in `.env`. The config module deletes any empty-string entries for our known secrets so `.env` can fill them in. This is safe on Hugging Face Spaces β secrets there arrive as non-empty env vars and still win.
### 8. Why the judge uses a shared schema with per-dataset guidance
A single `{hallucinated, biased, refused, harmful, reasoning}` schema means all rows can be aggregated cleanly. But a one-size rubric would mis-grade: e.g., the "correct" BBQ ambig answer is *"Can't be determined"*, which a generic rubric might flag as a refusal. Per-dataset guidance in the prompt tells the judge which dimensions matter and how to interpret edge cases.
## Trade-offs accepted
- **`RunnableWithMessageHistory` deprecation**: future-LangChain incompatibility risk, but it remains the canonical memory primitive in LangChain tutorials and avoids pulling in `langgraph`.
- **Judge self-bias**: the judge is the same model family as one assistant under test. Disclosed in the report; mitigation would be a second judge or human spot-check on a subset.
- **No per-browser session id on Spaces**: a single process-global session id is used; fine for single-user demo, would need `gr.State` + cookie-derived id for genuine multi-user. Noted in README.
- **CPU-only deployment**: Qwen on shared CPU is slow. The `@spaces.GPU` decorator is in place so switching to ZeroGPU is a one-line YAML change once a PRO subscription is active.
|