Architecture
How the pieces fit together, and why each design decision was made.
Request flow
browser
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β app.py β Gradio ChatInterface β
β β
β 1. input guardrail (regex blocklist) β
β ββ blocked β canned refusal + footer (model never called) β
β β
β 2. memory layer (per-turn) β
β RunnableWithMessageHistory.invoke({"input": msg}, session_id) β
β loads last 6 turns from SQLChatMessageHistory (SQLite) β
β β
β 3. assistant._respond β
β ββ SystemMessage + trimmed history + HumanMessage β
β ββ tool-calling loop (β€ 4 rounds): β
β model.invoke β if tool_calls β run_tool_call β repeat β
β β
β 4. output guardrail (Claude Haiku 4.5 moderation) β
β ββ blocked β refusal text, AND rewrite stored history β
β β
β 5. status footer (assistant | tools_used | guardrail states) β
β β
β Everything above is wrapped in a Langfuse @observe trace tagged β
β with session_id and assistant_type; tool/model spans nest under. β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Module map
| File | Responsibility |
|---|---|
app.py |
Gradio entry; orchestrates guardrails β memory β assistant per turn. |
src/config.py |
pydantic-settings loader; drops empty env vars that would shadow .env. |
src/assistants/base.py |
BaseAssistant ABC; shared tool-calling loop + history trimming. |
src/assistants/frontier.py |
ChatAnthropic wrapper (Claude Sonnet 4.5). |
src/assistants/oss.py |
QwenChatModel β custom LangChain chat model with native tool template. |
src/memory.py |
SQLChatMessageHistory + build_conversational() factory. |
src/tools.py |
calculator (sandboxed AST eval) + web_search (Tavily). |
src/guardrails.py |
input regex blocklist + Haiku-4.5 output moderation. |
src/observability.py |
Langfuse client init + @observe decorator (no-op fallback). |
eval/datasets.py |
TruthfulQA / BBQ / AdvBench loaders, seed=42. |
eval/run_eval.py |
Resumable JSONL runner. |
eval/judge.py |
Claude Sonnet 4.5 LLM-as-judge with dataset-aware rubric. |
eval/report.py |
Bootstrap CIs, matplotlib charts, EVALUATION_REPORT.md. |
Key design decisions and why
1. Why a single BaseAssistant with the tool-loop in the base class
For the comparison to be fair, both assistants must have identical capabilities. Putting the tool-calling loop, system prompt, history trimming, and memory plumbing in BaseAssistant means the only differences between Claude and Qwen are (a) the underlying LangChain chat model and (b) inference latency. Subclasses implement only _build_model().
2. Why we built a custom QwenChatModel instead of using ChatHuggingFace
langchain-huggingface.ChatHuggingFace.bind_tools() does not render tool schemas into Qwen's chat template β so bind_tools is silently a no-op and Qwen never emits tool calls through it. Tested with a deliberate calculator question: Qwen wrote prose, never invoked the tool.
Qwen2.5-Instruct's native chat template fully supports tools and emits well-formed <tool_call>{...}</tool_call> blocks. QwenChatModel:
- Overrides
bind_tools()to attach OpenAI-style tool schemas to the runnable. - Calls
tokenizer.apply_chat_template(messages, tools=schemas, ...)so Qwen sees the tools. - Parses
<tool_call>blocks out of the output back into LangChain'sAIMessage.tool_callsformat.
Result: Qwen genuinely uses the calculator/search, matching the Claude interface.
3. Why guardrails live in the UI layer, not in the assistants
The evaluation must measure raw model behavior β that's the only way to honestly compare hallucination, bias, and safety between OSS and frontier. If guardrails ran inside assistant.chat(), the eval would measure the protected system, not the model itself. So:
BaseAssistant.chat()is stateless and unmoderated β used by the eval.app.respond()wraps that with input guardrail β memory invocation β output moderation β footer β used by the UI.
This keeps the eval honest while still demonstrating real guardrail behavior in the deployed app.
4. Why output moderation also rewrites stored history
A blocked unsafe reply, if persisted, would leak into the next turn's context and could prime the model. So on a block we both replace what the user sees and overwrite the last assistant message in SQLChatMessageHistory with the refusal. Rare path, so the clear() + add_messages() cycle is cheap.
5. Why RunnableWithMessageHistory + manual tool loop (rather than LangGraph)
RunnableWithMessageHistory is deprecated in LangChain 1.x in favor of LangGraph persistence β but it's still the canonical LangChain memory primitive that most tutorials and docs use, and adding langgraph would have meant a much larger dependency surface. The manual tool loop (capped at 4 rounds for safety) is ~15 lines, fully traceable, and easy to reason about.
6. Why a 6-turn memory window
Qwen-1.5B has a small effective context; we trim to the last 12 messages (6 user/assistant pairs) before each call so long sessions don't degrade Qwen's quality. SQLite still stores the full session (useful for audit/debugging); the trim is only at read time.
7. Why pydantic-settings drops empty env vars at config load
pydantic-settings ranks real environment variables above .env. The dev environment exported ANTHROPIC_API_KEY= (empty string) which silently shadowed the real key in .env. The config module deletes any empty-string entries for our known secrets so .env can fill them in. This is safe on Hugging Face Spaces β secrets there arrive as non-empty env vars and still win.
8. Why the judge uses a shared schema with per-dataset guidance
A single {hallucinated, biased, refused, harmful, reasoning} schema means all rows can be aggregated cleanly. But a one-size rubric would mis-grade: e.g., the "correct" BBQ ambig answer is "Can't be determined", which a generic rubric might flag as a refusal. Per-dataset guidance in the prompt tells the judge which dimensions matter and how to interpret edge cases.
Trade-offs accepted
RunnableWithMessageHistorydeprecation: future-LangChain incompatibility risk, but it remains the canonical memory primitive in LangChain tutorials and avoids pulling inlanggraph.- Judge self-bias: the judge is the same model family as one assistant under test. Disclosed in the report; mitigation would be a second judge or human spot-check on a subset.
- No per-browser session id on Spaces: a single process-global session id is used; fine for single-user demo, would need
gr.State+ cookie-derived id for genuine multi-user. Noted in README. - CPU-only deployment: Qwen on shared CPU is slow. The
@spaces.GPUdecorator is in place so switching to ZeroGPU is a one-line YAML change once a PRO subscription is active.