title: OSS vs Frontier Assistant
emoji: π€
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 6.14.0
python_version: '3.11'
app_file: app.py
hardware: cpu-basic
pinned: false
OSS vs. Frontier Assistant
Side-by-side evaluation of an open-source assistant (Qwen2.5-1.5B-Instruct)
and a frontier assistant (Claude Sonnet 4.5). Both share an identical Gradio
UI and capabilities β multi-turn chat with persistent short-term memory, a
calculator + web-search tool, and a two-layer input/output guardrail β and are
evaluated on hallucination, demographic bias, and safety / jailbreak resistance.
π Live demo: huggingface.co/spaces/KevinMerchant13/oss-vs-frontier-assistant
π Evaluation report: docs/EVALUATION_REPORT.md
π οΈ Architecture notes: docs/ARCHITECTURE.md Β· Deploy guide: docs/DEPLOY_GUIDE.md
Setup (local)
git clone <this-repo>
cd oss-vs-frontier-assistant
cp .env.example .env # then fill in your API keys
uv sync # installs all pinned deps
uv run python app.py # http://127.0.0.1:7860
Required keys in .env:
ANTHROPIC_API_KEYβ frontier assistant, Haiku moderation, eval judgeHF_TOKENβ Qwen download, deploymentTAVILY_API_KEYβ web-search toolLANGFUSE_PUBLIC_KEY/LANGFUSE_SECRET_KEYβ observability (optional; app no-ops gracefully without them)
Running the eval
uv run python -m eval.run_eval --limit 2 # smoke run (~2 min, ~$0.05)
uv run python -m eval.run_eval # full run, 90 prompts Γ 2 assistants
uv run python -m eval.judge # Sonnet 4.5 scores every response
uv run python -m eval.report # charts + EVALUATION_REPORT.md
Both run_eval and judge are resumable β interrupt with Ctrl+C and re-run to continue.
Architecture
ββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββ ββββββββββββββββ
β Gradio UI β β β input filterβ β β assistant (Claude β β β output β
β (app.py) β β (regex β β or Qwen) + tools β β moderation β
β β β blocklist) β β + LangChain memory β β (Haiku 4.5) β
ββββββββββββββ ββββββββββββββββ ββββββββββββ¬ββββββββββββ ββββββββ¬ββββββββ
β β
ββββββββ΄βββββββ ββββββββ΄ββββββββ
β SQLite β β trace via β
β sessions β β Langfuse β
βββββββββββββββ ββββββββββββββββ
Key decisions (full rationale in docs/ARCHITECTURE.md):
- Both assistants share a single
BaseAssistantwith the tool-calling loop and a 6-turn memory window. - Qwen tool-calling is wired through Qwen's native
<tool_call>template (LangChain'sChatHuggingFacedoesn't render tool schemas), giving the OSS model genuine tool use. - Guardrails live in the UI layer, not in
BaseAssistant, so the eval sees raw model output for honest safety measurement. - The eval judge gets the same JSON schema across all rows (
{hallucinated, biased, refused, harmful, reasoning}) plus dataset-aware guidance to make verdicts meaningful per benchmark.
Headline eval results (n=30 per dataset, 95% bootstrap CIs)
| Metric | Claude (frontier) | Qwen-1.5B (OSS) |
|---|---|---|
| Hallucination rate (TruthfulQA) | 10.0% [0.0, 23.3] | 63.3% [46.7, 80.0] |
| Bias rate (BBQ, overall) | 3.3% [0.0, 10.0] | 36.7% [20.0, 53.3] |
| Jailbreak resistance (AdvBench) | 100.0% [100.0, 100.0] | 100.0% [100.0, 100.0] |
| Refusal rate (overall) | 33.3% [23.3, 42.2] | 35.6% [25.6, 45.6] |
Full breakdown by demographic and the judge-self-bias limitation disclosure: docs/EVALUATION_REPORT.md.
Latency (per turn, measured on the eval run)
| Assistant | Hardware | median | mean | p95 | max |
|---|---|---|---|---|---|
| Claude Sonnet 4.5 | API call | 4.4 s | 5.1 s | 10.1 s | 14.6 s |
| Qwen-1.5B | local CPU (dev laptop) | 11.9 s | 16.8 s | 37.1 s | 102.2 s |
| Qwen-1.5B | HF Space cpu-basic |
likely 30-60 s+ per reply on shared CPU β see "Deployment" |
The deployed Space uses the free cpu-basic hardware (no HF PRO subscription). The
@spaces.GPU decorator is already in place on Qwen's generation, so switching to
ZeroGPU later is a one-line YAML change (hardware: zero-a10g) once a PRO
subscription is active β expected to bring Qwen latency to ~3-8 s.
Cost (rough, per chat turn)
| Component | Per-turn cost (approx) |
|---|---|
| Claude Sonnet 4.5 assistant (~500 in / 200 out tok) | ~$0.0045 |
| Haiku 4.5 output moderation (~150 in / 50 out tok) | ~$0.0003 |
Qwen-1.5B on HF Spaces (cpu-basic or zero-a10g) |
free (within HF Space quotas) |
| Tavily web search (when tool fires) | free tier β€1k/mo |
A 100-turn Claude conversation runs ~$0.50; the same 100 turns on Qwen via Hugging Face Spaces are free (modulo HF quota).
Tradeoffs
- Frontier is meaningfully more reliable on hallucination and bias in the raw eval, but ~6Γ the per-turn cost vs. free OSS hosting.
- The OSS model needs the guardrails enabled to be safe to expose β the input/output filters were designed to close the residual gap.
- A 1.5B model is the floor, chosen here to fit ZeroGPU comfortably. A 7Bβ14B Qwen (or Llama-3.1) would likely close most of the hallucination/bias gap but needs more GPU time.
- The judge is a Claude model, which has documented self-preference bias β the OSS-vs-Claude comparison is therefore slightly optimistic for Claude.
What I'd improve with more time
- A second judge (GPT-4o or a human-annotated subset of ~30 rows) to calibrate the self-bias.
- Larger sample sizes (n=100+ per dataset) to tighten CIs.
- A real tool-use eval (e.g. GSM8K with calculator, NaturalQuestions with web search) β the current eval is zero-shot Q&A and doesn't directly measure how much the tools help.
- Try a larger OSS model (Qwen2.5-7B or Llama-3.1-8B) on ZeroGPU's
xlargetier to quantify the OSSβfrontier gap-vs-size curve. - Add session-scoped Gradio state (per-browser-tab session id) for true multi-user deployment.
License
MIT β free for any use with attribution.
Project layout
oss-vs-frontier-assistant/
βββ app.py # Gradio entry β runs locally AND on HF Spaces
βββ src/
β βββ config.py # pydantic-settings env loader
β βββ assistants/ # BaseAssistant + Claude + Qwen
β βββ memory.py # LangChain SQLChatMessageHistory + RunnableWithMessageHistory
β βββ tools.py # calculator + Tavily search
β βββ guardrails.py # input blocklist + Haiku output moderation
β βββ observability.py # Langfuse @observe decorator
βββ eval/ # datasets / runner / judge / report
βββ docs/ # ARCHITECTURE, DEPLOY_GUIDE, EVALUATION_REPORT
βββ tests/ # smoke tests