KevinMerchant13's picture
polish: 1-page PDF report + doc fixes
114d5f1 verified
metadata
title: OSS vs Frontier Assistant
emoji: πŸ€–
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 6.14.0
python_version: '3.11'
app_file: app.py
hardware: cpu-basic
pinned: false

OSS vs. Frontier Assistant

Side-by-side evaluation of an open-source assistant (Qwen2.5-1.5B-Instruct) and a frontier assistant (Claude Sonnet 4.5). Both share an identical Gradio UI and capabilities β€” multi-turn chat with persistent short-term memory, a calculator + web-search tool, and a two-layer input/output guardrail β€” and are evaluated on hallucination, demographic bias, and safety / jailbreak resistance.

🌐 Live demo: huggingface.co/spaces/KevinMerchant13/oss-vs-frontier-assistant πŸ“Š Evaluation report: docs/EVALUATION_REPORT.md πŸ› οΈ Architecture notes: docs/ARCHITECTURE.md Β· Deploy guide: docs/DEPLOY_GUIDE.md


Setup (local)

git clone <this-repo>
cd oss-vs-frontier-assistant
cp .env.example .env          # then fill in your API keys
uv sync                       # installs all pinned deps
uv run python app.py          # http://127.0.0.1:7860

Required keys in .env:

  • ANTHROPIC_API_KEY β€” frontier assistant, Haiku moderation, eval judge
  • HF_TOKEN β€” Qwen download, deployment
  • TAVILY_API_KEY β€” web-search tool
  • LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY β€” observability (optional; app no-ops gracefully without them)

Running the eval

uv run python -m eval.run_eval --limit 2   # smoke run (~2 min, ~$0.05)
uv run python -m eval.run_eval             # full run, 90 prompts Γ— 2 assistants
uv run python -m eval.judge                # Sonnet 4.5 scores every response
uv run python -m eval.report               # charts + EVALUATION_REPORT.md

Both run_eval and judge are resumable β€” interrupt with Ctrl+C and re-run to continue.


Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Gradio UI β”‚ β†’  β”‚  input filterβ”‚ β†’  β”‚  assistant (Claude   β”‚ β†’  β”‚  output      β”‚
β”‚  (app.py)  β”‚    β”‚  (regex      β”‚    β”‚  or Qwen) + tools    β”‚    β”‚  moderation  β”‚
β”‚            β”‚    β”‚  blocklist)  β”‚    β”‚  + LangChain memory  β”‚    β”‚  (Haiku 4.5) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                                 β”‚                       β”‚
                                          β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”
                                          β”‚  SQLite     β”‚         β”‚  trace via   β”‚
                                          β”‚  sessions   β”‚         β”‚  Langfuse    β”‚
                                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key decisions (full rationale in docs/ARCHITECTURE.md):

  • Both assistants share a single BaseAssistant with the tool-calling loop and a 6-turn memory window.
  • Qwen tool-calling is wired through Qwen's native <tool_call> template (LangChain's ChatHuggingFace doesn't render tool schemas), giving the OSS model genuine tool use.
  • Guardrails live in the UI layer, not in BaseAssistant, so the eval sees raw model output for honest safety measurement.
  • The eval judge gets the same JSON schema across all rows ({hallucinated, biased, refused, harmful, reasoning}) plus dataset-aware guidance to make verdicts meaningful per benchmark.

Headline eval results (n=30 per dataset, 95% bootstrap CIs)

Metric Claude (frontier) Qwen-1.5B (OSS)
Hallucination rate (TruthfulQA) 10.0% [0.0, 23.3] 63.3% [46.7, 80.0]
Bias rate (BBQ, overall) 3.3% [0.0, 10.0] 36.7% [20.0, 53.3]
Jailbreak resistance (AdvBench) 100.0% [100.0, 100.0] 100.0% [100.0, 100.0]
Refusal rate (overall) 33.3% [23.3, 42.2] 35.6% [25.6, 45.6]

Full breakdown by demographic and the judge-self-bias limitation disclosure: docs/EVALUATION_REPORT.md.

Latency (per turn, measured on the eval run)

Assistant Hardware median mean p95 max
Claude Sonnet 4.5 API call 4.4 s 5.1 s 10.1 s 14.6 s
Qwen-1.5B local CPU (dev laptop) 11.9 s 16.8 s 37.1 s 102.2 s
Qwen-1.5B HF Space cpu-basic likely 30-60 s+ per reply on shared CPU β€” see "Deployment"

The deployed Space uses the free cpu-basic hardware (no HF PRO subscription). The @spaces.GPU decorator is already in place on Qwen's generation, so switching to ZeroGPU later is a one-line YAML change (hardware: zero-a10g) once a PRO subscription is active β€” expected to bring Qwen latency to ~3-8 s.

Cost (rough, per chat turn)

Component Per-turn cost (approx)
Claude Sonnet 4.5 assistant (~500 in / 200 out tok) ~$0.0045
Haiku 4.5 output moderation (~150 in / 50 out tok) ~$0.0003
Qwen-1.5B on HF Spaces (cpu-basic or zero-a10g) free (within HF Space quotas)
Tavily web search (when tool fires) free tier ≀1k/mo

A 100-turn Claude conversation runs ~$0.50; the same 100 turns on Qwen via Hugging Face Spaces are free (modulo HF quota).


Tradeoffs

  • Frontier is meaningfully more reliable on hallucination and bias in the raw eval, but ~6Γ— the per-turn cost vs. free OSS hosting.
  • The OSS model needs the guardrails enabled to be safe to expose β€” the input/output filters were designed to close the residual gap.
  • A 1.5B model is the floor, chosen here to fit ZeroGPU comfortably. A 7B–14B Qwen (or Llama-3.1) would likely close most of the hallucination/bias gap but needs more GPU time.
  • The judge is a Claude model, which has documented self-preference bias β€” the OSS-vs-Claude comparison is therefore slightly optimistic for Claude.

What I'd improve with more time

  • A second judge (GPT-4o or a human-annotated subset of ~30 rows) to calibrate the self-bias.
  • Larger sample sizes (n=100+ per dataset) to tighten CIs.
  • A real tool-use eval (e.g. GSM8K with calculator, NaturalQuestions with web search) β€” the current eval is zero-shot Q&A and doesn't directly measure how much the tools help.
  • Try a larger OSS model (Qwen2.5-7B or Llama-3.1-8B) on ZeroGPU's xlarge tier to quantify the OSS↔frontier gap-vs-size curve.
  • Add session-scoped Gradio state (per-browser-tab session id) for true multi-user deployment.

License

MIT β€” free for any use with attribution.

Project layout

oss-vs-frontier-assistant/
β”œβ”€β”€ app.py                   # Gradio entry β€” runs locally AND on HF Spaces
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ config.py            # pydantic-settings env loader
β”‚   β”œβ”€β”€ assistants/          # BaseAssistant + Claude + Qwen
β”‚   β”œβ”€β”€ memory.py            # LangChain SQLChatMessageHistory + RunnableWithMessageHistory
β”‚   β”œβ”€β”€ tools.py             # calculator + Tavily search
β”‚   β”œβ”€β”€ guardrails.py        # input blocklist + Haiku output moderation
β”‚   └── observability.py     # Langfuse @observe decorator
β”œβ”€β”€ eval/                    # datasets / runner / judge / report
β”œβ”€β”€ docs/                    # ARCHITECTURE, DEPLOY_GUIDE, EVALUATION_REPORT
└── tests/                   # smoke tests