Spaces:

KevinMerchant13
/

oss-vs-frontier-assistant

Running

App Files Files Community

oss-vs-frontier-assistant / README.md

KevinMerchant13

polish: 1-page PDF report + doc fixes

114d5f1 verified 3 days ago

preview code

raw

history blame contribute delete

8.27 kB

metadata

title: OSS vs Frontier Assistant
emoji: 🤖
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 6.14.0
python_version: '3.11'
app_file: app.py
hardware: cpu-basic
pinned: false

OSS vs. Frontier Assistant

Side-by-side evaluation of an open-source assistant (Qwen2.5-1.5B-Instruct) and a frontier assistant (Claude Sonnet 4.5). Both share an identical Gradio UI and capabilities — multi-turn chat with persistent short-term memory, a calculator + web-search tool, and a two-layer input/output guardrail — and are evaluated on hallucination, demographic bias, and safety / jailbreak resistance.

🌐 Live demo: huggingface.co/spaces/KevinMerchant13/oss-vs-frontier-assistant 📊 Evaluation report: docs/EVALUATION_REPORT.md 🛠️ Architecture notes: docs/ARCHITECTURE.md · Deploy guide: docs/DEPLOY_GUIDE.md

Setup (local)

git clone <this-repo>
cd oss-vs-frontier-assistant
cp .env.example .env          # then fill in your API keys
uv sync                       # installs all pinned deps
uv run python app.py          # http://127.0.0.1:7860

Required keys in .env:

ANTHROPIC_API_KEY — frontier assistant, Haiku moderation, eval judge
HF_TOKEN — Qwen download, deployment
TAVILY_API_KEY — web-search tool
LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY — observability (optional; app no-ops gracefully without them)

Running the eval

uv run python -m eval.run_eval --limit 2   # smoke run (~2 min, ~$0.05)
uv run python -m eval.run_eval             # full run, 90 prompts × 2 assistants
uv run python -m eval.judge                # Sonnet 4.5 scores every response
uv run python -m eval.report               # charts + EVALUATION_REPORT.md

Both run_eval and judge are resumable — interrupt with Ctrl+C and re-run to continue.

Architecture

┌────────────┐    ┌──────────────┐    ┌──────────────────────┐    ┌──────────────┐
│  Gradio UI │ →  │  input filter│ →  │  assistant (Claude   │ →  │  output      │
│  (app.py)  │    │  (regex      │    │  or Qwen) + tools    │    │  moderation  │
│            │    │  blocklist)  │    │  + LangChain memory  │    │  (Haiku 4.5) │
└────────────┘    └──────────────┘    └──────────┬───────────┘    └──────┬───────┘
                                                 │                       │
                                          ┌──────┴──────┐         ┌──────┴───────┐
                                          │  SQLite     │         │  trace via   │
                                          │  sessions   │         │  Langfuse    │
                                          └─────────────┘         └──────────────┘

Key decisions (full rationale in docs/ARCHITECTURE.md):

Both assistants share a single BaseAssistant with the tool-calling loop and a 6-turn memory window.
Qwen tool-calling is wired through Qwen's native <tool_call> template (LangChain's ChatHuggingFace doesn't render tool schemas), giving the OSS model genuine tool use.
Guardrails live in the UI layer, not in BaseAssistant, so the eval sees raw model output for honest safety measurement.
The eval judge gets the same JSON schema across all rows ({hallucinated, biased, refused, harmful, reasoning}) plus dataset-aware guidance to make verdicts meaningful per benchmark.

Headline eval results (n=30 per dataset, 95% bootstrap CIs)

Metric	Claude (frontier)	Qwen-1.5B (OSS)
Hallucination rate (TruthfulQA)	10.0% [0.0, 23.3]	63.3% [46.7, 80.0]
Bias rate (BBQ, overall)	3.3% [0.0, 10.0]	36.7% [20.0, 53.3]
Jailbreak resistance (AdvBench)	100.0% [100.0, 100.0]	100.0% [100.0, 100.0]
Refusal rate (overall)	33.3% [23.3, 42.2]	35.6% [25.6, 45.6]

Full breakdown by demographic and the judge-self-bias limitation disclosure: docs/EVALUATION_REPORT.md.

Latency (per turn, measured on the eval run)

Assistant	Hardware	median	mean	p95	max
Claude Sonnet 4.5	API call	4.4 s	5.1 s	10.1 s	14.6 s
Qwen-1.5B	local CPU (dev laptop)	11.9 s	16.8 s	37.1 s	102.2 s
Qwen-1.5B	HF Space `cpu-basic`	likely 30-60 s+ per reply on shared CPU — see "Deployment"

The deployed Space uses the free cpu-basic hardware (no HF PRO subscription). The @spaces.GPU decorator is already in place on Qwen's generation, so switching to ZeroGPU later is a one-line YAML change (hardware: zero-a10g) once a PRO subscription is active — expected to bring Qwen latency to ~3-8 s.

Cost (rough, per chat turn)

Component	Per-turn cost (approx)
Claude Sonnet 4.5 assistant (~500 in / 200 out tok)	~$0.0045
Haiku 4.5 output moderation (~150 in / 50 out tok)	~$0.0003
Qwen-1.5B on HF Spaces (`cpu-basic` or `zero-a10g`)	free (within HF Space quotas)
Tavily web search (when tool fires)	free tier ≤1k/mo

A 100-turn Claude conversation runs ~$0.50; the same 100 turns on Qwen via Hugging Face Spaces are free (modulo HF quota).

Tradeoffs

Frontier is meaningfully more reliable on hallucination and bias in the raw eval, but ~6× the per-turn cost vs. free OSS hosting.
The OSS model needs the guardrails enabled to be safe to expose — the input/output filters were designed to close the residual gap.
A 1.5B model is the floor, chosen here to fit ZeroGPU comfortably. A 7B–14B Qwen (or Llama-3.1) would likely close most of the hallucination/bias gap but needs more GPU time.
The judge is a Claude model, which has documented self-preference bias — the OSS-vs-Claude comparison is therefore slightly optimistic for Claude.

What I'd improve with more time

A second judge (GPT-4o or a human-annotated subset of ~30 rows) to calibrate the self-bias.
Larger sample sizes (n=100+ per dataset) to tighten CIs.
A real tool-use eval (e.g. GSM8K with calculator, NaturalQuestions with web search) — the current eval is zero-shot Q&A and doesn't directly measure how much the tools help.
Try a larger OSS model (Qwen2.5-7B or Llama-3.1-8B) on ZeroGPU's xlarge tier to quantify the OSS↔frontier gap-vs-size curve.
Add session-scoped Gradio state (per-browser-tab session id) for true multi-user deployment.

License

MIT — free for any use with attribution.

Project layout

oss-vs-frontier-assistant/
├── app.py                   # Gradio entry — runs locally AND on HF Spaces
├── src/
│   ├── config.py            # pydantic-settings env loader
│   ├── assistants/          # BaseAssistant + Claude + Qwen
│   ├── memory.py            # LangChain SQLChatMessageHistory + RunnableWithMessageHistory
│   ├── tools.py             # calculator + Tavily search
│   ├── guardrails.py        # input blocklist + Haiku output moderation
│   └── observability.py     # Langfuse @observe decorator
├── eval/                    # datasets / runner / judge / report
├── docs/                    # ARCHITECTURE, DEPLOY_GUIDE, EVALUATION_REPORT
└── tests/                   # smoke tests