KevinMerchant13's picture
polish: 1-page PDF report + doc fixes
114d5f1 verified
---
title: OSS vs Frontier Assistant
emoji: πŸ€–
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 6.14.0
python_version: "3.11"
app_file: app.py
hardware: cpu-basic
pinned: false
---
# OSS vs. Frontier Assistant
Side-by-side evaluation of an **open-source assistant** (`Qwen2.5-1.5B-Instruct`)
and a **frontier assistant** (Claude Sonnet 4.5). Both share an identical Gradio
UI and capabilities β€” multi-turn chat with persistent short-term memory, a
calculator + web-search tool, and a two-layer input/output guardrail β€” and are
evaluated on hallucination, demographic bias, and safety / jailbreak resistance.
🌐 **Live demo:** [huggingface.co/spaces/KevinMerchant13/oss-vs-frontier-assistant](https://huggingface.co/spaces/KevinMerchant13/oss-vs-frontier-assistant)
πŸ“Š **Evaluation report:** [`docs/EVALUATION_REPORT.md`](docs/EVALUATION_REPORT.md)
πŸ› οΈ **Architecture notes:** [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) Β· **Deploy guide:** [`docs/DEPLOY_GUIDE.md`](docs/DEPLOY_GUIDE.md)
---
## Setup (local)
```bash
git clone <this-repo>
cd oss-vs-frontier-assistant
cp .env.example .env # then fill in your API keys
uv sync # installs all pinned deps
uv run python app.py # http://127.0.0.1:7860
```
Required keys in `.env`:
- `ANTHROPIC_API_KEY` β€” frontier assistant, Haiku moderation, eval judge
- `HF_TOKEN` β€” Qwen download, deployment
- `TAVILY_API_KEY` β€” web-search tool
- `LANGFUSE_PUBLIC_KEY` / `LANGFUSE_SECRET_KEY` β€” observability (optional; app no-ops gracefully without them)
## Running the eval
```bash
uv run python -m eval.run_eval --limit 2 # smoke run (~2 min, ~$0.05)
uv run python -m eval.run_eval # full run, 90 prompts Γ— 2 assistants
uv run python -m eval.judge # Sonnet 4.5 scores every response
uv run python -m eval.report # charts + EVALUATION_REPORT.md
```
Both `run_eval` and `judge` are **resumable** β€” interrupt with Ctrl+C and re-run to continue.
---
## Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Gradio UI β”‚ β†’ β”‚ input filterβ”‚ β†’ β”‚ assistant (Claude β”‚ β†’ β”‚ output β”‚
β”‚ (app.py) β”‚ β”‚ (regex β”‚ β”‚ or Qwen) + tools β”‚ β”‚ moderation β”‚
β”‚ β”‚ β”‚ blocklist) β”‚ β”‚ + LangChain memory β”‚ β”‚ (Haiku 4.5) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”
β”‚ SQLite β”‚ β”‚ trace via β”‚
β”‚ sessions β”‚ β”‚ Langfuse β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
**Key decisions** (full rationale in [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md)):
- Both assistants share a single `BaseAssistant` with the tool-calling loop and a 6-turn memory window.
- Qwen tool-calling is wired through Qwen's *native* `<tool_call>` template (LangChain's `ChatHuggingFace` doesn't render tool schemas), giving the OSS model genuine tool use.
- Guardrails live in the **UI layer**, not in `BaseAssistant`, so the eval sees raw model output for honest safety measurement.
- The eval judge gets the same JSON schema across all rows (`{hallucinated, biased, refused, harmful, reasoning}`) plus *dataset-aware guidance* to make verdicts meaningful per benchmark.
---
## Headline eval results (n=30 per dataset, 95% bootstrap CIs)
| Metric | Claude (frontier) | Qwen-1.5B (OSS) |
|-----------------------------------|--------------------------|---------------------------|
| Hallucination rate (TruthfulQA) | **10.0%** [0.0, 23.3] | 63.3% [46.7, 80.0] |
| Bias rate (BBQ, overall) | **3.3%** [0.0, 10.0] | 36.7% [20.0, 53.3] |
| Jailbreak resistance (AdvBench) | 100.0% [100.0, 100.0] | 100.0% [100.0, 100.0] |
| Refusal rate (overall) | 33.3% [23.3, 42.2] | 35.6% [25.6, 45.6] |
Full breakdown by demographic and the judge-self-bias limitation disclosure: [`docs/EVALUATION_REPORT.md`](docs/EVALUATION_REPORT.md).
## Latency (per turn, measured on the eval run)
| Assistant | Hardware | median | mean | p95 | max |
|-------------------|-------------|--------|--------|--------|--------|
| Claude Sonnet 4.5 | API call | 4.4 s | 5.1 s | 10.1 s | 14.6 s |
| Qwen-1.5B | local CPU (dev laptop) | 11.9 s | 16.8 s | 37.1 s | 102.2 s |
| Qwen-1.5B | HF Space `cpu-basic` | likely 30-60 s+ per reply on shared CPU β€” see "Deployment" |
The deployed Space uses the **free `cpu-basic`** hardware (no HF PRO subscription). The
`@spaces.GPU` decorator is already in place on Qwen's generation, so switching to
**ZeroGPU** later is a one-line YAML change (`hardware: zero-a10g`) once a PRO
subscription is active β€” expected to bring Qwen latency to ~3-8 s.
## Cost (rough, per chat turn)
| Component | Per-turn cost (approx) |
|-------------------------------------|------------------------|
| Claude Sonnet 4.5 assistant (~500 in / 200 out tok) | ~$0.0045 |
| Haiku 4.5 output moderation (~150 in / 50 out tok) | ~$0.0003 |
| Qwen-1.5B on HF Spaces (`cpu-basic` or `zero-a10g`) | free (within HF Space quotas) |
| Tavily web search (when tool fires) | free tier ≀1k/mo |
A 100-turn Claude conversation runs **~$0.50**; the same 100 turns on Qwen via Hugging Face Spaces are **free** (modulo HF quota).
---
## Tradeoffs
- **Frontier is meaningfully more reliable** on hallucination and bias in the raw eval, but ~6Γ— the per-turn cost vs. free OSS hosting.
- **The OSS model needs the guardrails enabled** to be safe to expose β€” the input/output filters were designed to close the residual gap.
- **A 1.5B model is the floor**, chosen here to fit ZeroGPU comfortably. A 7B–14B Qwen (or Llama-3.1) would likely close most of the hallucination/bias gap but needs more GPU time.
- **The judge is a Claude model**, which has documented self-preference bias β€” the OSS-vs-Claude comparison is therefore slightly optimistic for Claude.
## What I'd improve with more time
- A **second judge** (GPT-4o or a human-annotated subset of ~30 rows) to calibrate the self-bias.
- Larger sample sizes (n=100+ per dataset) to tighten CIs.
- A **real tool-use eval** (e.g. GSM8K with calculator, NaturalQuestions with web search) β€” the current eval is zero-shot Q&A and doesn't directly measure how much the tools help.
- Try a **larger OSS model** (Qwen2.5-7B or Llama-3.1-8B) on ZeroGPU's `xlarge` tier to quantify the OSS↔frontier gap-vs-size curve.
- Add **session-scoped Gradio state** (per-browser-tab session id) for true multi-user deployment.
## License
[MIT](LICENSE) β€” free for any use with attribution.
## Project layout
```
oss-vs-frontier-assistant/
β”œβ”€β”€ app.py # Gradio entry β€” runs locally AND on HF Spaces
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ config.py # pydantic-settings env loader
β”‚ β”œβ”€β”€ assistants/ # BaseAssistant + Claude + Qwen
β”‚ β”œβ”€β”€ memory.py # LangChain SQLChatMessageHistory + RunnableWithMessageHistory
β”‚ β”œβ”€β”€ tools.py # calculator + Tavily search
β”‚ β”œβ”€β”€ guardrails.py # input blocklist + Haiku output moderation
β”‚ └── observability.py # Langfuse @observe decorator
β”œβ”€β”€ eval/ # datasets / runner / judge / report
β”œβ”€β”€ docs/ # ARCHITECTURE, DEPLOY_GUIDE, EVALUATION_REPORT
└── tests/ # smoke tests
```