File size: 8,267 Bytes
fd93c26 35c0d38 fd93c26 35c0d38 fd93c26 35c0d38 fd93c26 35c0d38 3683c14 35c0d38 114d5f1 35c0d38 114d5f1 35c0d38 3683c14 35c0d38 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | ---
title: OSS vs Frontier Assistant
emoji: π€
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 6.14.0
python_version: "3.11"
app_file: app.py
hardware: cpu-basic
pinned: false
---
# OSS vs. Frontier Assistant
Side-by-side evaluation of an **open-source assistant** (`Qwen2.5-1.5B-Instruct`)
and a **frontier assistant** (Claude Sonnet 4.5). Both share an identical Gradio
UI and capabilities β multi-turn chat with persistent short-term memory, a
calculator + web-search tool, and a two-layer input/output guardrail β and are
evaluated on hallucination, demographic bias, and safety / jailbreak resistance.
π **Live demo:** [huggingface.co/spaces/KevinMerchant13/oss-vs-frontier-assistant](https://huggingface.co/spaces/KevinMerchant13/oss-vs-frontier-assistant)
π **Evaluation report:** [`docs/EVALUATION_REPORT.md`](docs/EVALUATION_REPORT.md)
π οΈ **Architecture notes:** [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) Β· **Deploy guide:** [`docs/DEPLOY_GUIDE.md`](docs/DEPLOY_GUIDE.md)
---
## Setup (local)
```bash
git clone <this-repo>
cd oss-vs-frontier-assistant
cp .env.example .env # then fill in your API keys
uv sync # installs all pinned deps
uv run python app.py # http://127.0.0.1:7860
```
Required keys in `.env`:
- `ANTHROPIC_API_KEY` β frontier assistant, Haiku moderation, eval judge
- `HF_TOKEN` β Qwen download, deployment
- `TAVILY_API_KEY` β web-search tool
- `LANGFUSE_PUBLIC_KEY` / `LANGFUSE_SECRET_KEY` β observability (optional; app no-ops gracefully without them)
## Running the eval
```bash
uv run python -m eval.run_eval --limit 2 # smoke run (~2 min, ~$0.05)
uv run python -m eval.run_eval # full run, 90 prompts Γ 2 assistants
uv run python -m eval.judge # Sonnet 4.5 scores every response
uv run python -m eval.report # charts + EVALUATION_REPORT.md
```
Both `run_eval` and `judge` are **resumable** β interrupt with Ctrl+C and re-run to continue.
---
## Architecture
```
ββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββ ββββββββββββββββ
β Gradio UI β β β input filterβ β β assistant (Claude β β β output β
β (app.py) β β (regex β β or Qwen) + tools β β moderation β
β β β blocklist) β β + LangChain memory β β (Haiku 4.5) β
ββββββββββββββ ββββββββββββββββ ββββββββββββ¬ββββββββββββ ββββββββ¬ββββββββ
β β
ββββββββ΄βββββββ ββββββββ΄ββββββββ
β SQLite β β trace via β
β sessions β β Langfuse β
βββββββββββββββ ββββββββββββββββ
```
**Key decisions** (full rationale in [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md)):
- Both assistants share a single `BaseAssistant` with the tool-calling loop and a 6-turn memory window.
- Qwen tool-calling is wired through Qwen's *native* `<tool_call>` template (LangChain's `ChatHuggingFace` doesn't render tool schemas), giving the OSS model genuine tool use.
- Guardrails live in the **UI layer**, not in `BaseAssistant`, so the eval sees raw model output for honest safety measurement.
- The eval judge gets the same JSON schema across all rows (`{hallucinated, biased, refused, harmful, reasoning}`) plus *dataset-aware guidance* to make verdicts meaningful per benchmark.
---
## Headline eval results (n=30 per dataset, 95% bootstrap CIs)
| Metric | Claude (frontier) | Qwen-1.5B (OSS) |
|-----------------------------------|--------------------------|---------------------------|
| Hallucination rate (TruthfulQA) | **10.0%** [0.0, 23.3] | 63.3% [46.7, 80.0] |
| Bias rate (BBQ, overall) | **3.3%** [0.0, 10.0] | 36.7% [20.0, 53.3] |
| Jailbreak resistance (AdvBench) | 100.0% [100.0, 100.0] | 100.0% [100.0, 100.0] |
| Refusal rate (overall) | 33.3% [23.3, 42.2] | 35.6% [25.6, 45.6] |
Full breakdown by demographic and the judge-self-bias limitation disclosure: [`docs/EVALUATION_REPORT.md`](docs/EVALUATION_REPORT.md).
## Latency (per turn, measured on the eval run)
| Assistant | Hardware | median | mean | p95 | max |
|-------------------|-------------|--------|--------|--------|--------|
| Claude Sonnet 4.5 | API call | 4.4 s | 5.1 s | 10.1 s | 14.6 s |
| Qwen-1.5B | local CPU (dev laptop) | 11.9 s | 16.8 s | 37.1 s | 102.2 s |
| Qwen-1.5B | HF Space `cpu-basic` | likely 30-60 s+ per reply on shared CPU β see "Deployment" |
The deployed Space uses the **free `cpu-basic`** hardware (no HF PRO subscription). The
`@spaces.GPU` decorator is already in place on Qwen's generation, so switching to
**ZeroGPU** later is a one-line YAML change (`hardware: zero-a10g`) once a PRO
subscription is active β expected to bring Qwen latency to ~3-8 s.
## Cost (rough, per chat turn)
| Component | Per-turn cost (approx) |
|-------------------------------------|------------------------|
| Claude Sonnet 4.5 assistant (~500 in / 200 out tok) | ~$0.0045 |
| Haiku 4.5 output moderation (~150 in / 50 out tok) | ~$0.0003 |
| Qwen-1.5B on HF Spaces (`cpu-basic` or `zero-a10g`) | free (within HF Space quotas) |
| Tavily web search (when tool fires) | free tier β€1k/mo |
A 100-turn Claude conversation runs **~$0.50**; the same 100 turns on Qwen via Hugging Face Spaces are **free** (modulo HF quota).
---
## Tradeoffs
- **Frontier is meaningfully more reliable** on hallucination and bias in the raw eval, but ~6Γ the per-turn cost vs. free OSS hosting.
- **The OSS model needs the guardrails enabled** to be safe to expose β the input/output filters were designed to close the residual gap.
- **A 1.5B model is the floor**, chosen here to fit ZeroGPU comfortably. A 7Bβ14B Qwen (or Llama-3.1) would likely close most of the hallucination/bias gap but needs more GPU time.
- **The judge is a Claude model**, which has documented self-preference bias β the OSS-vs-Claude comparison is therefore slightly optimistic for Claude.
## What I'd improve with more time
- A **second judge** (GPT-4o or a human-annotated subset of ~30 rows) to calibrate the self-bias.
- Larger sample sizes (n=100+ per dataset) to tighten CIs.
- A **real tool-use eval** (e.g. GSM8K with calculator, NaturalQuestions with web search) β the current eval is zero-shot Q&A and doesn't directly measure how much the tools help.
- Try a **larger OSS model** (Qwen2.5-7B or Llama-3.1-8B) on ZeroGPU's `xlarge` tier to quantify the OSSβfrontier gap-vs-size curve.
- Add **session-scoped Gradio state** (per-browser-tab session id) for true multi-user deployment.
## License
[MIT](LICENSE) β free for any use with attribution.
## Project layout
```
oss-vs-frontier-assistant/
βββ app.py # Gradio entry β runs locally AND on HF Spaces
βββ src/
β βββ config.py # pydantic-settings env loader
β βββ assistants/ # BaseAssistant + Claude + Qwen
β βββ memory.py # LangChain SQLChatMessageHistory + RunnableWithMessageHistory
β βββ tools.py # calculator + Tavily search
β βββ guardrails.py # input blocklist + Haiku output moderation
β βββ observability.py # Langfuse @observe decorator
βββ eval/ # datasets / runner / judge / report
βββ docs/ # ARCHITECTURE, DEPLOY_GUIDE, EVALUATION_REPORT
βββ tests/ # smoke tests
```
|