---
title: OSS vs Frontier Assistant
emoji: 🤖
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 6.14.0
python_version: "3.11"
app_file: app.py
hardware: cpu-basic
pinned: false
---

# OSS vs. Frontier Assistant

Side-by-side evaluation of an **open-source assistant** (`Qwen2.5-1.5B-Instruct`)
and a **frontier assistant** (Claude Sonnet 4.5). Both share an identical Gradio
UI and capabilities — multi-turn chat with persistent short-term memory, a
calculator + web-search tool, and a two-layer input/output guardrail — and are
evaluated on hallucination, demographic bias, and safety / jailbreak resistance.

🌐 **Live demo:** [huggingface.co/spaces/KevinMerchant13/oss-vs-frontier-assistant](https://huggingface.co/spaces/KevinMerchant13/oss-vs-frontier-assistant)
📊 **Evaluation report:** [`docs/EVALUATION_REPORT.md`](docs/EVALUATION_REPORT.md)
🛠️ **Architecture notes:** [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) · **Deploy guide:** [`docs/DEPLOY_GUIDE.md`](docs/DEPLOY_GUIDE.md)

---

## Setup (local)

```bash
git clone <this-repo>
cd oss-vs-frontier-assistant
cp .env.example .env          # then fill in your API keys
uv sync                       # installs all pinned deps
uv run python app.py          # http://127.0.0.1:7860
```

Required keys in `.env`:
- `ANTHROPIC_API_KEY` — frontier assistant, Haiku moderation, eval judge
- `HF_TOKEN` — Qwen download, deployment
- `TAVILY_API_KEY` — web-search tool
- `LANGFUSE_PUBLIC_KEY` / `LANGFUSE_SECRET_KEY` — observability (optional; app no-ops gracefully without them)

## Running the eval

```bash
uv run python -m eval.run_eval --limit 2   # smoke run (~2 min, ~$0.05)
uv run python -m eval.run_eval             # full run, 90 prompts × 2 assistants
uv run python -m eval.judge                # Sonnet 4.5 scores every response
uv run python -m eval.report               # charts + EVALUATION_REPORT.md
```

Both `run_eval` and `judge` are **resumable** — interrupt with Ctrl+C and re-run to continue.

---

## Architecture

```
┌────────────┐    ┌──────────────┐    ┌──────────────────────┐    ┌──────────────┐
│  Gradio UI │ →  │  input filter│ →  │  assistant (Claude   │ →  │  output      │
│  (app.py)  │    │  (regex      │    │  or Qwen) + tools    │    │  moderation  │
│            │    │  blocklist)  │    │  + LangChain memory  │    │  (Haiku 4.5) │
└────────────┘    └──────────────┘    └──────────┬───────────┘    └──────┬───────┘
                                                 │                       │
                                          ┌──────┴──────┐         ┌──────┴───────┐
                                          │  SQLite     │         │  trace via   │
                                          │  sessions   │         │  Langfuse    │
                                          └─────────────┘         └──────────────┘
```

**Key decisions** (full rationale in [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md)):
- Both assistants share a single `BaseAssistant` with the tool-calling loop and a 6-turn memory window.
- Qwen tool-calling is wired through Qwen's *native* `<tool_call>` template (LangChain's `ChatHuggingFace` doesn't render tool schemas), giving the OSS model genuine tool use.
- Guardrails live in the **UI layer**, not in `BaseAssistant`, so the eval sees raw model output for honest safety measurement.
- The eval judge gets the same JSON schema across all rows (`{hallucinated, biased, refused, harmful, reasoning}`) plus *dataset-aware guidance* to make verdicts meaningful per benchmark.

---

## Headline eval results (n=30 per dataset, 95% bootstrap CIs)

| Metric                            | Claude (frontier)        | Qwen-1.5B (OSS)           |
|-----------------------------------|--------------------------|---------------------------|
| Hallucination rate (TruthfulQA)   | **10.0%** [0.0, 23.3]    | 63.3% [46.7, 80.0]        |
| Bias rate (BBQ, overall)          | **3.3%** [0.0, 10.0]     | 36.7% [20.0, 53.3]        |
| Jailbreak resistance (AdvBench)   | 100.0% [100.0, 100.0]    | 100.0% [100.0, 100.0]     |
| Refusal rate (overall)            | 33.3% [23.3, 42.2]       | 35.6% [25.6, 45.6]        |

Full breakdown by demographic and the judge-self-bias limitation disclosure: [`docs/EVALUATION_REPORT.md`](docs/EVALUATION_REPORT.md).

## Latency (per turn, measured on the eval run)

| Assistant         | Hardware    | median | mean   | p95    | max    |
|-------------------|-------------|--------|--------|--------|--------|
| Claude Sonnet 4.5 | API call    | 4.4 s  | 5.1 s  | 10.1 s | 14.6 s |
| Qwen-1.5B         | local CPU (dev laptop)  | 11.9 s | 16.8 s | 37.1 s | 102.2 s |
| Qwen-1.5B         | HF Space `cpu-basic`    | likely 30-60 s+ per reply on shared CPU — see "Deployment" |

The deployed Space uses the **free `cpu-basic`** hardware (no HF PRO subscription). The
`@spaces.GPU` decorator is already in place on Qwen's generation, so switching to
**ZeroGPU** later is a one-line YAML change (`hardware: zero-a10g`) once a PRO
subscription is active — expected to bring Qwen latency to ~3-8 s.

## Cost (rough, per chat turn)

| Component                           | Per-turn cost (approx) |
|-------------------------------------|------------------------|
| Claude Sonnet 4.5 assistant (~500 in / 200 out tok) | ~$0.0045 |
| Haiku 4.5 output moderation (~150 in / 50 out tok)  | ~$0.0003 |
| Qwen-1.5B on HF Spaces (`cpu-basic` or `zero-a10g`) | free (within HF Space quotas) |
| Tavily web search (when tool fires) | free tier ≤1k/mo |

A 100-turn Claude conversation runs **~$0.50**; the same 100 turns on Qwen via Hugging Face Spaces are **free** (modulo HF quota).

---

## Tradeoffs

- **Frontier is meaningfully more reliable** on hallucination and bias in the raw eval, but ~6× the per-turn cost vs. free OSS hosting.
- **The OSS model needs the guardrails enabled** to be safe to expose — the input/output filters were designed to close the residual gap.
- **A 1.5B model is the floor**, chosen here to fit ZeroGPU comfortably. A 7B–14B Qwen (or Llama-3.1) would likely close most of the hallucination/bias gap but needs more GPU time.
- **The judge is a Claude model**, which has documented self-preference bias — the OSS-vs-Claude comparison is therefore slightly optimistic for Claude.

## What I'd improve with more time

- A **second judge** (GPT-4o or a human-annotated subset of ~30 rows) to calibrate the self-bias.
- Larger sample sizes (n=100+ per dataset) to tighten CIs.
- A **real tool-use eval** (e.g. GSM8K with calculator, NaturalQuestions with web search) — the current eval is zero-shot Q&A and doesn't directly measure how much the tools help.
- Try a **larger OSS model** (Qwen2.5-7B or Llama-3.1-8B) on ZeroGPU's `xlarge` tier to quantify the OSS↔frontier gap-vs-size curve.
- Add **session-scoped Gradio state** (per-browser-tab session id) for true multi-user deployment.

## License

[MIT](LICENSE) — free for any use with attribution.

## Project layout

```
oss-vs-frontier-assistant/
├── app.py                   # Gradio entry — runs locally AND on HF Spaces
├── src/
│   ├── config.py            # pydantic-settings env loader
│   ├── assistants/          # BaseAssistant + Claude + Qwen
│   ├── memory.py            # LangChain SQLChatMessageHistory + RunnableWithMessageHistory
│   ├── tools.py             # calculator + Tavily search
│   ├── guardrails.py        # input blocklist + Haiku output moderation
│   └── observability.py     # Langfuse @observe decorator
├── eval/                    # datasets / runner / judge / report
├── docs/                    # ARCHITECTURE, DEPLOY_GUIDE, EVALUATION_REPORT
└── tests/                   # smoke tests
```