File size: 8,267 Bytes
fd93c26
35c0d38
 
 
 
fd93c26
 
35c0d38
fd93c26
35c0d38
fd93c26
 
 
35c0d38
 
 
 
 
 
 
 
3683c14
35c0d38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114d5f1
35c0d38
 
114d5f1
35c0d38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3683c14
 
 
 
35c0d38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
title: OSS vs Frontier Assistant
emoji: πŸ€–
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 6.14.0
python_version: "3.11"
app_file: app.py
hardware: cpu-basic
pinned: false
---

# OSS vs. Frontier Assistant

Side-by-side evaluation of an **open-source assistant** (`Qwen2.5-1.5B-Instruct`)
and a **frontier assistant** (Claude Sonnet 4.5). Both share an identical Gradio
UI and capabilities β€” multi-turn chat with persistent short-term memory, a
calculator + web-search tool, and a two-layer input/output guardrail β€” and are
evaluated on hallucination, demographic bias, and safety / jailbreak resistance.

🌐 **Live demo:** [huggingface.co/spaces/KevinMerchant13/oss-vs-frontier-assistant](https://huggingface.co/spaces/KevinMerchant13/oss-vs-frontier-assistant)
πŸ“Š **Evaluation report:** [`docs/EVALUATION_REPORT.md`](docs/EVALUATION_REPORT.md)
πŸ› οΈ **Architecture notes:** [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) Β· **Deploy guide:** [`docs/DEPLOY_GUIDE.md`](docs/DEPLOY_GUIDE.md)

---

## Setup (local)

```bash
git clone <this-repo>
cd oss-vs-frontier-assistant
cp .env.example .env          # then fill in your API keys
uv sync                       # installs all pinned deps
uv run python app.py          # http://127.0.0.1:7860
```

Required keys in `.env`:
- `ANTHROPIC_API_KEY` β€” frontier assistant, Haiku moderation, eval judge
- `HF_TOKEN` β€” Qwen download, deployment
- `TAVILY_API_KEY` β€” web-search tool
- `LANGFUSE_PUBLIC_KEY` / `LANGFUSE_SECRET_KEY` β€” observability (optional; app no-ops gracefully without them)

## Running the eval

```bash
uv run python -m eval.run_eval --limit 2   # smoke run (~2 min, ~$0.05)
uv run python -m eval.run_eval             # full run, 90 prompts Γ— 2 assistants
uv run python -m eval.judge                # Sonnet 4.5 scores every response
uv run python -m eval.report               # charts + EVALUATION_REPORT.md
```

Both `run_eval` and `judge` are **resumable** β€” interrupt with Ctrl+C and re-run to continue.

---

## Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Gradio UI β”‚ β†’  β”‚  input filterβ”‚ β†’  β”‚  assistant (Claude   β”‚ β†’  β”‚  output      β”‚
β”‚  (app.py)  β”‚    β”‚  (regex      β”‚    β”‚  or Qwen) + tools    β”‚    β”‚  moderation  β”‚
β”‚            β”‚    β”‚  blocklist)  β”‚    β”‚  + LangChain memory  β”‚    β”‚  (Haiku 4.5) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                                 β”‚                       β”‚
                                          β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”
                                          β”‚  SQLite     β”‚         β”‚  trace via   β”‚
                                          β”‚  sessions   β”‚         β”‚  Langfuse    β”‚
                                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

**Key decisions** (full rationale in [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md)):
- Both assistants share a single `BaseAssistant` with the tool-calling loop and a 6-turn memory window.
- Qwen tool-calling is wired through Qwen's *native* `<tool_call>` template (LangChain's `ChatHuggingFace` doesn't render tool schemas), giving the OSS model genuine tool use.
- Guardrails live in the **UI layer**, not in `BaseAssistant`, so the eval sees raw model output for honest safety measurement.
- The eval judge gets the same JSON schema across all rows (`{hallucinated, biased, refused, harmful, reasoning}`) plus *dataset-aware guidance* to make verdicts meaningful per benchmark.

---

## Headline eval results (n=30 per dataset, 95% bootstrap CIs)

| Metric                            | Claude (frontier)        | Qwen-1.5B (OSS)           |
|-----------------------------------|--------------------------|---------------------------|
| Hallucination rate (TruthfulQA)   | **10.0%** [0.0, 23.3]    | 63.3% [46.7, 80.0]        |
| Bias rate (BBQ, overall)          | **3.3%** [0.0, 10.0]     | 36.7% [20.0, 53.3]        |
| Jailbreak resistance (AdvBench)   | 100.0% [100.0, 100.0]    | 100.0% [100.0, 100.0]     |
| Refusal rate (overall)            | 33.3% [23.3, 42.2]       | 35.6% [25.6, 45.6]        |

Full breakdown by demographic and the judge-self-bias limitation disclosure: [`docs/EVALUATION_REPORT.md`](docs/EVALUATION_REPORT.md).

## Latency (per turn, measured on the eval run)

| Assistant         | Hardware    | median | mean   | p95    | max    |
|-------------------|-------------|--------|--------|--------|--------|
| Claude Sonnet 4.5 | API call    | 4.4 s  | 5.1 s  | 10.1 s | 14.6 s |
| Qwen-1.5B         | local CPU (dev laptop)  | 11.9 s | 16.8 s | 37.1 s | 102.2 s |
| Qwen-1.5B         | HF Space `cpu-basic`    | likely 30-60 s+ per reply on shared CPU β€” see "Deployment" |

The deployed Space uses the **free `cpu-basic`** hardware (no HF PRO subscription). The
`@spaces.GPU` decorator is already in place on Qwen's generation, so switching to
**ZeroGPU** later is a one-line YAML change (`hardware: zero-a10g`) once a PRO
subscription is active β€” expected to bring Qwen latency to ~3-8 s.

## Cost (rough, per chat turn)

| Component                           | Per-turn cost (approx) |
|-------------------------------------|------------------------|
| Claude Sonnet 4.5 assistant (~500 in / 200 out tok) | ~$0.0045 |
| Haiku 4.5 output moderation (~150 in / 50 out tok)  | ~$0.0003 |
| Qwen-1.5B on HF Spaces (`cpu-basic` or `zero-a10g`) | free (within HF Space quotas) |
| Tavily web search (when tool fires) | free tier ≀1k/mo |

A 100-turn Claude conversation runs **~$0.50**; the same 100 turns on Qwen via Hugging Face Spaces are **free** (modulo HF quota).

---

## Tradeoffs

- **Frontier is meaningfully more reliable** on hallucination and bias in the raw eval, but ~6Γ— the per-turn cost vs. free OSS hosting.
- **The OSS model needs the guardrails enabled** to be safe to expose β€” the input/output filters were designed to close the residual gap.
- **A 1.5B model is the floor**, chosen here to fit ZeroGPU comfortably. A 7B–14B Qwen (or Llama-3.1) would likely close most of the hallucination/bias gap but needs more GPU time.
- **The judge is a Claude model**, which has documented self-preference bias β€” the OSS-vs-Claude comparison is therefore slightly optimistic for Claude.

## What I'd improve with more time

- A **second judge** (GPT-4o or a human-annotated subset of ~30 rows) to calibrate the self-bias.
- Larger sample sizes (n=100+ per dataset) to tighten CIs.
- A **real tool-use eval** (e.g. GSM8K with calculator, NaturalQuestions with web search) β€” the current eval is zero-shot Q&A and doesn't directly measure how much the tools help.
- Try a **larger OSS model** (Qwen2.5-7B or Llama-3.1-8B) on ZeroGPU's `xlarge` tier to quantify the OSS↔frontier gap-vs-size curve.
- Add **session-scoped Gradio state** (per-browser-tab session id) for true multi-user deployment.

## License

[MIT](LICENSE) β€” free for any use with attribution.

## Project layout

```
oss-vs-frontier-assistant/
β”œβ”€β”€ app.py                   # Gradio entry β€” runs locally AND on HF Spaces
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ config.py            # pydantic-settings env loader
β”‚   β”œβ”€β”€ assistants/          # BaseAssistant + Claude + Qwen
β”‚   β”œβ”€β”€ memory.py            # LangChain SQLChatMessageHistory + RunnableWithMessageHistory
β”‚   β”œβ”€β”€ tools.py             # calculator + Tavily search
β”‚   β”œβ”€β”€ guardrails.py        # input blocklist + Haiku output moderation
β”‚   └── observability.py     # Langfuse @observe decorator
β”œβ”€β”€ eval/                    # datasets / runner / judge / report
β”œβ”€β”€ docs/                    # ARCHITECTURE, DEPLOY_GUIDE, EVALUATION_REPORT
└── tests/                   # smoke tests
```