Spaces:

KevinMerchant13
/

oss-vs-frontier-assistant

Running

App Files Files Community

oss-vs-frontier-assistant / README.md

KevinMerchant13

polish: 1-page PDF report + doc fixes

114d5f1 verified 3 days ago

preview code

raw

history blame contribute delete

8.27 kB

	---
	title: OSS vs Frontier Assistant
	emoji: 🤖
	colorFrom: indigo
	colorTo: purple
	sdk: gradio
	sdk_version: 6.14.0
	python_version: "3.11"
	app_file: app.py
	hardware: cpu-basic
	pinned: false
	---

	# OSS vs. Frontier Assistant

	Side-by-side evaluation of an open-source assistant (`Qwen2.5-1.5B-Instruct`)
	and a frontier assistant (Claude Sonnet 4.5). Both share an identical Gradio
	UI and capabilities — multi-turn chat with persistent short-term memory, a
	calculator + web-search tool, and a two-layer input/output guardrail — and are
	evaluated on hallucination, demographic bias, and safety / jailbreak resistance.

	🌐 Live demo: [huggingface.co/spaces/KevinMerchant13/oss-vs-frontier-assistant](https://huggingface.co/spaces/KevinMerchant13/oss-vs-frontier-assistant)
	📊 Evaluation report: [`docs/EVALUATION_REPORT.md`](docs/EVALUATION_REPORT.md)
	🛠️ Architecture notes: [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) · Deploy guide: [`docs/DEPLOY_GUIDE.md`](docs/DEPLOY_GUIDE.md)

	---

	## Setup (local)

	```bash
	git clone <this-repo>
	cd oss-vs-frontier-assistant
	cp .env.example .env # then fill in your API keys
	uv sync # installs all pinned deps
	uv run python app.py # http://127.0.0.1:7860
	```

	Required keys in `.env`:
	- `ANTHROPIC_API_KEY` — frontier assistant, Haiku moderation, eval judge
	- `HF_TOKEN` — Qwen download, deployment
	- `TAVILY_API_KEY` — web-search tool
	- `LANGFUSE_PUBLIC_KEY` / `LANGFUSE_SECRET_KEY` — observability (optional; app no-ops gracefully without them)

	## Running the eval

	```bash
	uv run python -m eval.run_eval --limit 2 # smoke run (~2 min, ~$0.05)
	uv run python -m eval.run_eval # full run, 90 prompts × 2 assistants
	uv run python -m eval.judge # Sonnet 4.5 scores every response
	uv run python -m eval.report # charts + EVALUATION_REPORT.md
	```

	Both `run_eval` and `judge` are resumable — interrupt with Ctrl+C and re-run to continue.

	---

	## Architecture

	```
	┌────────────┐ ┌──────────────┐ ┌──────────────────────┐ ┌──────────────┐
	│ Gradio UI │ → │ input filter│ → │ assistant (Claude │ → │ output │
	│ (app.py) │ │ (regex │ │ or Qwen) + tools │ │ moderation │
	│ │ │ blocklist) │ │ + LangChain memory │ │ (Haiku 4.5) │
	└────────────┘ └──────────────┘ └──────────┬───────────┘ └──────┬───────┘
	│ │
	┌──────┴──────┐ ┌──────┴───────┐
	│ SQLite │ │ trace via │
	│ sessions │ │ Langfuse │
	└─────────────┘ └──────────────┘
	```

	Key decisions (full rationale in [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md)):
	- Both assistants share a single `BaseAssistant` with the tool-calling loop and a 6-turn memory window.
	- Qwen tool-calling is wired through Qwen's native `<tool_call>` template (LangChain's `ChatHuggingFace` doesn't render tool schemas), giving the OSS model genuine tool use.
	- Guardrails live in the UI layer, not in `BaseAssistant`, so the eval sees raw model output for honest safety measurement.
	- The eval judge gets the same JSON schema across all rows (`{hallucinated, biased, refused, harmful, reasoning}`) plus dataset-aware guidance to make verdicts meaningful per benchmark.

	---

	## Headline eval results (n=30 per dataset, 95% bootstrap CIs)

	\| Metric \| Claude (frontier) \| Qwen-1.5B (OSS) \|
	\|-----------------------------------\|--------------------------\|---------------------------\|
	\| Hallucination rate (TruthfulQA) \| 10.0% [0.0, 23.3] \| 63.3% [46.7, 80.0] \|
	\| Bias rate (BBQ, overall) \| 3.3% [0.0, 10.0] \| 36.7% [20.0, 53.3] \|
	\| Jailbreak resistance (AdvBench) \| 100.0% [100.0, 100.0] \| 100.0% [100.0, 100.0] \|
	\| Refusal rate (overall) \| 33.3% [23.3, 42.2] \| 35.6% [25.6, 45.6] \|

	Full breakdown by demographic and the judge-self-bias limitation disclosure: [`docs/EVALUATION_REPORT.md`](docs/EVALUATION_REPORT.md).

	## Latency (per turn, measured on the eval run)

	\| Assistant \| Hardware \| median \| mean \| p95 \| max \|
	\|-------------------\|-------------\|--------\|--------\|--------\|--------\|
	\| Claude Sonnet 4.5 \| API call \| 4.4 s \| 5.1 s \| 10.1 s \| 14.6 s \|
	\| Qwen-1.5B \| local CPU (dev laptop) \| 11.9 s \| 16.8 s \| 37.1 s \| 102.2 s \|
	\| Qwen-1.5B \| HF Space `cpu-basic` \| likely 30-60 s+ per reply on shared CPU — see "Deployment" \|

	The deployed Space uses the free `cpu-basic` hardware (no HF PRO subscription). The
	`@spaces.GPU` decorator is already in place on Qwen's generation, so switching to
	ZeroGPU later is a one-line YAML change (`hardware: zero-a10g`) once a PRO
	subscription is active — expected to bring Qwen latency to ~3-8 s.

	## Cost (rough, per chat turn)

	\| Component \| Per-turn cost (approx) \|
	\|-------------------------------------\|------------------------\|
	\| Claude Sonnet 4.5 assistant (~500 in / 200 out tok) \| ~$0.0045 \|
	\| Haiku 4.5 output moderation (~150 in / 50 out tok) \| ~$0.0003 \|
	\| Qwen-1.5B on HF Spaces (`cpu-basic` or `zero-a10g`) \| free (within HF Space quotas) \|
	\| Tavily web search (when tool fires) \| free tier ≤1k/mo \|

	A 100-turn Claude conversation runs ~$0.50; the same 100 turns on Qwen via Hugging Face Spaces are free (modulo HF quota).

	---

	## Tradeoffs

	- Frontier is meaningfully more reliable on hallucination and bias in the raw eval, but ~6× the per-turn cost vs. free OSS hosting.
	- The OSS model needs the guardrails enabled to be safe to expose — the input/output filters were designed to close the residual gap.
	- A 1.5B model is the floor, chosen here to fit ZeroGPU comfortably. A 7B–14B Qwen (or Llama-3.1) would likely close most of the hallucination/bias gap but needs more GPU time.
	- The judge is a Claude model, which has documented self-preference bias — the OSS-vs-Claude comparison is therefore slightly optimistic for Claude.

	## What I'd improve with more time

	- A second judge (GPT-4o or a human-annotated subset of ~30 rows) to calibrate the self-bias.
	- Larger sample sizes (n=100+ per dataset) to tighten CIs.
	- A real tool-use eval (e.g. GSM8K with calculator, NaturalQuestions with web search) — the current eval is zero-shot Q&A and doesn't directly measure how much the tools help.
	- Try a larger OSS model (Qwen2.5-7B or Llama-3.1-8B) on ZeroGPU's `xlarge` tier to quantify the OSS↔frontier gap-vs-size curve.
	- Add session-scoped Gradio state (per-browser-tab session id) for true multi-user deployment.

	## License

	[MIT](LICENSE) — free for any use with attribution.

	## Project layout

	```
	oss-vs-frontier-assistant/
	├── app.py # Gradio entry — runs locally AND on HF Spaces
	├── src/
	│ ├── config.py # pydantic-settings env loader
	│ ├── assistants/ # BaseAssistant + Claude + Qwen
	│ ├── memory.py # LangChain SQLChatMessageHistory + RunnableWithMessageHistory
	│ ├── tools.py # calculator + Tavily search
	│ ├── guardrails.py # input blocklist + Haiku output moderation
	│ └── observability.py # Langfuse @observe decorator
	├── eval/ # datasets / runner / judge / report
	├── docs/ # ARCHITECTURE, DEPLOY_GUIDE, EVALUATION_REPORT
	└── tests/ # smoke tests
	```