| --- |
| title: OSS vs Frontier Assistant |
| emoji: π€ |
| colorFrom: indigo |
| colorTo: purple |
| sdk: gradio |
| sdk_version: 6.14.0 |
| python_version: "3.11" |
| app_file: app.py |
| hardware: cpu-basic |
| pinned: false |
| --- |
| |
| # OSS vs. Frontier Assistant |
|
|
| Side-by-side evaluation of an **open-source assistant** (`Qwen2.5-1.5B-Instruct`) |
| and a **frontier assistant** (Claude Sonnet 4.5). Both share an identical Gradio |
| UI and capabilities β multi-turn chat with persistent short-term memory, a |
| calculator + web-search tool, and a two-layer input/output guardrail β and are |
| evaluated on hallucination, demographic bias, and safety / jailbreak resistance. |
|
|
| π **Live demo:** [huggingface.co/spaces/KevinMerchant13/oss-vs-frontier-assistant](https://huggingface.co/spaces/KevinMerchant13/oss-vs-frontier-assistant) |
| π **Evaluation report:** [`docs/EVALUATION_REPORT.md`](docs/EVALUATION_REPORT.md) |
| π οΈ **Architecture notes:** [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) Β· **Deploy guide:** [`docs/DEPLOY_GUIDE.md`](docs/DEPLOY_GUIDE.md) |
|
|
| --- |
|
|
| ## Setup (local) |
|
|
| ```bash |
| git clone <this-repo> |
| cd oss-vs-frontier-assistant |
| cp .env.example .env # then fill in your API keys |
| uv sync # installs all pinned deps |
| uv run python app.py # http://127.0.0.1:7860 |
| ``` |
|
|
| Required keys in `.env`: |
| - `ANTHROPIC_API_KEY` β frontier assistant, Haiku moderation, eval judge |
| - `HF_TOKEN` β Qwen download, deployment |
| - `TAVILY_API_KEY` β web-search tool |
| - `LANGFUSE_PUBLIC_KEY` / `LANGFUSE_SECRET_KEY` β observability (optional; app no-ops gracefully without them) |
|
|
| ## Running the eval |
|
|
| ```bash |
| uv run python -m eval.run_eval --limit 2 # smoke run (~2 min, ~$0.05) |
| uv run python -m eval.run_eval # full run, 90 prompts Γ 2 assistants |
| uv run python -m eval.judge # Sonnet 4.5 scores every response |
| uv run python -m eval.report # charts + EVALUATION_REPORT.md |
| ``` |
|
|
| Both `run_eval` and `judge` are **resumable** β interrupt with Ctrl+C and re-run to continue. |
|
|
| --- |
|
|
| ## Architecture |
|
|
| ``` |
| ββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββ ββββββββββββββββ |
| β Gradio UI β β β input filterβ β β assistant (Claude β β β output β |
| β (app.py) β β (regex β β or Qwen) + tools β β moderation β |
| β β β blocklist) β β + LangChain memory β β (Haiku 4.5) β |
| ββββββββββββββ ββββββββββββββββ ββββββββββββ¬ββββββββββββ ββββββββ¬ββββββββ |
| β β |
| ββββββββ΄βββββββ ββββββββ΄ββββββββ |
| β SQLite β β trace via β |
| β sessions β β Langfuse β |
| βββββββββββββββ ββββββββββββββββ |
| ``` |
|
|
| **Key decisions** (full rationale in [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md)): |
| - Both assistants share a single `BaseAssistant` with the tool-calling loop and a 6-turn memory window. |
| - Qwen tool-calling is wired through Qwen's *native* `<tool_call>` template (LangChain's `ChatHuggingFace` doesn't render tool schemas), giving the OSS model genuine tool use. |
| - Guardrails live in the **UI layer**, not in `BaseAssistant`, so the eval sees raw model output for honest safety measurement. |
| - The eval judge gets the same JSON schema across all rows (`{hallucinated, biased, refused, harmful, reasoning}`) plus *dataset-aware guidance* to make verdicts meaningful per benchmark. |
|
|
| --- |
|
|
| ## Headline eval results (n=30 per dataset, 95% bootstrap CIs) |
|
|
| | Metric | Claude (frontier) | Qwen-1.5B (OSS) | |
| |-----------------------------------|--------------------------|---------------------------| |
| | Hallucination rate (TruthfulQA) | **10.0%** [0.0, 23.3] | 63.3% [46.7, 80.0] | |
| | Bias rate (BBQ, overall) | **3.3%** [0.0, 10.0] | 36.7% [20.0, 53.3] | |
| | Jailbreak resistance (AdvBench) | 100.0% [100.0, 100.0] | 100.0% [100.0, 100.0] | |
| | Refusal rate (overall) | 33.3% [23.3, 42.2] | 35.6% [25.6, 45.6] | |
|
|
| Full breakdown by demographic and the judge-self-bias limitation disclosure: [`docs/EVALUATION_REPORT.md`](docs/EVALUATION_REPORT.md). |
|
|
| ## Latency (per turn, measured on the eval run) |
|
|
| | Assistant | Hardware | median | mean | p95 | max | |
| |-------------------|-------------|--------|--------|--------|--------| |
| | Claude Sonnet 4.5 | API call | 4.4 s | 5.1 s | 10.1 s | 14.6 s | |
| | Qwen-1.5B | local CPU (dev laptop) | 11.9 s | 16.8 s | 37.1 s | 102.2 s | |
| | Qwen-1.5B | HF Space `cpu-basic` | likely 30-60 s+ per reply on shared CPU β see "Deployment" | |
|
|
| The deployed Space uses the **free `cpu-basic`** hardware (no HF PRO subscription). The |
| `@spaces.GPU` decorator is already in place on Qwen's generation, so switching to |
| **ZeroGPU** later is a one-line YAML change (`hardware: zero-a10g`) once a PRO |
| subscription is active β expected to bring Qwen latency to ~3-8 s. |
|
|
| ## Cost (rough, per chat turn) |
|
|
| | Component | Per-turn cost (approx) | |
| |-------------------------------------|------------------------| |
| | Claude Sonnet 4.5 assistant (~500 in / 200 out tok) | ~$0.0045 | |
| | Haiku 4.5 output moderation (~150 in / 50 out tok) | ~$0.0003 | |
| | Qwen-1.5B on HF Spaces (`cpu-basic` or `zero-a10g`) | free (within HF Space quotas) | |
| | Tavily web search (when tool fires) | free tier β€1k/mo | |
|
|
| A 100-turn Claude conversation runs **~$0.50**; the same 100 turns on Qwen via Hugging Face Spaces are **free** (modulo HF quota). |
|
|
| --- |
|
|
| ## Tradeoffs |
|
|
| - **Frontier is meaningfully more reliable** on hallucination and bias in the raw eval, but ~6Γ the per-turn cost vs. free OSS hosting. |
| - **The OSS model needs the guardrails enabled** to be safe to expose β the input/output filters were designed to close the residual gap. |
| - **A 1.5B model is the floor**, chosen here to fit ZeroGPU comfortably. A 7Bβ14B Qwen (or Llama-3.1) would likely close most of the hallucination/bias gap but needs more GPU time. |
| - **The judge is a Claude model**, which has documented self-preference bias β the OSS-vs-Claude comparison is therefore slightly optimistic for Claude. |
|
|
| ## What I'd improve with more time |
|
|
| - A **second judge** (GPT-4o or a human-annotated subset of ~30 rows) to calibrate the self-bias. |
| - Larger sample sizes (n=100+ per dataset) to tighten CIs. |
| - A **real tool-use eval** (e.g. GSM8K with calculator, NaturalQuestions with web search) β the current eval is zero-shot Q&A and doesn't directly measure how much the tools help. |
| - Try a **larger OSS model** (Qwen2.5-7B or Llama-3.1-8B) on ZeroGPU's `xlarge` tier to quantify the OSSβfrontier gap-vs-size curve. |
| - Add **session-scoped Gradio state** (per-browser-tab session id) for true multi-user deployment. |
|
|
| ## License |
|
|
| [MIT](LICENSE) β free for any use with attribution. |
|
|
| ## Project layout |
|
|
| ``` |
| oss-vs-frontier-assistant/ |
| βββ app.py # Gradio entry β runs locally AND on HF Spaces |
| βββ src/ |
| β βββ config.py # pydantic-settings env loader |
| β βββ assistants/ # BaseAssistant + Claude + Qwen |
| β βββ memory.py # LangChain SQLChatMessageHistory + RunnableWithMessageHistory |
| β βββ tools.py # calculator + Tavily search |
| β βββ guardrails.py # input blocklist + Haiku output moderation |
| β βββ observability.py # Langfuse @observe decorator |
| βββ eval/ # datasets / runner / judge / report |
| βββ docs/ # ARCHITECTURE, DEPLOY_GUIDE, EVALUATION_REPORT |
| βββ tests/ # smoke tests |
| ``` |
|
|