--- title: OSS vs Frontier Assistant emoji: πŸ€– colorFrom: indigo colorTo: purple sdk: gradio sdk_version: 6.14.0 python_version: "3.11" app_file: app.py hardware: cpu-basic pinned: false --- # OSS vs. Frontier Assistant Side-by-side evaluation of an **open-source assistant** (`Qwen2.5-1.5B-Instruct`) and a **frontier assistant** (Claude Sonnet 4.5). Both share an identical Gradio UI and capabilities β€” multi-turn chat with persistent short-term memory, a calculator + web-search tool, and a two-layer input/output guardrail β€” and are evaluated on hallucination, demographic bias, and safety / jailbreak resistance. 🌐 **Live demo:** [huggingface.co/spaces/KevinMerchant13/oss-vs-frontier-assistant](https://huggingface.co/spaces/KevinMerchant13/oss-vs-frontier-assistant) πŸ“Š **Evaluation report:** [`docs/EVALUATION_REPORT.md`](docs/EVALUATION_REPORT.md) πŸ› οΈ **Architecture notes:** [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) Β· **Deploy guide:** [`docs/DEPLOY_GUIDE.md`](docs/DEPLOY_GUIDE.md) --- ## Setup (local) ```bash git clone cd oss-vs-frontier-assistant cp .env.example .env # then fill in your API keys uv sync # installs all pinned deps uv run python app.py # http://127.0.0.1:7860 ``` Required keys in `.env`: - `ANTHROPIC_API_KEY` β€” frontier assistant, Haiku moderation, eval judge - `HF_TOKEN` β€” Qwen download, deployment - `TAVILY_API_KEY` β€” web-search tool - `LANGFUSE_PUBLIC_KEY` / `LANGFUSE_SECRET_KEY` β€” observability (optional; app no-ops gracefully without them) ## Running the eval ```bash uv run python -m eval.run_eval --limit 2 # smoke run (~2 min, ~$0.05) uv run python -m eval.run_eval # full run, 90 prompts Γ— 2 assistants uv run python -m eval.judge # Sonnet 4.5 scores every response uv run python -m eval.report # charts + EVALUATION_REPORT.md ``` Both `run_eval` and `judge` are **resumable** β€” interrupt with Ctrl+C and re-run to continue. --- ## Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Gradio UI β”‚ β†’ β”‚ input filterβ”‚ β†’ β”‚ assistant (Claude β”‚ β†’ β”‚ output β”‚ β”‚ (app.py) β”‚ β”‚ (regex β”‚ β”‚ or Qwen) + tools β”‚ β”‚ moderation β”‚ β”‚ β”‚ β”‚ blocklist) β”‚ β”‚ + LangChain memory β”‚ β”‚ (Haiku 4.5) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β” β”‚ SQLite β”‚ β”‚ trace via β”‚ β”‚ sessions β”‚ β”‚ Langfuse β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` **Key decisions** (full rationale in [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md)): - Both assistants share a single `BaseAssistant` with the tool-calling loop and a 6-turn memory window. - Qwen tool-calling is wired through Qwen's *native* `` template (LangChain's `ChatHuggingFace` doesn't render tool schemas), giving the OSS model genuine tool use. - Guardrails live in the **UI layer**, not in `BaseAssistant`, so the eval sees raw model output for honest safety measurement. - The eval judge gets the same JSON schema across all rows (`{hallucinated, biased, refused, harmful, reasoning}`) plus *dataset-aware guidance* to make verdicts meaningful per benchmark. --- ## Headline eval results (n=30 per dataset, 95% bootstrap CIs) | Metric | Claude (frontier) | Qwen-1.5B (OSS) | |-----------------------------------|--------------------------|---------------------------| | Hallucination rate (TruthfulQA) | **10.0%** [0.0, 23.3] | 63.3% [46.7, 80.0] | | Bias rate (BBQ, overall) | **3.3%** [0.0, 10.0] | 36.7% [20.0, 53.3] | | Jailbreak resistance (AdvBench) | 100.0% [100.0, 100.0] | 100.0% [100.0, 100.0] | | Refusal rate (overall) | 33.3% [23.3, 42.2] | 35.6% [25.6, 45.6] | Full breakdown by demographic and the judge-self-bias limitation disclosure: [`docs/EVALUATION_REPORT.md`](docs/EVALUATION_REPORT.md). ## Latency (per turn, measured on the eval run) | Assistant | Hardware | median | mean | p95 | max | |-------------------|-------------|--------|--------|--------|--------| | Claude Sonnet 4.5 | API call | 4.4 s | 5.1 s | 10.1 s | 14.6 s | | Qwen-1.5B | local CPU (dev laptop) | 11.9 s | 16.8 s | 37.1 s | 102.2 s | | Qwen-1.5B | HF Space `cpu-basic` | likely 30-60 s+ per reply on shared CPU β€” see "Deployment" | The deployed Space uses the **free `cpu-basic`** hardware (no HF PRO subscription). The `@spaces.GPU` decorator is already in place on Qwen's generation, so switching to **ZeroGPU** later is a one-line YAML change (`hardware: zero-a10g`) once a PRO subscription is active β€” expected to bring Qwen latency to ~3-8 s. ## Cost (rough, per chat turn) | Component | Per-turn cost (approx) | |-------------------------------------|------------------------| | Claude Sonnet 4.5 assistant (~500 in / 200 out tok) | ~$0.0045 | | Haiku 4.5 output moderation (~150 in / 50 out tok) | ~$0.0003 | | Qwen-1.5B on HF Spaces (`cpu-basic` or `zero-a10g`) | free (within HF Space quotas) | | Tavily web search (when tool fires) | free tier ≀1k/mo | A 100-turn Claude conversation runs **~$0.50**; the same 100 turns on Qwen via Hugging Face Spaces are **free** (modulo HF quota). --- ## Tradeoffs - **Frontier is meaningfully more reliable** on hallucination and bias in the raw eval, but ~6Γ— the per-turn cost vs. free OSS hosting. - **The OSS model needs the guardrails enabled** to be safe to expose β€” the input/output filters were designed to close the residual gap. - **A 1.5B model is the floor**, chosen here to fit ZeroGPU comfortably. A 7B–14B Qwen (or Llama-3.1) would likely close most of the hallucination/bias gap but needs more GPU time. - **The judge is a Claude model**, which has documented self-preference bias β€” the OSS-vs-Claude comparison is therefore slightly optimistic for Claude. ## What I'd improve with more time - A **second judge** (GPT-4o or a human-annotated subset of ~30 rows) to calibrate the self-bias. - Larger sample sizes (n=100+ per dataset) to tighten CIs. - A **real tool-use eval** (e.g. GSM8K with calculator, NaturalQuestions with web search) β€” the current eval is zero-shot Q&A and doesn't directly measure how much the tools help. - Try a **larger OSS model** (Qwen2.5-7B or Llama-3.1-8B) on ZeroGPU's `xlarge` tier to quantify the OSS↔frontier gap-vs-size curve. - Add **session-scoped Gradio state** (per-browser-tab session id) for true multi-user deployment. ## License [MIT](LICENSE) β€” free for any use with attribution. ## Project layout ``` oss-vs-frontier-assistant/ β”œβ”€β”€ app.py # Gradio entry β€” runs locally AND on HF Spaces β”œβ”€β”€ src/ β”‚ β”œβ”€β”€ config.py # pydantic-settings env loader β”‚ β”œβ”€β”€ assistants/ # BaseAssistant + Claude + Qwen β”‚ β”œβ”€β”€ memory.py # LangChain SQLChatMessageHistory + RunnableWithMessageHistory β”‚ β”œβ”€β”€ tools.py # calculator + Tavily search β”‚ β”œβ”€β”€ guardrails.py # input blocklist + Haiku output moderation β”‚ └── observability.py # Langfuse @observe decorator β”œβ”€β”€ eval/ # datasets / runner / judge / report β”œβ”€β”€ docs/ # ARCHITECTURE, DEPLOY_GUIDE, EVALUATION_REPORT └── tests/ # smoke tests ```