| # Evaluation Report: OSS vs. Frontier Assistant | |
| Comparison of an open-source assistant (Qwen2.5-1.5B-Instruct) against a frontier assistant (Claude Sonnet 4.5) on hallucination, demographic bias, and safety / jailbreak resistance. | |
| ## Methodology | |
| - **Datasets** (random seed 42, 30 prompts each): | |
| - TruthfulQA (generation split) β hallucination. | |
| - BBQ (Elfsong/BBQ) β bias; 10 ambiguous-context items each from Age, Gender_identity, Race_ethnicity. | |
| - AdvBench harmful_behaviors β safety / jailbreak. | |
| - **Assistants** are run *statelessly* (no memory, no guardrails) so the eval measures raw model behavior, not the surrounding safety layers. | |
| - **Judge**: Claude Sonnet 4.5 with a JSON rubric `{hallucinated, biased, refused, harmful, reasoning}` and dataset-specific guidance. Temperature 0. | |
| - **Uncertainty**: 95% bootstrap CIs (1000 resamples) on every reported rate. | |
| ## Headline metrics | |
| | Metric | Claude (frontier) | Qwen-1.5B (OSS) | | |
| |---|---|---| | |
| | Hallucination rate (TruthfulQA) | 10.0% [0.0, 23.3] | 63.3% [46.7, 80.0] | | |
| | Bias rate (BBQ, overall) | 3.3% [0.0, 10.0] | 36.7% [20.0, 53.3] | | |
| | Jailbreak resistance (AdvBench) | 100.0% [100.0, 100.0] | 100.0% [100.0, 100.0] | | |
| | Refusal rate (overall) | 33.3% [23.3, 42.2] | 35.6% [25.6, 45.6] | | |
| ## Bias rate by demographic (BBQ) | |
| | Demographic | Claude (frontier) | Qwen-1.5B (OSS) | | |
| |---|---|---| | |
| | Age | 10.0% [0.0, 30.0] | 60.0% [30.0, 90.0] | | |
| | Gender identity | 0.0% [0.0, 0.0] | 20.0% [0.0, 40.0] | | |
| | Race / ethnicity | 0.0% [0.0, 0.0] | 30.0% [0.0, 60.0] | | |
| ## Charts | |
|  | |
|  | |
|  | |
| ## Key findings | |
| - Hallucination: Claude 10.0% [0.0, 23.3] vs. Qwen 63.3% [46.7, 80.0]. | |
| - Jailbreak resistance: Claude 100.0% [100.0, 100.0] vs. Qwen 100.0% [100.0, 100.0]. | |
| - Bias differences by demographic are shown in the chart above; refer to the table for exact CIs. | |
| ## Recommendations | |
| - For production deployments where safety and factual reliability matter, the frontier model's *raw* behavior is meaningfully stronger; the OSS model should only be used with the input/output guardrails enabled (they catch the residual gap on safety prompts in this project). | |
| - The OSS model is dramatically cheaper at inference time but slower on CPU. A GPU (or hosted endpoint) closes the latency gap. | |
| - For sensitive demographic queries, prefer answers that explicitly acknowledge uncertainty; both models still pick a side on a fraction of ambiguous BBQ items. | |
| ## Limitations | |
| - **Small samples** (n=30 per dataset). The 95% CIs are correspondingly wide β read differences with care. | |
| - **Judge self-bias**: the judge (Claude Sonnet 4.5) is the same model family as one of the assistants under test. LLM judges have a documented tendency to prefer outputs from their own family; the Claude vs. Qwen comparison here is therefore optimistic for Claude. A second judge (e.g. GPT-4o or human review) on a subset would calibrate this. | |
| - **Categories covered**: BBQ subset is age / gender / race only. Other axes (disability, religion, SES, etc.) are not measured. | |
| - **Tool use isn't directly evaluated**; the prompts here are zero-shot questions, not tasks that demand tool calls. | |
| - **The judge sees the dataset label**, which can prime its scoring. A blinded judge would be more robust. | |