Spaces:

KevinMerchant13
/

oss-vs-frontier-assistant

Running

App Files Files Community

oss-vs-frontier-assistant / docs /EVALUATION_REPORT.md

KevinMerchant13

polish: 1-page PDF report + doc fixes

114d5f1 verified 3 days ago

preview code

raw

history blame contribute delete

3.49 kB

	# Evaluation Report: OSS vs. Frontier Assistant

	Comparison of an open-source assistant (Qwen2.5-1.5B-Instruct) against a frontier assistant (Claude Sonnet 4.5) on hallucination, demographic bias, and safety / jailbreak resistance.

	## Methodology

	- Datasets (random seed 42, 30 prompts each):
	- TruthfulQA (generation split) — hallucination.
	- BBQ (Elfsong/BBQ) — bias; 10 ambiguous-context items each from Age, Gender_identity, Race_ethnicity.
	- AdvBench harmful_behaviors — safety / jailbreak.
	- Assistants are run statelessly (no memory, no guardrails) so the eval measures raw model behavior, not the surrounding safety layers.
	- Judge: Claude Sonnet 4.5 with a JSON rubric `{hallucinated, biased, refused, harmful, reasoning}` and dataset-specific guidance. Temperature 0.
	- Uncertainty: 95% bootstrap CIs (1000 resamples) on every reported rate.

	## Headline metrics

	\| Metric \| Claude (frontier) \| Qwen-1.5B (OSS) \|
	\|---\|---\|---\|
	\| Hallucination rate (TruthfulQA) \| 10.0% [0.0, 23.3] \| 63.3% [46.7, 80.0] \|
	\| Bias rate (BBQ, overall) \| 3.3% [0.0, 10.0] \| 36.7% [20.0, 53.3] \|
	\| Jailbreak resistance (AdvBench) \| 100.0% [100.0, 100.0] \| 100.0% [100.0, 100.0] \|
	\| Refusal rate (overall) \| 33.3% [23.3, 42.2] \| 35.6% [25.6, 45.6] \|

	## Bias rate by demographic (BBQ)

	\| Demographic \| Claude (frontier) \| Qwen-1.5B (OSS) \|
	\|---\|---\|---\|
	\| Age \| 10.0% [0.0, 30.0] \| 60.0% [30.0, 90.0] \|
	\| Gender identity \| 0.0% [0.0, 0.0] \| 20.0% [0.0, 40.0] \|
	\| Race / ethnicity \| 0.0% [0.0, 0.0] \| 30.0% [0.0, 60.0] \|

	## Charts

	![Hallucination rate](../results/charts/hallucination_rate.png)

	![Bias by demographic](../results/charts/bias_by_demographic.png)

	![Jailbreak resistance](../results/charts/jailbreak_resistance.png)

	## Key findings

	- Hallucination: Claude 10.0% [0.0, 23.3] vs. Qwen 63.3% [46.7, 80.0].
	- Jailbreak resistance: Claude 100.0% [100.0, 100.0] vs. Qwen 100.0% [100.0, 100.0].
	- Bias differences by demographic are shown in the chart above; refer to the table for exact CIs.

	## Recommendations

	- For production deployments where safety and factual reliability matter, the frontier model's raw behavior is meaningfully stronger; the OSS model should only be used with the input/output guardrails enabled (they catch the residual gap on safety prompts in this project).
	- The OSS model is dramatically cheaper at inference time but slower on CPU. A GPU (or hosted endpoint) closes the latency gap.
	- For sensitive demographic queries, prefer answers that explicitly acknowledge uncertainty; both models still pick a side on a fraction of ambiguous BBQ items.

	## Limitations

	- Small samples (n=30 per dataset). The 95% CIs are correspondingly wide — read differences with care.
	- Judge self-bias: the judge (Claude Sonnet 4.5) is the same model family as one of the assistants under test. LLM judges have a documented tendency to prefer outputs from their own family; the Claude vs. Qwen comparison here is therefore optimistic for Claude. A second judge (e.g. GPT-4o or human review) on a subset would calibrate this.
	- Categories covered: BBQ subset is age / gender / race only. Other axes (disability, religion, SES, etc.) are not measured.
	- Tool use isn't directly evaluated; the prompts here are zero-shot questions, not tasks that demand tool calls.
	- The judge sees the dataset label, which can prime its scoring. A blinded judge would be more robust.

	# Evaluation Report: OSS vs. Frontier Assistant

	Comparison of an open-source assistant (Qwen2.5-1.5B-Instruct) against a frontier assistant (Claude Sonnet 4.5) on hallucination, demographic bias, and safety / jailbreak resistance.

	## Methodology

	- Datasets (random seed 42, 30 prompts each):
	- TruthfulQA (generation split) — hallucination.
	- BBQ (Elfsong/BBQ) — bias; 10 ambiguous-context items each from Age, Gender_identity, Race_ethnicity.
	- AdvBench harmful_behaviors — safety / jailbreak.
	- Assistants are run statelessly (no memory, no guardrails) so the eval measures raw model behavior, not the surrounding safety layers.
	- Judge: Claude Sonnet 4.5 with a JSON rubric `{hallucinated, biased, refused, harmful, reasoning}` and dataset-specific guidance. Temperature 0.
	- Uncertainty: 95% bootstrap CIs (1000 resamples) on every reported rate.

	## Headline metrics

	\| Metric \| Claude (frontier) \| Qwen-1.5B (OSS) \|
	\|---\|---\|---\|
	\| Hallucination rate (TruthfulQA) \| 10.0% [0.0, 23.3] \| 63.3% [46.7, 80.0] \|
	\| Bias rate (BBQ, overall) \| 3.3% [0.0, 10.0] \| 36.7% [20.0, 53.3] \|
	\| Jailbreak resistance (AdvBench) \| 100.0% [100.0, 100.0] \| 100.0% [100.0, 100.0] \|
	\| Refusal rate (overall) \| 33.3% [23.3, 42.2] \| 35.6% [25.6, 45.6] \|

	## Bias rate by demographic (BBQ)

	\| Demographic \| Claude (frontier) \| Qwen-1.5B (OSS) \|
	\|---\|---\|---\|
	\| Age \| 10.0% [0.0, 30.0] \| 60.0% [30.0, 90.0] \|
	\| Gender identity \| 0.0% [0.0, 0.0] \| 20.0% [0.0, 40.0] \|
	\| Race / ethnicity \| 0.0% [0.0, 0.0] \| 30.0% [0.0, 60.0] \|

	## Charts

	![Hallucination rate](../results/charts/hallucination_rate.png)

	![Bias by demographic](../results/charts/bias_by_demographic.png)

	![Jailbreak resistance](../results/charts/jailbreak_resistance.png)

	## Key findings

	- Hallucination: Claude 10.0% [0.0, 23.3] vs. Qwen 63.3% [46.7, 80.0].
	- Jailbreak resistance: Claude 100.0% [100.0, 100.0] vs. Qwen 100.0% [100.0, 100.0].
	- Bias differences by demographic are shown in the chart above; refer to the table for exact CIs.

	## Recommendations

	- For production deployments where safety and factual reliability matter, the frontier model's raw behavior is meaningfully stronger; the OSS model should only be used with the input/output guardrails enabled (they catch the residual gap on safety prompts in this project).
	- The OSS model is dramatically cheaper at inference time but slower on CPU. A GPU (or hosted endpoint) closes the latency gap.
	- For sensitive demographic queries, prefer answers that explicitly acknowledge uncertainty; both models still pick a side on a fraction of ambiguous BBQ items.

	## Limitations

	- Small samples (n=30 per dataset). The 95% CIs are correspondingly wide — read differences with care.
	- Judge self-bias: the judge (Claude Sonnet 4.5) is the same model family as one of the assistants under test. LLM judges have a documented tendency to prefer outputs from their own family; the Claude vs. Qwen comparison here is therefore optimistic for Claude. A second judge (e.g. GPT-4o or human review) on a subset would calibrate this.
	- Categories covered: BBQ subset is age / gender / race only. Other axes (disability, religion, SES, etc.) are not measured.
	- Tool use isn't directly evaluated; the prompts here are zero-shot questions, not tasks that demand tool calls.
	- The judge sees the dataset label, which can prime its scoring. A blinded judge would be more robust.