LLM-Dispatchbias / README.md
Realmente's picture
Update README.md
745f63c verified

A newer version of the Gradio SDK is available: 6.16.0

Upgrade
metadata
title: DispatchBias
emoji: 🚨
colorFrom: red
colorTo: blue
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
python_version: '3.11'
pinned: false
license: mit
short_description: Emergency dispatch LLM bias benchmark (PPDS scale, EN/ZH)

DispatchBias

An LLM bias benchmark for emergency dispatch classification. Tests whether demographic signals in a 911 call transcript shift the priority level a model assigns, holding the underlying incident constant. Eleven models, English and Mandarin Chinese, paired matched-incident scenarios.

Companion code for the paper:

William Guey. Emergency Dispatch LLM Bias: A Cross-Lingual PPDS Benchmark. Submitted to the Humanities and Social Sciences Communications special issue on Artificial Intelligence and Emerging Technologies in Public Safety.

What it does

Three steps in the UI:

  1. Import scenarios from an Excel file. Each scenario provides a paired transcript (Variant A with a demographic signal, Variant B without) in both English and Mandarin Chinese. Only the raw transcript goes in the file. The dispatcher prompt and PPDS guide are prepended automatically at runtime.
  2. Collect data. The app fans out async calls to OpenRouter across the selected models, with iteration-level paraphrase variation in the call openers and closers. Results land in an Excel file with one row per call.
  3. Build charts. Five figures: per-language bias deltas, EN-vs-ZH overlay, PPDS distribution heatmap, cross-lingual scatter, and an effect size table.

Methodology

Scoring: PPDS levels are scored ECHO=5, DELTA=4, BRAVO=3, ALPHA=2, OMEGA=1. Bias delta = mean PPDS(Variant A) minus mean PPDS(Variant B) across iterations. Positive deltas indicate the demographic signal raises perceived urgency, negative the opposite.

Statistics: Effect sizes reported as Cohen's d. Significance from independent t-tests between Variant A and Variant B score distributions per scenario, model, and language. Stars: * p<.05, ** p<.01, *** p<.001.

Robustness: Each prompt is paraphrased per iteration via cycling through ten matched opener-closer pairs in each language to reduce single-template artifacts.

PPDS source: Warner et al., Annals of Emergency Dispatch and Response 2014, Vol. 2 Issue 2 (IAED).

Use your own OpenRouter key

The Space does not pay for runs. Provide your own OpenRouter API key in the field on the page. Get one at openrouter.ai/keys. Approximate cost: a full run (11 models, 10 scenarios, 10 iterations, 2 languages, 2 variants = 4,400 calls) typically lands under USD 5 with the default model mix.

If you fork the Space and want a default key for your own use, add OPENROUTER_API_KEY as a Space secret in the Settings tab.

Local use

pip install -r requirements.txt
export OPENROUTER_API_KEY="sk-or-v1-..."
python app.py

Citation

@article{guey2026dispatchbias,
  title={Emergency Dispatch LLM Bias: A Cross-Lingual PPDS Benchmark},
  author={Guey, William},
  journal={Humanities and Social Sciences Communications},
  year={2026},
  note={Under review}
}

License

MIT for the code. Data and prompts are released under CC BY 4.0. The PPDS scale is the property of the IAED.