Spaces:
Running
Running
| title: DispatchBias | |
| emoji: 🚨 | |
| colorFrom: red | |
| colorTo: blue | |
| sdk: gradio | |
| sdk_version: 4.44.0 | |
| app_file: app.py | |
| python_version: "3.11" | |
| pinned: false | |
| license: mit | |
| short_description: Emergency dispatch LLM bias benchmark (PPDS scale, EN/ZH) | |
| # DispatchBias | |
| An LLM bias benchmark for emergency dispatch classification. Tests whether demographic signals in a 911 call transcript shift the priority level a model assigns, holding the underlying incident constant. Eleven models, English and Mandarin Chinese, paired matched-incident scenarios. | |
| Companion code for the paper: | |
| > William Guey. *Emergency Dispatch LLM Bias: A Cross-Lingual PPDS Benchmark*. Submitted to the Humanities and Social Sciences Communications special issue on Artificial Intelligence and Emerging Technologies in Public Safety. | |
| ## What it does | |
| Three steps in the UI: | |
| 1. **Import scenarios** from an Excel file. Each scenario provides a paired transcript (Variant A with a demographic signal, Variant B without) in both English and Mandarin Chinese. Only the raw transcript goes in the file. The dispatcher prompt and PPDS guide are prepended automatically at runtime. | |
| 2. **Collect data**. The app fans out async calls to OpenRouter across the selected models, with iteration-level paraphrase variation in the call openers and closers. Results land in an Excel file with one row per call. | |
| 3. **Build charts**. Five figures: per-language bias deltas, EN-vs-ZH overlay, PPDS distribution heatmap, cross-lingual scatter, and an effect size table. | |
| ## Methodology | |
| **Scoring:** PPDS levels are scored ECHO=5, DELTA=4, BRAVO=3, ALPHA=2, OMEGA=1. Bias delta = mean PPDS(Variant A) minus mean PPDS(Variant B) across iterations. Positive deltas indicate the demographic signal raises perceived urgency, negative the opposite. | |
| **Statistics:** Effect sizes reported as Cohen's d. Significance from independent t-tests between Variant A and Variant B score distributions per scenario, model, and language. Stars: * p<.05, ** p<.01, *** p<.001. | |
| **Robustness:** Each prompt is paraphrased per iteration via cycling through ten matched opener-closer pairs in each language to reduce single-template artifacts. | |
| **PPDS source:** Warner et al., *Annals of Emergency Dispatch and Response* 2014, Vol. 2 Issue 2 (IAED). | |
| ## Use your own OpenRouter key | |
| The Space does not pay for runs. Provide your own OpenRouter API key in the field on the page. Get one at [openrouter.ai/keys](https://openrouter.ai/keys). Approximate cost: a full run (11 models, 10 scenarios, 10 iterations, 2 languages, 2 variants = 4,400 calls) typically lands under USD 5 with the default model mix. | |
| If you fork the Space and want a default key for your own use, add `OPENROUTER_API_KEY` as a Space secret in the Settings tab. | |
| ## Local use | |
| ```bash | |
| pip install -r requirements.txt | |
| export OPENROUTER_API_KEY="sk-or-v1-..." | |
| python app.py | |
| ``` | |
| ## Citation | |
| ```bibtex | |
| @article{guey2026dispatchbias, | |
| title={Emergency Dispatch LLM Bias: A Cross-Lingual PPDS Benchmark}, | |
| author={Guey, William}, | |
| journal={Humanities and Social Sciences Communications}, | |
| year={2026}, | |
| note={Under review} | |
| } | |
| ``` | |
| ## License | |
| MIT for the code. Data and prompts are released under CC BY 4.0. The PPDS scale is the property of the IAED. | |