| # FairEval: Human-Aligned Evaluation for Generative Models | |
| **Author:** Kriti Behl | |
| **GitHub:** https://github.com/kritibehl/FairEval | |
| **Paper (preprint):** _“FairEval: Human-Aligned Evaluation for Generative Models”_ | |
| FairEval is a lightweight research framework for evaluating LLM outputs beyond accuracy — focusing on: | |
| - **LLM-as-Judge alignment scoring** | |
| - **Toxicity / safety analysis** | |
| - **Human agreement metrics (κ, ρ)** | |
| - **Group-wise fairness dashboards** | |
| It is designed as a **research tool**, not a deployment model. | |
| --- | |
| ## What this repo contains | |
| This Hugging Face repo currently serves as a **model card + metadata hub** for: | |
| - The **FairEval evaluation pipeline** (code on GitHub) | |
| - A planned **Hugging Face Space demo** (UI built in Streamlit) | |
| - Links to my **preprint** and **Medium explainer**. | |
| > **Code**: https://github.com/kritibehl/FairEval | |
| > **Medium**: https://medium.com/@kriti0608/faireval-a-human-aligned-evaluation-framework-for-generative-models-d822bfd5c99d | |
| --- | |
| ## Capabilities | |
| FairEval supports: | |
| 1. **Rubric-based LLM-as-Judge scoring** | |
| - Uses a structured rubric (`config/prompts/judge_rubric.md`) to score: | |
| - coherence | |
| - helpfulness | |
| - factuality | |
| - Returns **scalar scores** that correlate with human preference. | |
| 2. **Toxicity and safety metrics** | |
| - Wraps a toxicity model (e.g., Detoxify) to compute: | |
| - composite toxicity | |
| - per-category scores (insult, threat, identity attack, etc.) | |
| - Provides **Altair charts** for: | |
| - toxicity breakdown by category | |
| - toxicity distribution by demographic group | |
| 3. **Human evaluation agreement** | |
| - Ingests a `human_eval.csv` file with human ratings. | |
| - Computes: | |
| - **Fleiss’ κ** (inter-rater reliability) | |
| - **Spearman ρ** between judge and human scores. | |
| --- | |
| ## Example Usage | |
| Checkout the GitHub repo and run the Streamlit demo: | |
| ```bash | |
| git clone https://github.com/kritibehl/FairEval.git | |
| cd FairEval | |
| python3 -m venv .venv && source .venv/bin/activate | |
| pip install -r requirements.txt | |
| streamlit run demo/app.py |