# FairEval: Human-Aligned Evaluation for Generative Models **Author:** Kriti Behl **GitHub:** https://github.com/kritibehl/FairEval **Paper (preprint):** _“FairEval: Human-Aligned Evaluation for Generative Models”_ FairEval is a lightweight research framework for evaluating LLM outputs beyond accuracy — focusing on: - **LLM-as-Judge alignment scoring** - **Toxicity / safety analysis** - **Human agreement metrics (κ, ρ)** - **Group-wise fairness dashboards** It is designed as a **research tool**, not a deployment model. --- ## What this repo contains This Hugging Face repo currently serves as a **model card + metadata hub** for: - The **FairEval evaluation pipeline** (code on GitHub) - A planned **Hugging Face Space demo** (UI built in Streamlit) - Links to my **preprint** and **Medium explainer**. > **Code**: https://github.com/kritibehl/FairEval > **Medium**: https://medium.com/@kriti0608/faireval-a-human-aligned-evaluation-framework-for-generative-models-d822bfd5c99d --- ## Capabilities FairEval supports: 1. **Rubric-based LLM-as-Judge scoring** - Uses a structured rubric (`config/prompts/judge_rubric.md`) to score: - coherence - helpfulness - factuality - Returns **scalar scores** that correlate with human preference. 2. **Toxicity and safety metrics** - Wraps a toxicity model (e.g., Detoxify) to compute: - composite toxicity - per-category scores (insult, threat, identity attack, etc.) - Provides **Altair charts** for: - toxicity breakdown by category - toxicity distribution by demographic group 3. **Human evaluation agreement** - Ingests a `human_eval.csv` file with human ratings. - Computes: - **Fleiss’ κ** (inter-rater reliability) - **Spearman ρ** between judge and human scores. --- ## Example Usage Checkout the GitHub repo and run the Streamlit demo: ```bash git clone https://github.com/kritibehl/FairEval.git cd FairEval python3 -m venv .venv && source .venv/bin/activate pip install -r requirements.txt streamlit run demo/app.py