FairEval: Human-Aligned Evaluation for Generative Models

Author: Kriti Behl
GitHub: https://github.com/kritibehl/FairEval
Paper (preprint): “FairEval: Human-Aligned Evaluation for Generative Models”

FairEval is a lightweight research framework for evaluating LLM outputs beyond accuracy — focusing on:

LLM-as-Judge alignment scoring
Toxicity / safety analysis
Human agreement metrics (κ, ρ)
Group-wise fairness dashboards

It is designed as a research tool, not a deployment model.

What this repo contains

This Hugging Face repo currently serves as a model card + metadata hub for:

The FairEval evaluation pipeline (code on GitHub)
A planned Hugging Face Space demo (UI built in Streamlit)
Links to my preprint and Medium explainer.

Code: https://github.com/kritibehl/FairEval
Medium: https://medium.com/@kriti0608/faireval-a-human-aligned-evaluation-framework-for-generative-models-d822bfd5c99d

Capabilities

FairEval supports:

Rubric-based LLM-as-Judge scoring
- Uses a structured rubric (config/prompts/judge_rubric.md) to score:
  - coherence
  - helpfulness
  - factuality
- Returns scalar scores that correlate with human preference.
Toxicity and safety metrics
- Wraps a toxicity model (e.g., Detoxify) to compute:
  - composite toxicity
  - per-category scores (insult, threat, identity attack, etc.)
- Provides Altair charts for:
  - toxicity breakdown by category
  - toxicity distribution by demographic group
Human evaluation agreement
- Ingests a human_eval.csv file with human ratings.
- Computes:
  - Fleiss’ κ (inter-rater reliability)
  - Spearman ρ between judge and human scores.

Example Usage

Checkout the GitHub repo and run the Streamlit demo:

git clone https://github.com/kritibehl/FairEval.git
cd FairEval
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
streamlit run demo/app.py