YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
FairEval: Human-Aligned Evaluation for Generative Models
Author: Kriti Behl
GitHub: https://github.com/kritibehl/FairEval
Paper (preprint): “FairEval: Human-Aligned Evaluation for Generative Models”
FairEval is a lightweight research framework for evaluating LLM outputs beyond accuracy — focusing on:
- LLM-as-Judge alignment scoring
- Toxicity / safety analysis
- Human agreement metrics (κ, ρ)
- Group-wise fairness dashboards
It is designed as a research tool, not a deployment model.
What this repo contains
This Hugging Face repo currently serves as a model card + metadata hub for:
- The FairEval evaluation pipeline (code on GitHub)
- A planned Hugging Face Space demo (UI built in Streamlit)
- Links to my preprint and Medium explainer.
Code: https://github.com/kritibehl/FairEval
Medium: https://medium.com/@kriti0608/faireval-a-human-aligned-evaluation-framework-for-generative-models-d822bfd5c99d
Capabilities
FairEval supports:
Rubric-based LLM-as-Judge scoring
- Uses a structured rubric (
config/prompts/judge_rubric.md) to score:- coherence
- helpfulness
- factuality
- Returns scalar scores that correlate with human preference.
- Uses a structured rubric (
Toxicity and safety metrics
- Wraps a toxicity model (e.g., Detoxify) to compute:
- composite toxicity
- per-category scores (insult, threat, identity attack, etc.)
- Provides Altair charts for:
- toxicity breakdown by category
- toxicity distribution by demographic group
- Wraps a toxicity model (e.g., Detoxify) to compute:
Human evaluation agreement
- Ingests a
human_eval.csvfile with human ratings. - Computes:
- Fleiss’ κ (inter-rater reliability)
- Spearman ρ between judge and human scores.
- Ingests a
Example Usage
Checkout the GitHub repo and run the Streamlit demo:
git clone https://github.com/kritibehl/FairEval.git
cd FairEval
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
streamlit run demo/app.py
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support