YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

FairEval: Human-Aligned Evaluation for Generative Models

Author: Kriti Behl
GitHub: https://github.com/kritibehl/FairEval
Paper (preprint): “FairEval: Human-Aligned Evaluation for Generative Models”

FairEval is a lightweight research framework for evaluating LLM outputs beyond accuracy — focusing on:

  • LLM-as-Judge alignment scoring
  • Toxicity / safety analysis
  • Human agreement metrics (κ, ρ)
  • Group-wise fairness dashboards

It is designed as a research tool, not a deployment model.


What this repo contains

This Hugging Face repo currently serves as a model card + metadata hub for:

  • The FairEval evaluation pipeline (code on GitHub)
  • A planned Hugging Face Space demo (UI built in Streamlit)
  • Links to my preprint and Medium explainer.

Code: https://github.com/kritibehl/FairEval
Medium: https://medium.com/@kriti0608/faireval-a-human-aligned-evaluation-framework-for-generative-models-d822bfd5c99d


Capabilities

FairEval supports:

  1. Rubric-based LLM-as-Judge scoring

    • Uses a structured rubric (config/prompts/judge_rubric.md) to score:
      • coherence
      • helpfulness
      • factuality
    • Returns scalar scores that correlate with human preference.
  2. Toxicity and safety metrics

    • Wraps a toxicity model (e.g., Detoxify) to compute:
      • composite toxicity
      • per-category scores (insult, threat, identity attack, etc.)
    • Provides Altair charts for:
      • toxicity breakdown by category
      • toxicity distribution by demographic group
  3. Human evaluation agreement

    • Ingests a human_eval.csv file with human ratings.
    • Computes:
      • Fleiss’ κ (inter-rater reliability)
      • Spearman ρ between judge and human scores.

Example Usage

Checkout the GitHub repo and run the Streamlit demo:

git clone https://github.com/kritibehl/FairEval.git
cd FairEval
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
streamlit run demo/app.py
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support