File size: 2,076 Bytes
982012d
ef350e1
982012d
 
 
ef350e1
982012d
ef350e1
982012d
 
 
 
ef350e1
982012d
ef350e1
982012d
ef350e1
982012d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# FairEval: Human-Aligned Evaluation for Generative Models

**Author:** Kriti Behl  
**GitHub:** https://github.com/kritibehl/FairEval  
**Paper (preprint):** _“FairEval: Human-Aligned Evaluation for Generative Models”_  

FairEval is a lightweight research framework for evaluating LLM outputs beyond accuracy — focusing on:

- **LLM-as-Judge alignment scoring**
- **Toxicity / safety analysis**
- **Human agreement metrics (κ, ρ)**
- **Group-wise fairness dashboards**

It is designed as a **research tool**, not a deployment model.

---

## What this repo contains

This Hugging Face repo currently serves as a **model card + metadata hub** for:

- The **FairEval evaluation pipeline** (code on GitHub)
- A planned **Hugging Face Space demo** (UI built in Streamlit)
- Links to my **preprint** and **Medium explainer**.

> **Code**: https://github.com/kritibehl/FairEval  
> **Medium**: https://medium.com/@kriti0608/faireval-a-human-aligned-evaluation-framework-for-generative-models-d822bfd5c99d  

---

## Capabilities

FairEval supports:

1. **Rubric-based LLM-as-Judge scoring**  
   - Uses a structured rubric (`config/prompts/judge_rubric.md`) to score:
     - coherence
     - helpfulness
     - factuality
   - Returns **scalar scores** that correlate with human preference.

2. **Toxicity and safety metrics**
   - Wraps a toxicity model (e.g., Detoxify) to compute:
     - composite toxicity
     - per-category scores (insult, threat, identity attack, etc.)
   - Provides **Altair charts** for:
     - toxicity breakdown by category
     - toxicity distribution by demographic group

3. **Human evaluation agreement**
   - Ingests a `human_eval.csv` file with human ratings.
   - Computes:
     - **Fleiss’ κ** (inter-rater reliability)
     - **Spearman ρ** between judge and human scores.

---

## Example Usage

Checkout the GitHub repo and run the Streamlit demo:

```bash
git clone https://github.com/kritibehl/FairEval.git
cd FairEval
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
streamlit run demo/app.py