akseljoonas HF Staff commited on
Commit
00f76b2
·
1 Parent(s): b5fffed

adding readme

Browse files
Files changed (1) hide show
  1. eval/README.md +76 -0
eval/README.md ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HF-Agent Eval
2
+
3
+ Rubric-based evaluation pipeline implementing [Rubrics as Rewards](https://arxiv.org/abs/2410.13254) (RaR-Explicit).
4
+
5
+ ## Pipeline
6
+
7
+ ```
8
+ QA pairs → generate_rubrics.py → evaluate.py → scores
9
+ ```
10
+
11
+ ### 1. Generate Rubrics
12
+
13
+ Creates instance-specific evaluation criteria from question + reference answer.
14
+
15
+ ```bash
16
+ python eval/generate_rubrics.py \
17
+ --infile qa_pairs.jsonl \
18
+ --outfile qa_rubrics.jsonl \
19
+ --model anthropic/claude-sonnet-4-5-20250929 \
20
+ --push-to-hub akseljoonas/hf-agent-benchmark@rubrics
21
+ ```
22
+
23
+ **Input format:**
24
+ ```json
25
+ {"question": "...", "solution": "...", "thread": [...]}
26
+ ```
27
+
28
+ **Output:** 7-20 weighted criteria per question (Essential: +5, Important: +3-4, Optional: +1-2, Pitfall: -1 to -2)
29
+
30
+ ### 2. Evaluate Responses
31
+
32
+ Scores responses using generated rubrics via LLM-as-judge.
33
+
34
+ ```python
35
+ from evaluate import evaluate_dataset_with_rubrics
36
+
37
+ evaluate_dataset_with_rubrics(
38
+ input_file="responses.jsonl",
39
+ rubric_file="qa_rubrics.jsonl",
40
+ ground_truth_file="qa_pairs.jsonl",
41
+ output_file="results.jsonl",
42
+ model="gpt-4o-mini",
43
+ push_to_hub="akseljoonas/hf-agent-benchmark@evaluations"
44
+ )
45
+ ```
46
+
47
+ **Output:** Normalized score [0, 1] + per-criterion satisfaction + reasoning
48
+
49
+ ## HuggingFace Integration
50
+
51
+ Both scripts upload DataFrames before saving JSONL:
52
+
53
+ ```python
54
+ from hf_dataset_io import df_to_hub, hub_to_df
55
+
56
+ # Upload
57
+ df_to_hub(df, "username/dataset@config", split="train")
58
+
59
+ # Download
60
+ df = hub_to_df("username/dataset@config", split="train")
61
+ ```
62
+
63
+ Use `@config` notation to organize: `@rubrics`, `@evaluations`, `@ground-truth`
64
+
65
+ ## Key Parameters
66
+
67
+ - **--max-concurrent**: Parallel workers (default: 30 for rubrics, 10 for eval)
68
+ - **--push-to-hub**: Auto-upload to HF Hub (e.g., `user/dataset@rubrics`)
69
+ - **--model**: LiteLLM model string
70
+ - **split**: `train` for rubrics, `test` for evaluations
71
+
72
+ ## Scoring
73
+
74
+ RaR-Explicit: `score = Σ(weight × satisfied) / Σ(positive_weights)`
75
+
76
+ Normalized to [0, 1], clipped if pitfalls make it negative.