Representation-as-a-judge
Repo for the paper: "Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry", accepted by ICLR 2026.
Plug-and-Play
1. Requirements
pip install -r requirements.txt
2. Evaluate your question and prediction pairs on reasoning tasks
We already uploaded trained classifiers for the datasets used in our paper. However, you can also use them to evaluate similar tasks, such as evaluate SVAMP tasks using our trained GSM8K/MATH classifiers.
The classifiers are based on Qwen3-1.7B and scikit-learn, please make sure your resource is compatible with the inference. It supports evaluating the reasoning pairs across 5 aspects (ROSCOE, ICLR 2023):
Semantic Consistency, Logicality, Informativeness, Fluency, Factuality
For each pair, multiclass classifier will rate 1-5 score for each aspect, and binary classifier will rate 0/1 (low quality/high quality) score for each aspect.
The evaluated results will add scores of these 5 aspects in the dict, and key "total_score" to sum up all scores. You can also specify --top_percent (defualt=1) to filter top perecent quality data based on "total_score".
We support 2 input formats (but the dict must contain keys "question" and "prediction"):
- The input format is a list containing question and prediction pairs:
from quick_eval import evaluate_samples
# Your input list
samples = [
{"question": "What is 2+2?", "prediction": "4"},
{"question": "What is 3+3?", "prediction": "6"}
]
# Evaluate and get results
results = evaluate_samples(samples, clf_root='gsm8k_multi_clfs', batch_size=16, top_percent=1.0)
# Returns: [
# {"question": "...", "prediction": "...",
# "semantic_consistency": 5, "logicality": 4,
# "informativeness": 1, "fluency": 1, "factuality": 5,
# "total_score": 16},
# ...
# ]
- The input format is a json file containing similar pairs as above, e.g. Meta-Llama-3-8B-Instruct_gsm8k_results.json. This will produce an evaluated json file named like:
Meta-Llama-3-8B-Instruct_gsm8k_results_Qwen3-1.7B_multi_evaled.json.
python quick_eval.py --clf_root 'gsm8k_multi_clfs' --batch_size 16 --file_path 'Meta-Llama-3-8B-Instruct_math_results.json' --top_percent 1.0