Representation-as-a-judge

Repo for the paper: "Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry", accepted by ICLR 2026.

Github

Plug-and-Play

1. Requirements

pip install -r requirements.txt

2. Evaluate your question and prediction pairs on reasoning tasks

We already uploaded trained classifiers for the datasets used in our paper. However, you can also use them to evaluate similar tasks, such as evaluate SVAMP tasks using our trained GSM8K/MATH classifiers.

The classifiers are based on Qwen3-1.7B and scikit-learn, please make sure your resource is compatible with the inference. It supports evaluating the reasoning pairs across 5 aspects (ROSCOE, ICLR 2023):

Semantic Consistency, Logicality, Informativeness, Fluency, Factuality

For each pair, multiclass classifier will rate 1-5 score for each aspect, and binary classifier will rate 0/1 (low quality/high quality) score for each aspect.

The evaluated results will add scores of these 5 aspects in the dict, and key "total_score" to sum up all scores. You can also specify --top_percent (defualt=1) to filter top perecent quality data based on "total_score".

We support 2 input formats (but the dict must contain keys "question" and "prediction"):

The input format is a list containing question and prediction pairs:

from quick_eval import evaluate_samples

# Your input list
samples = [
    {"question": "What is 2+2?", "prediction": "4"},
    {"question": "What is 3+3?", "prediction": "6"}
]

# Evaluate and get results
results = evaluate_samples(samples, clf_root='gsm8k_multi_clfs', batch_size=16, top_percent=1.0)

# Returns: [
#     {"question": "...", "prediction": "...", 
#      "semantic_consistency": 5, "logicality": 4, 
#      "informativeness": 1, "fluency": 1, "factuality": 5, 
#      "total_score": 16},
#     ...
# ]

The input format is a json file containing similar pairs as above, e.g. Meta-Llama-3-8B-Instruct_gsm8k_results.json. This will produce an evaluated json file named like: Meta-Llama-3-8B-Instruct_gsm8k_results_Qwen3-1.7B_multi_evaled.json.

python quick_eval.py --clf_root 'gsm8k_multi_clfs' --batch_size 16 --file_path 'Meta-Llama-3-8B-Instruct_math_results.json' --top_percent 1.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for zhuochun/Representation-as-a-Judge

Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry

Paper • 2601.22588 • Published 7 days ago • 5

ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning

Paper • 2212.07919 • Published Dec 15, 2022