AGC-Judge
AGC-Judge is an open-weight scorer for LLM creativity, fine-tuned on bias-corrected three-judge ratings produced by Judge Response Theory (JRT). It is the open-weight scoring artifact released alongside AGC-Bench: Measuring Artificial General Creativity (NeurIPS 2026 Evaluations and Datasets Track, anonymous submission).
Quick start
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("agcbench-2026/AGC-Judge")
model = AutoPeftModelForCausalLM.from_pretrained(
"agcbench-2026/AGC-Judge",
torch_dtype="auto",
device_map="auto",
)
prompt = (
"Benchmark rubric:\n{rubric}\n\n"
"Prompt:\n{instruction}\n\nResponse:\n{response}\n\n"
"Output a single integer score on the scale specified in the rubric. "
"No explanation, no formatting, just the number.\n\nScore:"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=4)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Model details
- Base model:
Qwen/Qwen3-30B-A3B-Instruct-2507 - Fine-tuning: LoRA adapter on the base model
- LoRA hyperparameters: rank 16, alpha 32, learning rate 1e-4, 1 epoch, AdamW
- Training data: 48,299 JRT-corrected creativity ratings on 24 LLM-judge benchmarks from AGC-Bench
- Validation data: 6,883 held-out items
- Compute: 2 ร H100-80GB SXM, 20-minute fine-tune
Intended use
AGC-Judge is a drop-in scoring head for creativity benchmarks that otherwise require an LLM-as-judge. It predicts the JRT-corrected gold score (the consensus of three frontier vendor judges with per-judge severity bias removed) at a fraction of the inference cost of running those frontier judges directly.
Primary use cases:
- Re-scoring AGC-Bench cells when reproducing the benchmark on new models
- Extending AGC-Bench with new creativity benchmarks under a calibrated open-weight scoring head
- Substantive creativity-evaluation research where reproducible scoring is needed without dependency on closed-weight frontier judges
Performance
On AGC-Bench held-out splits:
| Split | Spearman ฯ |
|---|---|
| In-distribution test | 0.94 |
| 10 held-out frontier models | 0.94 |
| 3 held-out benchmarks (novel rubrics) | 0.83 |
| Cohort leaderboard reproduction (JRT-corrected sub-composite) | โฅ 0.97 on every split |
On the out-of-distribution Orwig creative-writing corpus, AGC-Judge tracks human ratings about as well as a frontier GPT judge with ~40% smaller residual source-preference bias (per-source mean gap +0.51 vs +0.83 on the 1โ5 scale).
Limitations and bias
- AGC-Judge was fine-tuned exclusively on LLM-produced responses with frontier-LLM ratings. It has not been trained on human-produced creative responses. For cross-source (human vs LLM) scoring, use the fairness-aware inference prompt described in the AGC-Bench paper (Appendix on AGC-Human dissociations) which discloses the LLM-self-preference bias documented in prior judge work and labels the response source before scoring.
- The model inherits a moderate, characterizable LLM-stylistic preference from its supervision (~40% of the original bias of the closed-weight frontier judge it was distilled from, but non-zero).
- AGC-Judge is calibrated on creativity-rubric scoring (Likert and wide-range scales). It is not a general-purpose evaluator and should not be used outside creativity-rating contexts without re-evaluation.
Citation
@inproceedings{agcbench2026,
title = {AGC-Bench: Measuring Artificial General Creativity},
author = {Anonymous Authors},
booktitle = {Advances in Neural Information Processing Systems
Datasets and Benchmarks Track (under review)},
year = {2026}
}
License
Apache 2.0 (inherits from the base model).
- Downloads last month
- 39
Model tree for agcbench-2026/AGC-Judge
Base model
Qwen/Qwen3-30B-A3B-Instruct-2507