AGC-Judge

AGC-Judge is an open-weight scorer for LLM creativity, fine-tuned on bias-corrected three-judge ratings produced by Judge Response Theory (JRT). It is the open-weight scoring artifact released alongside AGC-Bench: Measuring Artificial General Creativity (NeurIPS 2026 Evaluations and Datasets Track, anonymous submission).

Quick start

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("agcbench-2026/AGC-Judge")
model = AutoPeftModelForCausalLM.from_pretrained(
    "agcbench-2026/AGC-Judge",
    torch_dtype="auto",
    device_map="auto",
)

prompt = (
    "Benchmark rubric:\n{rubric}\n\n"
    "Prompt:\n{instruction}\n\nResponse:\n{response}\n\n"
    "Output a single integer score on the scale specified in the rubric. "
    "No explanation, no formatting, just the number.\n\nScore:"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=4)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Model details

Base model: Qwen/Qwen3-30B-A3B-Instruct-2507
Fine-tuning: LoRA adapter on the base model
LoRA hyperparameters: rank 16, alpha 32, learning rate 1e-4, 1 epoch, AdamW
Training data: 48,299 JRT-corrected creativity ratings on 24 LLM-judge benchmarks from AGC-Bench
Validation data: 6,883 held-out items
Compute: 2 × H100-80GB SXM, 20-minute fine-tune

Intended use

AGC-Judge is a drop-in scoring head for creativity benchmarks that otherwise require an LLM-as-judge. It predicts the JRT-corrected gold score (the consensus of three frontier vendor judges with per-judge severity bias removed) at a fraction of the inference cost of running those frontier judges directly.

Primary use cases:

Re-scoring AGC-Bench cells when reproducing the benchmark on new models
Extending AGC-Bench with new creativity benchmarks under a calibrated open-weight scoring head
Substantive creativity-evaluation research where reproducible scoring is needed without dependency on closed-weight frontier judges

Performance

On AGC-Bench held-out splits:

Split	Spearman ρ
In-distribution test	0.94
10 held-out frontier models	0.94
3 held-out benchmarks (novel rubrics)	0.83
Cohort leaderboard reproduction (JRT-corrected sub-composite)	≥ 0.97 on every split

On the out-of-distribution Orwig creative-writing corpus, AGC-Judge tracks human ratings about as well as a frontier GPT judge with ~40% smaller residual source-preference bias (per-source mean gap +0.51 vs +0.83 on the 1–5 scale).

Limitations and bias

AGC-Judge was fine-tuned exclusively on LLM-produced responses with frontier-LLM ratings. It has not been trained on human-produced creative responses. For cross-source (human vs LLM) scoring, use the fairness-aware inference prompt described in the AGC-Bench paper (Appendix on AGC-Human dissociations) which discloses the LLM-self-preference bias documented in prior judge work and labels the response source before scoring.
The model inherits a moderate, characterizable LLM-stylistic preference from its supervision (~40% of the original bias of the closed-weight frontier judge it was distilled from, but non-zero).
AGC-Judge is calibrated on creativity-rubric scoring (Likert and wide-range scales). It is not a general-purpose evaluator and should not be used outside creativity-rating contexts without re-evaluation.

Citation

@inproceedings{agcbench2026,
  title     = {AGC-Bench: Measuring Artificial General Creativity},
  author    = {Anonymous Authors},
  booktitle = {Advances in Neural Information Processing Systems
               Datasets and Benchmarks Track (under review)},
  year      = {2026}
}

License

Apache 2.0 (inherits from the base model).

Downloads last month: 5

Model tree for agcbench-2026/AGC-Judge

Base model

Qwen/Qwen3-30B-A3B-Instruct-2507

Finetuned

(81)

this model

agcbench-2026
/

AGC-Judge