Rubric-Grounded Faithfulness Evaluation Code
This repository hosts the anonymized executable code accompanying the NeurIPS 2026 Evaluations and Datasets submission:
- From Scores to Checks: Rubric-Grounded Faithfulness Evaluation for AI-Generated Images
It contains the scripts, prompt templates, and dependency files needed to inspect or rerun the paper's evaluation logic. The released dataset/resource assets are hosted separately under the OpenReview Dataset URL.
Included
code/: training, evaluation, routing, bootstrap, and cross-dataset inference scriptsexternal_eval/: live external-judge scripts, prompt templates, and external-evaluation READMErequirements-topic2.txt: main Python dependenciesLICENSES.md: upstream license and terms summary.gitignore
Not included
- derived dataset/resource files
- source benchmark images
- external reference-prediction CSVs
- provider credentials
How reviewers should use this code
- Download the released resource from the OpenReview Dataset URL.
- Place the dataset bundle beside this code bundle or copy its
data/,results/, andexternal_eval/reference_predictions/directories into the locations expected by the scripts. - Install dependencies:
python -m pip install -r requirements-topic2.txt
python -m pip install -r external_eval/requirements_qwen_baseline.txt
- Run the basic preflight check after the dataset bundle is in place:
python code/preflight.py --data-root data/topic2_aigciqa2023
In the release-only setup this check should confirm that derived annotations and manifests are present; it is expected to report ready=False until the upstream benchmark images are joined locally.
The hosted dataset bundle is sufficient for core result reproduction from released annotations, predictions, and metric files. A full rerun from upstream benchmark images requires joining the original image assets and metadata, and only that fuller setup should be checked with:
python code/preflight.py --data-root data/topic2_aigciqa2023 --strict
Core release-only reproduction
The hosted dataset and code releases are enough to verify the paper's central reported numbers from saved artifacts.
Aggregate the internal Rubric probe on the main split:
python code/aggregate_eval_metrics.py --run-dir results/fullgold_eval/main2026/rubric --metric-name test_metrics.json --output-json tmp/rubric_main2026_aggregate.json
Re-evaluate the released Qwen global reference predictions on the full-gold split:
python code/eval_external_predictions.py --prediction-csv external_eval/reference_predictions/qwen_vl_global_predictions_split2026_test.csv --data-root data/topic2_aigciqa2023 --protocol generator_holdout --split-seed 2026 --split test --output-dir tmp/external_qwen_global_split2026
Bootstrap the central in-domain Rubric-vs-Direct difference from released prediction frames:
python code/bootstrap_compare.py --run-a results/fullgold_eval/main2026/direct --run-b results/fullgold_eval/main2026/rubric --name-a direct --name-b rubric --metric macro_f1 --subdir . --filename test_predictions_aligned.csv --output-dir tmp/bootstrap_main2026
Re-evaluate the released cross-dataset Qwen global predictions on the relabeled T2I-CompBench stress test:
python code/eval_cross_dataset_predictions.py --prediction-csv external_eval/reference_predictions/qwen_vl_global_predictions_full2400.csv --labels-csv data/p2_t2i_compbench_human_eval/cross_dataset_eval_labels.csv --output-dir tmp/p2_qwen_global_full2400
Scope
This bundle is intended to let reviewers inspect:
- the internal same-backbone probe implementations
- the evaluation and aggregation logic
- the routing and bootstrap scripts
- the live external-judge prompting scripts and templates
The primary released evaluation resource is hosted separately for the Dataset URL and documented there through README.md, EVALUATION_CARD.md, and croissant.json.