Rubric-Grounded Faithfulness Evaluation Code

This repository hosts the anonymized executable code accompanying the NeurIPS 2026 Evaluations and Datasets submission:

From Scores to Checks: Rubric-Grounded Faithfulness Evaluation for AI-Generated Images

It contains the scripts, prompt templates, and dependency files needed to inspect or rerun the paper's evaluation logic. The released dataset/resource assets are hosted separately under the OpenReview Dataset URL.

Included

code/: training, evaluation, routing, bootstrap, and cross-dataset inference scripts
external_eval/: live external-judge scripts, prompt templates, and external-evaluation README
requirements-topic2.txt: main Python dependencies
LICENSES.md: upstream license and terms summary
.gitignore

Not included

derived dataset/resource files
source benchmark images
external reference-prediction CSVs
provider credentials

How reviewers should use this code

Download the released resource from the OpenReview Dataset URL.
Place the dataset bundle beside this code bundle or copy its data/, results/, and external_eval/reference_predictions/ directories into the locations expected by the scripts.
Install dependencies:

python -m pip install -r requirements-topic2.txt
python -m pip install -r external_eval/requirements_qwen_baseline.txt

Run the basic preflight check after the dataset bundle is in place:

python code/preflight.py --data-root data/topic2_aigciqa2023

In the release-only setup this check should confirm that derived annotations and manifests are present; it is expected to report ready=False until the upstream benchmark images are joined locally.

The hosted dataset bundle is sufficient for core result reproduction from released annotations, predictions, and metric files. A full rerun from upstream benchmark images requires joining the original image assets and metadata, and only that fuller setup should be checked with:

python code/preflight.py --data-root data/topic2_aigciqa2023 --strict

Core release-only reproduction

The hosted dataset and code releases are enough to verify the paper's central reported numbers from saved artifacts.

Aggregate the internal Rubric probe on the main split:

python code/aggregate_eval_metrics.py --run-dir results/fullgold_eval/main2026/rubric --metric-name test_metrics.json --output-json tmp/rubric_main2026_aggregate.json

Re-evaluate the released Qwen global reference predictions on the full-gold split:

python code/eval_external_predictions.py --prediction-csv external_eval/reference_predictions/qwen_vl_global_predictions_split2026_test.csv --data-root data/topic2_aigciqa2023 --protocol generator_holdout --split-seed 2026 --split test --output-dir tmp/external_qwen_global_split2026

Bootstrap the central in-domain Rubric-vs-Direct difference from released prediction frames:

python code/bootstrap_compare.py --run-a results/fullgold_eval/main2026/direct --run-b results/fullgold_eval/main2026/rubric --name-a direct --name-b rubric --metric macro_f1 --subdir . --filename test_predictions_aligned.csv --output-dir tmp/bootstrap_main2026

Re-evaluate the released cross-dataset Qwen global predictions on the relabeled T2I-CompBench stress test:

python code/eval_cross_dataset_predictions.py --prediction-csv external_eval/reference_predictions/qwen_vl_global_predictions_full2400.csv --labels-csv data/p2_t2i_compbench_human_eval/cross_dataset_eval_labels.csv --output-dir tmp/p2_qwen_global_full2400

Scope

This bundle is intended to let reviewers inspect:

the internal same-backbone probe implementations
the evaluation and aggregation logic
the routing and bootstrap scripts
the live external-judge prompting scripts and templates

The primary released evaluation resource is hosted separately for the Dataset URL and documented there through README.md, EVALUATION_CARD.md, and croissant.json.

Downloads last month: -; Downloads are not tracked for this model. How to track