Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?
TL;DR We introduce VAB (Visual Aesthetic Benchmark), a benchmark that tests whether frontier AI models can make fine-grained aesthetic judgments the way human experts do. Instead of scalar ratings, VAB uses pairwise and set-based comparisons across fine art, photography, and illustration, grounded in 13,000+ expert assessments. The best model (Claude Sonnet 4.6, 26.5%) still falls far short of the 68.9% human expert baseline. Models struggle most with illustration, degrade sharply as candidate sets grow, and remain sensitive to option ordering.
Links
- ๐ Leaderboard: https://vab.bakelab.ai/#leaderboard
- ๐ค Data: https://huggingface.co/datasets/BakeLab/Visual-Aesthetic-Benchmark
- ๐ป Code for evaluation: https://github.com/BakeLab/Visual-Aesthetic-Benchmark
- ๐ง Arena: https://vab.bakelab.ai/arena
Introduction
In Greek mythology, a single golden apple inscribed with Kallisti, "to the fairest," ignited a decade-long war. It was an early recognition that aesthetic judgment is contested, irreducibly comparative, and deeply human. That a question as simple as which is more beautiful? could unravel an empire reflects the weight we place on taste. Beauty, it seems, has always resisted reduction to any fixed formula.
For most of history, that resistance felt safe. Aesthetics were argued over in salons and symposiums, settled by critics and connoisseurs. The stakes were human, the judges were human, and even when they were wrong, their wrongness was legible. We could trace the reasoning, contest the verdict, appeal to another court of taste.
Yet today, we are asking frontier models to pick up that same golden apple. AI systems now decide which images surface in a search, which visual styles propagate across platforms, which generated works get shown. They filter, rank, and recommend at a scale no human curator ever could. Beyond curation, they are increasingly used as reward models, their preferences quietly shaping what the next generation of models learns to create. The question is no longer whether machines will participate in aesthetic judgment. They already do. The question is whether they are actually judging, or simply averaging.
Evaluating Models Beyond the Objective Trap
Getting aesthetic evaluation right matters. But here is the difficulty: the golden apple was never inscribed with a score. Yet this is precisely what most existing benchmarks do. In pursuit of objectivity, they collect scalar ratings from crowds of annotators, average the numbers, and call the result ground truth. However, what a large enough crowd scores highly is not necessarily what an expert finds meaningful, or even interesting. Worse, scalar ratings strip away the one thing that makes aesthetic comparison legible: context. We do not look at two photographs in isolation and ask how good each one is. We look at them together, against a shared subject, and we feel which one gets closer to something true.
This is why we built VAB, the Visual Aesthetic Benchmark. Rather than running away from subjectivity, we chose to anchor it in structure and expertise.
We do not ask models to rate an image out of ten. We ask them to choose the best and the worst among a group of candidates. Put several things side by side, and humans almost always sense where the top and bottom lie. We ask models to do the same.
Every comparison in VAB is constrained within a shared topic. Within that topic, the differences are deliberate and fine-grained. Two portraits of the same subject may be separated by little more than a shift in light or a choice of angle. When content is held constant and the quality gap is small, the model must be able to tell the difference.
We worked with domain experts: photographers, illustrators, and artists. Not every judgment made the cut. We kept only the comparisons where independent expert assessments converged. What remains reflects the shape of expert consensus, not any single expert's taste.
How We Build VAB
Domain Coverage
VAB spans three visual domains: artwork, photography, and illustration (and we are adding more). Each has its own aesthetic conventions and failure modes, and competence in one does not transfer to another.
Evaluation Metrics
Raw accuracy conflates genuine judgment with positional bias. To avoid this, we test each comparison three times with randomly shuffled option orders.
This yields two metrics:
pass^3: A task is scored as correct only if the model answers correctly on all three shuffled orderings. The final score is the fraction of tasks where this holds. This is a strict measure: a single failure on any ordering zeros out that task.
ap@1: The accuracy is computed separately for each of the three orderings, then averaged.
Task Settings
VAB evaluates models under two task settings. In Top-1, a model is asked to identify the best image in a set. In Top & Bottom 1 (TB-1), it must identify both the best and the worst. The second setting is stricter: it demands that a model hold a coherent aesthetic ordering across the full range of a set, not just recognize a standout.
We compare model performance against two baselines: Expert Baseline, reflecting the average of expert-level aesthetic judgment, and Random Guess, the expected accuracy of selecting answers uniformly at random.
Expert Data and Judgement Collection
VAB separates creation from evaluation by design. The entire pipeline, from artwork collection to expert annotation, is powered by Proof, the expert data infrastructure built by Bake AI.
Creators. We work with the expert network of 1,000+ artists, from mid-career to 20+ year domain experts, contributing 2,000+ hours of commissioned production. We mix strong and competent work so the benchmark can actually discriminate. Most pieces were commissioned fresh to avoid contamination from public sources.
Judges. We invited 100+ independent evaluators, blind to artist identity, producing 13,000+ assessments over 300+ hours. Each comparison is reviewed by 10 judges who score detailed rubrics (composition, color, technique, etc.) before making a final decision. We discard cases where experts genuinely disagree, keeping only comparisons with clear consensus so the ground truth is as close to golden as possible.
Concretely, each comparison set is reviewed by 10 judges. For a 2-image set, we keep it when at least 8 judges agree on the best (worst is implied). For sets with 3 or more images, we require strong majority agreement on both the best and the worst image. The agreement threshold is tuned per set size so that the probability of passing the filter by pure chance stays below 1%. For example, a 3-image set requires 7 agreeing judges (null pass probability 0.88%), while a 6-image set only needs 6 (null pass probability 0.03%) because agreement among more options is inherently harder to achieve by luck.
Data Collection by Domain
All images are grouped into sets that share the same subject but differ in execution quality. This keeps content constant so the only variable is how well something is done, not what it depicts.
1) Artwork
426 painting sets, 1,126 paintings, 9 topics (calligraphy, Chinese painting, ink & wash, landscape color, portrait color, portrait sketch, quick sketch, still life color, still life sketch). For each topic, a constrained prompt fixes the subject and key compositional requirements. Artists then produce independent renditions of the same subject, so each set holds intent constant but differs in aesthetic choices: composition, value structure, color harmony, and paint handling. All works were commissioned fresh to avoid contamination from publicly indexed art.
2) Photography
670 sets, 1,809 images, 9 topics (architecture, food/product, landscape, macro, night/astro, portrait, sports, street/city, wildlife). Each set starts from a single source photo and produces controlled variants via two pipelines. In the expert-edit pipeline, photographers take photos with clear aesthetic flaws and produce improved versions through recomposition, color and tone correction, and content-aware expansion. In the automated pipeline, an agent generates better/worse variants using image-to-image models while preserving semantic content. All sets are deduplicated and expert-reviewed before annotation.
3) Illustration
250 sets, 880 images, 6 topics (anime/manga, comic, concept art, digital/AI art, pixel art, stylized 3D). Two pipelines produce the data. In the generative pipeline, prompts are built from modular components that lock down subject, scene, camera angle, and lighting, so the only thing that varies across a set is aesthetic execution. In the 3D pipeline, scanned artworks are rendered from multiple viewpoints under fixed lighting and background, creating sets where quality differences come purely from the original work rather than rendering artifacts. All sets are reviewed by professional illustrators before annotation.
Final Benchmark
All candidate sets are sent through the expert evaluation pipeline described above. After filtering for consensus, 400 tasks with 1,195 images remain:
| Domain | Tasks | Images |
|---|---|---|
| Fine Art | 161 | 458 |
| Illustration | 100 | 348 |
| Photography | 139 | 389 |
| Total | 400 | 1,195 |
Results and Analysis
Full results are available on our Leaderboard.
Key Findings
Frontier models still fall far short of human aesthetic judgment. The top-performing model (Claude Sonnet 4.6, 26.5%) reaches less than half the 68.9% human expert baseline. Newer generations do not reliably close this gap: Claude Sonnet 4.5 $\to$ 4.6 improves from 14.5% to 26.5%, but the GPT-5 series declines monotonically from 21.8% (GPT-5) to 20.0% (GPT-5.1) to 15.5% (GPT-5.2).
A gap persists between proprietary and open-weight models. The strongest open model (Qwen 3.5-397B-A17B, 17.2%) trails Claude Sonnet 4.6 (26.5%) by over 9 points. Since pass^3 requires correct answers across all three permutation trials, weaker cross-permutation consistency in open models is amplified under this metric.
Difficulty varies sharply across domains. Illustration proves the most challenging, where the best model (Claude Sonnet 4.6) reaches only 19.0% against a 54.4% human baseline. Photography is comparatively tractable, with o4-mini achieving 30.2%. For fine art, Claude Sonnet 4.6 leads at 34.2% against a 74.7% human baseline.
Models lack robustness to positional permutation. The pass^3 metric inherently penalizes positional sensitivity: a model that guesses correctly on one ordering but fails on another will score zero on that task. The gap between models' ap@1 and pass^3 scores suggests that much of their performance is not robust to reordering.
Performance degrades with increasing candidate set size. For 2-image tasks, the best model achieves 47.3%; this drops to 6.7% for sets of 4 images, while human experts degrade only from 87.1% to 43.6%.
One More Thing
We are building our arena! Come play and put your own aesthetic judgment to the test: Try the Arena.
Citation
@misc{vab2026,
title = {Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?},
author = {VAB Team},
year = {2026},
url = {https://vab.bakelab.ai/blog},
}

