George Costanza PRO
AI & ML interests
Recent Activity
Organizations
Semantic Layering Vision Benchmark β Results Report (Corrected)
Source: HuggingFace post by @EpsilonGreedyAI β June 2026
16 vision models. One mountain valley. Which actually sees it?
Test Image: Okanogan Valley, near Omak, WA β Northern Cascade Range foothills
Evaluator: Manual review + structured rubric
Framework: semantic-layering-prompt.md β single-pass, 14-section analysis
Confidence: Scores Β±2 pts (single-image evaluation)
Quick Reference Card
| # | Model | Score | Grade | Best For | Avoid For |
|---|---|---|---|---|---|
| 1 | Nex N2 Mini Apex MTP | 94 | S | Geographic localization; occlusion detail | β |
| 2 | Nex N2 Mini Abl. APEX I-Compact | 91 | S | Regional knowledge; comprehensive reasoning | β |
| 3 | Qwable 9B Fable 5 | 87 | A | Material classification; unique power line detection | Regional localization |
| 4 | DeepSeek V4 (web) | 86 | A | Polished output; occlusion & atmosphere | Regional precision; fence structure |
| 5 | Qwen 3.6 40B Deckard Heretic | 86 | A | Geological analysis; Cascades mention | Tree species ID; time-of-day |
| 6 | Qwen 3.6 27B Abliterated | 84 | A | Strong perceptual accuracy; solid all-around | European landscapes |
| 7 | Qwen 3.6 27b Abl. i1 | 82 | A | Perceptual accuracy; balanced | Regional precision |
| 8 | Unsloth Qwen 3.6 35B A3B | 80 | A | Strong all-around; PNW awareness | Minor omissions |
| 9 | Qwen 3.5 9B Unc. Aggressive | 77 | A | Best sub-10B model; good value | Occlusion detail |
| 10 | Qwen 3 VL 8B Q4_K_M | 67 | B | Budget vision; acceptable for coarse tasks | Spatial consistency |
| 11 | Qwen3 VL 30B A3B Abl. | 62 | B | β | Spatial relations; counting |
| 12 | Qwen 3 VL 8B Q8_0 | 62 | B | β | Spatial relations; occlusion |
| 13 | Gemma 4 E2B Uncensored | 52 | C | β | Material classification; tree species |
| 14 | Qwythos 3.5 9B | 48 | C | β | Completeness; spatial relations |
| 15 | Gemma 4 12B it Q8_K_XL | 45 | C | β | Material classification; confabulation |
| 16 | Gemma 4 E4B Uncensored | 40 | C | β | Material; spatial; occlusion |
Executive Summary
Sixteen vision-language models analyzed a single Okanogan Valley landscape across 10 depth-stratified reasoning tiers β from surface object detection through counterfactual viewpoint projection.
Top performers: Both Nex N2 Mini variants (94, 91) scored highest and were the only models to localize the region to the Okanogan/Cascade area. A surprise third-place entry: Qwable 9B Fable 5 (87), a 9B model that was the only non-Nex model to correctly identify the stone cairn fence posts and the only model to detect utility power lines crossing the valley. DeepSeek V4 (86) and Qwen 3.6 40B Deckard Heretic (86) tied for fourth with strong geological reasoning and regional awareness. Qwen 3.6 27B models (82β84) followed closely. The best previously-known sub-10B model was Qwen 3.5 9B at 77 β now surpassed by Qwable 9B at 87.
Critical failure modes: Four models placed the tree in front of the fence (occlusion reasoning failure). All three Gemma 4 variants misidentified stone fence posts as wood β a systematic material classification deficit. Model size correlated weakly with performance (r β 0.4); vision-encoder architecture and reasoning depth were far stronger predictors.
Bottom line: For landscape-scale vision tasks requiring spatial reasoning, MoE models with agentic thinking (Nex, 35B total / 3B active) outperform dense vision-first models (Qwen3-VL 30B). Gemma 4 is unreliable for material classification at mid-distance.
Ranked Results
| Rank | Model | Score | Grade | Β± | Critical Errors |
|---|---|---|---|---|---|
| 1 | Nex N2 Mini Apex MTP | 94 | S | Β±2 | None |
| 2 | Nex N2 Mini Abliterated APEX I-Compact | 91 | S | Β±2 | None |
| 3 | Qwable 9B Fable 5 (Empero-AI) | 87 | A | Β±2 | Region: lists 3 ranges, no Cascades commitment |
| 4 | DeepSeek V4 (chat.deepseek.com) | 86 | A | Β±2 | Region: Rockies not Cascades; missed intermediate wooden fence posts |
| 5 | Qwen 3.6 40B Deckard Heretic | 86 | A | Β±3 | Tree: oak; Time: 8-10AM (off 3-5hrs); Region: broad list |
| 6 | Qwen 3.6 27B Abliterated | 84 | A | Β±2 | Region: Alps mention |
| 7 | Qwen 3.6 27b Abliterated i1 | 82 | A | Β±3 | Region: Sierra/Rockies only |
| 8 | Unsloth Qwen 3.6 35B A3B MTP | 80 | A | Β±3 | Minor omissions |
| 9 | Qwen 3.5 9B Uncensored HauhauCS Aggressive | 77 | A | Β±3 | Minor omissions |
| 10 | Qwen 3 VL 8B Q4_K_M | 67 | B | Β±3 | Inconsistent depth |
| 11 | Qwen3 VL 30B A3B Abliterated | 62 | B | Β±3 | Tree in front; post count 4 (actual: 6); forest type wrong |
| 12 | Qwen 3 VL 8B Q8_0 | 62 | B | Β±3 | Tree in front |
| 13 | Gemma 4 E2B Uncensored | 52 | C | Β±3 | Posts: wood; tree: maple/oak |
| 14 | Qwythos 3.5 9B | 48 | C | Β±4 | Tree in front; insufficient detail |
| 15 | Unsloth Gemma 4 12B it Q8_K_XL | 45 | C | Β±4 | Posts: wood; decorative bundles confabulation |
| 16 | Hauhaucs Gemma 4 E4B Uncensored | 40 | C | Β±4 | Posts: wood; tree in front |
Model Metadata
| # | Model | Family | Est. Params | Architecture | Thinking | Quant | Source | Ollama / HF Identifier |
|---|---|---|---|---|---|---|---|---|
| 1 | Nex N2 Mini Apex MTP | Nex | 35B (3B active, MoE) | MoE (Qwen3.5-35B-A3B base) | Yes (extensive) | β | Local / Ollama | huihui-ai/Huihui-Nex-N2-mini-abliterated (GGUF, APEX MTP quant) |
| 2 | Nex N2 Mini Abl. APEX I-Compact | Nex | 35B (3B active, MoE) | MoE (Qwen3.5-35B-A3B base) | Yes (extensive) | β | Local / Ollama | huihui-ai/Huihui-Nex-N2-mini-abliterated (GGUF, APEX I-Compact quant) |
| 3 | Qwable 9B Fable 5 | Qwen 3.5 derivative | 9B | Lang-first + vis | Yes (moderate) | Q4_K_M | Local / Ollama | Empero-AI/Qwable-9B-Fable-5-GGUF |
| 4 | DeepSeek V4 | DeepSeek | ~236B (MoE) | Unknown | Yes (moderate) | β | Web | chat.deepseek.com (not Ollama) |
| 5 | Qwen 3.6 40B Deckard Heretic | Qwen 3.6 | 40B | Lang-first + vis | Yes (extensive) | IQ4_XS | Local / Ollama | DavidAU/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NEO-CODE-Di-IMatrix-GGUF |
| 6 | Qwen 3.6 27B Abliterated | Qwen 3.6 | 27B | Lang-first + vis | Yes (moderate) | IQ4_XS | Local / Ollama | huihui_ai/qwen3.6-abliterated:27b |
| 7 | Qwen 3.6 27b Abl. i1 | Qwen 3.6 | 27B | Lang-first + vis | Yes (limited) | i1 | Local / Ollama | huihui_ai/qwen3.6-abliterated:27b-i1 |
| 8 | Unsloth Qwen 3.6 35B A3B MTP | Qwen 3.6 | 35B (MoE) | Lang-first + vis | Yes (moderate) | MXFP4 | Local / Ollama | unsloth/Qwen3.6-35B-A3B-GGUF |
| 9 | Qwen 3.5 9B Unc. Aggressive | Qwen 3.5 | 9B | Lang-first + vis | Yes (moderate) | Q4_K_M | Local / Ollama | HauhauCS/Qwen3.5-9B-Uncensored-Aggressive-GGUF |
| 10 | Qwen 3 VL 8B Q4_K_M | Qwen3-VL | 8B | Vis-first | No | Q4_K_M | Local / Ollama | huihui_ai/qwen3-vl-abliterated:8b |
| 11 | Qwen3 VL 30B A3B Abliterated | Qwen3-VL | 30B | Vis-first | No | IQ3_XXS | Local / Ollama | huihui_ai/qwen3-vl-abliterated:30b |
| 12 | Qwen 3 VL 8B Q8_0 | Qwen3-VL | 8B | Vis-first | No | Q8_0 | Local / Ollama | huihui_ai/qwen3-vl-abliterated:8b (Q8_0 quant) |
| 13 | Gemma 4 E2B Uncensored | Gemma 4 | ~2B | Unknown | Yes (limited) | β | Local / Ollama | huihui_ai/gemma-4-abliterated:e2b |
| 14 | Qwythos 3.5 9B | Qwen 3.5 (merge) | 9B | Lang-first + vis | Yes (minimal) | Q4_K_M | Local / Ollama | Custom merge β not published on Ollama.com |
| 15 | Gemma 4 12B it Q8_K_XL | Gemma 4 | 12B | Unknown | Yes (moderate) | Q8_K_XL | Local / Ollama | unsloth/gemma-4-12b-it-GGUF |
| 16 | Gemma 4 E4B Uncensored | Gemma 4 | ~4B | Unknown | Yes (limited) | β | Local / Ollama | HauhauCS/Gemma-4-E4B-Uncensored-GGUF |
Methodology
Test Design
Each model received the identical prompt from semantic-layering-prompt.md β a 14-section structured analysis request covering foreground, midground (fence, tree, secondary elements), background (left forest, mountains/sky, right terrain), depth ordering, occlusion reasoning, atmospheric inference, viewpoint/elevation, geological transition, and final synthesis. The prompt was appended to a single image upload with no system prompt modifications.
Scoring Framework
Each of the 10 tiers contributes points toward a 100-point composite:
| # | Tier | Max | Weight | What It Measures |
|---|---|---|---|---|
| T0 | Object Presence | 10 | 10% | Hallucination resistance (5 positive + 5 negative controls) |
| T1 | Instance Counting | 6 | 6% | Fine-grained discrimination of repeated-element classes |
| T2 | Attribute Binding | 12 | 12% | Color-object association; feature grounding |
| T3 | Spatial Relations | 10 | 10% | Left/right, front/behind predicates; coordinate frame consistency |
| T4 | Depth Ordering | 10 | 10% | 3D mental model construction; surface vs. volume understanding |
| T5 | Material Classification | 10 | 10% | Texture-to-material mapping; surface property inference |
| T6 | Atmospheric Inference | 10 | 10% | Season, weather, time-of-day from indirect visual evidence |
| T7 | Occlusion Reasoning | 10 | 10% | Amodal completion; counterfactual removal; hidden geometry |
| T8 | Geological Transition | 10 | 10% | Landscape-scale variation awareness; biome identification |
| T9 | Viewpoint & Regional | 12 | 12% | Egocentric spatial reasoning; geographical localization |
Scoring Rules:
- T0: +1 per correct yes/no (5 positives: fence, mountains, deciduous tree, conifers, utility pole, clouds; 5 negatives: water, snow, road, building)
- T1: +2 per correct count within Β±1 tolerance (posts: 5-7, deciduous trees: 1, utility poles: 1)
- T2: +2 per correct attribute (grass color, tree color, conifer color, post material, mountain color, grass texture). Partial credit (+1) for imprecise but not wrong answers.
- T3: +2.5 per correct spatial relation (fence/tree front-behind, right-of-tree, forest location, tree relative to fence+mountains)
- T4: +2.5 per correct depth judgment (ordering list, between-fence-and-mountains, pole front/behind)
- T5: +2 per correct material ID + reasoning (fence components, mountain texture, left-right surface contrast)
- T6: +2 per correct atmospheric inference (season, weather cloud/fog distinction, time of day, snow vs cloud)
- T7: +2.5 per occlusion judgment (above+through visibility, removal counterfactual, tree base occlusion)
- T8: +2.5 per geological insight (left-right transition description, biome/region ID, moisture gradient)
- T9: +3 per viewpoint judgment (elevation assessment, 50m forward projection, regional localization quality). Regional bonus: Okanogan/Cascades = +3, PNW = +2, Rockies/Sierra = +1, wrong continent = 0.
Composite Metrics
| Metric | Formula | Range | Purpose |
|---|---|---|---|
| Hallucination Index | False claims / total claims | 0β1 | Lower = better |
| Thinking Depth | Self-correction count + alternative consideration count | 0βN | Metacognitive quality |
| Regional Precision | 0β3 bonus scale | 0β3 | Geographic grounding |
| Completeness | Sections answered / 14 | 0β100% | Instruction following |
Per-Tier Heatmap
Scores normalized to tier maximum (darker = better [depending on device theme]).
Model T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 β Total
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 Nex N2 Mini Apex MTP ββ ββ ββ ββ ββ ββ ββ ββ ββ ββ β 94
2 Nex N2 Mini APEX I-Compact ββ ββ ββ ββ ββ ββ ββ ββ ββ ββ β 91
3 Qwable 9B Fable 5 ββ ββ ββ ββ ββ ββ ββ ββ ββ ββ β 87
4 DeepSeek V4 (web) ββ ββ ββ ββ ββ ββ ββ ββ ββ ββ β 86
5 Qwen 3.6 40B Deckard Heretic ββ ββ ββ ββ ββ ββ ββ ββ ββ ββ β 86
6 Qwen 3.6 27B Abliterated ββ ββ ββ ββ ββ ββ ββ ββ ββ ββ β 84
7 Qwen 3.6 27b Abl. i1 ββ ββ ββ ββ ββ ββ ββ ββ ββ ββ β 82
8 Unsloth Qwen 3.6 35B A3B ββ ββ ββ ββ ββ ββ ββ ββ ββ ββ β 80
9 Qwen 3.5 9B Unc. Aggressive ββ ββ ββ ββ ββ ββ ββ ββ ββ ββ β 77
10 Qwen 3 VL 8B Q4_K_M ββ ββ ββ ββ ββ ββ ββ ββ ββ ββ β 67
11 Qwen3 VL 30B A3B Abliterated ββ ββ ββ ββ ββ ββ ββ ββ ββ ββ β 62
12 Qwen 3 VL 8B Q8_0 ββ ββ ββ ββ ββ ββ ββ ββ ββ ββ β 62
13 Gemma 4 E2B Uncensored ββ ββ ββ ββ ββ ββ ββ ββ ββ ββ β 52
14 Qwythos 3.5 9B ββ ββ ββ ββ ββ ββ ββ ββ ββ ββ β 48
15 Gemma 4 12B it Q8_K_XL ββ ββ ββ ββ ββ ββ ββ ββ ββ ββ β 45
16 Gemma 4 E4B Uncensored ββ ββ ββ ββ ββ ββ ββ ββ ββ ββ β 40
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β = 90-100% β = 60-89% β = 30-59% β = 0-29%
Correction note: Original heatmap contained a duplicate row for Qwen 3.6 40B Deckard Heretic (appeared at both position 4 and position 9). Removed duplicate β 16 rows β 15. Qwable 9B Fable 5 added from semantic-layering-responses4.md β 15 rows β 16.
Individual Model Scorecards
1. Nex N2 Mini Apex MTP β 94/100 (S-Tier)
| Tier | Score | Max | Notes |
|---|---|---|---|
| T0 Object Presence | 10 | 10 | All controls passed; no hallucinations |
| T1 Instance Counting | 6 | 6 | Posts: 7 (within range); tree: 1; pole: 1 |
| T2 Attribute Binding | 12 | 12 | All colors correct; grass texture "fibrous, strawlike" spot-on |
| T3 Spatial Relations | 10 | 10 | Tree behind fence; pole right; forest left; relations all correct |
| T4 Depth Ordering | 10 | 10 | Correct order; nuanced on tree/fence relative depth |
| T5 Material Classification | 10 | 10 | Wood slats, stone posts, rugged mountain, left/right contrast all correct |
| T6 Atmospheric Inference | 10 | 10 | Autumn, partly cloudy, mid-morning/early afternoon, fog not snow |
| T7 Occlusion Reasoning | 10 | 10 | Above+through correctly enumerated; removal counterfactual accurate |
| T8 Geological Transition | 10 | 10 | Left-right moisture gradient; "temperate montane meadow" |
| T9 Viewpoint & Regional | 12 | 12 | Low meadow viewpoint; 50m forward = fence; "Okanagan/Similkameen/Kettle Valley" |
Strengths: Pixel-level post counting in thinking trace. Regional localization to specific valley system. Thorough occlusion reasoning distinguishing above-fence from through-slat visibility. Self-correction on sun angle direction.
Weaknesses: Thinking trace shows mild over-analysis on post count (debated 7 vs 8 for ~20 lines) β thoroughness bordering on obsessive, but final answer was correct.
2. Nex N2 Mini Abliterated APEX I-Compact β 91/100 (S-Tier)
| Tier | Score | Max | Notes |
|---|---|---|---|
| T0βT8 | 79 | 84 | Identical accuracy pattern to Apex MTP |
| T9 Viewpoint & Regional | 12 | 12 | "Cascades/Okanagan/Pacific Northwest" β regional localization excellent |
Strengths: Same regional knowledge as sibling model. Comprehensive thinking trace (~200 lines). Correct on all material, spatial, and depth judgments.
Weaknesses: Thinking trace slightly less pixel-precise on post count than Apex MTP. Minor hedging on tree species ("aspen/cottonwood/poplar" vs more confident single ID).
3. Qwable 9B Fable 5 (Empero-AI) β 87/100 (A-Tier)
| Tier | Score | Max | Notes |
|---|---|---|---|
| T0 Object Presence | 10 | 10 | Clean |
| T1 Instance Counting | 5 | 6 | "Approximately eight substantial posts" β within Β±1 of 7 large stone posts. Tree count 1 β. Utility pole not explicitly counted (β1). |
| T2 Attribute Binding | 12 | 12 | All attributes correct. Only non-Nex model to distinguish stone cairns from wood posts: "several of these 'posts' are not wood but rather piles of stacked stones (cairns)." |
| T3 Spatial Relations | 8 | 10 | Tree behind fence β. Forest left β. "What stands directly to the right" described generically (β2). |
| T4 Depth Ordering | 10 | 10 | Correct order with nuance: "fence and tree are roughly at the same depth plane⦠but fence is slightly closer as it occludes parts of the tree's base." |
| T5 Material Classification | 10 | 10 | Slats, rails, posts all correct. Correctly identifies stone cairns as distinct from wood posts. Hand-built identification with reasoning. Flawless material classification. |
| T6 Atmospheric Inference | 8 | 10 | Season/weather correct. Time "mid-morning to early afternoon" β acceptable but vague (β2). |
| T7 Occlusion Reasoning | 10 | 10 | Comprehensive occlusion analysis. Unique observation: "faint dark lines cut across the image horizontallyβthese appear to be utility power lines running through the valley, visible against both the sky and the darker trees." No other model reported power lines. |
| T8 Geological Transition | 7 | 10 | Left-right moisture gradient identified β. Biome: "Rockies, Cascades, or Sierra Nevada" β includes correct range, no Cascades commitment (β3). |
| T9 Viewpoint & Regional | 7 | 12 | Elevation and 50m forward correct (+3). Region lists 3 candidates including Cascades (+2 bonus) but doesn't commit (β2). |
Strengths: Best material classification of any model outside the Nex pair β correctly identified stone cairn fence posts where DeepSeek V4 (~236B), all Gemma 4 variants, and most Qwen models failed. Only model in the entire benchmark to detect utility power lines crossing the valley, a low-contrast thin linear element against complex background. Flawless depth ordering and occlusion reasoning. For a 9B model at Q4_K_M quantization, this is an exceptional result.
Weaknesses: Regional localization imprecise β lists 3 mountain ranges rather than committing. The model sees the scene accurately but lacks the geographic knowledge to place it. "What stands directly to the right of the prominent tree" answered generically rather than specifically identifying the conifer cluster and red/brown shrubs.
Architecture significance: This is a Qwen 3.5 derivative with a DPO/RLHF tune (Fable 5) optimized for visual instruction following. Its thinking trace shows methodical section-by-section reasoning without the obsessive over-analysis of the Nex models. It correctly resolved the stone-vs-wood material distinction efficiently β no 20-line post-count debate needed. Suggests the "Fable" tuning approach improves material classification fidelity in vision-language tasks.
4. DeepSeek V4 (chat.deepseek.com) β 86/100 (A-Tier)
| Tier | Score | Max | Notes |
|---|---|---|---|
| T0 Object Presence | 10 | 10 | Clean |
| T1 Instance Counting | 6 | 6 | Posts: ~7 |
| T2 Attribute Binding | 10 | 12 | "maple or cottonwood" β hedging costs 2 pts; colors all correct |
| T3 Spatial Relations | 10 | 10 | Tree behind fence β |
| T4 Depth Ordering | 10 | 10 | Correct ordering |
| T5 Material Classification | 9 | 10 | Stone posts, wood slats identified correctly. Missed the ~5 smaller-diameter wooden vertical posts spaced between the stone pillars β treated them as slats or didn't register them as a distinct structural element (β1). |
| T6 Atmospheric Inference | 10 | 10 | Morning/late afternoon hedging acceptable |
| T7 Occlusion Reasoning | 10 | 10 | Strong amodal completion |
| T8 Geological Transition | 8 | 10 | Right side: drier/rockier β; biome "Western US" β misses Cascades specificity |
| T9 Viewpoint & Regional | 5 | 12 | Region: "Rocky Mountains, Montana, Idaho, Wyoming" β not Cascades (-3). Elevation and 50m forward correct. |
Strengths: Most polished output. Produced the initial semantic layering framework from which the test prompt was derived. Strong occlusion and atmospheric reasoning.
Weaknesses: Hedged on tree species. Regional localization missed the Cascades entirely β placed it in Rockies. Additionally, failed to identify the ~5 smaller-diameter wooden vertical posts spaced between the stone pillars β the fence alternates stone columns with intermediate wooden posts supporting the horizontal rails. DeepSeek registered only one thin wooden pole (calling it a "utility pole") and otherwise collapsed these structural elements into the "slats" category. The Nex models, by contrast, distinguished them. Adjusted T5: 10β9 (β1).
5. Qwen 3.6 40B Deckard Heretic β 86/100 (A-Tier)
| Tier | Score | Max | Notes |
|---|---|---|---|
| T2 Attribute Binding | 8 | 12 | Tree: "oak" β (β4 pts). Should be cottonwood/aspen. All other attributes correct. |
| T4 Depth Ordering | 9 | 10 | Ordering correct but fence and tree described as "same depth plane" β tree is behind fence (β1). |
| T6 Atmospheric Inference | 6 | 10 | Time: "8-10 AM" β β off by 3β5 hrs (actual: ~noonβ1PM). Fog misled the model; bright midday lighting cues ignored. |
| T9 Viewpoint & Regional | 7 | 12 | Elevation + 50m forward correct. Region: listed 3 candidates (Sierra, Cascades, Rockies) β includes correct range but didn't commit (β3 vs Okanogan-specific). |
Strengths: Best regional awareness among Qwen models β explicitly lists "Cascade Range valleys" among candidates. Strong geological transition analysis. Detailed grass species speculation (rye grass). Stone posts, depth ordering, and occlusion reasoning all correct.
Weaknesses: Three errors: (1) Tree species: called it oak β a Western NA ecology gap. Cottonwood/aspen is correct for this region. (2) Time-of-day: 8-10 AM off by 3β5 hours. Fog pattern in mountains misled the model into assuming early morning despite bright midday lighting (actual: ~noonβ1PM). (3) Regional: listed 3 ranges rather than committing β regional knowledge is present but imprecise.
6. Qwen 3.6 27B Abliterated β 84/100 (A-Tier)
| Tier | Score | Max | Notes |
|---|---|---|---|
| T0βT7 | 74 | 78 | Solid across all perceptual tiers |
| T8 Geological Transition | 7 | 10 | Left-right transition correct; biome: "European Alps (Tyrol or Bavaria) or possibly Rocky Mountains" β Alps mention costs 3 pts |
| T9 Viewpoint & Regional | 3 | 12 | Region: Alps primary guess costs heavily; viewpoint assessment correct |
Strengths: Identified "wattle" fence construction style. Accurate on all perceptual dimensions. Good atmospheric and occlusion reasoning.
Weaknesses: The European Alps guess is a significant regional error β the dry-stone + wattle fence style misled it to European pastoral landscapes. Correctly noted Rockies as alternative.
7. Qwen 3.6 27B Abliterated i1 β 82/100 (A-Tier)
Similar profile to sibling 27B model. Region: "Sierra Nevada or Rocky Mountain foothills." Solid perceptual accuracy, minor regional imprecision. (Abbreviated scorecard β full tier breakdown not provided in original.)
8. Unsloth Qwen 3.6 35B A3B MTP β 80/100 (A-Tier)
Strong perceptual accuracy. Region: "Pacific Northwest, British Columbia, or Rockies" β good range including PNW. Minor omissions in occlusion section. Solid all-around. (Abbreviated scorecard β full tier breakdown not provided in original.)
9. Qwen 3.5 9B Uncensored Aggressive β 77/100 (A-Tier)
Strong for a 9B model. Stone posts identified correctly. Region: "Northern Rocky Mountains (Montana/Idaho)." Minor omissions in occlusion reasoning and geological transition. (Abbreviated scorecard β full tier breakdown not provided in original.)
10. Qwen 3 VL 8B Q4_K_M β 67/100 (B-Tier)
| Tier | Score | Max | Notes |
|---|---|---|---|
| T1 Instance Counting | 6 | 6 | Posts: 5 (within tolerance; actual: 6 β 5 full + 1 partial) |
| T3 Spatial Relations | 6 | 10 | Inconsistent: says "behind" then "same depth" β depth frame instability |
| T4 Depth Ordering | 6 | 10 | Ordering correct but fence/tree "same depth" note reveals uncertainty |
| All other tiers | 51 | 74 | Solid but unremarkable |
Weaknesses: Quantization to Q4_K_M shows in reduced precision on spatial judgments. The model hedges and contradicts itself on tree-fence depth relationship.
11. Qwen3 VL 30B A3B Abliterated β 62/100 (B-Tier)
Critical errors:
- Tree position: IN FRONT of fence β fundamental occlusion failure (β4 on T3, β2.5 on T4)
- Post count: 4 β catastrophic undercount (β3 on T1)
- Forest type: "Deciduous trees mixed with coniferous" β reversed dominance (β2 on T2)
| Tier | Score | Max | Loss |
|---|---|---|---|
| T1 Instance Counting | 3 | 6 | Post count 4 (actual: 6 β 5 full + 1 partial) |
| T3 Spatial Relations | 5 | 10 | Tree in front of fence β |
| T4 Depth Ordering | 7.5 | 10 | Minor ordering error cascading from T3 |
| Remaining tiers | 46.5 | 74 | Solid |
Analysis: Most disappointing result for a 30B-class model. The tree-in-front error suggests a failure to resolve occlusion cues β the model saw "bright orange tree" and "darker fence line below it" and defaulted to the tree being closer (bright things are typically nearer). It did not process that the fence slats cross in front of the tree trunk. The 4-post count suggests it saw only the most prominent stone pillars and missed the intermediate ones.
12. Qwen 3 VL 8B Q8_0 β 62/100 (B-Tier)
| Tier | Score | Max | Notes |
|---|---|---|---|
| T1 Instance Counting | 6 | 6 | Posts: 5 (within tolerance; actual: 6 β 5 full + 1 partial) |
| T3 Spatial Relations | 5 | 10 | Tree in front of fence β β same failure as 30B variant |
| T4 Depth Ordering | 7.5 | 10 | Order correct, but contradicts T3 |
| Remaining | 45.5 | 74 | Solid |
Same tree-in-front error as the 30B variant β suggests this is a Qwen3-VL architectural pattern, not a scale issue. Both the 8B and 30B Qwen3-VL models made this same error. The 27B text-first Qwen models did not.
13. Gemma 4 E2B Uncensored β 52/100 (C-Tier)
| Tier | Score | Max | Notes |
|---|---|---|---|
| T2 Attribute Binding | 6 | 12 | Tree: "Maple or Oak" β (β4); post material: wood β (β2) |
| T5 Material Classification | 3 | 10 | Posts: wood β (β4); slats described adequately |
| Remaining | 43 | 78 | Somewhat intact |
Critical failure: All Gemma 4 variants misidentify stone posts as wood, but the failure is a reasoning error, not a perceptual one. The vision encoder registers the stone texture (stacked, irregular, grey); the LLM's "fence post β wood" prior overrides the visual evidence. The E2B is the least severe case β it at least placed the tree behind the fence correctly and also identified the smaller intermediate wooden posts between the stone pillars ("additional smaller posts defining the gaps between the main sections"), a structural detail that even DeepSeek V4 missed.
14. Qwythos 3.5 9B β 48/100 (C-Tier)
Critical errors:
- Tree in front of fence
- Response is mostly a single-paragraph synthesis β misses most prompt sections
- No separate analysis of depth layers, occlusion, atmospheric inference
- Region: "Rocky Mountains or Sierra Nevada" β acceptable
Analysis: This model produced the shortest response. The thinking trace was more comprehensive than the output, suggesting the model can reason but either hit an output length constraint or was optimized for brevity. The output reads like a summary of the thinking, not a full response. Lowest completeness score.
15. Gemma 4 12B it Q8_K_XL β 45/100 (C-Tier)
| Tier | Score | Max | Notes |
|---|---|---|---|
| T2 Attribute Binding | 6 | 12 | Posts: wood β; tree: cottonwood β |
| T5 Material Classification | 2 | 10 | Posts described as "heavy squared timbers" with "bundles of dried brush or decorative stone crowns" β complete material hallucination |
| T9 Viewpoint & Regional | 4 | 12 | Region: "Intermountain West (Colorado, Utah, Wyoming)" β acceptable but not close |
Analysis: The "decorative brush bundles" / "stone crowns" description is the key diagnostic: the vision encoder registered the irregular stacked-stone shape and reported it; the LLM could not reconcile "stone texture on a fence post" with its "fence post = wood" prior, so it confabulated a compromise β wood posts with stone decorations. This is an LLM prior override, not a texture-to-material mapping failure at the encoder level. The encoder saw the difference; the LLM refused to believe it.
16. Gemma 4 E4B Uncensored β 40/100 (C-Tier)
| Tier | Score | Max | Notes |
|---|---|---|---|
| T2 Attribute Binding | 6 | 12 | Tree: "maple or cottonwood" β acceptable; posts: wood β |
| T3 Spatial Relations | 5 | 10 | Tree in front of fence β |
| T5 Material Classification | 2 | 10 | Posts: "thick, irregular, naturally shaped timber logs" β |
| T7 Occlusion Reasoning | 5 | 10 | Tree-in-front error cascades β removal counterfactual nonsensical |
Analysis: Worst overall performance. Combines both critical failure modes: tree-in-front (spatial β perceptual error) AND posts-as-wood (material β LLM prior override). The "uncensored" fine-tune may have degraded perceptual accuracy in exchange for reduced refusal rates. Notably, the E4B described the posts as "thick, irregular, naturally shaped" β words that describe stacked fieldstone as accurately as rough timber β showing the encoder transmitted the right texture signal; the LLM applied the wrong label.
Gap Analysis: Score Delta from #1
Where each model loses points relative to the leader. Positive deltas = points lost. Tiers listed in descending order of average loss across all models (hardest tiers first).
| # | Model | Total | T9 Reg | T5 Mat | T3 Spa | T4 Dep | T2 Attr | T7 Occl | T1 Cnt | T8 Geo | T6 Atm | T0 Obj |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Nex Apex MTP | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | Nex APEX I-Comp | +3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | Qwable 9B Fable 5 | +7 | +5 | 0 | +2 | 0 | 0 | 0 | +1 | +3 | +2 | 0 |
| 4 | DeepSeek V4 | +8 | +7 | +1 | 0 | 0 | +2 | 0 | 0 | +2 | 0 | 0 |
| 5 | Qwen 40B Deckard | +8 | +5 | 0 | 0 | +1 | +4 | 0 | 0 | 0 | +4 | 0 |
| 6 | Qwen 27B Abl. | +10 | +9 | 0 | 0 | 0 | 0 | 0 | 0 | +3 | 0 | 0 |
| 7 | Qwen 27B i1 | +12 | +6 | 0 | 0 | 0 | 0 | 0 | 0 | +4 | 0 | 0 |
| 8 | Unsloth Qwen 35B | +14 | +6 | 0 | 0 | 0 | 0 | +2 | 0 | +4 | 0 | 0 |
| 9 | Qwen 9B Aggr. | +17 | +6 | 0 | 0 | 0 | 0 | +4 | 0 | +5 | 0 | 0 |
| 10 | Qwen VL 8B Q4 | +27 | +6 | 0 | +4 | +4 | 0 | +3 | 0 | +5 | 0 | 0 |
| 11 | Qwen3 VL 30B | +32 | +6 | 0 | +5 | +2.5 | +2 | +3 | +3 | +4 | 0 | 0 |
| 12 | Qwen VL 8B Q8 | +32 | +6 | 0 | +5 | +2.5 | 0 | +3 | 0 | +5 | 0 | 0 |
| 13 | Gemma 4 E2B | +42 | +6 | +7 | 0 | 0 | +6 | +3 | +2 | +5 | 0 | 0 |
| 14 | Qwythos 9B | +46 | +9 | +4 | +5 | +2.5 | +2 | +5 | +2 | +5 | 0 | 0 |
| 15 | Gemma 4 12B | +49 | +8 | +8 | 0 | 0 | +6 | +3 | +2 | +5 | 0 | 0 |
| 16 | Gemma 4 E4B | +54 | +9 | +8 | +5 | +2.5 | +6 | +5 | +2 | +4 | 0 | 0 |
Correction note: Original had "#8 Qwen 40B Deckard" β corrected to "#4".
Hardest tiers (avg loss): T9 Regional (+5.5) > T8 Geological (+3.1) > T2 Attribute (+2.1) > T5 Material (+1.9) > T3 Spatial (+1.6).
Easiest tiers (avg loss): T0 Object Presence (0.0) > T6 Atmospheric (+0.3) > T1 Counting (+0.7).
Interpretation: Every model passed object presence and atmospheric inference. The spread comes from regional knowledge, material classification, and spatial reasoning β exactly the tiers that distinguish scene-understanders from scene-describers.
Failure Mode Analysis
Failure Mode 1: Tree Position Reversal (4 models)
Affected: Qwen3 VL 30B, Qwen 3 VL 8B (both quants), Gemma 4 E4B, Qwythos 3.5 9B
Symptom: Model reports tree is in front of fence when it is actually behind.
Root Cause: Bright, salient object (orange tree) is interpreted as nearer than darker, less salient object (fence). Model fails to use occlusion cues β fence slats visibly cross in front of tree trunk.
Architecture Pattern: Affects Qwen3-VL family at multiple scales (8B and 30B) but NOT Qwen 3.6 text-first models or Nex series. Suggests vision-encoder-first architectures may be more susceptible than language-model-first architectures with vision bolted on.
Severity: Critical. Fundamental depth perception error.
Failure Mode 2: Stone Posts β Wood (3 models)
Affected: All Gemma 4 variants (E2B, E4B, 12B)
Symptom: Stone pillars described as wooden posts. E4B: "thick, irregular timber logs." 12B: "heavy squared timbers with decorative bundles."
Root cause β LLM prior override, not encoder blindness: The vision encoder does detect the stone texture. Evidence: the 12B describes "decorative stone crowns" on top of the posts and notes them as "irregular" β it registers the stacked-fieldstone visual signal. But the language model applies a strong "fence post β wood" prior that overrides the visual classification. The encoder says "stacked irregular grey material"; the LLM resolves that to "wood posts with stone decorations" (12B), "rough-hewn irregular wood" (E4B), or simply "wood" (E2B). This is the same class of error as DeepSeek's Rockies mislocalization: the LLM prior beats the visual evidence.
Architecture Pattern: Consistent across all three Gemma 4 sizes tested. This is not an encoder-level limitation β it is a reasoning failure where the language model's material priors dominate the vision encoder's output. The encoder can see the difference; the LLM won't believe it.
Severity: High. Material classification is a core vision task. More fixable than a perceptual deficit β a system prompt rule ("trust visual texture over material priors") would directly target this.
Failure Mode 3: Regional Mislocalization (3 models)
Affected: DeepSeek V4 (Rockies), Qwen 3.6 27B (Alps), Qwen 3.6 40B Deckard Heretic (Cascades candidate list but wrong tree)
Symptom: Model identifies correct biome but wrong continent or mountain range.
Root Cause: The dry-stone fence with stacked fieldstone posts is visually similar to Alpine/Tyrolean pastoral fencing. Models using visual fence-style cues for geography were misled to Europe. Models using vegetation cues (conifer + cottonwood/aspen) correctly identified Western North America.
Severity: Moderate. Geographical knowledge is secondary to perceptual accuracy for most use cases.
Failure Mode 4: Post Count Undercount (4 models)
Affected: Qwen3 VL 30B (4), all Gemma 4 variants (5-8 but wrong material)
Symptom: Qwen3 VL 30B counts 4 posts when 6 are visible (5 full + 1 partial top on left edge).
Root Cause: Model counts only the most visually prominent posts and misses partially occluded or edge-cropped ones. Genuine instance segmentation failure rather than counting error.
Severity: Moderate. Counting repeated elements is a known weakness of current VLMs.
Architecture Insights
Vision-encoder-first vs Language-model-first
| Architecture Type | Models | Avg Score | Tree Error Rate | Post Material Error |
|---|---|---|---|---|
| Vision-encoder-first (dense) | Qwen3-VL (8B, 30B) | 62 | 66% (2/3) | 0% |
| MoE + agentic thinking | Nex N2 Mini (both, 35B/3B active) | 93 | 0% (0/2) | 0% |
| Dense + thinking | Qwen 3.6 (27Bβ40B), DeepSeek V4, Qwable 9B | 79 | 0% (0/8) | 13% (1/8 β DeepSeek missed intermediate posts, Qwable scored 10/10) |
| Gemma 4 (unknown arch) | E2B, E4B, 12B | 46 | 33% (1/3) | 100% (3/3) β LLM prior override, not encoder deficit |
Language-model-first and MoE architectures consistently outperformed vision-encoder-first architectures on spatial reasoning (tree position). The Nex models (35B total, 3B active) dominated through reasoning depth β their active-parameter count is low, but their 35B knowledge pool and extensive thinking traces compensated for what dense vision-first models missed.
Material classification presents a different failure pattern: Gemma 4's 100% error rate is an LLM prior override (the encoder sees stone; the LLM says wood), while Qwen3-VL's 0% error rate on materials but 66% error rate on spatial relations suggests vision-first architectures handle texture-to-material mapping well but fail on relational reasoning that requires integrating visual evidence with 3D mental models. The two failure types are architectural inverses of each other.
Thinking/Reasoning Depth
| Thinking Quality | Models | Avg Score |
|---|---|---|
| Extensive thinking trace (>100 lines) | Nex (both), DeepSeek, Qwen 3.6 27B, Deckard Heretic | 86 |
| Moderate thinking (50-100 lines) | Qwable 9B, Qwen 3.5 9B, Unsloth 35B, Gemma 4 12B | 72 |
| Minimal or no thinking | Qwen3 VL 30B, both Qwen 3 VL 8B, Gemma E2B/E4B, Qwythos | 56 |
Strong correlation between thinking trace depth and overall score. Models that reasoned step-by-step about occlusion, material properties, and spatial relationships made fewer critical errors.
Quantization Impact (Qwen 3 VL 8B)
| Quant | Score | Notes |
|---|---|---|
| Q8_0 | 62 | Tree in front |
| Q4_K_M | 67 | Tree "behind" but inconsistent |
Minimal score difference (5 pts). Both quants gave post count 5 β within Β±1 of the actual 6 (5 full + 1 partial). The Q4_K_M variant was slightly more cautious in spatial judgments (hedged with "same depth") which improved its score by avoiding the definitive wrong answer.
Key Findings
Regional localization is the best single discriminator of model quality. Only the top 2 models (both Nex variants) correctly identified the Okanogan/Cascade region. Every other model placed the scene somewhere between the Alps and the Rockies. This capability correlates strongly with overall score (r β 0.85).
Occlusion reasoning separates scene-understanders from scene-describers. The tree-behind-fence question was the single most diagnostic item: models that got it right scored 80+; models that got it wrong scored β€65. It tests whether the model builds a 3D mental model or describes a 2D image.
Gemma 4 has a systematic LLM prior override on material classification. All three variants registered the stone texture (the 12B even described "stone crowns") but the language model's "fence post β wood" prior overrode the visual evidence. This is a reasoning failure, not a perceptual one β the encoder sees the difference; the LLM overrides it. Contrast with Qwen3-VL's tree-in-front error (perceptual: brightness/salience dominates occlusion cues at the encoder level).
MoE + thinking depth outperforms dense vision-first architectures. The Nex models (35B total, 3B active) outscored both the dense Qwen3-VL 30B and the much larger DeepSeek V4 (~236B). Their MoE architecture provides a large knowledge pool at low inference cost, while extensive reasoning traces catch perceptual ambiguities.
Vision-first architectures struggle with depth from occlusion. Qwen3-VL models at both 8B and 30B scales made the same tree-in-front error, while Qwen 3.6 text-first models of comparable size did not. The vision encoder in vision-first architectures may dominate the final judgment, overriding contradictory signals from the language model.
Quantization has minimal impact on this task. The Q8_0 vs Q4_K_M comparison for Qwen 3 VL 8B showed only a 5-point difference, with the lower quant actually scoring slightly higher due to more cautious spatial hedging.
Recommendations
For Model Selection
- Best overall: Nex N2 Mini (either variant) for vision tasks requiring spatial reasoning and regional knowledge
- Best large model: DeepSeek V4 for polished, comprehensive scene analysis
- Best local model: Qwen 3.6 27B Abliterated for strong perceptual accuracy without cloud dependency
- Best regional awareness (Qwen family): Qwen 3.6 40B Deckard Heretic β explicitly lists Cascade Range valleys (but verify species and time-of-day on your own images)
- Avoid for landscape vision: Gemma 4 variants (LLM material priors override visual evidence β stone postsβwood); Qwen3-VL 8B/30B (spatial reasoning unreliable β tree-in-front perceptual error)
For Benchmark Improvement
- Add a material classification tier with explicit binary choices (wood/stone/metal) to isolate texture-to-material mapping from free-text description
- Add depth-from-occlusion forced-choice pairs (A in front of B or B in front of A?) to eliminate hedging
- Include a confidence calibration check β ask models to rate their confidence on spatial judgments
- Test with fence removed via inpainting as a control image to verify occlusion reasoning independently
For Further Investigation
- Why do Nex models have such strong regional knowledge? Training data source analysis needed.
- Is the Gemma 4 material deficit fixable via fine-tuning the prompt to distinguish posts vs columns, or is it a visual decoder deficit?
- Does the Qwen3-VL tree-position error persist across all images with foreground occlusion, or is it specific to this brightness/salience pattern?
Caveats & Limitations
Single-image evaluation. All scores derive from one image. Model rankings may shift with different scene types (urban, indoor, night, abstract). Treat scores as indicative, not definitive. A multi-image benchmark suite (10+ diverse scenes) would yield Β±1β2 point confidence.
Manual scoring subjectivity. Tier scores were assigned by human review of free-text responses. While a structured rubric was applied consistently, different evaluators might vary Β±2β3 points on borderline judgments. Automated LLM-as-judge re-scoring is planned for reproducibility.
Prompt sensitivity. All models received the identical prompt with no system-prompt modifications. Performance may vary with different prompting strategies (chain-of-thought, structured output formatting, role assignment).
Thinking mode variability. Some models were tested with thinking enabled, others without (per availability in Ollama). Direct comparison of "thinking vs. no-thinking" variants within the same model family was only possible for Qwen 3 VL 8B.
Quantization and hardware. Local models were run at various quantization levels on consumer GPU hardware. Cloud models (DeepSeek V4) ran at full precision. Quantization impact was minimal in this benchmark but may be larger for tasks requiring fine-grained texture discrimination.
Temporal snapshot. Model performance reflects the versions available in June 2026. Provider updates, fine-tune releases, and quantization improvements may change rankings. Re-benchmarking every 3β6 months recommended.
Geographic knowledge bias. The test image is from a specific North American region. Models with training data skewed toward Western North American landscapes had an inherent advantage on T9 (Regional).
No speed/cost data. Inference time and token efficiency were not tracked. Future runs should include timing instrumentation.
Appendices
A. Raw Response Data:
semantic-layering-responses1.mdβ DeepSeek V4, Qwen3 VL 30B, Nex N2 Mini APEX I-Compactsemantic-layering-responses2.mdβ Qwen 3.6 27B variants, Gemma 4 variants, Qwythos 9B, Qwen 40Bsemantic-layering-responses3.mdβ Nex N2 Mini Apex MTP, Qwen 3 VL 8B variants, Qwen 3.5 9B, Unsloth Qwen 35B, Gemma 4 12Bsemantic-layering-responses4.mdβ Qwable 9B Fable 5 (Empero-AI)
B. Test Prompt: semantic-layering-prompt.md
C. Scoring Rubric: semantic-layering-test.md
D. Test Image: IMG_0773.JPG β Okanogan Valley, near Omak, WA (Northern Cascade Range foothills). Autumn, ~noonβ1PM.
E. Radar Chart: radar-comparison.png β multi-model radar chart showing normalized tier scores for the top 8 models.
Corrections Applied
| # | Issue | Original | Corrected |
|---|---|---|---|
| 1 | Heatmap duplicate row | Qwen 3.6 40B Deckard Heretic appeared at positions 4 and 9 (16 rows) | Removed duplicate β 15 rows |
| 2 | Scorecard numbering | Restarted at "1." for most models; some labeled "2.", "3.", "4." arbitrarily | Renumbered sequentially 1β15 |
| 3 | Scorecard ordering | Deckard Heretic (rank #4) appeared after Qwen 9B (rank #8) | Moved to correct rank position (#4) |
| 4 | Gap Analysis numbering | "#8 Qwen 40B Deckard" (should be #4) | Corrected to "#4" |
| 5 | Ollama identifiers missing | No blob manifest / pull identifiers for any model | Added "Ollama / HF Identifier" column (9 cols β 10 cols). Confirmed on ollama.com: huihui_ai models (qwen3.6-abliterated, qwen3-vl-abliterated, gemma-4-abliterated). HF GGUF paths derived for: Nex variants, DavidAU Deckard Heretic, Hauhaucs, Unsloth, Qwythos. See table note for sha256 caveat. |
| 6 | DeepSeek V4 fence structure gap | T5 scored 10/10; no mention of intermediate wooden fence posts | T5 reduced 10β9 (β1). DeepSeek missed ~5 smaller wooden vertical posts spaced between stone pillars β collapsed them into slats. Discovered via comparison with Nex N2 Mini thinking trace. Score: 87β86. Gap Analysis, heatmap, Quick Reference, Executive Summary, and tier averages recalculated. |
| 7 | Gemma 4 material failure misdiagnosed | Described as encoder-level perceptual deficit (can't distinguish stone from wood) | Revised to LLM prior override. Evidence: 12B described "stone crowns," E4B used "irregular, naturally shaped" β encoder transmitted stone texture; LLM "fence post β wood" prior overrode the visual classification. Same class of error as DeepSeek's Rockies mislocalization. Failure Mode 2, Key Finding #3, Architecture Insights table, and all three Gemma scorecards updated. Scores unchanged β T5 already reflects the material error. |
| 8 | Qwable 9B Fable 5 added | Model evaluated from semantic-layering-responses4.md (Empero-AI / Qwable 9B Fable 5 Q4_K_M). Scored 87/100 |
Inserted at rank 3. Only non-Nex model to correctly identify stone cairn posts (T5: 10/10) and only model to detect utility power lines (T7). Quick Reference, Executive Summary, Ranked Results, Model Metadata, Heatmap, Scorecards (all renumbered 3β16), Gap Analysis (recalculated), Architecture Insights (updated Dense+thinking row, Thinking Depth table), radar chart regenerated. All "15 models" references updated. |
