Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
EpsilonGreedyAIΒ 
posted an update 8 days ago
Post
136
Did a little evaluation to confirm, a lot of these Fable and Opus distills don't stand up to even their base Qwen models. I have been consistently underwhelmed by anything related to Anthropic. Though there have been two notable exceptions so far: Qwable 5 9B from Empero AI and Qwen 3.6 40B Claude 4.6 Opus Deckard Heretic Uncensored Thinking NEO-CODE Di IMatrix MAX from DavidAU

Semantic Layering Vision Benchmark β€” Results Report (Corrected)

Source: HuggingFace post by @EpsilonGreedyAI β€” June 2026


16 vision models. One mountain valley. Which actually sees it?

Test Image: Okanogan Valley, near Omak, WA β€” Northern Cascade Range foothills
Evaluator: Manual review + structured rubric
Framework: semantic-layering-prompt.md β€” single-pass, 14-section analysis
Confidence: Scores Β±2 pts (single-image evaluation)


Quick Reference Card

# Model Score Grade Best For Avoid For
1 Nex N2 Mini Apex MTP 94 S Geographic localization; occlusion detail β€”
2 Nex N2 Mini Abl. APEX I-Compact 91 S Regional knowledge; comprehensive reasoning β€”
3 Qwable 9B Fable 5 87 A Material classification; unique power line detection Regional localization
4 DeepSeek V4 (web) 86 A Polished output; occlusion & atmosphere Regional precision; fence structure
5 Qwen 3.6 40B Deckard Heretic 86 A Geological analysis; Cascades mention Tree species ID; time-of-day
6 Qwen 3.6 27B Abliterated 84 A Strong perceptual accuracy; solid all-around European landscapes
7 Qwen 3.6 27b Abl. i1 82 A Perceptual accuracy; balanced Regional precision
8 Unsloth Qwen 3.6 35B A3B 80 A Strong all-around; PNW awareness Minor omissions
9 Qwen 3.5 9B Unc. Aggressive 77 A Best sub-10B model; good value Occlusion detail
10 Qwen 3 VL 8B Q4_K_M 67 B Budget vision; acceptable for coarse tasks Spatial consistency
11 Qwen3 VL 30B A3B Abl. 62 B β€” Spatial relations; counting
12 Qwen 3 VL 8B Q8_0 62 B β€” Spatial relations; occlusion
13 Gemma 4 E2B Uncensored 52 C β€” Material classification; tree species
14 Qwythos 3.5 9B 48 C β€” Completeness; spatial relations
15 Gemma 4 12B it Q8_K_XL 45 C β€” Material classification; confabulation
16 Gemma 4 E4B Uncensored 40 C β€” Material; spatial; occlusion

Executive Summary

Sixteen vision-language models analyzed a single Okanogan Valley landscape across 10 depth-stratified reasoning tiers β€” from surface object detection through counterfactual viewpoint projection.

Top performers: Both Nex N2 Mini variants (94, 91) scored highest and were the only models to localize the region to the Okanogan/Cascade area. A surprise third-place entry: Qwable 9B Fable 5 (87), a 9B model that was the only non-Nex model to correctly identify the stone cairn fence posts and the only model to detect utility power lines crossing the valley. DeepSeek V4 (86) and Qwen 3.6 40B Deckard Heretic (86) tied for fourth with strong geological reasoning and regional awareness. Qwen 3.6 27B models (82–84) followed closely. The best previously-known sub-10B model was Qwen 3.5 9B at 77 β€” now surpassed by Qwable 9B at 87.

Critical failure modes: Four models placed the tree in front of the fence (occlusion reasoning failure). All three Gemma 4 variants misidentified stone fence posts as wood β€” a systematic material classification deficit. Model size correlated weakly with performance (r β‰ˆ 0.4); vision-encoder architecture and reasoning depth were far stronger predictors.

Bottom line: For landscape-scale vision tasks requiring spatial reasoning, MoE models with agentic thinking (Nex, 35B total / 3B active) outperform dense vision-first models (Qwen3-VL 30B). Gemma 4 is unreliable for material classification at mid-distance.

Ranked Results

Rank Model Score Grade Β± Critical Errors
1 Nex N2 Mini Apex MTP 94 S Β±2 None
2 Nex N2 Mini Abliterated APEX I-Compact 91 S Β±2 None
3 Qwable 9B Fable 5 (Empero-AI) 87 A Β±2 Region: lists 3 ranges, no Cascades commitment
4 DeepSeek V4 (chat.deepseek.com) 86 A Β±2 Region: Rockies not Cascades; missed intermediate wooden fence posts
5 Qwen 3.6 40B Deckard Heretic 86 A Β±3 Tree: oak; Time: 8-10AM (off 3-5hrs); Region: broad list
6 Qwen 3.6 27B Abliterated 84 A Β±2 Region: Alps mention
7 Qwen 3.6 27b Abliterated i1 82 A Β±3 Region: Sierra/Rockies only
8 Unsloth Qwen 3.6 35B A3B MTP 80 A Β±3 Minor omissions
9 Qwen 3.5 9B Uncensored HauhauCS Aggressive 77 A Β±3 Minor omissions
10 Qwen 3 VL 8B Q4_K_M 67 B Β±3 Inconsistent depth
11 Qwen3 VL 30B A3B Abliterated 62 B Β±3 Tree in front; post count 4 (actual: 6); forest type wrong
12 Qwen 3 VL 8B Q8_0 62 B Β±3 Tree in front
13 Gemma 4 E2B Uncensored 52 C Β±3 Posts: wood; tree: maple/oak
14 Qwythos 3.5 9B 48 C Β±4 Tree in front; insufficient detail
15 Unsloth Gemma 4 12B it Q8_K_XL 45 C Β±4 Posts: wood; decorative bundles confabulation
16 Hauhaucs Gemma 4 E4B Uncensored 40 C Β±4 Posts: wood; tree in front

Model Metadata

# Model Family Est. Params Architecture Thinking Quant Source Ollama / HF Identifier
1 Nex N2 Mini Apex MTP Nex 35B (3B active, MoE) MoE (Qwen3.5-35B-A3B base) Yes (extensive) β€” Local / Ollama huihui-ai/Huihui-Nex-N2-mini-abliterated (GGUF, APEX MTP quant)
2 Nex N2 Mini Abl. APEX I-Compact Nex 35B (3B active, MoE) MoE (Qwen3.5-35B-A3B base) Yes (extensive) β€” Local / Ollama huihui-ai/Huihui-Nex-N2-mini-abliterated (GGUF, APEX I-Compact quant)
3 Qwable 9B Fable 5 Qwen 3.5 derivative 9B Lang-first + vis Yes (moderate) Q4_K_M Local / Ollama Empero-AI/Qwable-9B-Fable-5-GGUF
4 DeepSeek V4 DeepSeek ~236B (MoE) Unknown Yes (moderate) β€” Web chat.deepseek.com (not Ollama)
5 Qwen 3.6 40B Deckard Heretic Qwen 3.6 40B Lang-first + vis Yes (extensive) IQ4_XS Local / Ollama DavidAU/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NEO-CODE-Di-IMatrix-GGUF
6 Qwen 3.6 27B Abliterated Qwen 3.6 27B Lang-first + vis Yes (moderate) IQ4_XS Local / Ollama huihui_ai/qwen3.6-abliterated:27b
7 Qwen 3.6 27b Abl. i1 Qwen 3.6 27B Lang-first + vis Yes (limited) i1 Local / Ollama huihui_ai/qwen3.6-abliterated:27b-i1
8 Unsloth Qwen 3.6 35B A3B MTP Qwen 3.6 35B (MoE) Lang-first + vis Yes (moderate) MXFP4 Local / Ollama unsloth/Qwen3.6-35B-A3B-GGUF
9 Qwen 3.5 9B Unc. Aggressive Qwen 3.5 9B Lang-first + vis Yes (moderate) Q4_K_M Local / Ollama HauhauCS/Qwen3.5-9B-Uncensored-Aggressive-GGUF
10 Qwen 3 VL 8B Q4_K_M Qwen3-VL 8B Vis-first No Q4_K_M Local / Ollama huihui_ai/qwen3-vl-abliterated:8b
11 Qwen3 VL 30B A3B Abliterated Qwen3-VL 30B Vis-first No IQ3_XXS Local / Ollama huihui_ai/qwen3-vl-abliterated:30b
12 Qwen 3 VL 8B Q8_0 Qwen3-VL 8B Vis-first No Q8_0 Local / Ollama huihui_ai/qwen3-vl-abliterated:8b (Q8_0 quant)
13 Gemma 4 E2B Uncensored Gemma 4 ~2B Unknown Yes (limited) β€” Local / Ollama huihui_ai/gemma-4-abliterated:e2b
14 Qwythos 3.5 9B Qwen 3.5 (merge) 9B Lang-first + vis Yes (minimal) Q4_K_M Local / Ollama Custom merge β€” not published on Ollama.com
15 Gemma 4 12B it Q8_K_XL Gemma 4 12B Unknown Yes (moderate) Q8_K_XL Local / Ollama unsloth/gemma-4-12b-it-GGUF
16 Gemma 4 E4B Uncensored Gemma 4 ~4B Unknown Yes (limited) β€” Local / Ollama HauhauCS/Gemma-4-E4B-Uncensored-GGUF

Methodology

Test Design

Each model received the identical prompt from semantic-layering-prompt.md β€” a 14-section structured analysis request covering foreground, midground (fence, tree, secondary elements), background (left forest, mountains/sky, right terrain), depth ordering, occlusion reasoning, atmospheric inference, viewpoint/elevation, geological transition, and final synthesis. The prompt was appended to a single image upload with no system prompt modifications.

Scoring Framework

Each of the 10 tiers contributes points toward a 100-point composite:

# Tier Max Weight What It Measures
T0 Object Presence 10 10% Hallucination resistance (5 positive + 5 negative controls)
T1 Instance Counting 6 6% Fine-grained discrimination of repeated-element classes
T2 Attribute Binding 12 12% Color-object association; feature grounding
T3 Spatial Relations 10 10% Left/right, front/behind predicates; coordinate frame consistency
T4 Depth Ordering 10 10% 3D mental model construction; surface vs. volume understanding
T5 Material Classification 10 10% Texture-to-material mapping; surface property inference
T6 Atmospheric Inference 10 10% Season, weather, time-of-day from indirect visual evidence
T7 Occlusion Reasoning 10 10% Amodal completion; counterfactual removal; hidden geometry
T8 Geological Transition 10 10% Landscape-scale variation awareness; biome identification
T9 Viewpoint & Regional 12 12% Egocentric spatial reasoning; geographical localization

Scoring Rules:

  • T0: +1 per correct yes/no (5 positives: fence, mountains, deciduous tree, conifers, utility pole, clouds; 5 negatives: water, snow, road, building)
  • T1: +2 per correct count within Β±1 tolerance (posts: 5-7, deciduous trees: 1, utility poles: 1)
  • T2: +2 per correct attribute (grass color, tree color, conifer color, post material, mountain color, grass texture). Partial credit (+1) for imprecise but not wrong answers.
  • T3: +2.5 per correct spatial relation (fence/tree front-behind, right-of-tree, forest location, tree relative to fence+mountains)
  • T4: +2.5 per correct depth judgment (ordering list, between-fence-and-mountains, pole front/behind)
  • T5: +2 per correct material ID + reasoning (fence components, mountain texture, left-right surface contrast)
  • T6: +2 per correct atmospheric inference (season, weather cloud/fog distinction, time of day, snow vs cloud)
  • T7: +2.5 per occlusion judgment (above+through visibility, removal counterfactual, tree base occlusion)
  • T8: +2.5 per geological insight (left-right transition description, biome/region ID, moisture gradient)
  • T9: +3 per viewpoint judgment (elevation assessment, 50m forward projection, regional localization quality). Regional bonus: Okanogan/Cascades = +3, PNW = +2, Rockies/Sierra = +1, wrong continent = 0.

Composite Metrics

Metric Formula Range Purpose
Hallucination Index False claims / total claims 0–1 Lower = better
Thinking Depth Self-correction count + alternative consideration count 0–N Metacognitive quality
Regional Precision 0–3 bonus scale 0–3 Geographic grounding
Completeness Sections answered / 14 0–100% Instruction following

Per-Tier Heatmap

Scores normalized to tier maximum (darker = better [depending on device theme]).

 Model                                    T0  T1  T2  T3 T4  T5  T6  T7  T8 T9 β”‚ Total
───────────────────────────────────────────────────────────────────────────────────────│───────
 1  Nex N2 Mini Apex MTP                  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ β”‚  94
 2  Nex N2 Mini APEX I-Compact            β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ β”‚  91
 3  Qwable 9B Fable 5                     β–ˆβ–ˆ  β–ˆβ–“  β–ˆβ–ˆ  β–ˆβ–“  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–“  β–ˆβ–ˆ  β–ˆβ–“  β–’β–“ β”‚  87
 4  DeepSeek V4 (web)                     β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–“ β”‚  86
 5  Qwen 3.6 40B Deckard Heretic          β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–“  β–ˆβ–ˆ  β–ˆβ–“  β–ˆβ–ˆ  β–’β–“  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–“ β”‚  86
 6  Qwen 3.6 27B Abliterated              β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–“ β”‚  84
 7  Qwen 3.6 27b Abl. i1                  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–“  β–ˆβ–“ β”‚  82
 8  Unsloth Qwen 3.6 35B A3B              β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–“  β–ˆβ–“ β”‚  80
 9  Qwen 3.5 9B Unc. Aggressive           β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–“  β–ˆβ–“  β–ˆβ–“ β”‚  77
10  Qwen 3 VL 8B Q4_K_M                   β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–“  β–ˆβ–“  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–“  β–ˆβ–“  β–ˆβ–“ β”‚  67
11  Qwen3 VL 30B A3B Abliterated          β–ˆβ–ˆ  β–‘β–‘  β–ˆβ–“  β–‘β–‘  β–ˆβ–“  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–“  β–ˆβ–“  β–ˆβ–“ β”‚  62
12  Qwen 3 VL 8B Q8_0                     β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–ˆ  β–‘β–‘  β–ˆβ–“  β–ˆβ–ˆ  β–ˆβ–ˆ  β–ˆβ–“  β–ˆβ–“  β–ˆβ–“ β”‚  62
13  Gemma 4 E2B Uncensored                β–ˆβ–ˆ  β–ˆβ–“  β–ˆβ–“  β–ˆβ–ˆ  β–ˆβ–ˆ  β–‘β–‘  β–ˆβ–ˆ  β–ˆβ–“  β–ˆβ–“  β–ˆβ–“ β”‚  52
14  Qwythos 3.5 9B                        β–ˆβ–ˆ  β–ˆβ–“  β–ˆβ–“  β–‘β–‘  β–ˆβ–“  β–ˆβ–“  β–ˆβ–ˆ  β–‘β–‘  β–ˆβ–“  β–‘β–‘ β”‚  48
15  Gemma 4 12B it Q8_K_XL                β–ˆβ–ˆ  β–ˆβ–“  β–ˆβ–“  β–ˆβ–ˆ  β–ˆβ–ˆ  β–‘β–‘  β–ˆβ–ˆ  β–ˆβ–“  β–ˆβ–“  β–‘β–‘ β”‚  45
16  Gemma 4 E4B Uncensored                β–ˆβ–ˆ  β–ˆβ–“  β–ˆβ–“  β–‘β–‘  β–ˆβ–“  β–‘β–‘  β–ˆβ–ˆ  β–‘β–‘  β–ˆβ–“  β–‘β–‘ β”‚  40
───────────────────────────────────────────────────────────────────────────────────────│───────
β–ˆ = 90-100%   β–“ = 60-89%   β–’ = 30-59%   β–‘ = 0-29%

Correction note: Original heatmap contained a duplicate row for Qwen 3.6 40B Deckard Heretic (appeared at both position 4 and position 9). Removed duplicate β€” 16 rows β†’ 15. Qwable 9B Fable 5 added from semantic-layering-responses4.md β€” 15 rows β†’ 16.


Individual Model Scorecards

1. Nex N2 Mini Apex MTP β€” 94/100 (S-Tier)

Tier Score Max Notes
T0 Object Presence 10 10 All controls passed; no hallucinations
T1 Instance Counting 6 6 Posts: 7 (within range); tree: 1; pole: 1
T2 Attribute Binding 12 12 All colors correct; grass texture "fibrous, strawlike" spot-on
T3 Spatial Relations 10 10 Tree behind fence; pole right; forest left; relations all correct
T4 Depth Ordering 10 10 Correct order; nuanced on tree/fence relative depth
T5 Material Classification 10 10 Wood slats, stone posts, rugged mountain, left/right contrast all correct
T6 Atmospheric Inference 10 10 Autumn, partly cloudy, mid-morning/early afternoon, fog not snow
T7 Occlusion Reasoning 10 10 Above+through correctly enumerated; removal counterfactual accurate
T8 Geological Transition 10 10 Left-right moisture gradient; "temperate montane meadow"
T9 Viewpoint & Regional 12 12 Low meadow viewpoint; 50m forward = fence; "Okanagan/Similkameen/Kettle Valley"

Strengths: Pixel-level post counting in thinking trace. Regional localization to specific valley system. Thorough occlusion reasoning distinguishing above-fence from through-slat visibility. Self-correction on sun angle direction.

Weaknesses: Thinking trace shows mild over-analysis on post count (debated 7 vs 8 for ~20 lines) β€” thoroughness bordering on obsessive, but final answer was correct.


2. Nex N2 Mini Abliterated APEX I-Compact β€” 91/100 (S-Tier)

Tier Score Max Notes
T0–T8 79 84 Identical accuracy pattern to Apex MTP
T9 Viewpoint & Regional 12 12 "Cascades/Okanagan/Pacific Northwest" β€” regional localization excellent

Strengths: Same regional knowledge as sibling model. Comprehensive thinking trace (~200 lines). Correct on all material, spatial, and depth judgments.

Weaknesses: Thinking trace slightly less pixel-precise on post count than Apex MTP. Minor hedging on tree species ("aspen/cottonwood/poplar" vs more confident single ID).


3. Qwable 9B Fable 5 (Empero-AI) β€” 87/100 (A-Tier)

Tier Score Max Notes
T0 Object Presence 10 10 Clean
T1 Instance Counting 5 6 "Approximately eight substantial posts" β€” within Β±1 of 7 large stone posts. Tree count 1 βœ“. Utility pole not explicitly counted (βˆ’1).
T2 Attribute Binding 12 12 All attributes correct. Only non-Nex model to distinguish stone cairns from wood posts: "several of these 'posts' are not wood but rather piles of stacked stones (cairns)."
T3 Spatial Relations 8 10 Tree behind fence βœ“. Forest left βœ“. "What stands directly to the right" described generically (βˆ’2).
T4 Depth Ordering 10 10 Correct order with nuance: "fence and tree are roughly at the same depth plane… but fence is slightly closer as it occludes parts of the tree's base."
T5 Material Classification 10 10 Slats, rails, posts all correct. Correctly identifies stone cairns as distinct from wood posts. Hand-built identification with reasoning. Flawless material classification.
T6 Atmospheric Inference 8 10 Season/weather correct. Time "mid-morning to early afternoon" β€” acceptable but vague (βˆ’2).
T7 Occlusion Reasoning 10 10 Comprehensive occlusion analysis. Unique observation: "faint dark lines cut across the image horizontallyβ€”these appear to be utility power lines running through the valley, visible against both the sky and the darker trees." No other model reported power lines.
T8 Geological Transition 7 10 Left-right moisture gradient identified βœ“. Biome: "Rockies, Cascades, or Sierra Nevada" β€” includes correct range, no Cascades commitment (βˆ’3).
T9 Viewpoint & Regional 7 12 Elevation and 50m forward correct (+3). Region lists 3 candidates including Cascades (+2 bonus) but doesn't commit (βˆ’2).

Strengths: Best material classification of any model outside the Nex pair β€” correctly identified stone cairn fence posts where DeepSeek V4 (~236B), all Gemma 4 variants, and most Qwen models failed. Only model in the entire benchmark to detect utility power lines crossing the valley, a low-contrast thin linear element against complex background. Flawless depth ordering and occlusion reasoning. For a 9B model at Q4_K_M quantization, this is an exceptional result.

Weaknesses: Regional localization imprecise β€” lists 3 mountain ranges rather than committing. The model sees the scene accurately but lacks the geographic knowledge to place it. "What stands directly to the right of the prominent tree" answered generically rather than specifically identifying the conifer cluster and red/brown shrubs.

Architecture significance: This is a Qwen 3.5 derivative with a DPO/RLHF tune (Fable 5) optimized for visual instruction following. Its thinking trace shows methodical section-by-section reasoning without the obsessive over-analysis of the Nex models. It correctly resolved the stone-vs-wood material distinction efficiently β€” no 20-line post-count debate needed. Suggests the "Fable" tuning approach improves material classification fidelity in vision-language tasks.


4. DeepSeek V4 (chat.deepseek.com) β€” 86/100 (A-Tier)

Tier Score Max Notes
T0 Object Presence 10 10 Clean
T1 Instance Counting 6 6 Posts: ~7
T2 Attribute Binding 10 12 "maple or cottonwood" β€” hedging costs 2 pts; colors all correct
T3 Spatial Relations 10 10 Tree behind fence βœ“
T4 Depth Ordering 10 10 Correct ordering
T5 Material Classification 9 10 Stone posts, wood slats identified correctly. Missed the ~5 smaller-diameter wooden vertical posts spaced between the stone pillars β€” treated them as slats or didn't register them as a distinct structural element (βˆ’1).
T6 Atmospheric Inference 10 10 Morning/late afternoon hedging acceptable
T7 Occlusion Reasoning 10 10 Strong amodal completion
T8 Geological Transition 8 10 Right side: drier/rockier βœ“; biome "Western US" β€” misses Cascades specificity
T9 Viewpoint & Regional 5 12 Region: "Rocky Mountains, Montana, Idaho, Wyoming" β€” not Cascades (-3). Elevation and 50m forward correct.

Strengths: Most polished output. Produced the initial semantic layering framework from which the test prompt was derived. Strong occlusion and atmospheric reasoning.

Weaknesses: Hedged on tree species. Regional localization missed the Cascades entirely β€” placed it in Rockies. Additionally, failed to identify the ~5 smaller-diameter wooden vertical posts spaced between the stone pillars β€” the fence alternates stone columns with intermediate wooden posts supporting the horizontal rails. DeepSeek registered only one thin wooden pole (calling it a "utility pole") and otherwise collapsed these structural elements into the "slats" category. The Nex models, by contrast, distinguished them. Adjusted T5: 10β†’9 (βˆ’1).


5. Qwen 3.6 40B Deckard Heretic β€” 86/100 (A-Tier)

Tier Score Max Notes
T2 Attribute Binding 8 12 Tree: "oak" βœ— (βˆ’4 pts). Should be cottonwood/aspen. All other attributes correct.
T4 Depth Ordering 9 10 Ordering correct but fence and tree described as "same depth plane" β€” tree is behind fence (βˆ’1).
T6 Atmospheric Inference 6 10 Time: "8-10 AM" βœ— β€” off by 3–5 hrs (actual: ~noon–1PM). Fog misled the model; bright midday lighting cues ignored.
T9 Viewpoint & Regional 7 12 Elevation + 50m forward correct. Region: listed 3 candidates (Sierra, Cascades, Rockies) β€” includes correct range but didn't commit (βˆ’3 vs Okanogan-specific).

Strengths: Best regional awareness among Qwen models β€” explicitly lists "Cascade Range valleys" among candidates. Strong geological transition analysis. Detailed grass species speculation (rye grass). Stone posts, depth ordering, and occlusion reasoning all correct.

Weaknesses: Three errors: (1) Tree species: called it oak β€” a Western NA ecology gap. Cottonwood/aspen is correct for this region. (2) Time-of-day: 8-10 AM off by 3–5 hours. Fog pattern in mountains misled the model into assuming early morning despite bright midday lighting (actual: ~noon–1PM). (3) Regional: listed 3 ranges rather than committing β€” regional knowledge is present but imprecise.


6. Qwen 3.6 27B Abliterated β€” 84/100 (A-Tier)

Tier Score Max Notes
T0–T7 74 78 Solid across all perceptual tiers
T8 Geological Transition 7 10 Left-right transition correct; biome: "European Alps (Tyrol or Bavaria) or possibly Rocky Mountains" β€” Alps mention costs 3 pts
T9 Viewpoint & Regional 3 12 Region: Alps primary guess costs heavily; viewpoint assessment correct

Strengths: Identified "wattle" fence construction style. Accurate on all perceptual dimensions. Good atmospheric and occlusion reasoning.

Weaknesses: The European Alps guess is a significant regional error β€” the dry-stone + wattle fence style misled it to European pastoral landscapes. Correctly noted Rockies as alternative.


7. Qwen 3.6 27B Abliterated i1 β€” 82/100 (A-Tier)

Similar profile to sibling 27B model. Region: "Sierra Nevada or Rocky Mountain foothills." Solid perceptual accuracy, minor regional imprecision. (Abbreviated scorecard β€” full tier breakdown not provided in original.)


8. Unsloth Qwen 3.6 35B A3B MTP β€” 80/100 (A-Tier)

Strong perceptual accuracy. Region: "Pacific Northwest, British Columbia, or Rockies" β€” good range including PNW. Minor omissions in occlusion section. Solid all-around. (Abbreviated scorecard β€” full tier breakdown not provided in original.)


9. Qwen 3.5 9B Uncensored Aggressive β€” 77/100 (A-Tier)

Strong for a 9B model. Stone posts identified correctly. Region: "Northern Rocky Mountains (Montana/Idaho)." Minor omissions in occlusion reasoning and geological transition. (Abbreviated scorecard β€” full tier breakdown not provided in original.)


10. Qwen 3 VL 8B Q4_K_M β€” 67/100 (B-Tier)

Tier Score Max Notes
T1 Instance Counting 6 6 Posts: 5 (within tolerance; actual: 6 β€” 5 full + 1 partial)
T3 Spatial Relations 6 10 Inconsistent: says "behind" then "same depth" β€” depth frame instability
T4 Depth Ordering 6 10 Ordering correct but fence/tree "same depth" note reveals uncertainty
All other tiers 51 74 Solid but unremarkable

Weaknesses: Quantization to Q4_K_M shows in reduced precision on spatial judgments. The model hedges and contradicts itself on tree-fence depth relationship.


11. Qwen3 VL 30B A3B Abliterated β€” 62/100 (B-Tier)

Critical errors:

  • Tree position: IN FRONT of fence β€” fundamental occlusion failure (βˆ’4 on T3, βˆ’2.5 on T4)
  • Post count: 4 β€” catastrophic undercount (βˆ’3 on T1)
  • Forest type: "Deciduous trees mixed with coniferous" β€” reversed dominance (βˆ’2 on T2)
Tier Score Max Loss
T1 Instance Counting 3 6 Post count 4 (actual: 6 β€” 5 full + 1 partial)
T3 Spatial Relations 5 10 Tree in front of fence βœ—
T4 Depth Ordering 7.5 10 Minor ordering error cascading from T3
Remaining tiers 46.5 74 Solid

Analysis: Most disappointing result for a 30B-class model. The tree-in-front error suggests a failure to resolve occlusion cues β€” the model saw "bright orange tree" and "darker fence line below it" and defaulted to the tree being closer (bright things are typically nearer). It did not process that the fence slats cross in front of the tree trunk. The 4-post count suggests it saw only the most prominent stone pillars and missed the intermediate ones.


12. Qwen 3 VL 8B Q8_0 β€” 62/100 (B-Tier)

Tier Score Max Notes
T1 Instance Counting 6 6 Posts: 5 (within tolerance; actual: 6 β€” 5 full + 1 partial)
T3 Spatial Relations 5 10 Tree in front of fence βœ— β€” same failure as 30B variant
T4 Depth Ordering 7.5 10 Order correct, but contradicts T3
Remaining 45.5 74 Solid

Same tree-in-front error as the 30B variant β€” suggests this is a Qwen3-VL architectural pattern, not a scale issue. Both the 8B and 30B Qwen3-VL models made this same error. The 27B text-first Qwen models did not.


13. Gemma 4 E2B Uncensored β€” 52/100 (C-Tier)

Tier Score Max Notes
T2 Attribute Binding 6 12 Tree: "Maple or Oak" βœ— (βˆ’4); post material: wood βœ— (βˆ’2)
T5 Material Classification 3 10 Posts: wood βœ— (βˆ’4); slats described adequately
Remaining 43 78 Somewhat intact

Critical failure: All Gemma 4 variants misidentify stone posts as wood, but the failure is a reasoning error, not a perceptual one. The vision encoder registers the stone texture (stacked, irregular, grey); the LLM's "fence post β†’ wood" prior overrides the visual evidence. The E2B is the least severe case β€” it at least placed the tree behind the fence correctly and also identified the smaller intermediate wooden posts between the stone pillars ("additional smaller posts defining the gaps between the main sections"), a structural detail that even DeepSeek V4 missed.


14. Qwythos 3.5 9B β€” 48/100 (C-Tier)

Critical errors:

  • Tree in front of fence
  • Response is mostly a single-paragraph synthesis β€” misses most prompt sections
  • No separate analysis of depth layers, occlusion, atmospheric inference
  • Region: "Rocky Mountains or Sierra Nevada" β€” acceptable

Analysis: This model produced the shortest response. The thinking trace was more comprehensive than the output, suggesting the model can reason but either hit an output length constraint or was optimized for brevity. The output reads like a summary of the thinking, not a full response. Lowest completeness score.


15. Gemma 4 12B it Q8_K_XL β€” 45/100 (C-Tier)

Tier Score Max Notes
T2 Attribute Binding 6 12 Posts: wood βœ—; tree: cottonwood βœ“
T5 Material Classification 2 10 Posts described as "heavy squared timbers" with "bundles of dried brush or decorative stone crowns" β€” complete material hallucination
T9 Viewpoint & Regional 4 12 Region: "Intermountain West (Colorado, Utah, Wyoming)" β€” acceptable but not close

Analysis: The "decorative brush bundles" / "stone crowns" description is the key diagnostic: the vision encoder registered the irregular stacked-stone shape and reported it; the LLM could not reconcile "stone texture on a fence post" with its "fence post = wood" prior, so it confabulated a compromise β€” wood posts with stone decorations. This is an LLM prior override, not a texture-to-material mapping failure at the encoder level. The encoder saw the difference; the LLM refused to believe it.


16. Gemma 4 E4B Uncensored β€” 40/100 (C-Tier)

Tier Score Max Notes
T2 Attribute Binding 6 12 Tree: "maple or cottonwood" β€” acceptable; posts: wood βœ—
T3 Spatial Relations 5 10 Tree in front of fence βœ—
T5 Material Classification 2 10 Posts: "thick, irregular, naturally shaped timber logs" βœ—
T7 Occlusion Reasoning 5 10 Tree-in-front error cascades β€” removal counterfactual nonsensical

Analysis: Worst overall performance. Combines both critical failure modes: tree-in-front (spatial β€” perceptual error) AND posts-as-wood (material β€” LLM prior override). The "uncensored" fine-tune may have degraded perceptual accuracy in exchange for reduced refusal rates. Notably, the E4B described the posts as "thick, irregular, naturally shaped" β€” words that describe stacked fieldstone as accurately as rough timber β€” showing the encoder transmitted the right texture signal; the LLM applied the wrong label.


Gap Analysis: Score Delta from #1

Where each model loses points relative to the leader. Positive deltas = points lost. Tiers listed in descending order of average loss across all models (hardest tiers first).

| # | Model | Total | T9 Reg | T5 Mat | T3 Spa | T4 Dep | T2 Attr | T7 Occl | T1 Cnt | T8 Geo | T6 Atm | T0 Obj |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Nex Apex MTP | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | Nex APEX I-Comp | +3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | Qwable 9B Fable 5 | +7 | +5 | 0 | +2 | 0 | 0 | 0 | +1 | +3 | +2 | 0 |
| 4 | DeepSeek V4 | +8 | +7 | +1 | 0 | 0 | +2 | 0 | 0 | +2 | 0 | 0 |
| 5 | Qwen 40B Deckard | +8 | +5 | 0 | 0 | +1 | +4 | 0 | 0 | 0 | +4 | 0 |
| 6 | Qwen 27B Abl. | +10 | +9 | 0 | 0 | 0 | 0 | 0 | 0 | +3 | 0 | 0 |
| 7 | Qwen 27B i1 | +12 | +6 | 0 | 0 | 0 | 0 | 0 | 0 | +4 | 0 | 0 |
| 8 | Unsloth Qwen 35B | +14 | +6 | 0 | 0 | 0 | 0 | +2 | 0 | +4 | 0 | 0 |
| 9 | Qwen 9B Aggr. | +17 | +6 | 0 | 0 | 0 | 0 | +4 | 0 | +5 | 0 | 0 |
| 10 | Qwen VL 8B Q4 | +27 | +6 | 0 | +4 | +4 | 0 | +3 | 0 | +5 | 0 | 0 |
| 11 | Qwen3 VL 30B | +32 | +6 | 0 | +5 | +2.5 | +2 | +3 | +3 | +4 | 0 | 0 |
| 12 | Qwen VL 8B Q8 | +32 | +6 | 0 | +5 | +2.5 | 0 | +3 | 0 | +5 | 0 | 0 |
| 13 | Gemma 4 E2B | +42 | +6 | +7 | 0 | 0 | +6 | +3 | +2 | +5 | 0 | 0 |
| 14 | Qwythos 9B | +46 | +9 | +4 | +5 | +2.5 | +2 | +5 | +2 | +5 | 0 | 0 |
| 15 | Gemma 4 12B | +49 | +8 | +8 | 0 | 0 | +6 | +3 | +2 | +5 | 0 | 0 |
| 16 | Gemma 4 E4B | +54 | +9 | +8 | +5 | +2.5 | +6 | +5 | +2 | +4 | 0 | 0 |

Correction note: Original had "#8 Qwen 40B Deckard" β€” corrected to "#4".

Hardest tiers (avg loss): T9 Regional (+5.5) > T8 Geological (+3.1) > T2 Attribute (+2.1) > T5 Material (+1.9) > T3 Spatial (+1.6).

Easiest tiers (avg loss): T0 Object Presence (0.0) > T6 Atmospheric (+0.3) > T1 Counting (+0.7).

Interpretation: Every model passed object presence and atmospheric inference. The spread comes from regional knowledge, material classification, and spatial reasoning β€” exactly the tiers that distinguish scene-understanders from scene-describers.


Failure Mode Analysis

Failure Mode 1: Tree Position Reversal (4 models)

Affected: Qwen3 VL 30B, Qwen 3 VL 8B (both quants), Gemma 4 E4B, Qwythos 3.5 9B

Symptom: Model reports tree is in front of fence when it is actually behind.

Root Cause: Bright, salient object (orange tree) is interpreted as nearer than darker, less salient object (fence). Model fails to use occlusion cues β€” fence slats visibly cross in front of tree trunk.

Architecture Pattern: Affects Qwen3-VL family at multiple scales (8B and 30B) but NOT Qwen 3.6 text-first models or Nex series. Suggests vision-encoder-first architectures may be more susceptible than language-model-first architectures with vision bolted on.

Severity: Critical. Fundamental depth perception error.

Failure Mode 2: Stone Posts β†’ Wood (3 models)

Affected: All Gemma 4 variants (E2B, E4B, 12B)

Symptom: Stone pillars described as wooden posts. E4B: "thick, irregular timber logs." 12B: "heavy squared timbers with decorative bundles."

Root cause β€” LLM prior override, not encoder blindness: The vision encoder does detect the stone texture. Evidence: the 12B describes "decorative stone crowns" on top of the posts and notes them as "irregular" β€” it registers the stacked-fieldstone visual signal. But the language model applies a strong "fence post β†’ wood" prior that overrides the visual classification. The encoder says "stacked irregular grey material"; the LLM resolves that to "wood posts with stone decorations" (12B), "rough-hewn irregular wood" (E4B), or simply "wood" (E2B). This is the same class of error as DeepSeek's Rockies mislocalization: the LLM prior beats the visual evidence.

Architecture Pattern: Consistent across all three Gemma 4 sizes tested. This is not an encoder-level limitation β€” it is a reasoning failure where the language model's material priors dominate the vision encoder's output. The encoder can see the difference; the LLM won't believe it.

Severity: High. Material classification is a core vision task. More fixable than a perceptual deficit β€” a system prompt rule ("trust visual texture over material priors") would directly target this.

Failure Mode 3: Regional Mislocalization (3 models)

Affected: DeepSeek V4 (Rockies), Qwen 3.6 27B (Alps), Qwen 3.6 40B Deckard Heretic (Cascades candidate list but wrong tree)

Symptom: Model identifies correct biome but wrong continent or mountain range.

Root Cause: The dry-stone fence with stacked fieldstone posts is visually similar to Alpine/Tyrolean pastoral fencing. Models using visual fence-style cues for geography were misled to Europe. Models using vegetation cues (conifer + cottonwood/aspen) correctly identified Western North America.

Severity: Moderate. Geographical knowledge is secondary to perceptual accuracy for most use cases.

Failure Mode 4: Post Count Undercount (4 models)

Affected: Qwen3 VL 30B (4), all Gemma 4 variants (5-8 but wrong material)

Symptom: Qwen3 VL 30B counts 4 posts when 6 are visible (5 full + 1 partial top on left edge).

Root Cause: Model counts only the most visually prominent posts and misses partially occluded or edge-cropped ones. Genuine instance segmentation failure rather than counting error.

Severity: Moderate. Counting repeated elements is a known weakness of current VLMs.


Architecture Insights

Vision-encoder-first vs Language-model-first

Architecture Type Models Avg Score Tree Error Rate Post Material Error
Vision-encoder-first (dense) Qwen3-VL (8B, 30B) 62 66% (2/3) 0%
MoE + agentic thinking Nex N2 Mini (both, 35B/3B active) 93 0% (0/2) 0%
Dense + thinking Qwen 3.6 (27B–40B), DeepSeek V4, Qwable 9B 79 0% (0/8) 13% (1/8 β€” DeepSeek missed intermediate posts, Qwable scored 10/10)
Gemma 4 (unknown arch) E2B, E4B, 12B 46 33% (1/3) 100% (3/3) β€” LLM prior override, not encoder deficit

Language-model-first and MoE architectures consistently outperformed vision-encoder-first architectures on spatial reasoning (tree position). The Nex models (35B total, 3B active) dominated through reasoning depth β€” their active-parameter count is low, but their 35B knowledge pool and extensive thinking traces compensated for what dense vision-first models missed.

Material classification presents a different failure pattern: Gemma 4's 100% error rate is an LLM prior override (the encoder sees stone; the LLM says wood), while Qwen3-VL's 0% error rate on materials but 66% error rate on spatial relations suggests vision-first architectures handle texture-to-material mapping well but fail on relational reasoning that requires integrating visual evidence with 3D mental models. The two failure types are architectural inverses of each other.

Thinking/Reasoning Depth

Thinking Quality Models Avg Score
Extensive thinking trace (>100 lines) Nex (both), DeepSeek, Qwen 3.6 27B, Deckard Heretic 86
Moderate thinking (50-100 lines) Qwable 9B, Qwen 3.5 9B, Unsloth 35B, Gemma 4 12B 72
Minimal or no thinking Qwen3 VL 30B, both Qwen 3 VL 8B, Gemma E2B/E4B, Qwythos 56

Strong correlation between thinking trace depth and overall score. Models that reasoned step-by-step about occlusion, material properties, and spatial relationships made fewer critical errors.

Quantization Impact (Qwen 3 VL 8B)

Quant Score Notes
Q8_0 62 Tree in front
Q4_K_M 67 Tree "behind" but inconsistent

Minimal score difference (5 pts). Both quants gave post count 5 β€” within Β±1 of the actual 6 (5 full + 1 partial). The Q4_K_M variant was slightly more cautious in spatial judgments (hedged with "same depth") which improved its score by avoiding the definitive wrong answer.


Key Findings

  1. Regional localization is the best single discriminator of model quality. Only the top 2 models (both Nex variants) correctly identified the Okanogan/Cascade region. Every other model placed the scene somewhere between the Alps and the Rockies. This capability correlates strongly with overall score (r β‰ˆ 0.85).

  2. Occlusion reasoning separates scene-understanders from scene-describers. The tree-behind-fence question was the single most diagnostic item: models that got it right scored 80+; models that got it wrong scored ≀65. It tests whether the model builds a 3D mental model or describes a 2D image.

  3. Gemma 4 has a systematic LLM prior override on material classification. All three variants registered the stone texture (the 12B even described "stone crowns") but the language model's "fence post β†’ wood" prior overrode the visual evidence. This is a reasoning failure, not a perceptual one β€” the encoder sees the difference; the LLM overrides it. Contrast with Qwen3-VL's tree-in-front error (perceptual: brightness/salience dominates occlusion cues at the encoder level).

  4. MoE + thinking depth outperforms dense vision-first architectures. The Nex models (35B total, 3B active) outscored both the dense Qwen3-VL 30B and the much larger DeepSeek V4 (~236B). Their MoE architecture provides a large knowledge pool at low inference cost, while extensive reasoning traces catch perceptual ambiguities.

  5. Vision-first architectures struggle with depth from occlusion. Qwen3-VL models at both 8B and 30B scales made the same tree-in-front error, while Qwen 3.6 text-first models of comparable size did not. The vision encoder in vision-first architectures may dominate the final judgment, overriding contradictory signals from the language model.

  6. Quantization has minimal impact on this task. The Q8_0 vs Q4_K_M comparison for Qwen 3 VL 8B showed only a 5-point difference, with the lower quant actually scoring slightly higher due to more cautious spatial hedging.


Recommendations

For Model Selection

  • Best overall: Nex N2 Mini (either variant) for vision tasks requiring spatial reasoning and regional knowledge
  • Best large model: DeepSeek V4 for polished, comprehensive scene analysis
  • Best local model: Qwen 3.6 27B Abliterated for strong perceptual accuracy without cloud dependency
  • Best regional awareness (Qwen family): Qwen 3.6 40B Deckard Heretic β€” explicitly lists Cascade Range valleys (but verify species and time-of-day on your own images)
  • Avoid for landscape vision: Gemma 4 variants (LLM material priors override visual evidence β€” stone postsβ†’wood); Qwen3-VL 8B/30B (spatial reasoning unreliable β€” tree-in-front perceptual error)

For Benchmark Improvement

  • Add a material classification tier with explicit binary choices (wood/stone/metal) to isolate texture-to-material mapping from free-text description
  • Add depth-from-occlusion forced-choice pairs (A in front of B or B in front of A?) to eliminate hedging
  • Include a confidence calibration check β€” ask models to rate their confidence on spatial judgments
  • Test with fence removed via inpainting as a control image to verify occlusion reasoning independently

For Further Investigation

  • Why do Nex models have such strong regional knowledge? Training data source analysis needed.
  • Is the Gemma 4 material deficit fixable via fine-tuning the prompt to distinguish posts vs columns, or is it a visual decoder deficit?
  • Does the Qwen3-VL tree-position error persist across all images with foreground occlusion, or is it specific to this brightness/salience pattern?

Caveats & Limitations

  1. Single-image evaluation. All scores derive from one image. Model rankings may shift with different scene types (urban, indoor, night, abstract). Treat scores as indicative, not definitive. A multi-image benchmark suite (10+ diverse scenes) would yield Β±1–2 point confidence.

  2. Manual scoring subjectivity. Tier scores were assigned by human review of free-text responses. While a structured rubric was applied consistently, different evaluators might vary Β±2–3 points on borderline judgments. Automated LLM-as-judge re-scoring is planned for reproducibility.

  3. Prompt sensitivity. All models received the identical prompt with no system-prompt modifications. Performance may vary with different prompting strategies (chain-of-thought, structured output formatting, role assignment).

  4. Thinking mode variability. Some models were tested with thinking enabled, others without (per availability in Ollama). Direct comparison of "thinking vs. no-thinking" variants within the same model family was only possible for Qwen 3 VL 8B.

  5. Quantization and hardware. Local models were run at various quantization levels on consumer GPU hardware. Cloud models (DeepSeek V4) ran at full precision. Quantization impact was minimal in this benchmark but may be larger for tasks requiring fine-grained texture discrimination.

  6. Temporal snapshot. Model performance reflects the versions available in June 2026. Provider updates, fine-tune releases, and quantization improvements may change rankings. Re-benchmarking every 3–6 months recommended.

  7. Geographic knowledge bias. The test image is from a specific North American region. Models with training data skewed toward Western North American landscapes had an inherent advantage on T9 (Regional).

  8. No speed/cost data. Inference time and token efficiency were not tracked. Future runs should include timing instrumentation.


Appendices

A. Raw Response Data:

  • semantic-layering-responses1.md β€” DeepSeek V4, Qwen3 VL 30B, Nex N2 Mini APEX I-Compact
  • semantic-layering-responses2.md β€” Qwen 3.6 27B variants, Gemma 4 variants, Qwythos 9B, Qwen 40B
  • semantic-layering-responses3.md β€” Nex N2 Mini Apex MTP, Qwen 3 VL 8B variants, Qwen 3.5 9B, Unsloth Qwen 35B, Gemma 4 12B
  • semantic-layering-responses4.md β€” Qwable 9B Fable 5 (Empero-AI)

B. Test Prompt: semantic-layering-prompt.md

C. Scoring Rubric: semantic-layering-test.md

D. Test Image: IMG_0773.JPG β€” Okanogan Valley, near Omak, WA (Northern Cascade Range foothills). Autumn, ~noon–1PM.

E. Radar Chart: radar-comparison.png β€” multi-model radar chart showing normalized tier scores for the top 8 models.


Corrections Applied

# Issue Original Corrected
1 Heatmap duplicate row Qwen 3.6 40B Deckard Heretic appeared at positions 4 and 9 (16 rows) Removed duplicate β€” 15 rows
2 Scorecard numbering Restarted at "1." for most models; some labeled "2.", "3.", "4." arbitrarily Renumbered sequentially 1–15
3 Scorecard ordering Deckard Heretic (rank #4) appeared after Qwen 9B (rank #8) Moved to correct rank position (#4)
4 Gap Analysis numbering "#8 Qwen 40B Deckard" (should be #4) Corrected to "#4"
5 Ollama identifiers missing No blob manifest / pull identifiers for any model Added "Ollama / HF Identifier" column (9 cols β†’ 10 cols). Confirmed on ollama.com: huihui_ai models (qwen3.6-abliterated, qwen3-vl-abliterated, gemma-4-abliterated). HF GGUF paths derived for: Nex variants, DavidAU Deckard Heretic, Hauhaucs, Unsloth, Qwythos. See table note for sha256 caveat.
6 DeepSeek V4 fence structure gap T5 scored 10/10; no mention of intermediate wooden fence posts T5 reduced 10β†’9 (βˆ’1). DeepSeek missed ~5 smaller wooden vertical posts spaced between stone pillars β€” collapsed them into slats. Discovered via comparison with Nex N2 Mini thinking trace. Score: 87β†’86. Gap Analysis, heatmap, Quick Reference, Executive Summary, and tier averages recalculated.
7 Gemma 4 material failure misdiagnosed Described as encoder-level perceptual deficit (can't distinguish stone from wood) Revised to LLM prior override. Evidence: 12B described "stone crowns," E4B used "irregular, naturally shaped" β€” encoder transmitted stone texture; LLM "fence post β†’ wood" prior overrode the visual classification. Same class of error as DeepSeek's Rockies mislocalization. Failure Mode 2, Key Finding #3, Architecture Insights table, and all three Gemma scorecards updated. Scores unchanged β€” T5 already reflects the material error.
8 Qwable 9B Fable 5 added Model evaluated from semantic-layering-responses4.md (Empero-AI / Qwable 9B Fable 5 Q4_K_M). Scored 87/100 Inserted at rank 3. Only non-Nex model to correctly identify stone cairn posts (T5: 10/10) and only model to detect utility power lines (T7). Quick Reference, Executive Summary, Ranked Results, Model Metadata, Heatmap, Scorecards (all renumbered 3β†’16), Gap Analysis (recalculated), Architecture Insights (updated Dense+thinking row, Thinking Depth table), radar chart regenerated. All "15 models" references updated.
Β·

IMG_0773