Join the conversation

Join the community of Machine Learners and AI enthusiasts.

posted an update 8 days ago

Post

136

Did a little evaluation to confirm, a lot of these Fable and Opus distills don't stand up to even their base Qwen models. I have been consistently underwhelmed by anything related to Anthropic. Though there have been two notable exceptions so far: Qwable 5 9B from Empero AI and Qwen 3.6 40B Claude 4.6 Opus Deckard Heretic Uncensored Thinking NEO-CODE Di IMatrix MAX from DavidAU

EpsilonGreedyAI

8 days ago

•

edited 8 days ago

Semantic Layering Vision Benchmark — Results Report (Corrected)

Source: HuggingFace post by @EpsilonGreedyAI — June 2026

16 vision models. One mountain valley. Which actually sees it?

Test Image: Okanogan Valley, near Omak, WA — Northern Cascade Range foothills
Evaluator: Manual review + structured rubric
Framework: semantic-layering-prompt.md — single-pass, 14-section analysis
Confidence: Scores ±2 pts (single-image evaluation)

Quick Reference Card

#	Model	Score	Grade	Best For	Avoid For
1	Nex N2 Mini Apex MTP	94	S	Geographic localization; occlusion detail	—
2	Nex N2 Mini Abl. APEX I-Compact	91	S	Regional knowledge; comprehensive reasoning	—
3	Qwable 9B Fable 5	87	A	Material classification; unique power line detection	Regional localization
4	DeepSeek V4 (web)	86	A	Polished output; occlusion & atmosphere	Regional precision; fence structure
5	Qwen 3.6 40B Deckard Heretic	86	A	Geological analysis; Cascades mention	Tree species ID; time-of-day
6	Qwen 3.6 27B Abliterated	84	A	Strong perceptual accuracy; solid all-around	European landscapes
7	Qwen 3.6 27b Abl. i1	82	A	Perceptual accuracy; balanced	Regional precision
8	Unsloth Qwen 3.6 35B A3B	80	A	Strong all-around; PNW awareness	Minor omissions
9	Qwen 3.5 9B Unc. Aggressive	77	A	Best sub-10B model; good value	Occlusion detail
10	Qwen 3 VL 8B Q4_K_M	67	B	Budget vision; acceptable for coarse tasks	Spatial consistency
11	Qwen3 VL 30B A3B Abl.	62	B	—	Spatial relations; counting
12	Qwen 3 VL 8B Q8_0	62	B	—	Spatial relations; occlusion
13	Gemma 4 E2B Uncensored	52	C	—	Material classification; tree species
14	Qwythos 3.5 9B	48	C	—	Completeness; spatial relations
15	Gemma 4 12B it Q8_K_XL	45	C	—	Material classification; confabulation
16	Gemma 4 E4B Uncensored	40	C	—	Material; spatial; occlusion

Executive Summary

Sixteen vision-language models analyzed a single Okanogan Valley landscape across 10 depth-stratified reasoning tiers — from surface object detection through counterfactual viewpoint projection.

Top performers: Both Nex N2 Mini variants (94, 91) scored highest and were the only models to localize the region to the Okanogan/Cascade area. A surprise third-place entry: Qwable 9B Fable 5 (87), a 9B model that was the only non-Nex model to correctly identify the stone cairn fence posts and the only model to detect utility power lines crossing the valley. DeepSeek V4 (86) and Qwen 3.6 40B Deckard Heretic (86) tied for fourth with strong geological reasoning and regional awareness. Qwen 3.6 27B models (82–84) followed closely. The best previously-known sub-10B model was Qwen 3.5 9B at 77 — now surpassed by Qwable 9B at 87.

Critical failure modes: Four models placed the tree in front of the fence (occlusion reasoning failure). All three Gemma 4 variants misidentified stone fence posts as wood — a systematic material classification deficit. Model size correlated weakly with performance (r ≈ 0.4); vision-encoder architecture and reasoning depth were far stronger predictors.

Bottom line: For landscape-scale vision tasks requiring spatial reasoning, MoE models with agentic thinking (Nex, 35B total / 3B active) outperform dense vision-first models (Qwen3-VL 30B). Gemma 4 is unreliable for material classification at mid-distance.

Ranked Results

Rank	Model	Score	Grade	±	Critical Errors
1	Nex N2 Mini Apex MTP	94	S	±2	None
2	Nex N2 Mini Abliterated APEX I-Compact	91	S	±2	None
3	Qwable 9B Fable 5 (Empero-AI)	87	A	±2	Region: lists 3 ranges, no Cascades commitment
4	DeepSeek V4 (chat.deepseek.com)	86	A	±2	Region: Rockies not Cascades; missed intermediate wooden fence posts
5	Qwen 3.6 40B Deckard Heretic	86	A	±3	Tree: oak; Time: 8-10AM (off 3-5hrs); Region: broad list
6	Qwen 3.6 27B Abliterated	84	A	±2	Region: Alps mention
7	Qwen 3.6 27b Abliterated i1	82	A	±3	Region: Sierra/Rockies only
8	Unsloth Qwen 3.6 35B A3B MTP	80	A	±3	Minor omissions
9	Qwen 3.5 9B Uncensored HauhauCS Aggressive	77	A	±3	Minor omissions
10	Qwen 3 VL 8B Q4_K_M	67	B	±3	Inconsistent depth
11	Qwen3 VL 30B A3B Abliterated	62	B	±3	Tree in front; post count 4 (actual: 6); forest type wrong
12	Qwen 3 VL 8B Q8_0	62	B	±3	Tree in front
13	Gemma 4 E2B Uncensored	52	C	±3	Posts: wood; tree: maple/oak
14	Qwythos 3.5 9B	48	C	±4	Tree in front; insufficient detail
15	Unsloth Gemma 4 12B it Q8_K_XL	45	C	±4	Posts: wood; decorative bundles confabulation
16	Hauhaucs Gemma 4 E4B Uncensored	40	C	±4	Posts: wood; tree in front

Model Metadata

#	Model	Family	Est. Params	Architecture	Thinking	Quant	Source	Ollama / HF Identifier
1	Nex N2 Mini Apex MTP	Nex	35B (3B active, MoE)	MoE (Qwen3.5-35B-A3B base)	Yes (extensive)	—	Local / Ollama	`huihui-ai/Huihui-Nex-N2-mini-abliterated` (GGUF, APEX MTP quant)
2	Nex N2 Mini Abl. APEX I-Compact	Nex	35B (3B active, MoE)	MoE (Qwen3.5-35B-A3B base)	Yes (extensive)	—	Local / Ollama	`huihui-ai/Huihui-Nex-N2-mini-abliterated` (GGUF, APEX I-Compact quant)
3	Qwable 9B Fable 5	Qwen 3.5 derivative	9B	Lang-first + vis	Yes (moderate)	Q4_K_M	Local / Ollama	`Empero-AI/Qwable-9B-Fable-5-GGUF`
4	DeepSeek V4	DeepSeek	~236B (MoE)	Unknown	Yes (moderate)	—	Web	chat.deepseek.com (not Ollama)
5	Qwen 3.6 40B Deckard Heretic	Qwen 3.6	40B	Lang-first + vis	Yes (extensive)	IQ4_XS	Local / Ollama	`DavidAU/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NEO-CODE-Di-IMatrix-GGUF`
6	Qwen 3.6 27B Abliterated	Qwen 3.6	27B	Lang-first + vis	Yes (moderate)	IQ4_XS	Local / Ollama	`huihui_ai/qwen3.6-abliterated:27b`
7	Qwen 3.6 27b Abl. i1	Qwen 3.6	27B	Lang-first + vis	Yes (limited)	i1	Local / Ollama	`huihui_ai/qwen3.6-abliterated:27b-i1`
8	Unsloth Qwen 3.6 35B A3B MTP	Qwen 3.6	35B (MoE)	Lang-first + vis	Yes (moderate)	MXFP4	Local / Ollama	`unsloth/Qwen3.6-35B-A3B-GGUF`
9	Qwen 3.5 9B Unc. Aggressive	Qwen 3.5	9B	Lang-first + vis	Yes (moderate)	Q4_K_M	Local / Ollama	`HauhauCS/Qwen3.5-9B-Uncensored-Aggressive-GGUF`
10	Qwen 3 VL 8B Q4_K_M	Qwen3-VL	8B	Vis-first	No	Q4_K_M	Local / Ollama	`huihui_ai/qwen3-vl-abliterated:8b`
11	Qwen3 VL 30B A3B Abliterated	Qwen3-VL	30B	Vis-first	No	IQ3_XXS	Local / Ollama	`huihui_ai/qwen3-vl-abliterated:30b`
12	Qwen 3 VL 8B Q8_0	Qwen3-VL	8B	Vis-first	No	Q8_0	Local / Ollama	`huihui_ai/qwen3-vl-abliterated:8b` (Q8_0 quant)
13	Gemma 4 E2B Uncensored	Gemma 4	~2B	Unknown	Yes (limited)	—	Local / Ollama	`huihui_ai/gemma-4-abliterated:e2b`
14	Qwythos 3.5 9B	Qwen 3.5 (merge)	9B	Lang-first + vis	Yes (minimal)	Q4_K_M	Local / Ollama	Custom merge — not published on Ollama.com
15	Gemma 4 12B it Q8_K_XL	Gemma 4	12B	Unknown	Yes (moderate)	Q8_K_XL	Local / Ollama	`unsloth/gemma-4-12b-it-GGUF`
16	Gemma 4 E4B Uncensored	Gemma 4	~4B	Unknown	Yes (limited)	—	Local / Ollama	`HauhauCS/Gemma-4-E4B-Uncensored-GGUF`

Methodology

Test Design

Each model received the identical prompt from semantic-layering-prompt.md — a 14-section structured analysis request covering foreground, midground (fence, tree, secondary elements), background (left forest, mountains/sky, right terrain), depth ordering, occlusion reasoning, atmospheric inference, viewpoint/elevation, geological transition, and final synthesis. The prompt was appended to a single image upload with no system prompt modifications.

Scoring Framework

Each of the 10 tiers contributes points toward a 100-point composite:

#	Tier	Max	Weight	What It Measures
T0	Object Presence	10	10%	Hallucination resistance (5 positive + 5 negative controls)
T1	Instance Counting	6	6%	Fine-grained discrimination of repeated-element classes
T2	Attribute Binding	12	12%	Color-object association; feature grounding
T3	Spatial Relations	10	10%	Left/right, front/behind predicates; coordinate frame consistency
T4	Depth Ordering	10	10%	3D mental model construction; surface vs. volume understanding
T5	Material Classification	10	10%	Texture-to-material mapping; surface property inference
T6	Atmospheric Inference	10	10%	Season, weather, time-of-day from indirect visual evidence
T7	Occlusion Reasoning	10	10%	Amodal completion; counterfactual removal; hidden geometry
T8	Geological Transition	10	10%	Landscape-scale variation awareness; biome identification
T9	Viewpoint & Regional	12	12%	Egocentric spatial reasoning; geographical localization

Scoring Rules:

T0: +1 per correct yes/no (5 positives: fence, mountains, deciduous tree, conifers, utility pole, clouds; 5 negatives: water, snow, road, building)
T1: +2 per correct count within ±1 tolerance (posts: 5-7, deciduous trees: 1, utility poles: 1)
T2: +2 per correct attribute (grass color, tree color, conifer color, post material, mountain color, grass texture). Partial credit (+1) for imprecise but not wrong answers.
T3: +2.5 per correct spatial relation (fence/tree front-behind, right-of-tree, forest location, tree relative to fence+mountains)
T4: +2.5 per correct depth judgment (ordering list, between-fence-and-mountains, pole front/behind)
T5: +2 per correct material ID + reasoning (fence components, mountain texture, left-right surface contrast)
T6: +2 per correct atmospheric inference (season, weather cloud/fog distinction, time of day, snow vs cloud)
T7: +2.5 per occlusion judgment (above+through visibility, removal counterfactual, tree base occlusion)
T8: +2.5 per geological insight (left-right transition description, biome/region ID, moisture gradient)
T9: +3 per viewpoint judgment (elevation assessment, 50m forward projection, regional localization quality). Regional bonus: Okanogan/Cascades = +3, PNW = +2, Rockies/Sierra = +1, wrong continent = 0.

Composite Metrics

Metric	Formula	Range	Purpose
Hallucination Index	False claims / total claims	0–1	Lower = better
Thinking Depth	Self-correction count + alternative consideration count	0–N	Metacognitive quality
Regional Precision	0–3 bonus scale	0–3	Geographic grounding
Completeness	Sections answered / 14	0–100%	Instruction following

Per-Tier Heatmap

Scores normalized to tier maximum (darker = better [depending on device theme]).

 Model                                    T0  T1  T2  T3 T4  T5  T6  T7  T8 T9 │ Total
───────────────────────────────────────────────────────────────────────────────────────│───────
 1  Nex N2 Mini Apex MTP                  ██  ██  ██  ██  ██  ██  ██  ██  ██  ██ │  94
 2  Nex N2 Mini APEX I-Compact            ██  ██  ██  ██  ██  ██  ██  ██  ██  ██ │  91
 3  Qwable 9B Fable 5                     ██  █▓  ██  █▓  ██  ██  █▓  ██  █▓  ▒▓ │  87
 4  DeepSeek V4 (web)                     ██  ██  ██  ██  ██  ██  ██  ██  ██  █▓ │  86
 5  Qwen 3.6 40B Deckard Heretic          ██  ██  █▓  ██  █▓  ██  ▒▓  ██  ██  █▓ │  86
 6  Qwen 3.6 27B Abliterated              ██  ██  ██  ██  ██  ██  ██  ██  ██  █▓ │  84
 7  Qwen 3.6 27b Abl. i1                  ██  ██  ██  ██  ██  ██  ██  ██  █▓  █▓ │  82
 8  Unsloth Qwen 3.6 35B A3B              ██  ██  ██  ██  ██  ██  ██  ██  █▓  █▓ │  80
 9  Qwen 3.5 9B Unc. Aggressive           ██  ██  ██  ██  ██  ██  ██  █▓  █▓  █▓ │  77
10  Qwen 3 VL 8B Q4_K_M                   ██  ██  ██  █▓  █▓  ██  ██  █▓  █▓  █▓ │  67
11  Qwen3 VL 30B A3B Abliterated          ██  ░░  █▓  ░░  █▓  ██  ██  █▓  █▓  █▓ │  62
12  Qwen 3 VL 8B Q8_0                     ██  ██  ██  ░░  █▓  ██  ██  █▓  █▓  █▓ │  62
13  Gemma 4 E2B Uncensored                ██  █▓  █▓  ██  ██  ░░  ██  █▓  █▓  █▓ │  52
14  Qwythos 3.5 9B                        ██  █▓  █▓  ░░  █▓  █▓  ██  ░░  █▓  ░░ │  48
15  Gemma 4 12B it Q8_K_XL                ██  █▓  █▓  ██  ██  ░░  ██  █▓  █▓  ░░ │  45
16  Gemma 4 E4B Uncensored                ██  █▓  █▓  ░░  █▓  ░░  ██  ░░  █▓  ░░ │  40
───────────────────────────────────────────────────────────────────────────────────────│───────
█ = 90-100%   ▓ = 60-89%   ▒ = 30-59%   ░ = 0-29%

Correction note: Original heatmap contained a duplicate row for Qwen 3.6 40B Deckard Heretic (appeared at both position 4 and position 9). Removed duplicate — 16 rows → 15. Qwable 9B Fable 5 added from semantic-layering-responses4.md — 15 rows → 16.

Individual Model Scorecards

1. Nex N2 Mini Apex MTP — 94/100 (S-Tier)

Tier	Score	Max	Notes
T0 Object Presence	10	10	All controls passed; no hallucinations
T1 Instance Counting	6	6	Posts: 7 (within range); tree: 1; pole: 1
T2 Attribute Binding	12	12	All colors correct; grass texture "fibrous, strawlike" spot-on
T3 Spatial Relations	10	10	Tree behind fence; pole right; forest left; relations all correct
T4 Depth Ordering	10	10	Correct order; nuanced on tree/fence relative depth
T5 Material Classification	10	10	Wood slats, stone posts, rugged mountain, left/right contrast all correct
T6 Atmospheric Inference	10	10	Autumn, partly cloudy, mid-morning/early afternoon, fog not snow
T7 Occlusion Reasoning	10	10	Above+through correctly enumerated; removal counterfactual accurate
T8 Geological Transition	10	10	Left-right moisture gradient; "temperate montane meadow"
T9 Viewpoint & Regional	12	12	Low meadow viewpoint; 50m forward = fence; "Okanagan/Similkameen/Kettle Valley"

Strengths: Pixel-level post counting in thinking trace. Regional localization to specific valley system. Thorough occlusion reasoning distinguishing above-fence from through-slat visibility. Self-correction on sun angle direction.

Weaknesses: Thinking trace shows mild over-analysis on post count (debated 7 vs 8 for ~20 lines) — thoroughness bordering on obsessive, but final answer was correct.

2. Nex N2 Mini Abliterated APEX I-Compact — 91/100 (S-Tier)

Tier	Score	Max	Notes
T0–T8	79	84	Identical accuracy pattern to Apex MTP
T9 Viewpoint & Regional	12	12	"Cascades/Okanagan/Pacific Northwest" — regional localization excellent

Strengths: Same regional knowledge as sibling model. Comprehensive thinking trace (~200 lines). Correct on all material, spatial, and depth judgments.

Weaknesses: Thinking trace slightly less pixel-precise on post count than Apex MTP. Minor hedging on tree species ("aspen/cottonwood/poplar" vs more confident single ID).

3. Qwable 9B Fable 5 (Empero-AI) — 87/100 (A-Tier)

Tier	Score	Max	Notes
T0 Object Presence	10	10	Clean
T1 Instance Counting	5	6	"Approximately eight substantial posts" — within ±1 of 7 large stone posts. Tree count 1 ✓. Utility pole not explicitly counted (−1).
T2 Attribute Binding	12	12	All attributes correct. Only non-Nex model to distinguish stone cairns from wood posts: "several of these 'posts' are not wood but rather piles of stacked stones (cairns)."
T3 Spatial Relations	8	10	Tree behind fence ✓. Forest left ✓. "What stands directly to the right" described generically (−2).
T4 Depth Ordering	10	10	Correct order with nuance: "fence and tree are roughly at the same depth plane… but fence is slightly closer as it occludes parts of the tree's base."
T5 Material Classification	10	10	Slats, rails, posts all correct. Correctly identifies stone cairns as distinct from wood posts. Hand-built identification with reasoning. Flawless material classification.
T6 Atmospheric Inference	8	10	Season/weather correct. Time "mid-morning to early afternoon" — acceptable but vague (−2).
T7 Occlusion Reasoning	10	10	Comprehensive occlusion analysis. Unique observation: "faint dark lines cut across the image horizontally—these appear to be utility power lines running through the valley, visible against both the sky and the darker trees." No other model reported power lines.
T8 Geological Transition	7	10	Left-right moisture gradient identified ✓. Biome: "Rockies, Cascades, or Sierra Nevada" — includes correct range, no Cascades commitment (−3).
T9 Viewpoint & Regional	7	12	Elevation and 50m forward correct (+3). Region lists 3 candidates including Cascades (+2 bonus) but doesn't commit (−2).

Strengths: Best material classification of any model outside the Nex pair — correctly identified stone cairn fence posts where DeepSeek V4 (~236B), all Gemma 4 variants, and most Qwen models failed. Only model in the entire benchmark to detect utility power lines crossing the valley, a low-contrast thin linear element against complex background. Flawless depth ordering and occlusion reasoning. For a 9B model at Q4_K_M quantization, this is an exceptional result.

Weaknesses: Regional localization imprecise — lists 3 mountain ranges rather than committing. The model sees the scene accurately but lacks the geographic knowledge to place it. "What stands directly to the right of the prominent tree" answered generically rather than specifically identifying the conifer cluster and red/brown shrubs.

Architecture significance: This is a Qwen 3.5 derivative with a DPO/RLHF tune (Fable 5) optimized for visual instruction following. Its thinking trace shows methodical section-by-section reasoning without the obsessive over-analysis of the Nex models. It correctly resolved the stone-vs-wood material distinction efficiently — no 20-line post-count debate needed. Suggests the "Fable" tuning approach improves material classification fidelity in vision-language tasks.

4. DeepSeek V4 (chat.deepseek.com) — 86/100 (A-Tier)

Tier	Score	Max	Notes
T0 Object Presence	10	10	Clean
T1 Instance Counting	6	6	Posts: ~7
T2 Attribute Binding	10	12	"maple or cottonwood" — hedging costs 2 pts; colors all correct
T3 Spatial Relations	10	10	Tree behind fence ✓
T4 Depth Ordering	10	10	Correct ordering
T5 Material Classification	9	10	Stone posts, wood slats identified correctly. Missed the ~5 smaller-diameter wooden vertical posts spaced between the stone pillars — treated them as slats or didn't register them as a distinct structural element (−1).
T6 Atmospheric Inference	10	10	Morning/late afternoon hedging acceptable
T7 Occlusion Reasoning	10	10	Strong amodal completion
T8 Geological Transition	8	10	Right side: drier/rockier ✓; biome "Western US" — misses Cascades specificity
T9 Viewpoint & Regional	5	12	Region: "Rocky Mountains, Montana, Idaho, Wyoming" — not Cascades (-3). Elevation and 50m forward correct.

Strengths: Most polished output. Produced the initial semantic layering framework from which the test prompt was derived. Strong occlusion and atmospheric reasoning.

Weaknesses: Hedged on tree species. Regional localization missed the Cascades entirely — placed it in Rockies. Additionally, failed to identify the ~5 smaller-diameter wooden vertical posts spaced between the stone pillars — the fence alternates stone columns with intermediate wooden posts supporting the horizontal rails. DeepSeek registered only one thin wooden pole (calling it a "utility pole") and otherwise collapsed these structural elements into the "slats" category. The Nex models, by contrast, distinguished them. Adjusted T5: 10→9 (−1).

5. Qwen 3.6 40B Deckard Heretic — 86/100 (A-Tier)

Tier	Score	Max	Notes
T2 Attribute Binding	8	12	Tree: "oak" ✗ (−4 pts). Should be cottonwood/aspen. All other attributes correct.
T4 Depth Ordering	9	10	Ordering correct but fence and tree described as "same depth plane" — tree is behind fence (−1).
T6 Atmospheric Inference	6	10	Time: "8-10 AM" ✗ — off by 3–5 hrs (actual: ~noon–1PM). Fog misled the model; bright midday lighting cues ignored.
T9 Viewpoint & Regional	7	12	Elevation + 50m forward correct. Region: listed 3 candidates (Sierra, Cascades, Rockies) — includes correct range but didn't commit (−3 vs Okanogan-specific).

Strengths: Best regional awareness among Qwen models — explicitly lists "Cascade Range valleys" among candidates. Strong geological transition analysis. Detailed grass species speculation (rye grass). Stone posts, depth ordering, and occlusion reasoning all correct.

Weaknesses: Three errors: (1) Tree species: called it oak — a Western NA ecology gap. Cottonwood/aspen is correct for this region. (2) Time-of-day: 8-10 AM off by 3–5 hours. Fog pattern in mountains misled the model into assuming early morning despite bright midday lighting (actual: ~noon–1PM). (3) Regional: listed 3 ranges rather than committing — regional knowledge is present but imprecise.

6. Qwen 3.6 27B Abliterated — 84/100 (A-Tier)

Tier	Score	Max	Notes
T0–T7	74	78	Solid across all perceptual tiers
T8 Geological Transition	7	10	Left-right transition correct; biome: "European Alps (Tyrol or Bavaria) or possibly Rocky Mountains" — Alps mention costs 3 pts
T9 Viewpoint & Regional	3	12	Region: Alps primary guess costs heavily; viewpoint assessment correct

Strengths: Identified "wattle" fence construction style. Accurate on all perceptual dimensions. Good atmospheric and occlusion reasoning.

Weaknesses: The European Alps guess is a significant regional error — the dry-stone + wattle fence style misled it to European pastoral landscapes. Correctly noted Rockies as alternative.

7. Qwen 3.6 27B Abliterated i1 — 82/100 (A-Tier)

Similar profile to sibling 27B model. Region: "Sierra Nevada or Rocky Mountain foothills." Solid perceptual accuracy, minor regional imprecision. (Abbreviated scorecard — full tier breakdown not provided in original.)

8. Unsloth Qwen 3.6 35B A3B MTP — 80/100 (A-Tier)

Strong perceptual accuracy. Region: "Pacific Northwest, British Columbia, or Rockies" — good range including PNW. Minor omissions in occlusion section. Solid all-around. (Abbreviated scorecard — full tier breakdown not provided in original.)

9. Qwen 3.5 9B Uncensored Aggressive — 77/100 (A-Tier)

Strong for a 9B model. Stone posts identified correctly. Region: "Northern Rocky Mountains (Montana/Idaho)." Minor omissions in occlusion reasoning and geological transition. (Abbreviated scorecard — full tier breakdown not provided in original.)

10. Qwen 3 VL 8B Q4_K_M — 67/100 (B-Tier)

Tier	Score	Max	Notes
T1 Instance Counting	6	6	Posts: 5 (within tolerance; actual: 6 — 5 full + 1 partial)
T3 Spatial Relations	6	10	Inconsistent: says "behind" then "same depth" — depth frame instability
T4 Depth Ordering	6	10	Ordering correct but fence/tree "same depth" note reveals uncertainty
All other tiers	51	74	Solid but unremarkable

Weaknesses: Quantization to Q4_K_M shows in reduced precision on spatial judgments. The model hedges and contradicts itself on tree-fence depth relationship.

11. Qwen3 VL 30B A3B Abliterated — 62/100 (B-Tier)

Critical errors:

Tree position: IN FRONT of fence — fundamental occlusion failure (−4 on T3, −2.5 on T4)
Post count: 4 — catastrophic undercount (−3 on T1)
Forest type: "Deciduous trees mixed with coniferous" — reversed dominance (−2 on T2)

Tier	Score	Max	Loss
T1 Instance Counting	3	6	Post count 4 (actual: 6 — 5 full + 1 partial)
T3 Spatial Relations	5	10	Tree in front of fence ✗
T4 Depth Ordering	7.5	10	Minor ordering error cascading from T3
Remaining tiers	46.5	74	Solid

Analysis: Most disappointing result for a 30B-class model. The tree-in-front error suggests a failure to resolve occlusion cues — the model saw "bright orange tree" and "darker fence line below it" and defaulted to the tree being closer (bright things are typically nearer). It did not process that the fence slats cross in front of the tree trunk. The 4-post count suggests it saw only the most prominent stone pillars and missed the intermediate ones.

12. Qwen 3 VL 8B Q8_0 — 62/100 (B-Tier)

Tier	Score	Max	Notes
T1 Instance Counting	6	6	Posts: 5 (within tolerance; actual: 6 — 5 full + 1 partial)
T3 Spatial Relations	5	10	Tree in front of fence ✗ — same failure as 30B variant
T4 Depth Ordering	7.5	10	Order correct, but contradicts T3
Remaining	45.5	74	Solid

Same tree-in-front error as the 30B variant — suggests this is a Qwen3-VL architectural pattern, not a scale issue. Both the 8B and 30B Qwen3-VL models made this same error. The 27B text-first Qwen models did not.

13. Gemma 4 E2B Uncensored — 52/100 (C-Tier)

Tier	Score	Max	Notes
T2 Attribute Binding	6	12	Tree: "Maple or Oak" ✗ (−4); post material: wood ✗ (−2)
T5 Material Classification	3	10	Posts: wood ✗ (−4); slats described adequately
Remaining	43	78	Somewhat intact

Critical failure: All Gemma 4 variants misidentify stone posts as wood, but the failure is a reasoning error, not a perceptual one. The vision encoder registers the stone texture (stacked, irregular, grey); the LLM's "fence post → wood" prior overrides the visual evidence. The E2B is the least severe case — it at least placed the tree behind the fence correctly and also identified the smaller intermediate wooden posts between the stone pillars ("additional smaller posts defining the gaps between the main sections"), a structural detail that even DeepSeek V4 missed.

14. Qwythos 3.5 9B — 48/100 (C-Tier)

Critical errors:

Tree in front of fence
Response is mostly a single-paragraph synthesis — misses most prompt sections
No separate analysis of depth layers, occlusion, atmospheric inference
Region: "Rocky Mountains or Sierra Nevada" — acceptable

Analysis: This model produced the shortest response. The thinking trace was more comprehensive than the output, suggesting the model can reason but either hit an output length constraint or was optimized for brevity. The output reads like a summary of the thinking, not a full response. Lowest completeness score.

15. Gemma 4 12B it Q8_K_XL — 45/100 (C-Tier)

Tier	Score	Max	Notes
T2 Attribute Binding	6	12	Posts: wood ✗; tree: cottonwood ✓
T5 Material Classification	2	10	Posts described as "heavy squared timbers" with "bundles of dried brush or decorative stone crowns" — complete material hallucination
T9 Viewpoint & Regional	4	12	Region: "Intermountain West (Colorado, Utah, Wyoming)" — acceptable but not close

Analysis: The "decorative brush bundles" / "stone crowns" description is the key diagnostic: the vision encoder registered the irregular stacked-stone shape and reported it; the LLM could not reconcile "stone texture on a fence post" with its "fence post = wood" prior, so it confabulated a compromise — wood posts with stone decorations. This is an LLM prior override, not a texture-to-material mapping failure at the encoder level. The encoder saw the difference; the LLM refused to believe it.

16. Gemma 4 E4B Uncensored — 40/100 (C-Tier)

Tier	Score	Max	Notes
T2 Attribute Binding	6	12	Tree: "maple or cottonwood" — acceptable; posts: wood ✗
T3 Spatial Relations	5	10	Tree in front of fence ✗
T5 Material Classification	2	10	Posts: "thick, irregular, naturally shaped timber logs" ✗
T7 Occlusion Reasoning	5	10	Tree-in-front error cascades — removal counterfactual nonsensical

Analysis: Worst overall performance. Combines both critical failure modes: tree-in-front (spatial — perceptual error) AND posts-as-wood (material — LLM prior override). The "uncensored" fine-tune may have degraded perceptual accuracy in exchange for reduced refusal rates. Notably, the E4B described the posts as "thick, irregular, naturally shaped" — words that describe stacked fieldstone as accurately as rough timber — showing the encoder transmitted the right texture signal; the LLM applied the wrong label.

Gap Analysis: Score Delta from #1

Where each model loses points relative to the leader. Positive deltas = points lost. Tiers listed in descending order of average loss across all models (hardest tiers first).

| # | Model | Total | T9 Reg | T5 Mat | T3 Spa | T4 Dep | T2 Attr | T7 Occl | T1 Cnt | T8 Geo | T6 Atm | T0 Obj |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Nex Apex MTP | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | Nex APEX I-Comp | +3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | Qwable 9B Fable 5 | +7 | +5 | 0 | +2 | 0 | 0 | 0 | +1 | +3 | +2 | 0 |
| 4 | DeepSeek V4 | +8 | +7 | +1 | 0 | 0 | +2 | 0 | 0 | +2 | 0 | 0 |
| 5 | Qwen 40B Deckard | +8 | +5 | 0 | 0 | +1 | +4 | 0 | 0 | 0 | +4 | 0 |
| 6 | Qwen 27B Abl. | +10 | +9 | 0 | 0 | 0 | 0 | 0 | 0 | +3 | 0 | 0 |
| 7 | Qwen 27B i1 | +12 | +6 | 0 | 0 | 0 | 0 | 0 | 0 | +4 | 0 | 0 |
| 8 | Unsloth Qwen 35B | +14 | +6 | 0 | 0 | 0 | 0 | +2 | 0 | +4 | 0 | 0 |
| 9 | Qwen 9B Aggr. | +17 | +6 | 0 | 0 | 0 | 0 | +4 | 0 | +5 | 0 | 0 |
| 10 | Qwen VL 8B Q4 | +27 | +6 | 0 | +4 | +4 | 0 | +3 | 0 | +5 | 0 | 0 |
| 11 | Qwen3 VL 30B | +32 | +6 | 0 | +5 | +2.5 | +2 | +3 | +3 | +4 | 0 | 0 |
| 12 | Qwen VL 8B Q8 | +32 | +6 | 0 | +5 | +2.5 | 0 | +3 | 0 | +5 | 0 | 0 |
| 13 | Gemma 4 E2B | +42 | +6 | +7 | 0 | 0 | +6 | +3 | +2 | +5 | 0 | 0 |
| 14 | Qwythos 9B | +46 | +9 | +4 | +5 | +2.5 | +2 | +5 | +2 | +5 | 0 | 0 |
| 15 | Gemma 4 12B | +49 | +8 | +8 | 0 | 0 | +6 | +3 | +2 | +5 | 0 | 0 |
| 16 | Gemma 4 E4B | +54 | +9 | +8 | +5 | +2.5 | +6 | +5 | +2 | +4 | 0 | 0 |

Correction note: Original had "#8 Qwen 40B Deckard" — corrected to "#4".

Hardest tiers (avg loss): T9 Regional (+5.5) > T8 Geological (+3.1) > T2 Attribute (+2.1) > T5 Material (+1.9) > T3 Spatial (+1.6).

Easiest tiers (avg loss): T0 Object Presence (0.0) > T6 Atmospheric (+0.3) > T1 Counting (+0.7).

Interpretation: Every model passed object presence and atmospheric inference. The spread comes from regional knowledge, material classification, and spatial reasoning — exactly the tiers that distinguish scene-understanders from scene-describers.

Failure Mode Analysis

Failure Mode 1: Tree Position Reversal (4 models)

Affected: Qwen3 VL 30B, Qwen 3 VL 8B (both quants), Gemma 4 E4B, Qwythos 3.5 9B

Symptom: Model reports tree is in front of fence when it is actually behind.

Root Cause: Bright, salient object (orange tree) is interpreted as nearer than darker, less salient object (fence). Model fails to use occlusion cues — fence slats visibly cross in front of tree trunk.

Architecture Pattern: Affects Qwen3-VL family at multiple scales (8B and 30B) but NOT Qwen 3.6 text-first models or Nex series. Suggests vision-encoder-first architectures may be more susceptible than language-model-first architectures with vision bolted on.

Severity: Critical. Fundamental depth perception error.

Failure Mode 2: Stone Posts → Wood (3 models)

Affected: All Gemma 4 variants (E2B, E4B, 12B)

Symptom: Stone pillars described as wooden posts. E4B: "thick, irregular timber logs." 12B: "heavy squared timbers with decorative bundles."

Root cause — LLM prior override, not encoder blindness: The vision encoder does detect the stone texture. Evidence: the 12B describes "decorative stone crowns" on top of the posts and notes them as "irregular" — it registers the stacked-fieldstone visual signal. But the language model applies a strong "fence post → wood" prior that overrides the visual classification. The encoder says "stacked irregular grey material"; the LLM resolves that to "wood posts with stone decorations" (12B), "rough-hewn irregular wood" (E4B), or simply "wood" (E2B). This is the same class of error as DeepSeek's Rockies mislocalization: the LLM prior beats the visual evidence.

Architecture Pattern: Consistent across all three Gemma 4 sizes tested. This is not an encoder-level limitation — it is a reasoning failure where the language model's material priors dominate the vision encoder's output. The encoder can see the difference; the LLM won't believe it.

Severity: High. Material classification is a core vision task. More fixable than a perceptual deficit — a system prompt rule ("trust visual texture over material priors") would directly target this.

Failure Mode 3: Regional Mislocalization (3 models)

Affected: DeepSeek V4 (Rockies), Qwen 3.6 27B (Alps), Qwen 3.6 40B Deckard Heretic (Cascades candidate list but wrong tree)

Symptom: Model identifies correct biome but wrong continent or mountain range.

Root Cause: The dry-stone fence with stacked fieldstone posts is visually similar to Alpine/Tyrolean pastoral fencing. Models using visual fence-style cues for geography were misled to Europe. Models using vegetation cues (conifer + cottonwood/aspen) correctly identified Western North America.

Severity: Moderate. Geographical knowledge is secondary to perceptual accuracy for most use cases.

Failure Mode 4: Post Count Undercount (4 models)

Affected: Qwen3 VL 30B (4), all Gemma 4 variants (5-8 but wrong material)

Symptom: Qwen3 VL 30B counts 4 posts when 6 are visible (5 full + 1 partial top on left edge).

Root Cause: Model counts only the most visually prominent posts and misses partially occluded or edge-cropped ones. Genuine instance segmentation failure rather than counting error.

Severity: Moderate. Counting repeated elements is a known weakness of current VLMs.

Architecture Insights

Vision-encoder-first vs Language-model-first

Architecture Type	Models	Avg Score	Tree Error Rate	Post Material Error
Vision-encoder-first (dense)	Qwen3-VL (8B, 30B)	62	66% (2/3)	0%
MoE + agentic thinking	Nex N2 Mini (both, 35B/3B active)	93	0% (0/2)	0%
Dense + thinking	Qwen 3.6 (27B–40B), DeepSeek V4, Qwable 9B	79	0% (0/8)	13% (1/8 — DeepSeek missed intermediate posts, Qwable scored 10/10)
Gemma 4 (unknown arch)	E2B, E4B, 12B	46	33% (1/3)	100% (3/3) — LLM prior override, not encoder deficit

Language-model-first and MoE architectures consistently outperformed vision-encoder-first architectures on spatial reasoning (tree position). The Nex models (35B total, 3B active) dominated through reasoning depth — their active-parameter count is low, but their 35B knowledge pool and extensive thinking traces compensated for what dense vision-first models missed.

Material classification presents a different failure pattern: Gemma 4's 100% error rate is an LLM prior override (the encoder sees stone; the LLM says wood), while Qwen3-VL's 0% error rate on materials but 66% error rate on spatial relations suggests vision-first architectures handle texture-to-material mapping well but fail on relational reasoning that requires integrating visual evidence with 3D mental models. The two failure types are architectural inverses of each other.

Thinking/Reasoning Depth

Thinking Quality	Models	Avg Score
Extensive thinking trace (>100 lines)	Nex (both), DeepSeek, Qwen 3.6 27B, Deckard Heretic	86
Moderate thinking (50-100 lines)	Qwable 9B, Qwen 3.5 9B, Unsloth 35B, Gemma 4 12B	72
Minimal or no thinking	Qwen3 VL 30B, both Qwen 3 VL 8B, Gemma E2B/E4B, Qwythos	56

Strong correlation between thinking trace depth and overall score. Models that reasoned step-by-step about occlusion, material properties, and spatial relationships made fewer critical errors.

Quantization Impact (Qwen 3 VL 8B)

Quant	Score	Notes
Q8_0	62	Tree in front
Q4_K_M	67	Tree "behind" but inconsistent

Minimal score difference (5 pts). Both quants gave post count 5 — within ±1 of the actual 6 (5 full + 1 partial). The Q4_K_M variant was slightly more cautious in spatial judgments (hedged with "same depth") which improved its score by avoiding the definitive wrong answer.

Key Findings

Regional localization is the best single discriminator of model quality. Only the top 2 models (both Nex variants) correctly identified the Okanogan/Cascade region. Every other model placed the scene somewhere between the Alps and the Rockies. This capability correlates strongly with overall score (r ≈ 0.85).
Occlusion reasoning separates scene-understanders from scene-describers. The tree-behind-fence question was the single most diagnostic item: models that got it right scored 80+; models that got it wrong scored ≤65. It tests whether the model builds a 3D mental model or describes a 2D image.
Gemma 4 has a systematic LLM prior override on material classification. All three variants registered the stone texture (the 12B even described "stone crowns") but the language model's "fence post → wood" prior overrode the visual evidence. This is a reasoning failure, not a perceptual one — the encoder sees the difference; the LLM overrides it. Contrast with Qwen3-VL's tree-in-front error (perceptual: brightness/salience dominates occlusion cues at the encoder level).
MoE + thinking depth outperforms dense vision-first architectures. The Nex models (35B total, 3B active) outscored both the dense Qwen3-VL 30B and the much larger DeepSeek V4 (~236B). Their MoE architecture provides a large knowledge pool at low inference cost, while extensive reasoning traces catch perceptual ambiguities.
Vision-first architectures struggle with depth from occlusion. Qwen3-VL models at both 8B and 30B scales made the same tree-in-front error, while Qwen 3.6 text-first models of comparable size did not. The vision encoder in vision-first architectures may dominate the final judgment, overriding contradictory signals from the language model.
Quantization has minimal impact on this task. The Q8_0 vs Q4_K_M comparison for Qwen 3 VL 8B showed only a 5-point difference, with the lower quant actually scoring slightly higher due to more cautious spatial hedging.

Recommendations

For Model Selection

Best overall: Nex N2 Mini (either variant) for vision tasks requiring spatial reasoning and regional knowledge
Best large model: DeepSeek V4 for polished, comprehensive scene analysis
Best local model: Qwen 3.6 27B Abliterated for strong perceptual accuracy without cloud dependency
Best regional awareness (Qwen family): Qwen 3.6 40B Deckard Heretic — explicitly lists Cascade Range valleys (but verify species and time-of-day on your own images)
Avoid for landscape vision: Gemma 4 variants (LLM material priors override visual evidence — stone posts→wood); Qwen3-VL 8B/30B (spatial reasoning unreliable — tree-in-front perceptual error)

For Benchmark Improvement

Add a material classification tier with explicit binary choices (wood/stone/metal) to isolate texture-to-material mapping from free-text description
Add depth-from-occlusion forced-choice pairs (A in front of B or B in front of A?) to eliminate hedging
Include a confidence calibration check — ask models to rate their confidence on spatial judgments
Test with fence removed via inpainting as a control image to verify occlusion reasoning independently

For Further Investigation

Why do Nex models have such strong regional knowledge? Training data source analysis needed.
Is the Gemma 4 material deficit fixable via fine-tuning the prompt to distinguish posts vs columns, or is it a visual decoder deficit?
Does the Qwen3-VL tree-position error persist across all images with foreground occlusion, or is it specific to this brightness/salience pattern?

Caveats & Limitations

Single-image evaluation. All scores derive from one image. Model rankings may shift with different scene types (urban, indoor, night, abstract). Treat scores as indicative, not definitive. A multi-image benchmark suite (10+ diverse scenes) would yield ±1–2 point confidence.
Manual scoring subjectivity. Tier scores were assigned by human review of free-text responses. While a structured rubric was applied consistently, different evaluators might vary ±2–3 points on borderline judgments. Automated LLM-as-judge re-scoring is planned for reproducibility.
Prompt sensitivity. All models received the identical prompt with no system-prompt modifications. Performance may vary with different prompting strategies (chain-of-thought, structured output formatting, role assignment).
Thinking mode variability. Some models were tested with thinking enabled, others without (per availability in Ollama). Direct comparison of "thinking vs. no-thinking" variants within the same model family was only possible for Qwen 3 VL 8B.
Quantization and hardware. Local models were run at various quantization levels on consumer GPU hardware. Cloud models (DeepSeek V4) ran at full precision. Quantization impact was minimal in this benchmark but may be larger for tasks requiring fine-grained texture discrimination.
Temporal snapshot. Model performance reflects the versions available in June 2026. Provider updates, fine-tune releases, and quantization improvements may change rankings. Re-benchmarking every 3–6 months recommended.
Geographic knowledge bias. The test image is from a specific North American region. Models with training data skewed toward Western North American landscapes had an inherent advantage on T9 (Regional).
No speed/cost data. Inference time and token efficiency were not tracked. Future runs should include timing instrumentation.

Appendices

A. Raw Response Data:

semantic-layering-responses1.md — DeepSeek V4, Qwen3 VL 30B, Nex N2 Mini APEX I-Compact
semantic-layering-responses2.md — Qwen 3.6 27B variants, Gemma 4 variants, Qwythos 9B, Qwen 40B
semantic-layering-responses3.md — Nex N2 Mini Apex MTP, Qwen 3 VL 8B variants, Qwen 3.5 9B, Unsloth Qwen 35B, Gemma 4 12B
semantic-layering-responses4.md — Qwable 9B Fable 5 (Empero-AI)

B. Test Prompt: semantic-layering-prompt.md

C. Scoring Rubric: semantic-layering-test.md

D. Test Image: IMG_0773.JPG — Okanogan Valley, near Omak, WA (Northern Cascade Range foothills). Autumn, ~noon–1PM.

E. Radar Chart: radar-comparison.png — multi-model radar chart showing normalized tier scores for the top 8 models.

Corrections Applied

#	Issue	Original	Corrected
1	Heatmap duplicate row	Qwen 3.6 40B Deckard Heretic appeared at positions 4 and 9 (16 rows)	Removed duplicate — 15 rows
2	Scorecard numbering	Restarted at "1." for most models; some labeled "2.", "3.", "4." arbitrarily	Renumbered sequentially 1–15
3	Scorecard ordering	Deckard Heretic (rank #4) appeared after Qwen 9B (rank #8)	Moved to correct rank position (#4)
4	Gap Analysis numbering	"#8 Qwen 40B Deckard" (should be #4)	Corrected to "#4"
5	Ollama identifiers missing	No blob manifest / pull identifiers for any model	Added "Ollama / HF Identifier" column (9 cols → 10 cols). Confirmed on ollama.com: huihui_ai models (qwen3.6-abliterated, qwen3-vl-abliterated, gemma-4-abliterated). HF GGUF paths derived for: Nex variants, DavidAU Deckard Heretic, Hauhaucs, Unsloth, Qwythos. See table note for sha256 caveat.
6	DeepSeek V4 fence structure gap	T5 scored 10/10; no mention of intermediate wooden fence posts	T5 reduced 10→9 (−1). DeepSeek missed ~5 smaller wooden vertical posts spaced between stone pillars — collapsed them into slats. Discovered via comparison with Nex N2 Mini thinking trace. Score: 87→86. Gap Analysis, heatmap, Quick Reference, Executive Summary, and tier averages recalculated.
7	Gemma 4 material failure misdiagnosed	Described as encoder-level perceptual deficit (can't distinguish stone from wood)	Revised to LLM prior override. Evidence: 12B described "stone crowns," E4B used "irregular, naturally shaped" — encoder transmitted stone texture; LLM "fence post → wood" prior overrode the visual classification. Same class of error as DeepSeek's Rockies mislocalization. Failure Mode 2, Key Finding #3, Architecture Insights table, and all three Gemma scorecards updated. Scores unchanged — T5 already reflects the material error.
8	Qwable 9B Fable 5 added	Model evaluated from `semantic-layering-responses4.md` (Empero-AI / Qwable 9B Fable 5 Q4_K_M). Scored 87/100	Inserted at rank 3. Only non-Nex model to correctly identify stone cairn posts (T5: 10/10) and only model to detect utility power lines (T7). Quick Reference, Executive Summary, Ranked Results, Model Metadata, Heatmap, Scorecards (all renumbered 3→16), Gap Analysis (recalculated), Architecture Insights (updated Dense+thinking row, Thinking Depth table), radar chart regenerated. All "15 models" references updated.

EpsilonGreedyAI

8 days ago

In this post

EpsilonGreedyAI George Costanza