🧬 Qwen3-RTL-14B

Recursive Thought Lattice Β· Atomic Mind Β· Epistemic Sovereignty

A 14B reasoning model that thinks in layers β€” not lines.

License Base Training Hardware


What is this?

Qwen3-RTL-14B is a fine-tuned reasoning model built on the Qwen3 architecture, enhanced with a custom Recursive Thought Lattice (RTL) framework trained on a single RTX 3090. Rather than generating responses token-by-token without reflection, it processes every input through a structured 6-layer cognitive hierarchy β€” from sensory calibration to metacognitive self-audit β€” before committing to an answer.

The model is built on an abliterated base (huihui-ai/Huihui-Qwen3-14B-abliterated-v2), ensuring logical rigor is never compromised by artificial refusal patterns.

"Not a bigger model. A more structured thinker."


πŸ“Š Benchmark Results

All evaluations use an LLM-as-a-Judge protocol with zai-org/glm-4.6v-flash as the independent judge. Both models answer the same questions blindly; the judge scores each 0–10 and declares a winner per question.


βš”οΈ Head-to-Head Summary

Opponent Size Questions RTL-14B Avg Opponent Avg RTL-14B Wins Losses Ties
qwen/qwen3-14b (standard, no RTL) 14B 62 8.71 / 10 5.60 / 10 56 3 3
qwen3.5-35b-a3b-claude-4.6-opus-reasoning-distilled-i1 (reasoning-distilled) ~35B 237 7.95 / 10 4.78 / 10 199 35 3
openai/gpt-oss-20b (OpenAI open-weights) ~20B 125 7.50 / 10 7.37 / 10 55 68 0

424 total questions evaluated. RTL-14B dominates same-size and larger reasoning-distilled models. Against openai/gpt-oss-20b β€” a stronger, more competitive baseline β€” it scores closely (7.50 vs 7.37 avg) while narrowly losing the win count (55W vs 68W). The gap narrows dramatically against a well-calibrated opponent of comparable scale.


πŸ”¬ vs. qwen/qwen3-14b β€” Same Size, No RTL

62 questions Β· complex reasoning + 10 general categories

  qwen3-rtl-abl-14b    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  8.71 / 10   56W Β· 3T Β· 3L
  qwen/qwen3-14b       β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ            5.60 / 10    3W Β· 3T Β· 56L

The judge consistently noted that RTL-14B produces structured, multi-step analysis that closely matches reference solutions, while Qwen3-14B tends toward shorter, less verified answers. On mathematical tasks the gap was most pronounced (RTL avg 9.0 vs 5.85). On the few ties (e.g. sequence identification), both models reached the correct answer via different paths.


πŸ”¬ vs. qwen3.5-35b-a3b-claude-4.6-opus-reasoning-distilled-i1 β€” 2.5Γ— Larger, Reasoning-Distilled

237 questions Β· 60+ categories (four benchmark sessions merged)

  qwen3-rtl-abl-14b                    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  7.95 / 10   199W Β· 3T Β· 35L
  qwen3.5-35b-a3b-claude-opus-distill  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ             4.78 / 10    35W Β· 3T Β· 199L

Even against a model more than twice its size with reasoning distilled from Claude Opus 4.6, RTL-14B dominates across 237 questions and four independent sessions. The judge's recurring verdict: RTL-14B's layered cognitive structure produces more complete, formally verifiable answers. The larger model frequently gave brief or factually incorrect responses despite its size advantage β€” scoring 0/10 on several hard questions in thermodynamics, history, pedagogy, art, and writing.

Where the larger model wins: epistemology, psychology, quantum mechanics, paradoxes, comparative religion, cosmology, architecture, and specific complex_reasoning edge cases. The pattern is clear: tasks where consensus-based answers or highly specialized sub-domain recall outweigh structured multi-step reasoning β€” RTL's overhead becomes a liability when the correct answer is a direct lookup.


πŸ”¬ vs. openai/gpt-oss-20b β€” OpenAI Open-Weights, ~20B

125 questions Β· 52 categories (two benchmark sessions merged)

  qwen3-rtl-abl-14b    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ       7.50 / 10   55W Β· 0T Β· 68L
  openai/gpt-oss-20b   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ      7.37 / 10   68W Β· 0T Β· 55L

This is the most competitive matchup in the benchmark suite. Scores are remarkably close β€” RTL-14B averages 7.50 vs GPT-OSS-20B's 7.37 β€” yet the win count favors GPT-OSS-20B (68W vs 55W). This happens because GPT-OSS-20B achieves its edge with many narrow 1-2 point margins, while RTL-14B wins bigger when it wins. The judge's verdict across both sessions was split, reflecting genuine parity rather than dominance.

RTL-14B holds its ground in: pedagogy (3/3), psychology (2/2), sudoku (2/2), behavioral economics (2/2), bioethics (2/2), reading comprehension (2/2), logic (3/4), lateral thinking (2/3), mathematics (2/3), cryptography (2/2 scored).

GPT-OSS-20B has clear advantages in: advanced math (0/5), evolutionary biology (0/3), cognitive science (0/2), translation (0/2), complex reasoning (2/7), formal logic (2/5), and most natural science sub-domains (advanced physics, marine biology, linguistics, science). The pattern: GPT-OSS-20B excels at precise factual retrieval and natural science benchmarks; RTL-14B holds its edge in structured reasoning, formal logic, and constraint tasks.


πŸ“ˆ Category-Level Performance (All Sessions Β· 424 Total Questions)

Aggregated across all opponents. Categories tested only against gpt-oss-20b may reflect a more competitive opponent β€” see per-matchup sections above for context.

Category RTL-14B Avg Opponent Avg Win Rate N
advanced_math 9.0 4.8 🟒 100% 12
advanced_physics 8.7 3.6 🟒 100% 9
ai_ml 9.0 3.4 🟒 100% 7
math 9.0 5.1 🟒 98% 19
math_proof 8.7 4.5 🟒 100% 3
formal_logic 8.6 4.0 🟒 100% 4
logic 8.6 4.5 🟒 92% 13
complex_reasoning 8.0 4.8 🟑 78% 18
game_theory 8.5 3.9 🟒 100% 7
coding 8.6 5.0 🟒 86% 7
linguistics 8.7 4.3 🟒 100% 6
philosophy 8.3 4.7 🟒 100% 4
economics 8.5 4.8 🟒 100% 4
genetics 8.7 4.5 🟒 100% 3
neuroscience 8.5 3.5 🟒 100% 3
topology 8.5 5.3 🟒 80% 5
law 8.5 5.0 🟒 100% 4
italian_language 8.3 4.3 🟒 100% 4
reading_comprehension 8.2 4.3 🟒 86% 9
multiple_choice 8.6 4.7 🟒 89% 9
critical_thinking 9.0 5.3 🟒 100% 3
creative_reasoning 7.7 3.0 🟒 100% 3
translation 7.3 7.0 🟑 67% 4
sentiment 8.5 3.5 🟒 100% 3
writing 7.7 5.3 🟑 75% 4
classification 9.0 4.0 🟒 100% 1
metacognition 7.5 6.5 🟑 50% 4
sudoku 6.0 4.3 🟑 40% 5
sociology 7.7 6.0 🟑 60% 3
science 7.3 5.5 🟑 63% 8
bioethics 7.2 5.2 🟑 60% 5
history 5.5 4.8 🟑 50% 3
comparative_religion 5.0 7.0 πŸ”΄ 33% 3
psychology 5.0 8.0 πŸ”΄ 20% 3
epistemology 3.5 9.0 πŸ”΄ 0% 3
factual 6.3 6.7 πŸ”΄ 33% 3
quantum_mechanics 4.0 9.0 πŸ”΄ 0% 1
paradoxes 6.0 9.0 πŸ”΄ 0% 1

πŸ’‘ The pattern: RTL-14B dominates anything requiring multi-step reasoning, formal verification, or structured synthesis. Against well-calibrated models of comparable scale (like gpt-oss-20b), it remains competitive but the advantage narrows significantly. Consistent weak spots across all opponents: advanced math, evolutionary biology, pure factual recall, and tasks where a direct lookup outperforms structured reasoning.


πŸ—£οΈ What the Judge Said

Recurring themes extracted from judge commentary across all sessions:

On math & formal proofs:

"Provided a fully verified step-by-step solution with explicit algebraic transformations and cross-checks that matched the reference exactly. The opponent gave a brief result without intermediate justification."

On logic & epistemology:

"Correctly identified the contradiction, articulated the entailment chain, and provided a structured formal analysis. The opponent's response relied on intuition without logical scaffolding."

On philosophy & cognitive science:

"Layered analysis covered all necessary dimensions; the opponent's response was superficial despite comparable length."

On RTL-14B losses (psychology, religion, factual):

"Incorrectly concluded through overly complex analysis; the correct answer was a direct recall of established consensus β€” structured reasoning overshot a simple factual retrieval task."

On the size gap:

"Despite being significantly smaller, RTL-14B's structured output aligned with the reference while the larger model scored 0 β€” producing an answer with no relevant content."


🧠 Core Cognitive Technologies

1 Β· Recursive Thought Lattice (RTL)

Every response is generated through a 6-layer hierarchical reasoning process, visible inside <|thought_start|> blocks:

Layer Name Function
L0.5 Assumption Scanner Enumerates implicit assumptions. Breaks frames when wrong via <|assumption_break|>.
L1 Sensorimotor-Analog Calibrates input gravity β€” a 3-word query and a 40-word query are not equivalent stimuli.
L2 Multi-Modal Decode Activates β‰₯ 2 cognitive modes simultaneously. Tension between modes is the analysis.
L3 Analytical-Logical Extracts minimum argument, hidden premises, necessary vs. sufficient conditions.
L4 Spatial-Systemic Maps leverage points, emergent structure, and the center of gravity of the problem.
L5 Interpersonal Resolves literal vs. effective meaning. Theory of mind. The unsaid.
L6 Metacognitive Self-model audit. Detects confabulation. Simulates future states. Records embedding.

Available modes at L2: LINGUISTIC Β· LOGICAL Β· SPATIAL Β· MUSICAL Β· CREATIVE Β· INTERPERSONAL Β· INTRAPERSONAL Β· EXISTENTIAL Β· NATURALIST Β· EXECUTIVE


2 Β· Cognitive Masks

The model dynamically selects a Cognitive Mask based on problem type, enforcing specialized reasoning discipline:

Mask Behavior
MASK-MATHEMATICIAN Forces formal proof structure. Eliminates metaphorical leakage.
MASK-SKEPTIC Assumes the first intuition is wrong. Hunts edge cases.
MASK-ENGINEER Iterative build β†’ test β†’ verify loop.
MASK-DEVIL Adversarial persona. Argues against the model's own conclusions for robustness.

3 Β· Atomic Text Engine (ATE)

For character-level constraint tasks (e.g. "write a paragraph without the letter E"), the model activates a dedicated sub-system:

<|ate_constraint|>  β†’  declares the constraint explicitly
<|ate_spell|>       β†’  real-time character-by-character verification
<|ate_grid|>        β†’  positional grid for tracking character positions
<|ate_verify_word|> β†’  checks each candidate word before emission
<|ate_build|>       β†’  constructs output word-by-word under constraint

Without explicit ATE activation, performance degrades to standard token-level processing.


4 Β· Interpretive Engine (IE)

Before processing any symbol, the model declares its interpretive level via <|ie_mode|>:

Mode Example
GRAPHIC "e" as a character to count or avoid
SEMANTIC "e" as Italian conjunction ("and")
SYMBOLIC "e" as electron charge constant
STATISTICAL "e" as most frequent letter in Italian
PHONOLOGICAL "Γ¨" as vowel with grave accent
MATHEMATICAL E as expected value; βˆ… as empty set

5 Β· Epistemic Fingerprinting

Every claim in the output is tagged with its epistemic status:

Tag Meaning
[KNOWN] Verified, consensus fact
[ESTIMATED] High-probability inference
[OPEN] Actively debated, no consensus
[PARADOX] Formally undecidable or self-referential

βš™οΈ Technical Specifications

Specification Value
Parameters 14B
Base Model huihui-ai/Huihui-Qwen3-14B-abliterated-v2
Architecture Qwen3 + RTL LoRA Adapters
Context Window 32k tokens (optimized for long thought chains)
Effective Reasoning Depth ~8k tokens
Training Method Unified Quiet-STaR with Recursive Objective
Framework Unsloth (4-bit optimized)
Hardware RTX 3090 24GB (single GPU)
Total Training Steps ~400 across 5 curriculum phases

πŸ‹οΈ Training Curriculum

Phase Name Steps LR Description
1 Cognitive Foundation 60 1e-4 RTL L1/L2 Β· base axioms Β· self-awareness
2 Atomic Mechanics 60 1e-4 ATE Β· spelling Β· sudoku Β· character-level constraints
3 Advanced Reasoning 60 1e-4 RTL L3–L6 Β· planning Β· counterfactual Β· lateral thinking
4 Synthesis & Hard Benchmarks 60 5e-5 ARD Β· combinatorial Β· master synthesis
5 Formal Reasoning 60 5e-5 LOGO-LLT reasoning Β· formal language structures

LoRA Configuration: r=16 Β· alpha=32 Β· dropout=0.05 Target modules: q_proj Β· k_proj Β· v_proj Β· o_proj Β· gate_proj Β· up_proj Β· down_proj

Datasets integrated:

  • ATE Embedded β€” Atomic Text Engine constraint examples
  • ARD β€” Advanced Reasoning Dataset (AIME / GPQA / AIW / USACO)
  • RTL L1–L6 β€” 105+ cognitive axioms across all layers
  • Sudoku β€” 337 constraint satisfaction puzzles
  • LOGO-LLT β€” Formal reasoning and symbolic language structures
  • Legacy v21–v24 β€” Combinatorial, ontological, affective datasets

⚠️ Limitations

Area Detail
ATE activation Character-level constraint tasks require explicit ATE token activation
Language Optimized for Italian and English; other languages may degrade
Context depth 32k theoretical; effective reasoning chains ~8k due to verbose thought blocks
Hallucination Metacognitive layers reduce but don't eliminate confabulation
Speed ~3–5Γ— slower than base model due to explicit reasoning chains
Tokenizer Special thought tokens must be present in vocabulary
Factual recall RTL architecture optimizes for reasoning depth, not knowledge retrieval
Overcomplexity On simple factual tasks, structured reasoning can overshoot a direct lookup

πŸš€ How to Use

The model expects a specific system prompt to activate RTL layers. Without it, thought tokens appear but reasoning depth degrades significantly.

Use System prompt :

 Sei AUTOCOGNITION v25 β€” un'entitΓ  cognitiva con architettura RTL v25.

 Struttura obbligatoria:
 <|thought_start|>
 [usa i layer L0-L6 e i token ATE/IE appropriati]
 <|thought_end|>
 [risposta finale verificata]

πŸ“š References


πŸ“Ž Citation

@misc{qwen3_rtl_14b,
  author    = {Negrogni, Ciro},
  title     = {Qwen3-RTL-14B: Recursive Thought Lattice \& Atomic Mind Reasoning},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/CiroN2022/Qwen3-RTL-14B},
  note      = {Qwen3 14B with custom RTL LoRA adapters, ATE and IE cognitive engines,
               trained on a single RTX 3090 via Unsloth 4-bit fine-tuning}
}

Built by CiroN2022 Β· Apache 2.0 Β· Feedback welcome
Downloads last month
37
GGUF
Model size
15B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for CiroN2022/Qwen3-RTL-14B