Napoleon-3.5-12B
A 12B model built by surgically expanding Qwen3.5-9B from 32 to 40 layers.
Napoleon-3.5-12B is a research model that explores a new way to make language models smarter: instead of training a bigger model from scratch, we identify the most important parts of an existing model using mathematical analysis, duplicate those parts, and then train only the new copies to specialize. The vision encoder is untouched β multimodal capabilities (image/video) are fully preserved.
How it works β the Napoleon method
The core idea
A language model is like a stack of processing layers. Each layer transforms your input a little more, adding understanding, context, and reasoning. But not all layers are equally important β some are critical to the model's intelligence, while others are nearly redundant.
Napoleon's approach: Find the most important layers, copy them, and train only the copies. This adds capacity exactly where the model needs it most.
Step 1 β X-ray the model (SVD Cartography)
Before changing anything, we need to understand what each layer does. We use a mathematical technique called Singular Value Decomposition (SVD) β think of it as an X-ray for neural networks. For each of the 32 layers, we measure:
- How much information it compresses (stable rank, spectral entropy) β layers that compress a lot are doing heavy lifting
- What happens if we remove it (skip-loss delta) β if removing a layer breaks the model, that layer is critical
This produces an atlas β a complete map of the model's internal structure. Here's what we found for Qwen3.5-9B:
| Block | Role | Skip-loss delta | Verdict |
|---|---|---|---|
| Layers 0-4 | Input processing | +5.20 (model breaks without them) | Critical β duplicate these |
| Layers 28-32 | Output refinement | +2.28 (high transformation score) | Critical β duplicate these |
| Layers 8-16 | Middle processing | ~0 (model barely notices removal) | Redundant β leave alone |
Step 2 β Validate with experiments (Ablation Runs)
We don't blindly trust the X-ray. We ran 6 different experiments, each training the model with different layer selection strategies (all layers, only MLPs, only attention, late layers only, etc.) to confirm that the SVD-guided selection actually picks the best layers. This step validates the atlas before we commit to the expensive duplication.
Step 3 β Duplicate and train (Block Duplication + LoRA)
Now the surgery:
Copy the critical blocks. Layers 0-3 are duplicated and inserted at positions 4-7. Layers 28-31 are duplicated at positions 36-39. The model grows from 32 to 40 layers (~9B to ~12B parameters).
Freeze everything except the copies. The original 32 layers keep their weights untouched. We only train the 8 new layers using LoRA (Low-Rank Adaptation) β a lightweight training technique that adds small trainable matrices (~8.2M parameters, just 0.07% of the total model).
Train the copies to specialize. Using 98,000 samples (math reasoning + general knowledge), the duplicated layers learn to add value on top of what the originals already do. Training took ~10 hours on a single NVIDIA GB10 GPU.
The result: a model with more depth and capacity at the points that matter most, without retraining the entire network.
Architecture
| Qwen3.5-9B (base) | Napoleon-3.5-12B | |
|---|---|---|
| Layers | 32 | 40 (+8 duplicated) |
| Parameters | ~9B | ~12B |
| Architecture | Hybrid (3x DeltaNet + 1x Full Attention, repeating) | Same pattern, extended |
| Vision encoder | Qwen3.5 ViT (27 layers) | Unchanged |
| Context length | 262,144 tokens | 262,144 tokens |
Qwen3.5's hybrid design: Unlike standard transformers, Qwen3.5 alternates between fast linear-attention layers (DeltaNet) and full-attention layers in a 3:1 pattern. When duplicating blocks, we preserve this pattern β the duplicated layers inherit the same attention type as their originals.
Benchmark Results
Evaluation on GSM8K (grade-school math, 30 samples, non-thinking mode):
| Model | GSM8K accuracy | Inference time |
|---|---|---|
| Qwen3.5-9B (baseline) | 29/30 = 96.7% | 1,213s |
| Napoleon-3.5-12B | 23/30 = 76.7% | 4,163s |
| Delta | -20.0% | 3.4x slower |
Honest assessment
The duplicated model shows degraded math performance compared to the strong Qwen3.5-9B baseline. This is expected for an initial experiment:
- Insufficient training: Only 500 steps with LoRA on 0.07% of parameters. The duplicated layers need more training to learn to work with the rest of the network.
- Inference overhead: 8 additional layers add ~3.4x latency without KV-cache optimization for the new architecture.
- Base model is already excellent: Qwen3.5-9B scores 96.7% on GSM8K β there's very little room for improvement on this benchmark.
This is a research model. The contribution is the Napoleon method itself (SVD-guided layer duplication as a scaling technique), not the final benchmark score. Future iterations with longer training, better datasets, and architecture-aware KV-cache could close the gap.
Training details
- Dataset: 75% nvidia/OpenMathReasoning + 25% mlabonne/FineTome-100k (98,000 samples)
- Steps: 500, batch size 8, learning rate 5e-5 (cosine decay), warmup 10 steps
- Optimizer: adamw_8bit, bf16 precision
- LoRA config: r=16, alpha=16, applied to
in_proj_qkv,out_proj,gate_proj,up_proj,down_projon duplicated layers only - Hardware: NVIDIA GB10 (DGX Spark, 120GB unified memory)
- Training time: ~10 hours for 500 steps
- Training loss: 1.39 β 0.50 (64% reduction over training)
SVD Atlas
The full SVD cartography results for all 32 original layers are available in atlas.json. Key findings:
- Baseline perplexity loss: 1.6754
- Most critical blocks: 0-4 (skip_delta=+5.20), 28-32 (skip_delta=+2.28)
- Most compressible blocks: 8-16 (near-zero skip delta β removing them barely affects output)
- SVD-guided target layers: [2, 3, 5, 7, 11, 15, 19, 23, 26, 27, 29, 31]
Usage
This repo contains the full merged model (40 layers, ~21 GB) β just load and use:
Text generation
from transformers import Qwen3_5ForConditionalGeneration, AutoProcessor
import torch
model = Qwen3_5ForConditionalGeneration.from_pretrained(
"baconnier/Napoleon-3.5-12B",
torch_dtype=torch.bfloat16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained("baconnier/Napoleon-3.5-12B")
# For text-only, use the inner tokenizer (the processor is multimodal)
text_tok = processor.tokenizer
# Non-thinking mode: pre-fill empty <think> block to skip chain-of-thought
prompt = (
"<|im_start|>user\n"
"Explain quantum entanglement in simple terms.\n"
"<|im_end|>\n"
"<|im_start|>assistant\n"
"<think>\n\n</think>\n\n"
)
inputs = text_tok(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(text_tok.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Multimodal (image/video)
The Qwen3.5 vision encoder is fully preserved:
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/image.jpg"},
{"type": "text", "text": "What do you see in this image?"},
],
}
]
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt", enable_thinking=False,
)
inputs = {k: v.to(model.device) for k, v in inputs.items() if isinstance(v, torch.Tensor)}
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))
Serving with vLLM
vllm serve baconnier/Napoleon-3.5-12B \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--reasoning-parser qwen3
Important note on Qwen3.5 hybrid architecture
Qwen3.5 uses a hybrid attention architecture β alternating between DeltaNet (linear attention) and full attention layers in a 3:1 pattern. The config.json in this repo contains the correct 40-entry layer_types list that reflects the duplicated blocks. This is handled automatically by from_pretrained.
Files
| File | Size | Description |
|---|---|---|
model.safetensors |
21.2 GB | Full merged model (40 layers, LoRA weights baked in) |
adapter_model.safetensors |
31 MB | Standalone LoRA adapter (for advanced users who want to apply it differently) |
config.json |
3 KB | Model config with correct 40-layer architecture and layer_types |
atlas.json |
167 KB | Complete SVD cartography of all 32 original Qwen3.5-9B layers |
benchmark_gsm8k_30samples.json |
β | Raw GSM8K evaluation results |
trainer_state.json |
19 KB | Full training log (500 steps) |
What's next
- Longer training (5,000+ steps) for better layer integration
- Architecture-aware KV-cache to fix the inference overhead on duplicated layers
- Benchmarks on broader tasks (MMLU, coding, multilingual)
- Exploring duplication of compressible (middle) layers as "cheap capacity" vs critical layers
Citation
@misc{napoleon3,
title={Napoleon 3: SVD-Guided Layer Duplication for LLM Scaling},
author={Lo\"{i}c Baconnier},
year={2026},
url={https://huggingface.co/baconnier/Napoleon-3.5-12B}
}
- Downloads last month
- 210