Napoleon-3.5-12B

A 12B model built by surgically expanding Qwen3.5-9B from 32 to 40 layers.

Napoleon-3.5-12B is a research model that explores a new way to make language models smarter: instead of training a bigger model from scratch, we identify the most important parts of an existing model using mathematical analysis, duplicate those parts, and then train only the new copies to specialize. The vision encoder is untouched — multimodal capabilities (image/video) are fully preserved.

How it works — the Napoleon method

The core idea

A language model is like a stack of processing layers. Each layer transforms your input a little more, adding understanding, context, and reasoning. But not all layers are equally important — some are critical to the model's intelligence, while others are nearly redundant.

Napoleon's approach: Find the most important layers, copy them, and train only the copies. This adds capacity exactly where the model needs it most.

Step 1 — X-ray the model (SVD Cartography)

Before changing anything, we need to understand what each layer does. We use a mathematical technique called Singular Value Decomposition (SVD) — think of it as an X-ray for neural networks. For each of the 32 layers, we measure:

How much information it compresses (stable rank, spectral entropy) — layers that compress a lot are doing heavy lifting
What happens if we remove it (skip-loss delta) — if removing a layer breaks the model, that layer is critical

This produces an atlas — a complete map of the model's internal structure. Here's what we found for Qwen3.5-9B:

Block	Role	Skip-loss delta	Verdict
Layers 0-4	Input processing	+5.20 (model breaks without them)	Critical — duplicate these
Layers 28-32	Output refinement	+2.28 (high transformation score)	Critical — duplicate these
Layers 8-16	Middle processing	~0 (model barely notices removal)	Redundant — leave alone

Step 2 — Validate with experiments (Ablation Runs)

We don't blindly trust the X-ray. We ran 6 different experiments, each training the model with different layer selection strategies (all layers, only MLPs, only attention, late layers only, etc.) to confirm that the SVD-guided selection actually picks the best layers. This step validates the atlas before we commit to the expensive duplication.

Step 3 — Duplicate and train (Block Duplication + LoRA)

Now the surgery:

Copy the critical blocks. Layers 0-3 are duplicated and inserted at positions 4-7. Layers 28-31 are duplicated at positions 36-39. The model grows from 32 to 40 layers (~9B to ~12B parameters).
Freeze everything except the copies. The original 32 layers keep their weights untouched. We only train the 8 new layers using LoRA (Low-Rank Adaptation) — a lightweight training technique that adds small trainable matrices (~8.2M parameters, just 0.07% of the total model).
Train the copies to specialize. Using 98,000 samples (math reasoning + general knowledge), the duplicated layers learn to add value on top of what the originals already do. Training took ~10 hours on a single NVIDIA GB10 GPU.

The result: a model with more depth and capacity at the points that matter most, without retraining the entire network.

Architecture

	Qwen3.5-9B (base)	Napoleon-3.5-12B
Layers	32	40 (+8 duplicated)
Parameters	~9B	~12B
Architecture	Hybrid (3x DeltaNet + 1x Full Attention, repeating)	Same pattern, extended
Vision encoder	Qwen3.5 ViT (27 layers)	Unchanged
Context length	262,144 tokens	262,144 tokens

Qwen3.5's hybrid design: Unlike standard transformers, Qwen3.5 alternates between fast linear-attention layers (DeltaNet) and full-attention layers in a 3:1 pattern. When duplicating blocks, we preserve this pattern — the duplicated layers inherit the same attention type as their originals.

Benchmark Results

Evaluation on GSM8K (grade-school math, 30 samples, non-thinking mode):

Model	GSM8K accuracy	Inference time
Qwen3.5-9B (baseline)	29/30 = 96.7%	1,213s
Napoleon-3.5-12B	23/30 = 76.7%	4,163s
Delta	-20.0%	3.4x slower

Honest assessment

The duplicated model shows degraded math performance compared to the strong Qwen3.5-9B baseline. This is expected for an initial experiment:

Insufficient training: Only 500 steps with LoRA on 0.07% of parameters. The duplicated layers need more training to learn to work with the rest of the network.
Inference overhead: 8 additional layers add ~3.4x latency without KV-cache optimization for the new architecture.
Base model is already excellent: Qwen3.5-9B scores 96.7% on GSM8K — there's very little room for improvement on this benchmark.

This is a research model. The contribution is the Napoleon method itself (SVD-guided layer duplication as a scaling technique), not the final benchmark score. Future iterations with longer training, better datasets, and architecture-aware KV-cache could close the gap.

Training details

Dataset: 75% nvidia/OpenMathReasoning + 25% mlabonne/FineTome-100k (98,000 samples)
Steps: 500, batch size 8, learning rate 5e-5 (cosine decay), warmup 10 steps
Optimizer: adamw_8bit, bf16 precision
LoRA config: r=16, alpha=16, applied to in_proj_qkv, out_proj, gate_proj, up_proj, down_proj on duplicated layers only
Hardware: NVIDIA GB10 (DGX Spark, 120GB unified memory)
Training time: ~10 hours for 500 steps
Training loss: 1.39 → 0.50 (64% reduction over training)

SVD Atlas

The full SVD cartography results for all 32 original layers are available in atlas.json. Key findings:

Baseline perplexity loss: 1.6754
Most critical blocks: 0-4 (skip_delta=+5.20), 28-32 (skip_delta=+2.28)
Most compressible blocks: 8-16 (near-zero skip delta — removing them barely affects output)
SVD-guided target layers: [2, 3, 5, 7, 11, 15, 19, 23, 26, 27, 29, 31]

Usage

This repo contains the full merged model (40 layers, ~21 GB) — just load and use:

Text generation

from transformers import Qwen3_5ForConditionalGeneration, AutoProcessor
import torch

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "baconnier/Napoleon-3.5-12B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("baconnier/Napoleon-3.5-12B")

# For text-only, use the inner tokenizer (the processor is multimodal)
text_tok = processor.tokenizer

# Non-thinking mode: pre-fill empty <think> block to skip chain-of-thought
prompt = (
    "<|im_start|>user\n"
    "Explain quantum entanglement in simple terms.\n"
    "<|im_end|>\n"
    "<|im_start|>assistant\n"
    "<think>\n\n</think>\n\n"
)
inputs = text_tok(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(text_tok.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Multimodal (image/video)

The Qwen3.5 vision encoder is fully preserved:

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/image.jpg"},
            {"type": "text", "text": "What do you see in this image?"},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt", enable_thinking=False,
)
inputs = {k: v.to(model.device) for k, v in inputs.items() if isinstance(v, torch.Tensor)}

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))

Serving with vLLM

vllm serve baconnier/Napoleon-3.5-12B \
    --port 8000 \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --reasoning-parser qwen3

Important note on Qwen3.5 hybrid architecture

Qwen3.5 uses a hybrid attention architecture — alternating between DeltaNet (linear attention) and full attention layers in a 3:1 pattern. The config.json in this repo contains the correct 40-entry layer_types list that reflects the duplicated blocks. This is handled automatically by from_pretrained.

Files

File	Size	Description
`model.safetensors`	21.2 GB	Full merged model (40 layers, LoRA weights baked in)
`adapter_model.safetensors`	31 MB	Standalone LoRA adapter (for advanced users who want to apply it differently)
`config.json`	3 KB	Model config with correct 40-layer architecture and `layer_types`
`atlas.json`	167 KB	Complete SVD cartography of all 32 original Qwen3.5-9B layers
`benchmark_gsm8k_30samples.json`	—	Raw GSM8K evaluation results
`trainer_state.json`	19 KB	Full training log (500 steps)

What's next

Longer training (5,000+ steps) for better layer integration
Architecture-aware KV-cache to fix the inference overhead on duplicated layers
Benchmarks on broader tasks (MMLU, coding, multilingual)
Exploring duplication of compressible (middle) layers as "cheap capacity" vs critical layers

Citation

@misc{napoleon3,
  title={Napoleon 3: SVD-Guided Layer Duplication for LLM Scaling},
  author={Lo\"{i}c Baconnier},
  year={2026},
  url={https://huggingface.co/baconnier/Napoleon-3.5-12B}
}

Downloads last month: 210

Safetensors

Model size

11B params

Tensor type

F32

BF16

Model tree for baconnier/Napoleon-3.5-12B

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Adapter

(100)

this model

Adapters

2 models