Qwen3.5-16B-Dense (Structural Expansion)

⚠️ Status: Architectural Expansion / Untrained Weights

This model is a dense structural expansion of Qwen3.5-9B-Instruct, scaled to 16.1B parameters. It was created using structural mitosis to increase the model's depth and breadth, providing a larger "cognitive canvas" for downstream training.

Note: As this is a structural expansion, the new parameters have not yet been calibrated. The model will require a "Repair" SFT or Continued Pre-Training (CPT) phase to utilize its expanded capacity. Vision capability is preserved, however the model will need training to utilise it effectively.


πŸ›  Architecture & Expansion Strategy

The expansion targets the inherent limitations of sub-10B modelsβ€”specifically knowledge density and reasoning stabilityβ€”by providing additional parameter headroom.

  • Base Model: Qwen/Qwen3.5-9B-Instruct
  • Expanded Parameters: ~16.1B
  • Methodology: Structural Mitosis
    • Layer Duplication: High-importance layers were identified and duplicated to extend transformer depth.
    • SVD Noise Injection: Singular Value Decomposition (SVD) based noise was injected into the duplicated weights to break symmetry and induce divergence, preventing "identity-mapping" stalls during early training.

Why 16B?

The 16B parameter count represents a strategic "sweet spot" for modern hardware. It offers a significant increase in total neurons and associative memory over the 9B base, while remaining highly performant on consumer-grade GPUs (e.g., RTX 3090/4090/5080) when quantized to 4-bit or 8-bit.


πŸš€ Call to Action: Training & Calibration

This model is released as a base for researchers and hobbyists interested in high-density dense models. The additional ~7.1B parameters are currently "blank" capacity ready to be filled with specialized knowledge.

Recommended Training Path:

  1. Symmetry Breaking (Calibration): A short run on ~2-5B tokens of high-diversity data using a very low learning rate (1e-6) to allow the SVD-diverged layers to settle into functional roles.
  2. Knowledge Distillation: Fine-tuning on high-reasoning datasets (such as Opus-distilled sets) to take advantage of the expanded FFN capacity.
  3. DPO/PPO: Final alignment to stabilize the increased depth and prevent coherence drift during long-context generation.

πŸ“œ Credits

Downloads last month
88
Safetensors
Model size
15B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for blascotobasco/Qwen3.5-16B-Test

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(61)
this model