Qwen3.5-16B-Dense (Structural Expansion)
β οΈ Status: Architectural Expansion / Untrained Weights
This model is a dense structural expansion of Qwen3.5-9B-Instruct, scaled to 16.1B parameters. It was created using structural mitosis to increase the model's depth and breadth, providing a larger "cognitive canvas" for downstream training.
Note: As this is a structural expansion, the new parameters have not yet been calibrated. The model will require a "Repair" SFT or Continued Pre-Training (CPT) phase to utilize its expanded capacity. Vision capability is preserved, however the model will need training to utilise it effectively.
π Architecture & Expansion Strategy
The expansion targets the inherent limitations of sub-10B modelsβspecifically knowledge density and reasoning stabilityβby providing additional parameter headroom.
- Base Model: Qwen/Qwen3.5-9B-Instruct
- Expanded Parameters: ~16.1B
- Methodology: Structural Mitosis
- Layer Duplication: High-importance layers were identified and duplicated to extend transformer depth.
- SVD Noise Injection: Singular Value Decomposition (SVD) based noise was injected into the duplicated weights to break symmetry and induce divergence, preventing "identity-mapping" stalls during early training.
Why 16B?
The 16B parameter count represents a strategic "sweet spot" for modern hardware. It offers a significant increase in total neurons and associative memory over the 9B base, while remaining highly performant on consumer-grade GPUs (e.g., RTX 3090/4090/5080) when quantized to 4-bit or 8-bit.
π Call to Action: Training & Calibration
This model is released as a base for researchers and hobbyists interested in high-density dense models. The additional ~7.1B parameters are currently "blank" capacity ready to be filled with specialized knowledge.
Recommended Training Path:
- Symmetry Breaking (Calibration): A short run on ~2-5B tokens of high-diversity data using a very low learning rate (1e-6) to allow the SVD-diverged layers to settle into functional roles.
- Knowledge Distillation: Fine-tuning on high-reasoning datasets (such as Opus-distilled sets) to take advantage of the expanded FFN capacity.
- DPO/PPO: Final alignment to stabilize the increased depth and prevent coherence drift during long-context generation.
π Credits
- Base Architecture: Qwen/Qwen3.5-9B-Instruct
- Expansion Framework: Self Designed "Structural Mitosis" Framework
- Downloads last month
- 88