| # INTELLECT-3-V | |
| A vision-language model created by grafting the language model weights from [INTELLECT-3](https://huggingface.co/PrimeIntellect/INTELLECT-3) into the [GLM-4.6V](https://huggingface.co/THUDM/GLM-4.6V) architecture. | |
| ## Motivation | |
| INTELLECT-3 is a strong open-source language model, but lacks vision capabilities. GLM-4.6V is a vision-language model with an identical language model architecture. By replacing GLM-4.6V's language model weights with INTELLECT-3's weights while preserving the vision encoder and projection layers, we create a vision-language model powered by INTELLECT-3. | |
| ## Architecture | |
| Both models share the same language model backbone: | |
| - 46 transformer layers (layer 0 is dense MLP, layers 1-45 are MoE) | |
| - 4096 hidden dimension | |
| - 128 routed experts + shared experts per MoE layer | |
| - Grouped Query Attention (12288 q_proj, 1024 k/v_proj) | |
| - 151552 vocabulary size | |
| - BF16 weights | |
| GLM-4.6V additionally includes: | |
| - 24-layer vision transformer (1536 hidden dim) | |
| - Visual merger projecting vision features to LLM hidden dimension | |
| - Downsampling convolution for spatial compression | |
| ## What Was Grafted | |
| The following weights were copied from INTELLECT-3 to GLM-4.6V: | |
| | INTELLECT-3 | GLM-4.6V | | |
| |-------------|----------| | |
| | `model.layers.*` | `model.language_model.layers.*` | | |
| | `model.norm.weight` | `model.language_model.norm.weight` | | |
| ## What Was Preserved (from GLM-4.6V) | |
| - `model.language_model.embed_tokens.weight` — kept to maintain vision token compatibility | |
| - `lm_head.weight` — kept aligned with embed_tokens | |
| - `model.visual.*` — entire vision encoder and merger preserved | |
| ## Rationale | |
| **Why replace the final norm?** The RMSNorm after the last transformer layer is tightly coupled to the layer outputs it normalizes. INTELLECT-3's norm was trained end-to-end with its layers and learned to normalize their specific output distribution. | |
| **Why keep embed_tokens?** The vision merger projects visual features into the same embedding space as text tokens. Replacing embed_tokens could break the alignment between text and vision embeddings. Additionally, lm_head is often tied or co-trained with embed_tokens. | |
| **Why not replace lm_head?** Same reasoning — keeping lm_head and embed_tokens together maintains their learned relationship. | |
| ## Known Limitations | |
| 1. **Embedding space mismatch**: INTELLECT-3's layers learned representations in a potentially different embedding space than GLM-4.6V. This may cause some degradation in both language and vision-language performance. | |
| 2. **Vision-language alignment**: The visual merger was trained to project into GLM-4.6V's representation space. INTELLECT-3 may have learned different internal representations, potentially affecting vision-language tasks. | |
| 3. **Tokenizer compatibility**: While both models have the same vocabulary size (151552), verify tokenizer compatibility for your use case. | |
| ## Creation Script | |
| The model was created using `graft_intellect3_to_glm.py`: | |
| ```bash | |
| python graft_intellect3_to_glm.py \ | |
| --intellect3 ~/models/INTELLECT-3 \ | |
| --glm ~/models/GLM-4.6V \ | |
| --output ~/models/INTELLECT-3-V | |
| ``` | |
| ## Source Model Architectures | |
| ### INTELLECT-3 | |
| ``` | |
| lm_head.weight,[151552,4096],BF16 | |
| model.embed_tokens.weight,[151552,4096],BF16 | |
| model.layers.0.mlp.down_proj.weight,[4096,10944],BF16 | |
| model.layers.0.mlp.gate_proj.weight,[10944,4096],BF16 | |
| model.layers.0.mlp.up_proj.weight,[10944,4096],BF16 | |
| model.layers.[0-45].input_layernorm.weight,[4096],BF16 | |
| model.layers.[0-45].post_attention_layernorm.weight,[4096],BF16 | |
| model.layers.[0-45].self_attn.k_proj.bias,[1024],BF16 | |
| model.layers.[0-45].self_attn.k_proj.weight,[1024,4096],BF16 | |
| model.layers.[0-45].self_attn.o_proj.weight,[4096,12288],BF16 | |
| model.layers.[0-45].self_attn.q_proj.bias,[12288],BF16 | |
| model.layers.[0-45].self_attn.q_proj.weight,[12288,4096],BF16 | |
| model.layers.[0-45].self_attn.v_proj.bias,[1024],BF16 | |
| model.layers.[0-45].self_attn.v_proj.weight,[1024,4096],BF16 | |
| model.layers.[1-45].mlp.experts.[0-127].down_proj.weight,[4096,1408],BF16 | |
| model.layers.[1-45].mlp.experts.[0-127].gate_proj.weight,[1408,4096],BF16 | |
| model.layers.[1-45].mlp.experts.[0-127].up_proj.weight,[1408,4096],BF16 | |
| model.layers.[1-45].mlp.gate.e_score_correction_bias,[128],F32 | |
| model.layers.[1-45].mlp.gate.weight,[128,4096],BF16 | |
| model.layers.[1-45].mlp.shared_experts.down_proj.weight,[4096,1408],BF16 | |
| model.layers.[1-45].mlp.shared_experts.gate_proj.weight,[1408,4096],BF16 | |
| model.layers.[1-45].mlp.shared_experts.up_proj.weight,[1408,4096],BF16 | |
| model.norm.weight,[4096],BF16 | |
| ``` | |
| ### GLM-4.6V | |
| ``` | |
| lm_head.weight,[151552,4096],BF16 | |
| model.language_model.embed_tokens.weight,[151552,4096],BF16 | |
| model.language_model.layers.0.mlp.down_proj.weight,[4096,10944],BF16 | |
| model.language_model.layers.0.mlp.gate_proj.weight,[10944,4096],BF16 | |
| model.language_model.layers.0.mlp.up_proj.weight,[10944,4096],BF16 | |
| model.language_model.layers.[0-45].input_layernorm.weight,[4096],BF16 | |
| model.language_model.layers.[0-45].post_attention_layernorm.weight,[4096],BF16 | |
| model.language_model.layers.[0-45].self_attn.k_proj.bias,[1024],BF16 | |
| model.language_model.layers.[0-45].self_attn.k_proj.weight,[1024,4096],BF16 | |
| model.language_model.layers.[0-45].self_attn.o_proj.weight,[4096,12288],BF16 | |
| model.language_model.layers.[0-45].self_attn.q_proj.bias,[12288],BF16 | |
| model.language_model.layers.[0-45].self_attn.q_proj.weight,[12288,4096],BF16 | |
| model.language_model.layers.[0-45].self_attn.v_proj.bias,[1024],BF16 | |
| model.language_model.layers.[0-45].self_attn.v_proj.weight,[1024,4096],BF16 | |
| model.language_model.layers.[1-45].mlp.experts.[0-127].down_proj.weight,[4096,1408],BF16 | |
| model.language_model.layers.[1-45].mlp.experts.[0-127].gate_proj.weight,[1408,4096],BF16 | |
| model.language_model.layers.[1-45].mlp.experts.[0-127].up_proj.weight,[1408,4096],BF16 | |
| model.language_model.layers.[1-45].mlp.gate.e_score_correction_bias,[128],F32 | |
| model.language_model.layers.[1-45].mlp.gate.weight,[128,4096],BF16 | |
| model.language_model.layers.[1-45].mlp.shared_experts.down_proj.weight,[4096,1408],BF16 | |
| model.language_model.layers.[1-45].mlp.shared_experts.gate_proj.weight,[1408,4096],BF16 | |
| model.language_model.layers.[1-45].mlp.shared_experts.up_proj.weight,[1408,4096],BF16 | |
| model.language_model.norm.weight,[4096],BF16 | |
| model.visual.blocks.[0-23].attn.proj.weight,[1536,1536],BF16 | |
| model.visual.blocks.[0-23].attn.qkv.weight,[4608,1536],BF16 | |
| model.visual.blocks.[0-23].mlp.down_proj.weight,[1536,4096],BF16 | |
| model.visual.blocks.[0-23].mlp.gate_proj.weight,[4096,1536],BF16 | |
| model.visual.blocks.[0-23].mlp.up_proj.weight,[4096,1536],BF16 | |
| model.visual.blocks.[0-23].norm[1-2].weight,[1536],BF16 | |
| model.visual.downsample.bias,[4096],BF16 | |
| model.visual.downsample.weight,[4096,1536,2,2],BF16 | |
| model.visual.embeddings.position_embedding.weight,[576,1536],BF16 | |
| model.visual.merger.down_proj.weight,[4096,10944],BF16 | |
| model.visual.merger.gate_proj.weight,[10944,4096],BF16 | |
| model.visual.merger.post_projection_norm.bias,[4096],BF16 | |
| model.visual.merger.post_projection_norm.weight,[4096],BF16 | |
| model.visual.merger.proj.weight,[4096,4096],BF16 | |
| model.visual.merger.up_proj.weight,[10944,4096],BF16 | |
| model.visual.patch_embed.proj.bias,[1536],BF16 | |
| model.visual.patch_embed.proj.weight,[1536,3,2,14,14],BF16 | |
| model.visual.post_conv_layernorm.weight,[1536],BF16 | |
| model.visual.post_layernorm.weight,[1536],BF16 | |
| ``` | |
| ## License | |
| Please refer to the licenses of the source models: | |
| - [INTELLECT-3 License](https://huggingface.co/PrimeIntellect/INTELLECT-3) | |
| - [GLM-4.6V License](https://huggingface.co/THUDM/GLM-4.6V) | |
| ## Acknowledgments | |
| - [Prime Intellect](https://www.primeintellect.ai/) for INTELLECT-3 | |
| - [THUDM](https://github.com/THUDM) for GLM-4.6V |