CPT Training Different Modules Guide
Overview
By default, the CPT (Continual Pre-Training) configuration in /workspace/Trainer-kit/CPT/config.yaml trains only attention projection layers using LoRA adapters. This guide explains how to modify the configuration to train other modules.
Current Default Configuration
peft:
enabled: true
target_modules: "auto"
When target_modules: "auto" is set, the script automatically detects and trains these attention layers:
q_proj- Query projectionk_proj- Key projectionv_proj- Value projectiono_proj- Output projection
How to Train Other Modules
Method 1: Explicit Target Modules
Replace "auto" with a list of specific module names you want to train:
peft:
enabled: true
target_modules:
- "q_proj"
- "k_proj"
- "v_proj"
- "o_proj"
- "mlp.down_proj" # Add MLP down projection
- "mlp.gate_proj" # Add MLP gate projection
- "mlp.up_proj" # Add MLP up projection
Method 2: Custom Module Lists
For different model architectures, here are common modules you can train:
LLaMA/Llama-style Models
peft:
enabled: true
target_modules:
- "q_proj"
- "k_proj"
- "v_proj"
- "o_proj"
- "mlp.gate_proj"
- "mlp.up_proj"
- "mlp.down_proj"
Qwen-style Models
peft:
enabled: true
target_modules:
- "q_proj"
- "k_proj"
- "v_proj"
- "o_proj"
- "mlp.gate_proj"
- "mlp.up_proj"
- "mlp.down_proj"
Mixtral/Gemma-style Models
peft:
enabled: true
target_modules:
- "q_proj"
- "k_proj"
- "v_proj"
- "o_proj"
- "mlp.experts.*.w1" # Expert layer 1
- "mlp.experts.*.w2" # Expert layer 2
- "mlp.experts.*.w3" # Expert layer 3
Module Types You Can Train
1. Attention Layers
q_proj- Query projectionsk_proj- Key projectionsv_proj- Value projectionso_proj- Output projectionsqkv_proj- Combined QKV (in some models)c_attn- Attention in older models
2. MLP/Feed-Forward Layers
mlp.gate_proj- Gate projectionmlp.up_proj- Up projectionmlp.down_proj- Down projectionmlp.fc1- First layermlp.fc2- Second layerw1,w2,w3- Alternative naming
3. Embedding Layers
peft:
enabled: true
target_modules:
- "model.embed_tokens" # Token embeddings
- "lm_head" # Language model head
4. Normalization Layers
peft:
enabled: true
target_modules:
- "input_layernorm" # Input normalization
- "post_attention_layernorm" # Post-attention norm
- "final_layernorm" # Final normalization
5. MoE (Mixture of Experts) Layers
peft:
enabled: true
target_modules:
- "mlp.experts.*.w1" # Expert layer 1
- "mlp.experts.*.w2" # Expert layer 2
- "mlp.experts.*.w3" # Expert layer 3
- "mlp.gate" # Expert routing gate
Advanced Configuration Examples
Train Multiple Layer Types
peft:
enabled: true
target_modules:
- "q_proj"
- "k_proj"
- "v_proj"
- "o_proj"
- "mlp.gate_proj"
- "mlp.up_proj"
- "mlp.down_proj"
- "input_layernorm"
- "post_attention_layernorm"
Conservative Approach (Only MLPs)
peft:
enabled: true
target_modules:
- "mlp.gate_proj"
- "mlp.up_proj"
- "mlp.down_proj"
Comprehensive Approach (All Main Layers)
peft:
enabled: true
target_modules:
- "q_proj"
- "k_proj"
- "v_proj"
- "o_proj"
- "mlp.gate_proj"
- "mlp.up_proj"
- "mlp.down_proj"
- "input_layernorm"
- "post_attention_layernorm"
How to Find Module Names for Your Model
Method 1: Automatic Detection
Run the script once with target_modules: "auto" - it will log which modules it found:
Using auto-inferred target_modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj']
Method 2: Manual Inspection
Inspect your model structure:
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained("/workspace/Models/YourModel")
# Print all module names
for name, module in model.named_modules():
print(name)
Method 3: Use PEFT's Built-in Function
The script includes _infer_target_modules() function that can help identify available modules.
Considerations
1. Memory Usage
- More modules = More memory: Training additional layers requires more GPU memory
- Monitor VRAM usage: Use
nvidia-smito monitor memory consumption - Adjust batch size: You may need to reduce
per_device_train_batch_size
2. Training Time
- More modules = Longer training: Each additional layer increases computation time
- Learning rate adjustments: You might need to reduce
learning_ratewhen training more layers
3. Performance Trade-offs
- Attention only: Fast training, good for language understanding
- MLP only: Fast training, good for knowledge storage
- Both attention + MLP: Slower but potentially better performance
- All layers: Slowest but most comprehensive adaptation
4. Model Architecture Differences
Different model families use different module naming conventions:
- LLaMA:
mlp.gate_proj,mlp.up_proj,mlp.down_proj - Qwen:
mlp.gate_proj,mlp.up_proj,mlp.down_proj - Gemma:
mlp.gate_proj,mlp.up_proj,mlp.down_proj - Mixtral:
mlp.experts.*.w1, etc.
Best Practices
1. Start Conservative
Begin with just attention layers, then gradually add more modules if needed:
# Phase 1: Start here
target_modules: ["q_proj", "k_proj", "v_proj", "o_proj"]
# Phase 2: Add MLPs
target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "mlp.down_proj"]
# Phase 3: Add more if needed
target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "mlp.gate_proj", "mlp.up_proj", "mlp.down_proj"]
2. Monitor Overfitting
- Use evaluation split to monitor performance
- Adjust
learning_rateif overfitting occurs - Consider
lora_dropoutto reduce overfitting
3. Resource Management
- Start with small LoRA rank (
r: 16) if training many modules - Increase
gradient_accumulation_stepsif reducing batch size - Monitor GPU memory usage throughout training
4. Model-Specific Tuning
Different models may benefit from different module combinations:
- Code models: Focus on attention + MLP layers
- Chat models: Attention layers are most important
- Reasoning models: All layers might be beneficial
Example: Training Custom Modules
Complete Configuration Example
model:
repo_id: "/workspace/Models/Devstral-Small-2-24B-Instruct-2512"
torch_dtype: "bfloat16"
peft:
enabled: true
r: 64
lora_alpha: 128
lora_dropout: 0.05
bias: "none"
target_modules:
- "q_proj"
- "k_proj"
- "v_proj"
- "o_proj"
- "mlp.gate_proj"
- "mlp.up_proj"
- "mlp.down_proj"
- "input_layernorm"
train:
num_train_epochs: 2
learning_rate: 1e-5 # Reduced due to more modules
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
This configuration will train:
- All attention projection layers
- All MLP projection layers
- Input normalization layers
- Using a reduced learning rate to accommodate the additional trainable parameters.
Remember to always test with a small number of steps first to ensure your configuration works correctly before running full training.