# CPT Training Different Modules Guide ## Overview By default, the CPT (Continual Pre-Training) configuration in `/workspace/Trainer-kit/CPT/config.yaml` trains only **attention projection layers** using LoRA adapters. This guide explains how to modify the configuration to train other modules. ## Current Default Configuration ```yaml peft: enabled: true target_modules: "auto" ``` When `target_modules: "auto"` is set, the script automatically detects and trains these attention layers: - `q_proj` - Query projection - `k_proj` - Key projection - `v_proj` - Value projection - `o_proj` - Output projection ## How to Train Other Modules ### Method 1: Explicit Target Modules Replace `"auto"` with a list of specific module names you want to train: ```yaml peft: enabled: true target_modules: - "q_proj" - "k_proj" - "v_proj" - "o_proj" - "mlp.down_proj" # Add MLP down projection - "mlp.gate_proj" # Add MLP gate projection - "mlp.up_proj" # Add MLP up projection ``` ### Method 2: Custom Module Lists For different model architectures, here are common modules you can train: #### LLaMA/Llama-style Models ```yaml peft: enabled: true target_modules: - "q_proj" - "k_proj" - "v_proj" - "o_proj" - "mlp.gate_proj" - "mlp.up_proj" - "mlp.down_proj" ``` #### Qwen-style Models ```yaml peft: enabled: true target_modules: - "q_proj" - "k_proj" - "v_proj" - "o_proj" - "mlp.gate_proj" - "mlp.up_proj" - "mlp.down_proj" ``` #### Mixtral/Gemma-style Models ```yaml peft: enabled: true target_modules: - "q_proj" - "k_proj" - "v_proj" - "o_proj" - "mlp.experts.*.w1" # Expert layer 1 - "mlp.experts.*.w2" # Expert layer 2 - "mlp.experts.*.w3" # Expert layer 3 ``` ## Module Types You Can Train ### 1. Attention Layers - `q_proj` - Query projections - `k_proj` - Key projections - `v_proj` - Value projections - `o_proj` - Output projections - `qkv_proj` - Combined QKV (in some models) - `c_attn` - Attention in older models ### 2. MLP/Feed-Forward Layers - `mlp.gate_proj` - Gate projection - `mlp.up_proj` - Up projection - `mlp.down_proj` - Down projection - `mlp.fc1` - First layer - `mlp.fc2` - Second layer - `w1`, `w2`, `w3` - Alternative naming ### 3. Embedding Layers ```yaml peft: enabled: true target_modules: - "model.embed_tokens" # Token embeddings - "lm_head" # Language model head ``` ### 4. Normalization Layers ```yaml peft: enabled: true target_modules: - "input_layernorm" # Input normalization - "post_attention_layernorm" # Post-attention norm - "final_layernorm" # Final normalization ``` ### 5. MoE (Mixture of Experts) Layers ```yaml peft: enabled: true target_modules: - "mlp.experts.*.w1" # Expert layer 1 - "mlp.experts.*.w2" # Expert layer 2 - "mlp.experts.*.w3" # Expert layer 3 - "mlp.gate" # Expert routing gate ``` ## Advanced Configuration Examples ### Train Multiple Layer Types ```yaml peft: enabled: true target_modules: - "q_proj" - "k_proj" - "v_proj" - "o_proj" - "mlp.gate_proj" - "mlp.up_proj" - "mlp.down_proj" - "input_layernorm" - "post_attention_layernorm" ``` ### Conservative Approach (Only MLPs) ```yaml peft: enabled: true target_modules: - "mlp.gate_proj" - "mlp.up_proj" - "mlp.down_proj" ``` ### Comprehensive Approach (All Main Layers) ```yaml peft: enabled: true target_modules: - "q_proj" - "k_proj" - "v_proj" - "o_proj" - "mlp.gate_proj" - "mlp.up_proj" - "mlp.down_proj" - "input_layernorm" - "post_attention_layernorm" ``` ## How to Find Module Names for Your Model ### Method 1: Automatic Detection Run the script once with `target_modules: "auto"` - it will log which modules it found: ``` Using auto-inferred target_modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj'] ``` ### Method 2: Manual Inspection Inspect your model structure: ```python import torch from transformers import AutoModel model = AutoModel.from_pretrained("/workspace/Models/YourModel") # Print all module names for name, module in model.named_modules(): print(name) ``` ### Method 3: Use PEFT's Built-in Function The script includes `_infer_target_modules()` function that can help identify available modules. ## Considerations ### 1. Memory Usage - **More modules = More memory**: Training additional layers requires more GPU memory - **Monitor VRAM usage**: Use `nvidia-smi` to monitor memory consumption - **Adjust batch size**: You may need to reduce `per_device_train_batch_size` ### 2. Training Time - **More modules = Longer training**: Each additional layer increases computation time - **Learning rate adjustments**: You might need to reduce `learning_rate` when training more layers ### 3. Performance Trade-offs - **Attention only**: Fast training, good for language understanding - **MLP only**: Fast training, good for knowledge storage - **Both attention + MLP**: Slower but potentially better performance - **All layers**: Slowest but most comprehensive adaptation ### 4. Model Architecture Differences Different model families use different module naming conventions: - **LLaMA**: `mlp.gate_proj`, `mlp.up_proj`, `mlp.down_proj` - **Qwen**: `mlp.gate_proj`, `mlp.up_proj`, `mlp.down_proj` - **Gemma**: `mlp.gate_proj`, `mlp.up_proj`, `mlp.down_proj` - **Mixtral**: `mlp.experts.*.w1`, etc. ## Best Practices ### 1. Start Conservative Begin with just attention layers, then gradually add more modules if needed: ```yaml # Phase 1: Start here target_modules: ["q_proj", "k_proj", "v_proj", "o_proj"] # Phase 2: Add MLPs target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "mlp.down_proj"] # Phase 3: Add more if needed target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "mlp.gate_proj", "mlp.up_proj", "mlp.down_proj"] ``` ### 2. Monitor Overfitting - Use evaluation split to monitor performance - Adjust `learning_rate` if overfitting occurs - Consider `lora_dropout` to reduce overfitting ### 3. Resource Management - Start with small LoRA rank (`r: 16`) if training many modules - Increase `gradient_accumulation_steps` if reducing batch size - Monitor GPU memory usage throughout training ### 4. Model-Specific Tuning Different models may benefit from different module combinations: - **Code models**: Focus on attention + MLP layers - **Chat models**: Attention layers are most important - **Reasoning models**: All layers might be beneficial ## Example: Training Custom Modules ### Complete Configuration Example ```yaml model: repo_id: "/workspace/Models/Devstral-Small-2-24B-Instruct-2512" torch_dtype: "bfloat16" peft: enabled: true r: 64 lora_alpha: 128 lora_dropout: 0.05 bias: "none" target_modules: - "q_proj" - "k_proj" - "v_proj" - "o_proj" - "mlp.gate_proj" - "mlp.up_proj" - "mlp.down_proj" - "input_layernorm" train: num_train_epochs: 2 learning_rate: 1e-5 # Reduced due to more modules per_device_train_batch_size: 1 gradient_accumulation_steps: 16 ``` This configuration will train: - All attention projection layers - All MLP projection layers - Input normalization layers - Using a reduced learning rate to accommodate the additional trainable parameters. Remember to always test with a small number of steps first to ensure your configuration works correctly before running full training.