| # CPT Training Different Modules Guide | |
| ## Overview | |
| By default, the CPT (Continual Pre-Training) configuration in `/workspace/Trainer-kit/CPT/config.yaml` trains only **attention projection layers** using LoRA adapters. This guide explains how to modify the configuration to train other modules. | |
| ## Current Default Configuration | |
| ```yaml | |
| peft: | |
| enabled: true | |
| target_modules: "auto" | |
| ``` | |
| When `target_modules: "auto"` is set, the script automatically detects and trains these attention layers: | |
| - `q_proj` - Query projection | |
| - `k_proj` - Key projection | |
| - `v_proj` - Value projection | |
| - `o_proj` - Output projection | |
| ## How to Train Other Modules | |
| ### Method 1: Explicit Target Modules | |
| Replace `"auto"` with a list of specific module names you want to train: | |
| ```yaml | |
| peft: | |
| enabled: true | |
| target_modules: | |
| - "q_proj" | |
| - "k_proj" | |
| - "v_proj" | |
| - "o_proj" | |
| - "mlp.down_proj" # Add MLP down projection | |
| - "mlp.gate_proj" # Add MLP gate projection | |
| - "mlp.up_proj" # Add MLP up projection | |
| ``` | |
| ### Method 2: Custom Module Lists | |
| For different model architectures, here are common modules you can train: | |
| #### LLaMA/Llama-style Models | |
| ```yaml | |
| peft: | |
| enabled: true | |
| target_modules: | |
| - "q_proj" | |
| - "k_proj" | |
| - "v_proj" | |
| - "o_proj" | |
| - "mlp.gate_proj" | |
| - "mlp.up_proj" | |
| - "mlp.down_proj" | |
| ``` | |
| #### Qwen-style Models | |
| ```yaml | |
| peft: | |
| enabled: true | |
| target_modules: | |
| - "q_proj" | |
| - "k_proj" | |
| - "v_proj" | |
| - "o_proj" | |
| - "mlp.gate_proj" | |
| - "mlp.up_proj" | |
| - "mlp.down_proj" | |
| ``` | |
| #### Mixtral/Gemma-style Models | |
| ```yaml | |
| peft: | |
| enabled: true | |
| target_modules: | |
| - "q_proj" | |
| - "k_proj" | |
| - "v_proj" | |
| - "o_proj" | |
| - "mlp.experts.*.w1" # Expert layer 1 | |
| - "mlp.experts.*.w2" # Expert layer 2 | |
| - "mlp.experts.*.w3" # Expert layer 3 | |
| ``` | |
| ## Module Types You Can Train | |
| ### 1. Attention Layers | |
| - `q_proj` - Query projections | |
| - `k_proj` - Key projections | |
| - `v_proj` - Value projections | |
| - `o_proj` - Output projections | |
| - `qkv_proj` - Combined QKV (in some models) | |
| - `c_attn` - Attention in older models | |
| ### 2. MLP/Feed-Forward Layers | |
| - `mlp.gate_proj` - Gate projection | |
| - `mlp.up_proj` - Up projection | |
| - `mlp.down_proj` - Down projection | |
| - `mlp.fc1` - First layer | |
| - `mlp.fc2` - Second layer | |
| - `w1`, `w2`, `w3` - Alternative naming | |
| ### 3. Embedding Layers | |
| ```yaml | |
| peft: | |
| enabled: true | |
| target_modules: | |
| - "model.embed_tokens" # Token embeddings | |
| - "lm_head" # Language model head | |
| ``` | |
| ### 4. Normalization Layers | |
| ```yaml | |
| peft: | |
| enabled: true | |
| target_modules: | |
| - "input_layernorm" # Input normalization | |
| - "post_attention_layernorm" # Post-attention norm | |
| - "final_layernorm" # Final normalization | |
| ``` | |
| ### 5. MoE (Mixture of Experts) Layers | |
| ```yaml | |
| peft: | |
| enabled: true | |
| target_modules: | |
| - "mlp.experts.*.w1" # Expert layer 1 | |
| - "mlp.experts.*.w2" # Expert layer 2 | |
| - "mlp.experts.*.w3" # Expert layer 3 | |
| - "mlp.gate" # Expert routing gate | |
| ``` | |
| ## Advanced Configuration Examples | |
| ### Train Multiple Layer Types | |
| ```yaml | |
| peft: | |
| enabled: true | |
| target_modules: | |
| - "q_proj" | |
| - "k_proj" | |
| - "v_proj" | |
| - "o_proj" | |
| - "mlp.gate_proj" | |
| - "mlp.up_proj" | |
| - "mlp.down_proj" | |
| - "input_layernorm" | |
| - "post_attention_layernorm" | |
| ``` | |
| ### Conservative Approach (Only MLPs) | |
| ```yaml | |
| peft: | |
| enabled: true | |
| target_modules: | |
| - "mlp.gate_proj" | |
| - "mlp.up_proj" | |
| - "mlp.down_proj" | |
| ``` | |
| ### Comprehensive Approach (All Main Layers) | |
| ```yaml | |
| peft: | |
| enabled: true | |
| target_modules: | |
| - "q_proj" | |
| - "k_proj" | |
| - "v_proj" | |
| - "o_proj" | |
| - "mlp.gate_proj" | |
| - "mlp.up_proj" | |
| - "mlp.down_proj" | |
| - "input_layernorm" | |
| - "post_attention_layernorm" | |
| ``` | |
| ## How to Find Module Names for Your Model | |
| ### Method 1: Automatic Detection | |
| Run the script once with `target_modules: "auto"` - it will log which modules it found: | |
| ``` | |
| Using auto-inferred target_modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj'] | |
| ``` | |
| ### Method 2: Manual Inspection | |
| Inspect your model structure: | |
| ```python | |
| import torch | |
| from transformers import AutoModel | |
| model = AutoModel.from_pretrained("/workspace/Models/YourModel") | |
| # Print all module names | |
| for name, module in model.named_modules(): | |
| print(name) | |
| ``` | |
| ### Method 3: Use PEFT's Built-in Function | |
| The script includes `_infer_target_modules()` function that can help identify available modules. | |
| ## Considerations | |
| ### 1. Memory Usage | |
| - **More modules = More memory**: Training additional layers requires more GPU memory | |
| - **Monitor VRAM usage**: Use `nvidia-smi` to monitor memory consumption | |
| - **Adjust batch size**: You may need to reduce `per_device_train_batch_size` | |
| ### 2. Training Time | |
| - **More modules = Longer training**: Each additional layer increases computation time | |
| - **Learning rate adjustments**: You might need to reduce `learning_rate` when training more layers | |
| ### 3. Performance Trade-offs | |
| - **Attention only**: Fast training, good for language understanding | |
| - **MLP only**: Fast training, good for knowledge storage | |
| - **Both attention + MLP**: Slower but potentially better performance | |
| - **All layers**: Slowest but most comprehensive adaptation | |
| ### 4. Model Architecture Differences | |
| Different model families use different module naming conventions: | |
| - **LLaMA**: `mlp.gate_proj`, `mlp.up_proj`, `mlp.down_proj` | |
| - **Qwen**: `mlp.gate_proj`, `mlp.up_proj`, `mlp.down_proj` | |
| - **Gemma**: `mlp.gate_proj`, `mlp.up_proj`, `mlp.down_proj` | |
| - **Mixtral**: `mlp.experts.*.w1`, etc. | |
| ## Best Practices | |
| ### 1. Start Conservative | |
| Begin with just attention layers, then gradually add more modules if needed: | |
| ```yaml | |
| # Phase 1: Start here | |
| target_modules: ["q_proj", "k_proj", "v_proj", "o_proj"] | |
| # Phase 2: Add MLPs | |
| target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "mlp.down_proj"] | |
| # Phase 3: Add more if needed | |
| target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "mlp.gate_proj", "mlp.up_proj", "mlp.down_proj"] | |
| ``` | |
| ### 2. Monitor Overfitting | |
| - Use evaluation split to monitor performance | |
| - Adjust `learning_rate` if overfitting occurs | |
| - Consider `lora_dropout` to reduce overfitting | |
| ### 3. Resource Management | |
| - Start with small LoRA rank (`r: 16`) if training many modules | |
| - Increase `gradient_accumulation_steps` if reducing batch size | |
| - Monitor GPU memory usage throughout training | |
| ### 4. Model-Specific Tuning | |
| Different models may benefit from different module combinations: | |
| - **Code models**: Focus on attention + MLP layers | |
| - **Chat models**: Attention layers are most important | |
| - **Reasoning models**: All layers might be beneficial | |
| ## Example: Training Custom Modules | |
| ### Complete Configuration Example | |
| ```yaml | |
| model: | |
| repo_id: "/workspace/Models/Devstral-Small-2-24B-Instruct-2512" | |
| torch_dtype: "bfloat16" | |
| peft: | |
| enabled: true | |
| r: 64 | |
| lora_alpha: 128 | |
| lora_dropout: 0.05 | |
| bias: "none" | |
| target_modules: | |
| - "q_proj" | |
| - "k_proj" | |
| - "v_proj" | |
| - "o_proj" | |
| - "mlp.gate_proj" | |
| - "mlp.up_proj" | |
| - "mlp.down_proj" | |
| - "input_layernorm" | |
| train: | |
| num_train_epochs: 2 | |
| learning_rate: 1e-5 # Reduced due to more modules | |
| per_device_train_batch_size: 1 | |
| gradient_accumulation_steps: 16 | |
| ``` | |
| This configuration will train: | |
| - All attention projection layers | |
| - All MLP projection layers | |
| - Input normalization layers | |
| - Using a reduced learning rate to accommodate the additional trainable parameters. | |
| Remember to always test with a small number of steps first to ensure your configuration works correctly before running full training. | |