File size: 7,531 Bytes

e527a65

# CPT Training Different Modules Guide

## Overview

By default, the CPT (Continual Pre-Training) configuration in `/workspace/Trainer-kit/CPT/config.yaml` trains only **attention projection layers** using LoRA adapters. This guide explains how to modify the configuration to train other modules.

## Current Default Configuration

```yaml
peft:
  enabled: true
  target_modules: "auto"
```

When `target_modules: "auto"` is set, the script automatically detects and trains these attention layers:
- `q_proj` - Query projection
- `k_proj` - Key projection  
- `v_proj` - Value projection
- `o_proj` - Output projection

## How to Train Other Modules

### Method 1: Explicit Target Modules

Replace `"auto"` with a list of specific module names you want to train:

```yaml
peft:
  enabled: true
  target_modules: 
    - "q_proj"
    - "k_proj" 
    - "v_proj"
    - "o_proj"
    - "mlp.down_proj"    # Add MLP down projection
    - "mlp.gate_proj"    # Add MLP gate projection
    - "mlp.up_proj"      # Add MLP up projection
```

### Method 2: Custom Module Lists

For different model architectures, here are common modules you can train:

#### LLaMA/Llama-style Models
```yaml
peft:
  enabled: true
  target_modules:
    - "q_proj"
    - "k_proj"
    - "v_proj"
    - "o_proj"
    - "mlp.gate_proj"
    - "mlp.up_proj"
    - "mlp.down_proj"
```

#### Qwen-style Models
```yaml
peft:
  enabled: true
  target_modules:
    - "q_proj"
    - "k_proj" 
    - "v_proj"
    - "o_proj"
    - "mlp.gate_proj"
    - "mlp.up_proj"
    - "mlp.down_proj"
```

#### Mixtral/Gemma-style Models
```yaml
peft:
  enabled: true
  target_modules:
    - "q_proj"
    - "k_proj"
    - "v_proj" 
    - "o_proj"
    - "mlp.experts.*.w1"    # Expert layer 1
    - "mlp.experts.*.w2"    # Expert layer 2
    - "mlp.experts.*.w3"    # Expert layer 3
```

## Module Types You Can Train

### 1. Attention Layers
- `q_proj` - Query projections
- `k_proj` - Key projections
- `v_proj` - Value projections
- `o_proj` - Output projections
- `qkv_proj` - Combined QKV (in some models)
- `c_attn` - Attention in older models

### 2. MLP/Feed-Forward Layers
- `mlp.gate_proj` - Gate projection
- `mlp.up_proj` - Up projection  
- `mlp.down_proj` - Down projection
- `mlp.fc1` - First layer
- `mlp.fc2` - Second layer
- `w1`, `w2`, `w3` - Alternative naming

### 3. Embedding Layers
```yaml
peft:
  enabled: true
  target_modules:
    - "model.embed_tokens"  # Token embeddings
    - "lm_head"             # Language model head
```

### 4. Normalization Layers
```yaml
peft:
  enabled: true
  target_modules:
    - "input_layernorm"      # Input normalization
    - "post_attention_layernorm" # Post-attention norm
    - "final_layernorm"      # Final normalization
```

### 5. MoE (Mixture of Experts) Layers
```yaml
peft:
  enabled: true
  target_modules:
    - "mlp.experts.*.w1"     # Expert layer 1
    - "mlp.experts.*.w2"     # Expert layer 2  
    - "mlp.experts.*.w3"     # Expert layer 3
    - "mlp.gate"             # Expert routing gate
```

## Advanced Configuration Examples

### Train Multiple Layer Types
```yaml
peft:
  enabled: true
  target_modules:
    - "q_proj"
    - "k_proj"
    - "v_proj"
    - "o_proj"
    - "mlp.gate_proj"
    - "mlp.up_proj" 
    - "mlp.down_proj"
    - "input_layernorm"
    - "post_attention_layernorm"
```

### Conservative Approach (Only MLPs)
```yaml
peft:
  enabled: true
  target_modules:
    - "mlp.gate_proj"
    - "mlp.up_proj"
    - "mlp.down_proj"
```

### Comprehensive Approach (All Main Layers)
```yaml
peft:
  enabled: true
  target_modules:
    - "q_proj"
    - "k_proj"
    - "v_proj"
    - "o_proj"
    - "mlp.gate_proj"
    - "mlp.up_proj"
    - "mlp.down_proj"
    - "input_layernorm"
    - "post_attention_layernorm"
```

## How to Find Module Names for Your Model

### Method 1: Automatic Detection
Run the script once with `target_modules: "auto"` - it will log which modules it found:

```
Using auto-inferred target_modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj']
```

### Method 2: Manual Inspection
Inspect your model structure:

```python
import torch
from transformers import AutoModel

model = AutoModel.from_pretrained("/workspace/Models/YourModel")

# Print all module names
for name, module in model.named_modules():
    print(name)
```

### Method 3: Use PEFT's Built-in Function
The script includes `_infer_target_modules()` function that can help identify available modules.

## Considerations

### 1. Memory Usage
- **More modules = More memory**: Training additional layers requires more GPU memory
- **Monitor VRAM usage**: Use `nvidia-smi` to monitor memory consumption
- **Adjust batch size**: You may need to reduce `per_device_train_batch_size`

### 2. Training Time
- **More modules = Longer training**: Each additional layer increases computation time
- **Learning rate adjustments**: You might need to reduce `learning_rate` when training more layers

### 3. Performance Trade-offs
- **Attention only**: Fast training, good for language understanding
- **MLP only**: Fast training, good for knowledge storage
- **Both attention + MLP**: Slower but potentially better performance
- **All layers**: Slowest but most comprehensive adaptation

### 4. Model Architecture Differences
Different model families use different module naming conventions:
- **LLaMA**: `mlp.gate_proj`, `mlp.up_proj`, `mlp.down_proj`
- **Qwen**: `mlp.gate_proj`, `mlp.up_proj`, `mlp.down_proj`  
- **Gemma**: `mlp.gate_proj`, `mlp.up_proj`, `mlp.down_proj`
- **Mixtral**: `mlp.experts.*.w1`, etc.

## Best Practices

### 1. Start Conservative
Begin with just attention layers, then gradually add more modules if needed:
```yaml
# Phase 1: Start here
target_modules: ["q_proj", "k_proj", "v_proj", "o_proj"]

# Phase 2: Add MLPs
target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "mlp.down_proj"]

# Phase 3: Add more if needed
target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "mlp.gate_proj", "mlp.up_proj", "mlp.down_proj"]
```

### 2. Monitor Overfitting
- Use evaluation split to monitor performance
- Adjust `learning_rate` if overfitting occurs
- Consider `lora_dropout` to reduce overfitting

### 3. Resource Management
- Start with small LoRA rank (`r: 16`) if training many modules
- Increase `gradient_accumulation_steps` if reducing batch size
- Monitor GPU memory usage throughout training

### 4. Model-Specific Tuning
Different models may benefit from different module combinations:
- **Code models**: Focus on attention + MLP layers
- **Chat models**: Attention layers are most important
- **Reasoning models**: All layers might be beneficial

## Example: Training Custom Modules

### Complete Configuration Example
```yaml
model:
  repo_id: "/workspace/Models/Devstral-Small-2-24B-Instruct-2512"
  torch_dtype: "bfloat16"

peft:
  enabled: true
  r: 64
  lora_alpha: 128
  lora_dropout: 0.05
  bias: "none"
  target_modules: 
    - "q_proj"
    - "k_proj"
    - "v_proj"
    - "o_proj"
    - "mlp.gate_proj"
    - "mlp.up_proj"
    - "mlp.down_proj"
    - "input_layernorm"

train:
  num_train_epochs: 2
  learning_rate: 1e-5  # Reduced due to more modules
  per_device_train_batch_size: 1
  gradient_accumulation_steps: 16
```

This configuration will train:
- All attention projection layers
- All MLP projection layers  
- Input normalization layers
- Using a reduced learning rate to accommodate the additional trainable parameters.

Remember to always test with a small number of steps first to ensure your configuration works correctly before running full training.