task2file-llm / trainer-kit /documentation.md
SirajRLX's picture
Add Training Scripts
e527a65 verified

CPT Training Different Modules Guide

Overview

By default, the CPT (Continual Pre-Training) configuration in /workspace/Trainer-kit/CPT/config.yaml trains only attention projection layers using LoRA adapters. This guide explains how to modify the configuration to train other modules.

Current Default Configuration

peft:
  enabled: true
  target_modules: "auto"

When target_modules: "auto" is set, the script automatically detects and trains these attention layers:

  • q_proj - Query projection
  • k_proj - Key projection
  • v_proj - Value projection
  • o_proj - Output projection

How to Train Other Modules

Method 1: Explicit Target Modules

Replace "auto" with a list of specific module names you want to train:

peft:
  enabled: true
  target_modules: 
    - "q_proj"
    - "k_proj" 
    - "v_proj"
    - "o_proj"
    - "mlp.down_proj"    # Add MLP down projection
    - "mlp.gate_proj"    # Add MLP gate projection
    - "mlp.up_proj"      # Add MLP up projection

Method 2: Custom Module Lists

For different model architectures, here are common modules you can train:

LLaMA/Llama-style Models

peft:
  enabled: true
  target_modules:
    - "q_proj"
    - "k_proj"
    - "v_proj"
    - "o_proj"
    - "mlp.gate_proj"
    - "mlp.up_proj"
    - "mlp.down_proj"

Qwen-style Models

peft:
  enabled: true
  target_modules:
    - "q_proj"
    - "k_proj" 
    - "v_proj"
    - "o_proj"
    - "mlp.gate_proj"
    - "mlp.up_proj"
    - "mlp.down_proj"

Mixtral/Gemma-style Models

peft:
  enabled: true
  target_modules:
    - "q_proj"
    - "k_proj"
    - "v_proj" 
    - "o_proj"
    - "mlp.experts.*.w1"    # Expert layer 1
    - "mlp.experts.*.w2"    # Expert layer 2
    - "mlp.experts.*.w3"    # Expert layer 3

Module Types You Can Train

1. Attention Layers

  • q_proj - Query projections
  • k_proj - Key projections
  • v_proj - Value projections
  • o_proj - Output projections
  • qkv_proj - Combined QKV (in some models)
  • c_attn - Attention in older models

2. MLP/Feed-Forward Layers

  • mlp.gate_proj - Gate projection
  • mlp.up_proj - Up projection
  • mlp.down_proj - Down projection
  • mlp.fc1 - First layer
  • mlp.fc2 - Second layer
  • w1, w2, w3 - Alternative naming

3. Embedding Layers

peft:
  enabled: true
  target_modules:
    - "model.embed_tokens"  # Token embeddings
    - "lm_head"             # Language model head

4. Normalization Layers

peft:
  enabled: true
  target_modules:
    - "input_layernorm"      # Input normalization
    - "post_attention_layernorm" # Post-attention norm
    - "final_layernorm"      # Final normalization

5. MoE (Mixture of Experts) Layers

peft:
  enabled: true
  target_modules:
    - "mlp.experts.*.w1"     # Expert layer 1
    - "mlp.experts.*.w2"     # Expert layer 2  
    - "mlp.experts.*.w3"     # Expert layer 3
    - "mlp.gate"             # Expert routing gate

Advanced Configuration Examples

Train Multiple Layer Types

peft:
  enabled: true
  target_modules:
    - "q_proj"
    - "k_proj"
    - "v_proj"
    - "o_proj"
    - "mlp.gate_proj"
    - "mlp.up_proj" 
    - "mlp.down_proj"
    - "input_layernorm"
    - "post_attention_layernorm"

Conservative Approach (Only MLPs)

peft:
  enabled: true
  target_modules:
    - "mlp.gate_proj"
    - "mlp.up_proj"
    - "mlp.down_proj"

Comprehensive Approach (All Main Layers)

peft:
  enabled: true
  target_modules:
    - "q_proj"
    - "k_proj"
    - "v_proj"
    - "o_proj"
    - "mlp.gate_proj"
    - "mlp.up_proj"
    - "mlp.down_proj"
    - "input_layernorm"
    - "post_attention_layernorm"

How to Find Module Names for Your Model

Method 1: Automatic Detection

Run the script once with target_modules: "auto" - it will log which modules it found:

Using auto-inferred target_modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj']

Method 2: Manual Inspection

Inspect your model structure:

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained("/workspace/Models/YourModel")

# Print all module names
for name, module in model.named_modules():
    print(name)

Method 3: Use PEFT's Built-in Function

The script includes _infer_target_modules() function that can help identify available modules.

Considerations

1. Memory Usage

  • More modules = More memory: Training additional layers requires more GPU memory
  • Monitor VRAM usage: Use nvidia-smi to monitor memory consumption
  • Adjust batch size: You may need to reduce per_device_train_batch_size

2. Training Time

  • More modules = Longer training: Each additional layer increases computation time
  • Learning rate adjustments: You might need to reduce learning_rate when training more layers

3. Performance Trade-offs

  • Attention only: Fast training, good for language understanding
  • MLP only: Fast training, good for knowledge storage
  • Both attention + MLP: Slower but potentially better performance
  • All layers: Slowest but most comprehensive adaptation

4. Model Architecture Differences

Different model families use different module naming conventions:

  • LLaMA: mlp.gate_proj, mlp.up_proj, mlp.down_proj
  • Qwen: mlp.gate_proj, mlp.up_proj, mlp.down_proj
  • Gemma: mlp.gate_proj, mlp.up_proj, mlp.down_proj
  • Mixtral: mlp.experts.*.w1, etc.

Best Practices

1. Start Conservative

Begin with just attention layers, then gradually add more modules if needed:

# Phase 1: Start here
target_modules: ["q_proj", "k_proj", "v_proj", "o_proj"]

# Phase 2: Add MLPs
target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "mlp.down_proj"]

# Phase 3: Add more if needed
target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "mlp.gate_proj", "mlp.up_proj", "mlp.down_proj"]

2. Monitor Overfitting

  • Use evaluation split to monitor performance
  • Adjust learning_rate if overfitting occurs
  • Consider lora_dropout to reduce overfitting

3. Resource Management

  • Start with small LoRA rank (r: 16) if training many modules
  • Increase gradient_accumulation_steps if reducing batch size
  • Monitor GPU memory usage throughout training

4. Model-Specific Tuning

Different models may benefit from different module combinations:

  • Code models: Focus on attention + MLP layers
  • Chat models: Attention layers are most important
  • Reasoning models: All layers might be beneficial

Example: Training Custom Modules

Complete Configuration Example

model:
  repo_id: "/workspace/Models/Devstral-Small-2-24B-Instruct-2512"
  torch_dtype: "bfloat16"

peft:
  enabled: true
  r: 64
  lora_alpha: 128
  lora_dropout: 0.05
  bias: "none"
  target_modules: 
    - "q_proj"
    - "k_proj"
    - "v_proj"
    - "o_proj"
    - "mlp.gate_proj"
    - "mlp.up_proj"
    - "mlp.down_proj"
    - "input_layernorm"

train:
  num_train_epochs: 2
  learning_rate: 1e-5  # Reduced due to more modules
  per_device_train_batch_size: 1
  gradient_accumulation_steps: 16

This configuration will train:

  • All attention projection layers
  • All MLP projection layers
  • Input normalization layers
  • Using a reduced learning rate to accommodate the additional trainable parameters.

Remember to always test with a small number of steps first to ensure your configuration works correctly before running full training.