task2file-llm / trainer-kit /documentation.md

Add Training Scripts

e527a65 verified about 1 month ago

7.53 kB

	# CPT Training Different Modules Guide

	## Overview

	By default, the CPT (Continual Pre-Training) configuration in `/workspace/Trainer-kit/CPT/config.yaml` trains only attention projection layers using LoRA adapters. This guide explains how to modify the configuration to train other modules.

	## Current Default Configuration

	```yaml
	peft:
	enabled: true
	target_modules: "auto"
	```

	When `target_modules: "auto"` is set, the script automatically detects and trains these attention layers:
	- `q_proj` - Query projection
	- `k_proj` - Key projection
	- `v_proj` - Value projection
	- `o_proj` - Output projection

	## How to Train Other Modules

	### Method 1: Explicit Target Modules

	Replace `"auto"` with a list of specific module names you want to train:

	```yaml
	peft:
	enabled: true
	target_modules:
	- "q_proj"
	- "k_proj"
	- "v_proj"
	- "o_proj"
	- "mlp.down_proj" # Add MLP down projection
	- "mlp.gate_proj" # Add MLP gate projection
	- "mlp.up_proj" # Add MLP up projection
	```

	### Method 2: Custom Module Lists

	For different model architectures, here are common modules you can train:

	#### LLaMA/Llama-style Models
	```yaml
	peft:
	enabled: true
	target_modules:
	- "q_proj"
	- "k_proj"
	- "v_proj"
	- "o_proj"
	- "mlp.gate_proj"
	- "mlp.up_proj"
	- "mlp.down_proj"
	```

	#### Qwen-style Models
	```yaml
	peft:
	enabled: true
	target_modules:
	- "q_proj"
	- "k_proj"
	- "v_proj"
	- "o_proj"
	- "mlp.gate_proj"
	- "mlp.up_proj"
	- "mlp.down_proj"
	```

	#### Mixtral/Gemma-style Models
	```yaml
	peft:
	enabled: true
	target_modules:
	- "q_proj"
	- "k_proj"
	- "v_proj"
	- "o_proj"
	- "mlp.experts.*.w1" # Expert layer 1
	- "mlp.experts.*.w2" # Expert layer 2
	- "mlp.experts.*.w3" # Expert layer 3
	```

	## Module Types You Can Train

	### 1. Attention Layers
	- `q_proj` - Query projections
	- `k_proj` - Key projections
	- `v_proj` - Value projections
	- `o_proj` - Output projections
	- `qkv_proj` - Combined QKV (in some models)
	- `c_attn` - Attention in older models

	### 2. MLP/Feed-Forward Layers
	- `mlp.gate_proj` - Gate projection
	- `mlp.up_proj` - Up projection
	- `mlp.down_proj` - Down projection
	- `mlp.fc1` - First layer
	- `mlp.fc2` - Second layer
	- `w1`, `w2`, `w3` - Alternative naming

	### 3. Embedding Layers
	```yaml
	peft:
	enabled: true
	target_modules:
	- "model.embed_tokens" # Token embeddings
	- "lm_head" # Language model head
	```

	### 4. Normalization Layers
	```yaml
	peft:
	enabled: true
	target_modules:
	- "input_layernorm" # Input normalization
	- "post_attention_layernorm" # Post-attention norm
	- "final_layernorm" # Final normalization
	```

	### 5. MoE (Mixture of Experts) Layers
	```yaml
	peft:
	enabled: true
	target_modules:
	- "mlp.experts.*.w1" # Expert layer 1
	- "mlp.experts.*.w2" # Expert layer 2
	- "mlp.experts.*.w3" # Expert layer 3
	- "mlp.gate" # Expert routing gate
	```

	## Advanced Configuration Examples

	### Train Multiple Layer Types
	```yaml
	peft:
	enabled: true
	target_modules:
	- "q_proj"
	- "k_proj"
	- "v_proj"
	- "o_proj"
	- "mlp.gate_proj"
	- "mlp.up_proj"
	- "mlp.down_proj"
	- "input_layernorm"
	- "post_attention_layernorm"
	```

	### Conservative Approach (Only MLPs)
	```yaml
	peft:
	enabled: true
	target_modules:
	- "mlp.gate_proj"
	- "mlp.up_proj"
	- "mlp.down_proj"
	```

	### Comprehensive Approach (All Main Layers)
	```yaml
	peft:
	enabled: true
	target_modules:
	- "q_proj"
	- "k_proj"
	- "v_proj"
	- "o_proj"
	- "mlp.gate_proj"
	- "mlp.up_proj"
	- "mlp.down_proj"
	- "input_layernorm"
	- "post_attention_layernorm"
	```

	## How to Find Module Names for Your Model

	### Method 1: Automatic Detection
	Run the script once with `target_modules: "auto"` - it will log which modules it found:

	```
	Using auto-inferred target_modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj']
	```

	### Method 2: Manual Inspection
	Inspect your model structure:

	```python
	import torch
	from transformers import AutoModel

	model = AutoModel.from_pretrained("/workspace/Models/YourModel")

	# Print all module names
	for name, module in model.named_modules():
	print(name)
	```

	### Method 3: Use PEFT's Built-in Function
	The script includes `_infer_target_modules()` function that can help identify available modules.

	## Considerations

	### 1. Memory Usage
	- More modules = More memory: Training additional layers requires more GPU memory
	- Monitor VRAM usage: Use `nvidia-smi` to monitor memory consumption
	- Adjust batch size: You may need to reduce `per_device_train_batch_size`

	### 2. Training Time
	- More modules = Longer training: Each additional layer increases computation time
	- Learning rate adjustments: You might need to reduce `learning_rate` when training more layers

	### 3. Performance Trade-offs
	- Attention only: Fast training, good for language understanding
	- MLP only: Fast training, good for knowledge storage
	- Both attention + MLP: Slower but potentially better performance
	- All layers: Slowest but most comprehensive adaptation

	### 4. Model Architecture Differences
	Different model families use different module naming conventions:
	- LLaMA: `mlp.gate_proj`, `mlp.up_proj`, `mlp.down_proj`
	- Qwen: `mlp.gate_proj`, `mlp.up_proj`, `mlp.down_proj`
	- Gemma: `mlp.gate_proj`, `mlp.up_proj`, `mlp.down_proj`
	- Mixtral: `mlp.experts.*.w1`, etc.

	## Best Practices

	### 1. Start Conservative
	Begin with just attention layers, then gradually add more modules if needed:
	```yaml
	# Phase 1: Start here
	target_modules: ["q_proj", "k_proj", "v_proj", "o_proj"]

	# Phase 2: Add MLPs
	target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "mlp.down_proj"]

	# Phase 3: Add more if needed
	target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "mlp.gate_proj", "mlp.up_proj", "mlp.down_proj"]
	```

	### 2. Monitor Overfitting
	- Use evaluation split to monitor performance
	- Adjust `learning_rate` if overfitting occurs
	- Consider `lora_dropout` to reduce overfitting

	### 3. Resource Management
	- Start with small LoRA rank (`r: 16`) if training many modules
	- Increase `gradient_accumulation_steps` if reducing batch size
	- Monitor GPU memory usage throughout training

	### 4. Model-Specific Tuning
	Different models may benefit from different module combinations:
	- Code models: Focus on attention + MLP layers
	- Chat models: Attention layers are most important
	- Reasoning models: All layers might be beneficial

	## Example: Training Custom Modules

	### Complete Configuration Example
	```yaml
	model:
	repo_id: "/workspace/Models/Devstral-Small-2-24B-Instruct-2512"
	torch_dtype: "bfloat16"

	peft:
	enabled: true
	r: 64
	lora_alpha: 128
	lora_dropout: 0.05
	bias: "none"
	target_modules:
	- "q_proj"
	- "k_proj"
	- "v_proj"
	- "o_proj"
	- "mlp.gate_proj"
	- "mlp.up_proj"
	- "mlp.down_proj"
	- "input_layernorm"

	train:
	num_train_epochs: 2
	learning_rate: 1e-5 # Reduced due to more modules
	per_device_train_batch_size: 1
	gradient_accumulation_steps: 16
	```

	This configuration will train:
	- All attention projection layers
	- All MLP projection layers
	- Input normalization layers
	- Using a reduced learning rate to accommodate the additional trainable parameters.

	Remember to always test with a small number of steps first to ensure your configuration works correctly before running full training.