File size: 7,531 Bytes
e527a65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
# CPT Training Different Modules Guide

## Overview

By default, the CPT (Continual Pre-Training) configuration in `/workspace/Trainer-kit/CPT/config.yaml` trains only **attention projection layers** using LoRA adapters. This guide explains how to modify the configuration to train other modules.

## Current Default Configuration

```yaml
peft:
  enabled: true
  target_modules: "auto"
```

When `target_modules: "auto"` is set, the script automatically detects and trains these attention layers:
- `q_proj` - Query projection
- `k_proj` - Key projection  
- `v_proj` - Value projection
- `o_proj` - Output projection

## How to Train Other Modules

### Method 1: Explicit Target Modules

Replace `"auto"` with a list of specific module names you want to train:

```yaml
peft:
  enabled: true
  target_modules: 
    - "q_proj"
    - "k_proj" 
    - "v_proj"
    - "o_proj"
    - "mlp.down_proj"    # Add MLP down projection
    - "mlp.gate_proj"    # Add MLP gate projection
    - "mlp.up_proj"      # Add MLP up projection
```

### Method 2: Custom Module Lists

For different model architectures, here are common modules you can train:

#### LLaMA/Llama-style Models
```yaml
peft:
  enabled: true
  target_modules:
    - "q_proj"
    - "k_proj"
    - "v_proj"
    - "o_proj"
    - "mlp.gate_proj"
    - "mlp.up_proj"
    - "mlp.down_proj"
```

#### Qwen-style Models
```yaml
peft:
  enabled: true
  target_modules:
    - "q_proj"
    - "k_proj" 
    - "v_proj"
    - "o_proj"
    - "mlp.gate_proj"
    - "mlp.up_proj"
    - "mlp.down_proj"
```

#### Mixtral/Gemma-style Models
```yaml
peft:
  enabled: true
  target_modules:
    - "q_proj"
    - "k_proj"
    - "v_proj" 
    - "o_proj"
    - "mlp.experts.*.w1"    # Expert layer 1
    - "mlp.experts.*.w2"    # Expert layer 2
    - "mlp.experts.*.w3"    # Expert layer 3
```

## Module Types You Can Train

### 1. Attention Layers
- `q_proj` - Query projections
- `k_proj` - Key projections
- `v_proj` - Value projections
- `o_proj` - Output projections
- `qkv_proj` - Combined QKV (in some models)
- `c_attn` - Attention in older models

### 2. MLP/Feed-Forward Layers
- `mlp.gate_proj` - Gate projection
- `mlp.up_proj` - Up projection  
- `mlp.down_proj` - Down projection
- `mlp.fc1` - First layer
- `mlp.fc2` - Second layer
- `w1`, `w2`, `w3` - Alternative naming

### 3. Embedding Layers
```yaml
peft:
  enabled: true
  target_modules:
    - "model.embed_tokens"  # Token embeddings
    - "lm_head"             # Language model head
```

### 4. Normalization Layers
```yaml
peft:
  enabled: true
  target_modules:
    - "input_layernorm"      # Input normalization
    - "post_attention_layernorm" # Post-attention norm
    - "final_layernorm"      # Final normalization
```

### 5. MoE (Mixture of Experts) Layers
```yaml
peft:
  enabled: true
  target_modules:
    - "mlp.experts.*.w1"     # Expert layer 1
    - "mlp.experts.*.w2"     # Expert layer 2  
    - "mlp.experts.*.w3"     # Expert layer 3
    - "mlp.gate"             # Expert routing gate
```

## Advanced Configuration Examples

### Train Multiple Layer Types
```yaml
peft:
  enabled: true
  target_modules:
    - "q_proj"
    - "k_proj"
    - "v_proj"
    - "o_proj"
    - "mlp.gate_proj"
    - "mlp.up_proj" 
    - "mlp.down_proj"
    - "input_layernorm"
    - "post_attention_layernorm"
```

### Conservative Approach (Only MLPs)
```yaml
peft:
  enabled: true
  target_modules:
    - "mlp.gate_proj"
    - "mlp.up_proj"
    - "mlp.down_proj"
```

### Comprehensive Approach (All Main Layers)
```yaml
peft:
  enabled: true
  target_modules:
    - "q_proj"
    - "k_proj"
    - "v_proj"
    - "o_proj"
    - "mlp.gate_proj"
    - "mlp.up_proj"
    - "mlp.down_proj"
    - "input_layernorm"
    - "post_attention_layernorm"
```

## How to Find Module Names for Your Model

### Method 1: Automatic Detection
Run the script once with `target_modules: "auto"` - it will log which modules it found:

```
Using auto-inferred target_modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj']
```

### Method 2: Manual Inspection
Inspect your model structure:

```python
import torch
from transformers import AutoModel

model = AutoModel.from_pretrained("/workspace/Models/YourModel")

# Print all module names
for name, module in model.named_modules():
    print(name)
```

### Method 3: Use PEFT's Built-in Function
The script includes `_infer_target_modules()` function that can help identify available modules.

## Considerations

### 1. Memory Usage
- **More modules = More memory**: Training additional layers requires more GPU memory
- **Monitor VRAM usage**: Use `nvidia-smi` to monitor memory consumption
- **Adjust batch size**: You may need to reduce `per_device_train_batch_size`

### 2. Training Time
- **More modules = Longer training**: Each additional layer increases computation time
- **Learning rate adjustments**: You might need to reduce `learning_rate` when training more layers

### 3. Performance Trade-offs
- **Attention only**: Fast training, good for language understanding
- **MLP only**: Fast training, good for knowledge storage
- **Both attention + MLP**: Slower but potentially better performance
- **All layers**: Slowest but most comprehensive adaptation

### 4. Model Architecture Differences
Different model families use different module naming conventions:
- **LLaMA**: `mlp.gate_proj`, `mlp.up_proj`, `mlp.down_proj`
- **Qwen**: `mlp.gate_proj`, `mlp.up_proj`, `mlp.down_proj`  
- **Gemma**: `mlp.gate_proj`, `mlp.up_proj`, `mlp.down_proj`
- **Mixtral**: `mlp.experts.*.w1`, etc.

## Best Practices

### 1. Start Conservative
Begin with just attention layers, then gradually add more modules if needed:
```yaml
# Phase 1: Start here
target_modules: ["q_proj", "k_proj", "v_proj", "o_proj"]

# Phase 2: Add MLPs
target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "mlp.down_proj"]

# Phase 3: Add more if needed
target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "mlp.gate_proj", "mlp.up_proj", "mlp.down_proj"]
```

### 2. Monitor Overfitting
- Use evaluation split to monitor performance
- Adjust `learning_rate` if overfitting occurs
- Consider `lora_dropout` to reduce overfitting

### 3. Resource Management
- Start with small LoRA rank (`r: 16`) if training many modules
- Increase `gradient_accumulation_steps` if reducing batch size
- Monitor GPU memory usage throughout training

### 4. Model-Specific Tuning
Different models may benefit from different module combinations:
- **Code models**: Focus on attention + MLP layers
- **Chat models**: Attention layers are most important
- **Reasoning models**: All layers might be beneficial

## Example: Training Custom Modules

### Complete Configuration Example
```yaml
model:
  repo_id: "/workspace/Models/Devstral-Small-2-24B-Instruct-2512"
  torch_dtype: "bfloat16"

peft:
  enabled: true
  r: 64
  lora_alpha: 128
  lora_dropout: 0.05
  bias: "none"
  target_modules: 
    - "q_proj"
    - "k_proj"
    - "v_proj"
    - "o_proj"
    - "mlp.gate_proj"
    - "mlp.up_proj"
    - "mlp.down_proj"
    - "input_layernorm"

train:
  num_train_epochs: 2
  learning_rate: 1e-5  # Reduced due to more modules
  per_device_train_batch_size: 1
  gradient_accumulation_steps: 16
```

This configuration will train:
- All attention projection layers
- All MLP projection layers  
- Input normalization layers
- Using a reduced learning rate to accommodate the additional trainable parameters.

Remember to always test with a small number of steps first to ensure your configuration works correctly before running full training.