File size: 5,300 Bytes
e527a65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# Instruction Fine-Tuning Script

This script (`run_instruct.py`) is designed for fine-tuning language models on instruction-following tasks. It's based on the original CPT script but adapted specifically for instruction input/output pairs.

## Key Differences from CPT

1. **Data Format**: Handles structured instruction data with separate fields for instruction, input, and output
2. **Formatting Options**: Supports multiple instruction formats (ChatML, Alpaca, custom templates)
3. **No Text Packing**: Each example is treated as a complete instruction-response pair
4. **Proper Loss Masking**: Loss is only computed on the response/output portion, not on the instruction and input
5. **Automatic Label Creation**: Labels are automatically created with -100 masking for instruction tokens

## Supported Data Formats

### JSONL Structure
Each line should be a JSON object with the following fields:
```json
{
  "instruction": "Your instruction here",
  "input": "Optional input context (can be empty string)",
  "output": "Expected response"
}
```

### Formatting Options

#### 1. ChatML Format (Default)
Uses the model's chat template with system/user/assistant roles:
```yaml
data:
  format_type: "chatml"
  system_prompt: "You are a helpful assistant."
```

#### 2. Alpaca Format
Uses the classic Alpaca instruction format:
```yaml
data:
  format_type: "alpaca"
```

#### 3. Custom Format
Define your own template:
```yaml
data:
  format_type: "custom"
  custom_template: "Instruction: {instruction}\nInput: {input}\nOutput: {output}"
```

## Configuration

Key configuration options in `config_instruct.yaml`:

### Data Configuration
```yaml
data:
  train_jsonl: "path/to/your/train.jsonl"
  eval_jsonl: "path/to/your/eval.jsonl"  # optional
  eval_split_ratio: 0.1  # if no eval file provided
  
  # Field names in your data
  instruction_field: "instruction"
  input_field: "input"
  output_field: "output"
  
  # Formatting
  format_type: "chatml"  # "chatml" | "alpaca" | "custom"
  system_prompt: "You are a helpful assistant."
  
  # Tokenization
  max_length: 2048
```

### Training Configuration
```yaml
train:
  max_steps: 100
  num_train_epochs: 3
  per_device_train_batch_size: 1
  gradient_accumulation_steps: 16
  learning_rate: 5e-5
  # ... other training parameters
```

## Usage

### Basic Usage
```bash
python run_instruct.py --config config_instruct.yaml
```

### Merge Only (after training)
```bash
python run_instruct.py --config config_instruct.yaml --merge-only
```

## Example Data Format

See `instruct_data.jsonl` for examples of the expected data format. Here are a few examples:

```json
{"instruction": "What is the capital of France?", "input": "", "output": "The capital of France is Paris."}

{"instruction": "Translate the following English text to French.", "input": "Hello, how are you today?", "output": "Bonjour, comment allez-vous aujourd'hui?"}

{"instruction": "Write a Python function that calculates factorial.", "input": "", "output": "def factorial(n):\n    if n < 0:\n        raise ValueError(...)"}
```

## Key Features

1. **Multiple Format Support**: ChatML, Alpaca, and custom templates
2. **Flexible Field Mapping**: Configure custom field names for your data
3. **Proper Loss Masking**: Only computes loss on the response portion
4. **PEFT/LoRA Support**: Efficient fine-tuning with LoRA
5. **Evaluation Support**: Automatic evaluation split or separate eval file
6. **Checkpointing**: Resume training from checkpoints
7. **Model Merging**: Merge trained adapters with base model

## Best Practices

1. **Data Quality**: Ensure your instruction-response pairs are high-quality and consistent
2. **Format Consistency**: Use the same format for training and inference
3. **System Prompts**: Choose appropriate system prompts for your use case
4. **Token Length**: Set appropriate `max_length` based on your model and data
5. **Batch Size**: Adjust batch size and gradient accumulation based on your GPU memory

## Troubleshooting

### Common Issues

1. **CUDA Out of Memory**: Reduce batch size or enable 4-bit quantization
2. **Slow Training**: Increase `gradient_accumulation_steps` or reduce `max_length`
3. **Poor Quality**: Check data format consistency and quality
4. **Tokenizer Issues**: Ensure your model has proper chat template support

### Debug Mode
Add logging to see formatted examples:
```python
# In format_instruction function, add:
print(f"Formatted: {formatted_text}")
```

## File Structure

```
CPT/
β”œβ”€β”€ run_instruct.py          # Main instruction fine-tuning script
β”œβ”€β”€ config_instruct.yaml     # Configuration file
β”œβ”€β”€ instruct_data.jsonl      # Example instruction data
β”œβ”€β”€ README_instruct.md       # This documentation
└── runs/                    # Training outputs
    └── instruct_run_v1/
        β”œβ”€β”€ logs/
        β”œβ”€β”€ checkpoints/
        β”œβ”€β”€ best_adapter/
        └── final_model/
```

## Migration from CPT

To migrate from the original CPT script:

1. Convert your text data to instruction format
2. Update your configuration file
3. Choose appropriate formatting options
4. Adjust training parameters (instruction fine-tuning typically needs fewer steps)

The script maintains the same CLI interface and most configuration options for easy migration.