File size: 5,300 Bytes
e527a65 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
# Instruction Fine-Tuning Script
This script (`run_instruct.py`) is designed for fine-tuning language models on instruction-following tasks. It's based on the original CPT script but adapted specifically for instruction input/output pairs.
## Key Differences from CPT
1. **Data Format**: Handles structured instruction data with separate fields for instruction, input, and output
2. **Formatting Options**: Supports multiple instruction formats (ChatML, Alpaca, custom templates)
3. **No Text Packing**: Each example is treated as a complete instruction-response pair
4. **Proper Loss Masking**: Loss is only computed on the response/output portion, not on the instruction and input
5. **Automatic Label Creation**: Labels are automatically created with -100 masking for instruction tokens
## Supported Data Formats
### JSONL Structure
Each line should be a JSON object with the following fields:
```json
{
"instruction": "Your instruction here",
"input": "Optional input context (can be empty string)",
"output": "Expected response"
}
```
### Formatting Options
#### 1. ChatML Format (Default)
Uses the model's chat template with system/user/assistant roles:
```yaml
data:
format_type: "chatml"
system_prompt: "You are a helpful assistant."
```
#### 2. Alpaca Format
Uses the classic Alpaca instruction format:
```yaml
data:
format_type: "alpaca"
```
#### 3. Custom Format
Define your own template:
```yaml
data:
format_type: "custom"
custom_template: "Instruction: {instruction}\nInput: {input}\nOutput: {output}"
```
## Configuration
Key configuration options in `config_instruct.yaml`:
### Data Configuration
```yaml
data:
train_jsonl: "path/to/your/train.jsonl"
eval_jsonl: "path/to/your/eval.jsonl" # optional
eval_split_ratio: 0.1 # if no eval file provided
# Field names in your data
instruction_field: "instruction"
input_field: "input"
output_field: "output"
# Formatting
format_type: "chatml" # "chatml" | "alpaca" | "custom"
system_prompt: "You are a helpful assistant."
# Tokenization
max_length: 2048
```
### Training Configuration
```yaml
train:
max_steps: 100
num_train_epochs: 3
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
learning_rate: 5e-5
# ... other training parameters
```
## Usage
### Basic Usage
```bash
python run_instruct.py --config config_instruct.yaml
```
### Merge Only (after training)
```bash
python run_instruct.py --config config_instruct.yaml --merge-only
```
## Example Data Format
See `instruct_data.jsonl` for examples of the expected data format. Here are a few examples:
```json
{"instruction": "What is the capital of France?", "input": "", "output": "The capital of France is Paris."}
{"instruction": "Translate the following English text to French.", "input": "Hello, how are you today?", "output": "Bonjour, comment allez-vous aujourd'hui?"}
{"instruction": "Write a Python function that calculates factorial.", "input": "", "output": "def factorial(n):\n if n < 0:\n raise ValueError(...)"}
```
## Key Features
1. **Multiple Format Support**: ChatML, Alpaca, and custom templates
2. **Flexible Field Mapping**: Configure custom field names for your data
3. **Proper Loss Masking**: Only computes loss on the response portion
4. **PEFT/LoRA Support**: Efficient fine-tuning with LoRA
5. **Evaluation Support**: Automatic evaluation split or separate eval file
6. **Checkpointing**: Resume training from checkpoints
7. **Model Merging**: Merge trained adapters with base model
## Best Practices
1. **Data Quality**: Ensure your instruction-response pairs are high-quality and consistent
2. **Format Consistency**: Use the same format for training and inference
3. **System Prompts**: Choose appropriate system prompts for your use case
4. **Token Length**: Set appropriate `max_length` based on your model and data
5. **Batch Size**: Adjust batch size and gradient accumulation based on your GPU memory
## Troubleshooting
### Common Issues
1. **CUDA Out of Memory**: Reduce batch size or enable 4-bit quantization
2. **Slow Training**: Increase `gradient_accumulation_steps` or reduce `max_length`
3. **Poor Quality**: Check data format consistency and quality
4. **Tokenizer Issues**: Ensure your model has proper chat template support
### Debug Mode
Add logging to see formatted examples:
```python
# In format_instruction function, add:
print(f"Formatted: {formatted_text}")
```
## File Structure
```
CPT/
βββ run_instruct.py # Main instruction fine-tuning script
βββ config_instruct.yaml # Configuration file
βββ instruct_data.jsonl # Example instruction data
βββ README_instruct.md # This documentation
βββ runs/ # Training outputs
βββ instruct_run_v1/
βββ logs/
βββ checkpoints/
βββ best_adapter/
βββ final_model/
```
## Migration from CPT
To migrate from the original CPT script:
1. Convert your text data to instruction format
2. Update your configuration file
3. Choose appropriate formatting options
4. Adjust training parameters (instruction fine-tuning typically needs fewer steps)
The script maintains the same CLI interface and most configuration options for easy migration.
|