task2file-llm / trainer-kit /CPT-14b /README_instruct.md
SirajRLX's picture
Add Training Scripts
e527a65 verified
# Instruction Fine-Tuning Script
This script (`run_instruct.py`) is designed for fine-tuning language models on instruction-following tasks. It's based on the original CPT script but adapted specifically for instruction input/output pairs.
## Key Differences from CPT
1. **Data Format**: Handles structured instruction data with separate fields for instruction, input, and output
2. **Formatting Options**: Supports multiple instruction formats (ChatML, Alpaca, custom templates)
3. **No Text Packing**: Each example is treated as a complete instruction-response pair
4. **Proper Loss Masking**: Loss is only computed on the response/output portion, not on the instruction and input
5. **Automatic Label Creation**: Labels are automatically created with -100 masking for instruction tokens
## Supported Data Formats
### JSONL Structure
Each line should be a JSON object with the following fields:
```json
{
"instruction": "Your instruction here",
"input": "Optional input context (can be empty string)",
"output": "Expected response"
}
```
### Formatting Options
#### 1. ChatML Format (Default)
Uses the model's chat template with system/user/assistant roles:
```yaml
data:
format_type: "chatml"
system_prompt: "You are a helpful assistant."
```
#### 2. Alpaca Format
Uses the classic Alpaca instruction format:
```yaml
data:
format_type: "alpaca"
```
#### 3. Custom Format
Define your own template:
```yaml
data:
format_type: "custom"
custom_template: "Instruction: {instruction}\nInput: {input}\nOutput: {output}"
```
## Configuration
Key configuration options in `config_instruct.yaml`:
### Data Configuration
```yaml
data:
train_jsonl: "path/to/your/train.jsonl"
eval_jsonl: "path/to/your/eval.jsonl" # optional
eval_split_ratio: 0.1 # if no eval file provided
# Field names in your data
instruction_field: "instruction"
input_field: "input"
output_field: "output"
# Formatting
format_type: "chatml" # "chatml" | "alpaca" | "custom"
system_prompt: "You are a helpful assistant."
# Tokenization
max_length: 2048
```
### Training Configuration
```yaml
train:
max_steps: 100
num_train_epochs: 3
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
learning_rate: 5e-5
# ... other training parameters
```
## Usage
### Basic Usage
```bash
python run_instruct.py --config config_instruct.yaml
```
### Merge Only (after training)
```bash
python run_instruct.py --config config_instruct.yaml --merge-only
```
## Example Data Format
See `instruct_data.jsonl` for examples of the expected data format. Here are a few examples:
```json
{"instruction": "What is the capital of France?", "input": "", "output": "The capital of France is Paris."}
{"instruction": "Translate the following English text to French.", "input": "Hello, how are you today?", "output": "Bonjour, comment allez-vous aujourd'hui?"}
{"instruction": "Write a Python function that calculates factorial.", "input": "", "output": "def factorial(n):\n if n < 0:\n raise ValueError(...)"}
```
## Key Features
1. **Multiple Format Support**: ChatML, Alpaca, and custom templates
2. **Flexible Field Mapping**: Configure custom field names for your data
3. **Proper Loss Masking**: Only computes loss on the response portion
4. **PEFT/LoRA Support**: Efficient fine-tuning with LoRA
5. **Evaluation Support**: Automatic evaluation split or separate eval file
6. **Checkpointing**: Resume training from checkpoints
7. **Model Merging**: Merge trained adapters with base model
## Best Practices
1. **Data Quality**: Ensure your instruction-response pairs are high-quality and consistent
2. **Format Consistency**: Use the same format for training and inference
3. **System Prompts**: Choose appropriate system prompts for your use case
4. **Token Length**: Set appropriate `max_length` based on your model and data
5. **Batch Size**: Adjust batch size and gradient accumulation based on your GPU memory
## Troubleshooting
### Common Issues
1. **CUDA Out of Memory**: Reduce batch size or enable 4-bit quantization
2. **Slow Training**: Increase `gradient_accumulation_steps` or reduce `max_length`
3. **Poor Quality**: Check data format consistency and quality
4. **Tokenizer Issues**: Ensure your model has proper chat template support
### Debug Mode
Add logging to see formatted examples:
```python
# In format_instruction function, add:
print(f"Formatted: {formatted_text}")
```
## File Structure
```
CPT/
β”œβ”€β”€ run_instruct.py # Main instruction fine-tuning script
β”œβ”€β”€ config_instruct.yaml # Configuration file
β”œβ”€β”€ instruct_data.jsonl # Example instruction data
β”œβ”€β”€ README_instruct.md # This documentation
└── runs/ # Training outputs
└── instruct_run_v1/
β”œβ”€β”€ logs/
β”œβ”€β”€ checkpoints/
β”œβ”€β”€ best_adapter/
└── final_model/
```
## Migration from CPT
To migrate from the original CPT script:
1. Convert your text data to instruction format
2. Update your configuration file
3. Choose appropriate formatting options
4. Adjust training parameters (instruction fine-tuning typically needs fewer steps)
The script maintains the same CLI interface and most configuration options for easy migration.