Instruction Fine-Tuning Script
This script (run_instruct.py) is designed for fine-tuning language models on instruction-following tasks. It's based on the original CPT script but adapted specifically for instruction input/output pairs.
Key Differences from CPT
- Data Format: Handles structured instruction data with separate fields for instruction, input, and output
- Formatting Options: Supports multiple instruction formats (ChatML, Alpaca, custom templates)
- No Text Packing: Each example is treated as a complete instruction-response pair
- Proper Loss Masking: Loss is only computed on the response/output portion, not on the instruction and input
- Automatic Label Creation: Labels are automatically created with -100 masking for instruction tokens
Supported Data Formats
JSONL Structure
Each line should be a JSON object with the following fields:
{
"instruction": "Your instruction here",
"input": "Optional input context (can be empty string)",
"output": "Expected response"
}
Formatting Options
1. ChatML Format (Default)
Uses the model's chat template with system/user/assistant roles:
data:
format_type: "chatml"
system_prompt: "You are a helpful assistant."
2. Alpaca Format
Uses the classic Alpaca instruction format:
data:
format_type: "alpaca"
3. Custom Format
Define your own template:
data:
format_type: "custom"
custom_template: "Instruction: {instruction}\nInput: {input}\nOutput: {output}"
Configuration
Key configuration options in config_instruct.yaml:
Data Configuration
data:
train_jsonl: "path/to/your/train.jsonl"
eval_jsonl: "path/to/your/eval.jsonl" # optional
eval_split_ratio: 0.1 # if no eval file provided
# Field names in your data
instruction_field: "instruction"
input_field: "input"
output_field: "output"
# Formatting
format_type: "chatml" # "chatml" | "alpaca" | "custom"
system_prompt: "You are a helpful assistant."
# Tokenization
max_length: 2048
Training Configuration
train:
max_steps: 100
num_train_epochs: 3
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
learning_rate: 5e-5
# ... other training parameters
Usage
Basic Usage
python run_instruct.py --config config_instruct.yaml
Merge Only (after training)
python run_instruct.py --config config_instruct.yaml --merge-only
Example Data Format
See instruct_data.jsonl for examples of the expected data format. Here are a few examples:
{"instruction": "What is the capital of France?", "input": "", "output": "The capital of France is Paris."}
{"instruction": "Translate the following English text to French.", "input": "Hello, how are you today?", "output": "Bonjour, comment allez-vous aujourd'hui?"}
{"instruction": "Write a Python function that calculates factorial.", "input": "", "output": "def factorial(n):\n if n < 0:\n raise ValueError(...)"}
Key Features
- Multiple Format Support: ChatML, Alpaca, and custom templates
- Flexible Field Mapping: Configure custom field names for your data
- Proper Loss Masking: Only computes loss on the response portion
- PEFT/LoRA Support: Efficient fine-tuning with LoRA
- Evaluation Support: Automatic evaluation split or separate eval file
- Checkpointing: Resume training from checkpoints
- Model Merging: Merge trained adapters with base model
Best Practices
- Data Quality: Ensure your instruction-response pairs are high-quality and consistent
- Format Consistency: Use the same format for training and inference
- System Prompts: Choose appropriate system prompts for your use case
- Token Length: Set appropriate
max_lengthbased on your model and data - Batch Size: Adjust batch size and gradient accumulation based on your GPU memory
Troubleshooting
Common Issues
- CUDA Out of Memory: Reduce batch size or enable 4-bit quantization
- Slow Training: Increase
gradient_accumulation_stepsor reducemax_length - Poor Quality: Check data format consistency and quality
- Tokenizer Issues: Ensure your model has proper chat template support
Debug Mode
Add logging to see formatted examples:
# In format_instruction function, add:
print(f"Formatted: {formatted_text}")
File Structure
CPT/
βββ run_instruct.py # Main instruction fine-tuning script
βββ config_instruct.yaml # Configuration file
βββ instruct_data.jsonl # Example instruction data
βββ README_instruct.md # This documentation
βββ runs/ # Training outputs
βββ instruct_run_v1/
βββ logs/
βββ checkpoints/
βββ best_adapter/
βββ final_model/
Migration from CPT
To migrate from the original CPT script:
- Convert your text data to instruction format
- Update your configuration file
- Choose appropriate formatting options
- Adjust training parameters (instruction fine-tuning typically needs fewer steps)
The script maintains the same CLI interface and most configuration options for easy migration.