| # Instruction Fine-Tuning Script | |
| This script (`run_instruct.py`) is designed for fine-tuning language models on instruction-following tasks. It's based on the original CPT script but adapted specifically for instruction input/output pairs. | |
| ## Key Differences from CPT | |
| 1. **Data Format**: Handles structured instruction data with separate fields for instruction, input, and output | |
| 2. **Formatting Options**: Supports multiple instruction formats (ChatML, Alpaca, custom templates) | |
| 3. **No Text Packing**: Each example is treated as a complete instruction-response pair | |
| 4. **Proper Loss Masking**: Loss is only computed on the response/output portion, not on the instruction and input | |
| 5. **Automatic Label Creation**: Labels are automatically created with -100 masking for instruction tokens | |
| ## Supported Data Formats | |
| ### JSONL Structure | |
| Each line should be a JSON object with the following fields: | |
| ```json | |
| { | |
| "instruction": "Your instruction here", | |
| "input": "Optional input context (can be empty string)", | |
| "output": "Expected response" | |
| } | |
| ``` | |
| ### Formatting Options | |
| #### 1. ChatML Format (Default) | |
| Uses the model's chat template with system/user/assistant roles: | |
| ```yaml | |
| data: | |
| format_type: "chatml" | |
| system_prompt: "You are a helpful assistant." | |
| ``` | |
| #### 2. Alpaca Format | |
| Uses the classic Alpaca instruction format: | |
| ```yaml | |
| data: | |
| format_type: "alpaca" | |
| ``` | |
| #### 3. Custom Format | |
| Define your own template: | |
| ```yaml | |
| data: | |
| format_type: "custom" | |
| custom_template: "Instruction: {instruction}\nInput: {input}\nOutput: {output}" | |
| ``` | |
| ## Configuration | |
| Key configuration options in `config_instruct.yaml`: | |
| ### Data Configuration | |
| ```yaml | |
| data: | |
| train_jsonl: "path/to/your/train.jsonl" | |
| eval_jsonl: "path/to/your/eval.jsonl" # optional | |
| eval_split_ratio: 0.1 # if no eval file provided | |
| # Field names in your data | |
| instruction_field: "instruction" | |
| input_field: "input" | |
| output_field: "output" | |
| # Formatting | |
| format_type: "chatml" # "chatml" | "alpaca" | "custom" | |
| system_prompt: "You are a helpful assistant." | |
| # Tokenization | |
| max_length: 2048 | |
| ``` | |
| ### Training Configuration | |
| ```yaml | |
| train: | |
| max_steps: 100 | |
| num_train_epochs: 3 | |
| per_device_train_batch_size: 1 | |
| gradient_accumulation_steps: 16 | |
| learning_rate: 5e-5 | |
| # ... other training parameters | |
| ``` | |
| ## Usage | |
| ### Basic Usage | |
| ```bash | |
| python run_instruct.py --config config_instruct.yaml | |
| ``` | |
| ### Merge Only (after training) | |
| ```bash | |
| python run_instruct.py --config config_instruct.yaml --merge-only | |
| ``` | |
| ## Example Data Format | |
| See `instruct_data.jsonl` for examples of the expected data format. Here are a few examples: | |
| ```json | |
| {"instruction": "What is the capital of France?", "input": "", "output": "The capital of France is Paris."} | |
| {"instruction": "Translate the following English text to French.", "input": "Hello, how are you today?", "output": "Bonjour, comment allez-vous aujourd'hui?"} | |
| {"instruction": "Write a Python function that calculates factorial.", "input": "", "output": "def factorial(n):\n if n < 0:\n raise ValueError(...)"} | |
| ``` | |
| ## Key Features | |
| 1. **Multiple Format Support**: ChatML, Alpaca, and custom templates | |
| 2. **Flexible Field Mapping**: Configure custom field names for your data | |
| 3. **Proper Loss Masking**: Only computes loss on the response portion | |
| 4. **PEFT/LoRA Support**: Efficient fine-tuning with LoRA | |
| 5. **Evaluation Support**: Automatic evaluation split or separate eval file | |
| 6. **Checkpointing**: Resume training from checkpoints | |
| 7. **Model Merging**: Merge trained adapters with base model | |
| ## Best Practices | |
| 1. **Data Quality**: Ensure your instruction-response pairs are high-quality and consistent | |
| 2. **Format Consistency**: Use the same format for training and inference | |
| 3. **System Prompts**: Choose appropriate system prompts for your use case | |
| 4. **Token Length**: Set appropriate `max_length` based on your model and data | |
| 5. **Batch Size**: Adjust batch size and gradient accumulation based on your GPU memory | |
| ## Troubleshooting | |
| ### Common Issues | |
| 1. **CUDA Out of Memory**: Reduce batch size or enable 4-bit quantization | |
| 2. **Slow Training**: Increase `gradient_accumulation_steps` or reduce `max_length` | |
| 3. **Poor Quality**: Check data format consistency and quality | |
| 4. **Tokenizer Issues**: Ensure your model has proper chat template support | |
| ### Debug Mode | |
| Add logging to see formatted examples: | |
| ```python | |
| # In format_instruction function, add: | |
| print(f"Formatted: {formatted_text}") | |
| ``` | |
| ## File Structure | |
| ``` | |
| CPT/ | |
| βββ run_instruct.py # Main instruction fine-tuning script | |
| βββ config_instruct.yaml # Configuration file | |
| βββ instruct_data.jsonl # Example instruction data | |
| βββ README_instruct.md # This documentation | |
| βββ runs/ # Training outputs | |
| βββ instruct_run_v1/ | |
| βββ logs/ | |
| βββ checkpoints/ | |
| βββ best_adapter/ | |
| βββ final_model/ | |
| ``` | |
| ## Migration from CPT | |
| To migrate from the original CPT script: | |
| 1. Convert your text data to instruction format | |
| 2. Update your configuration file | |
| 3. Choose appropriate formatting options | |
| 4. Adjust training parameters (instruction fine-tuning typically needs fewer steps) | |
| The script maintains the same CLI interface and most configuration options for easy migration. | |