# Instruction Fine-Tuning Script This script (`run_instruct.py`) is designed for fine-tuning language models on instruction-following tasks. It's based on the original CPT script but adapted specifically for instruction input/output pairs. ## Key Differences from CPT 1. **Data Format**: Handles structured instruction data with separate fields for instruction, input, and output 2. **Formatting Options**: Supports multiple instruction formats (ChatML, Alpaca, custom templates) 3. **No Text Packing**: Each example is treated as a complete instruction-response pair 4. **Proper Loss Masking**: Loss is only computed on the response/output portion, not on the instruction and input 5. **Automatic Label Creation**: Labels are automatically created with -100 masking for instruction tokens ## Supported Data Formats ### JSONL Structure Each line should be a JSON object with the following fields: ```json { "instruction": "Your instruction here", "input": "Optional input context (can be empty string)", "output": "Expected response" } ``` ### Formatting Options #### 1. ChatML Format (Default) Uses the model's chat template with system/user/assistant roles: ```yaml data: format_type: "chatml" system_prompt: "You are a helpful assistant." ``` #### 2. Alpaca Format Uses the classic Alpaca instruction format: ```yaml data: format_type: "alpaca" ``` #### 3. Custom Format Define your own template: ```yaml data: format_type: "custom" custom_template: "Instruction: {instruction}\nInput: {input}\nOutput: {output}" ``` ## Configuration Key configuration options in `config_instruct.yaml`: ### Data Configuration ```yaml data: train_jsonl: "path/to/your/train.jsonl" eval_jsonl: "path/to/your/eval.jsonl" # optional eval_split_ratio: 0.1 # if no eval file provided # Field names in your data instruction_field: "instruction" input_field: "input" output_field: "output" # Formatting format_type: "chatml" # "chatml" | "alpaca" | "custom" system_prompt: "You are a helpful assistant." # Tokenization max_length: 2048 ``` ### Training Configuration ```yaml train: max_steps: 100 num_train_epochs: 3 per_device_train_batch_size: 1 gradient_accumulation_steps: 16 learning_rate: 5e-5 # ... other training parameters ``` ## Usage ### Basic Usage ```bash python run_instruct.py --config config_instruct.yaml ``` ### Merge Only (after training) ```bash python run_instruct.py --config config_instruct.yaml --merge-only ``` ## Example Data Format See `instruct_data.jsonl` for examples of the expected data format. Here are a few examples: ```json {"instruction": "What is the capital of France?", "input": "", "output": "The capital of France is Paris."} {"instruction": "Translate the following English text to French.", "input": "Hello, how are you today?", "output": "Bonjour, comment allez-vous aujourd'hui?"} {"instruction": "Write a Python function that calculates factorial.", "input": "", "output": "def factorial(n):\n if n < 0:\n raise ValueError(...)"} ``` ## Key Features 1. **Multiple Format Support**: ChatML, Alpaca, and custom templates 2. **Flexible Field Mapping**: Configure custom field names for your data 3. **Proper Loss Masking**: Only computes loss on the response portion 4. **PEFT/LoRA Support**: Efficient fine-tuning with LoRA 5. **Evaluation Support**: Automatic evaluation split or separate eval file 6. **Checkpointing**: Resume training from checkpoints 7. **Model Merging**: Merge trained adapters with base model ## Best Practices 1. **Data Quality**: Ensure your instruction-response pairs are high-quality and consistent 2. **Format Consistency**: Use the same format for training and inference 3. **System Prompts**: Choose appropriate system prompts for your use case 4. **Token Length**: Set appropriate `max_length` based on your model and data 5. **Batch Size**: Adjust batch size and gradient accumulation based on your GPU memory ## Troubleshooting ### Common Issues 1. **CUDA Out of Memory**: Reduce batch size or enable 4-bit quantization 2. **Slow Training**: Increase `gradient_accumulation_steps` or reduce `max_length` 3. **Poor Quality**: Check data format consistency and quality 4. **Tokenizer Issues**: Ensure your model has proper chat template support ### Debug Mode Add logging to see formatted examples: ```python # In format_instruction function, add: print(f"Formatted: {formatted_text}") ``` ## File Structure ``` CPT/ ├── run_instruct.py # Main instruction fine-tuning script ├── config_instruct.yaml # Configuration file ├── instruct_data.jsonl # Example instruction data ├── README_instruct.md # This documentation └── runs/ # Training outputs └── instruct_run_v1/ ├── logs/ ├── checkpoints/ ├── best_adapter/ └── final_model/ ``` ## Migration from CPT To migrate from the original CPT script: 1. Convert your text data to instruction format 2. Update your configuration file 3. Choose appropriate formatting options 4. Adjust training parameters (instruction fine-tuning typically needs fewer steps) The script maintains the same CLI interface and most configuration options for easy migration.