# Instruction Fine-Tuning Script

This script (`run_instruct.py`) is designed for fine-tuning language models on instruction-following tasks. It's based on the original CPT script but adapted specifically for instruction input/output pairs.

## Key Differences from CPT

1. **Data Format**: Handles structured instruction data with separate fields for instruction, input, and output
2. **Formatting Options**: Supports multiple instruction formats (ChatML, Alpaca, custom templates)
3. **No Text Packing**: Each example is treated as a complete instruction-response pair
4. **Proper Loss Masking**: Loss is only computed on the response/output portion, not on the instruction and input
5. **Automatic Label Creation**: Labels are automatically created with -100 masking for instruction tokens

## Supported Data Formats

### JSONL Structure
Each line should be a JSON object with the following fields:
```json
{
  "instruction": "Your instruction here",
  "input": "Optional input context (can be empty string)",
  "output": "Expected response"
}
```

### Formatting Options

#### 1. ChatML Format (Default)
Uses the model's chat template with system/user/assistant roles:
```yaml
data:
  format_type: "chatml"
  system_prompt: "You are a helpful assistant."
```

#### 2. Alpaca Format
Uses the classic Alpaca instruction format:
```yaml
data:
  format_type: "alpaca"
```

#### 3. Custom Format
Define your own template:
```yaml
data:
  format_type: "custom"
  custom_template: "Instruction: {instruction}\nInput: {input}\nOutput: {output}"
```

## Configuration

Key configuration options in `config_instruct.yaml`:

### Data Configuration
```yaml
data:
  train_jsonl: "path/to/your/train.jsonl"
  eval_jsonl: "path/to/your/eval.jsonl"  # optional
  eval_split_ratio: 0.1  # if no eval file provided
  
  # Field names in your data
  instruction_field: "instruction"
  input_field: "input"
  output_field: "output"
  
  # Formatting
  format_type: "chatml"  # "chatml" | "alpaca" | "custom"
  system_prompt: "You are a helpful assistant."
  
  # Tokenization
  max_length: 2048
```

### Training Configuration
```yaml
train:
  max_steps: 100
  num_train_epochs: 3
  per_device_train_batch_size: 1
  gradient_accumulation_steps: 16
  learning_rate: 5e-5
  # ... other training parameters
```

## Usage

### Basic Usage
```bash
python run_instruct.py --config config_instruct.yaml
```

### Merge Only (after training)
```bash
python run_instruct.py --config config_instruct.yaml --merge-only
```

## Example Data Format

See `instruct_data.jsonl` for examples of the expected data format. Here are a few examples:

```json
{"instruction": "What is the capital of France?", "input": "", "output": "The capital of France is Paris."}

{"instruction": "Translate the following English text to French.", "input": "Hello, how are you today?", "output": "Bonjour, comment allez-vous aujourd'hui?"}

{"instruction": "Write a Python function that calculates factorial.", "input": "", "output": "def factorial(n):\n    if n < 0:\n        raise ValueError(...)"}
```

## Key Features

1. **Multiple Format Support**: ChatML, Alpaca, and custom templates
2. **Flexible Field Mapping**: Configure custom field names for your data
3. **Proper Loss Masking**: Only computes loss on the response portion
4. **PEFT/LoRA Support**: Efficient fine-tuning with LoRA
5. **Evaluation Support**: Automatic evaluation split or separate eval file
6. **Checkpointing**: Resume training from checkpoints
7. **Model Merging**: Merge trained adapters with base model

## Best Practices

1. **Data Quality**: Ensure your instruction-response pairs are high-quality and consistent
2. **Format Consistency**: Use the same format for training and inference
3. **System Prompts**: Choose appropriate system prompts for your use case
4. **Token Length**: Set appropriate `max_length` based on your model and data
5. **Batch Size**: Adjust batch size and gradient accumulation based on your GPU memory

## Troubleshooting

### Common Issues

1. **CUDA Out of Memory**: Reduce batch size or enable 4-bit quantization
2. **Slow Training**: Increase `gradient_accumulation_steps` or reduce `max_length`
3. **Poor Quality**: Check data format consistency and quality
4. **Tokenizer Issues**: Ensure your model has proper chat template support

### Debug Mode
Add logging to see formatted examples:
```python
# In format_instruction function, add:
print(f"Formatted: {formatted_text}")
```

## File Structure

```
CPT/
├── run_instruct.py          # Main instruction fine-tuning script
├── config_instruct.yaml     # Configuration file
├── instruct_data.jsonl      # Example instruction data
├── README_instruct.md       # This documentation
└── runs/                    # Training outputs
    └── instruct_run_v1/
        ├── logs/
        ├── checkpoints/
        ├── best_adapter/
        └── final_model/
```

## Migration from CPT

To migrate from the original CPT script:

1. Convert your text data to instruction format
2. Update your configuration file
3. Choose appropriate formatting options
4. Adjust training parameters (instruction fine-tuning typically needs fewer steps)

The script maintains the same CLI interface and most configuration options for easy migration.