Instruction Fine-Tuning Script

This script (run_instruct.py) is designed for fine-tuning language models on instruction-following tasks. It's based on the original CPT script but adapted specifically for instruction input/output pairs.

Key Differences from CPT

Data Format: Handles structured instruction data with separate fields for instruction, input, and output
Formatting Options: Supports multiple instruction formats (ChatML, Alpaca, custom templates)
No Text Packing: Each example is treated as a complete instruction-response pair
Proper Loss Masking: Loss is only computed on the response/output portion, not on the instruction and input
Automatic Label Creation: Labels are automatically created with -100 masking for instruction tokens

Supported Data Formats

JSONL Structure

Each line should be a JSON object with the following fields:

{
  "instruction": "Your instruction here",
  "input": "Optional input context (can be empty string)",
  "output": "Expected response"
}

Formatting Options

1. ChatML Format (Default)

Uses the model's chat template with system/user/assistant roles:

data:
  format_type: "chatml"
  system_prompt: "You are a helpful assistant."

2. Alpaca Format

Uses the classic Alpaca instruction format:

data:
  format_type: "alpaca"

3. Custom Format

Define your own template:

data:
  format_type: "custom"
  custom_template: "Instruction: {instruction}\nInput: {input}\nOutput: {output}"

Configuration

Key configuration options in config_instruct.yaml:

Data Configuration

data:
  train_jsonl: "path/to/your/train.jsonl"
  eval_jsonl: "path/to/your/eval.jsonl"  # optional
  eval_split_ratio: 0.1  # if no eval file provided
  
  # Field names in your data
  instruction_field: "instruction"
  input_field: "input"
  output_field: "output"
  
  # Formatting
  format_type: "chatml"  # "chatml" | "alpaca" | "custom"
  system_prompt: "You are a helpful assistant."
  
  # Tokenization
  max_length: 2048

Training Configuration

train:
  max_steps: 100
  num_train_epochs: 3
  per_device_train_batch_size: 1
  gradient_accumulation_steps: 16
  learning_rate: 5e-5
  # ... other training parameters

Usage

Basic Usage

python run_instruct.py --config config_instruct.yaml

Merge Only (after training)

python run_instruct.py --config config_instruct.yaml --merge-only

Example Data Format

See instruct_data.jsonl for examples of the expected data format. Here are a few examples:

{"instruction": "What is the capital of France?", "input": "", "output": "The capital of France is Paris."}

{"instruction": "Translate the following English text to French.", "input": "Hello, how are you today?", "output": "Bonjour, comment allez-vous aujourd'hui?"}

{"instruction": "Write a Python function that calculates factorial.", "input": "", "output": "def factorial(n):\n    if n < 0:\n        raise ValueError(...)"}

Key Features

Multiple Format Support: ChatML, Alpaca, and custom templates
Flexible Field Mapping: Configure custom field names for your data
Proper Loss Masking: Only computes loss on the response portion
PEFT/LoRA Support: Efficient fine-tuning with LoRA
Evaluation Support: Automatic evaluation split or separate eval file
Checkpointing: Resume training from checkpoints
Model Merging: Merge trained adapters with base model

Best Practices

Data Quality: Ensure your instruction-response pairs are high-quality and consistent
Format Consistency: Use the same format for training and inference
System Prompts: Choose appropriate system prompts for your use case
Token Length: Set appropriate max_length based on your model and data
Batch Size: Adjust batch size and gradient accumulation based on your GPU memory

Troubleshooting

Common Issues

CUDA Out of Memory: Reduce batch size or enable 4-bit quantization
Slow Training: Increase gradient_accumulation_steps or reduce max_length
Poor Quality: Check data format consistency and quality
Tokenizer Issues: Ensure your model has proper chat template support

Debug Mode

Add logging to see formatted examples:

# In format_instruction function, add:
print(f"Formatted: {formatted_text}")

File Structure

CPT/
├── run_instruct.py          # Main instruction fine-tuning script
├── config_instruct.yaml     # Configuration file
├── instruct_data.jsonl      # Example instruction data
├── README_instruct.md       # This documentation
└── runs/                    # Training outputs
    └── instruct_run_v1/
        ├── logs/
        ├── checkpoints/
        ├── best_adapter/
        └── final_model/

Migration from CPT

To migrate from the original CPT script:

Convert your text data to instruction format
Update your configuration file
Choose appropriate formatting options
Adjust training parameters (instruction fine-tuning typically needs fewer steps)

The script maintains the same CLI interface and most configuration options for easy migration.