task2file-llm / trainer-kit /CPT-14b /README_instruct.md

Add Training Scripts

e527a65 verified about 2 months ago

5.3 kB

	# Instruction Fine-Tuning Script

	This script (`run_instruct.py`) is designed for fine-tuning language models on instruction-following tasks. It's based on the original CPT script but adapted specifically for instruction input/output pairs.

	## Key Differences from CPT

	1. Data Format: Handles structured instruction data with separate fields for instruction, input, and output
	2. Formatting Options: Supports multiple instruction formats (ChatML, Alpaca, custom templates)
	3. No Text Packing: Each example is treated as a complete instruction-response pair
	4. Proper Loss Masking: Loss is only computed on the response/output portion, not on the instruction and input
	5. Automatic Label Creation: Labels are automatically created with -100 masking for instruction tokens

	## Supported Data Formats

	### JSONL Structure
	Each line should be a JSON object with the following fields:
	```json
	{
	"instruction": "Your instruction here",
	"input": "Optional input context (can be empty string)",
	"output": "Expected response"
	}
	```

	### Formatting Options

	#### 1. ChatML Format (Default)
	Uses the model's chat template with system/user/assistant roles:
	```yaml
	data:
	format_type: "chatml"
	system_prompt: "You are a helpful assistant."
	```

	#### 2. Alpaca Format
	Uses the classic Alpaca instruction format:
	```yaml
	data:
	format_type: "alpaca"
	```

	#### 3. Custom Format
	Define your own template:
	```yaml
	data:
	format_type: "custom"
	custom_template: "Instruction: {instruction}\nInput: {input}\nOutput: {output}"
	```

	## Configuration

	Key configuration options in `config_instruct.yaml`:

	### Data Configuration
	```yaml
	data:
	train_jsonl: "path/to/your/train.jsonl"
	eval_jsonl: "path/to/your/eval.jsonl" # optional
	eval_split_ratio: 0.1 # if no eval file provided

	# Field names in your data
	instruction_field: "instruction"
	input_field: "input"
	output_field: "output"

	# Formatting
	format_type: "chatml" # "chatml" \| "alpaca" \| "custom"
	system_prompt: "You are a helpful assistant."

	# Tokenization
	max_length: 2048
	```

	### Training Configuration
	```yaml
	train:
	max_steps: 100
	num_train_epochs: 3
	per_device_train_batch_size: 1
	gradient_accumulation_steps: 16
	learning_rate: 5e-5
	# ... other training parameters
	```

	## Usage

	### Basic Usage
	```bash
	python run_instruct.py --config config_instruct.yaml
	```

	### Merge Only (after training)
	```bash
	python run_instruct.py --config config_instruct.yaml --merge-only
	```

	## Example Data Format

	See `instruct_data.jsonl` for examples of the expected data format. Here are a few examples:

	```json
	{"instruction": "What is the capital of France?", "input": "", "output": "The capital of France is Paris."}

	{"instruction": "Translate the following English text to French.", "input": "Hello, how are you today?", "output": "Bonjour, comment allez-vous aujourd'hui?"}

	{"instruction": "Write a Python function that calculates factorial.", "input": "", "output": "def factorial(n):\n if n < 0:\n raise ValueError(...)"}
	```

	## Key Features

	1. Multiple Format Support: ChatML, Alpaca, and custom templates
	2. Flexible Field Mapping: Configure custom field names for your data
	3. Proper Loss Masking: Only computes loss on the response portion
	4. PEFT/LoRA Support: Efficient fine-tuning with LoRA
	5. Evaluation Support: Automatic evaluation split or separate eval file
	6. Checkpointing: Resume training from checkpoints
	7. Model Merging: Merge trained adapters with base model

	## Best Practices

	1. Data Quality: Ensure your instruction-response pairs are high-quality and consistent
	2. Format Consistency: Use the same format for training and inference
	3. System Prompts: Choose appropriate system prompts for your use case
	4. Token Length: Set appropriate `max_length` based on your model and data
	5. Batch Size: Adjust batch size and gradient accumulation based on your GPU memory

	## Troubleshooting

	### Common Issues

	1. CUDA Out of Memory: Reduce batch size or enable 4-bit quantization
	2. Slow Training: Increase `gradient_accumulation_steps` or reduce `max_length`
	3. Poor Quality: Check data format consistency and quality
	4. Tokenizer Issues: Ensure your model has proper chat template support

	### Debug Mode
	Add logging to see formatted examples:
	```python
	# In format_instruction function, add:
	print(f"Formatted: {formatted_text}")
	```

	## File Structure

	```
	CPT/
	├── run_instruct.py # Main instruction fine-tuning script
	├── config_instruct.yaml # Configuration file
	├── instruct_data.jsonl # Example instruction data
	├── README_instruct.md # This documentation
	└── runs/ # Training outputs
	└── instruct_run_v1/
	├── logs/
	├── checkpoints/
	├── best_adapter/
	└── final_model/
	```

	## Migration from CPT

	To migrate from the original CPT script:

	1. Convert your text data to instruction format
	2. Update your configuration file
	3. Choose appropriate formatting options
	4. Adjust training parameters (instruction fine-tuning typically needs fewer steps)

	The script maintains the same CLI interface and most configuration options for easy migration.