# 🚀 CodeLlama Fine-Tuning Guide **Last Updated:** November 25, 2025 --- ## 📋 Overview This guide explains how to use the optimized CodeLlama fine-tuning script with checkpoint resume and incremental fine-tuning capabilities. --- ## 🎯 Features ### ✅ Implemented Features 1. **Optimized Hyperparameters** - Based on `HYPERPARAMETER_ANALYSIS.md` - Max Length: 1536 - LoRA Rank: 48 - LoRA Alpha: 96 - LoRA Dropout: 0.15 - Learning Rate: 2e-5 - Epochs: 5 - And more... 2. **Checkpoint Resume** - Automatically resume from last checkpoint if training is interrupted 3. **Incremental Fine-Tuning** - Continue training from existing fine-tuned model with new data 4. **Fresh Training** - Start from scratch (optionally clear old checkpoints) --- ## 🚀 Quick Start ### Start Fresh Training ```bash cd /workspace/ftt/codellama-migration python3 scripts/training/finetune_codellama.py \ --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \ --dataset datasets/processed/split/train.jsonl \ --output-dir training-outputs/codellama-fifo-v1 \ --max-length 1536 \ --num-epochs 5 \ --batch-size 2 \ --gradient-accumulation 4 \ --learning-rate 2e-5 \ --lora-r 48 \ --lora-alpha 96 \ --lora-dropout 0.15 ``` Or use the convenience script: ```bash bash start_training.sh ``` --- ## 🔄 Resuming from Checkpoint ### Automatic Resume (Recommended) If training is interrupted, simply run the same command again with `--resume-from-checkpoint auto`: ```bash python3 scripts/training/finetune_codellama.py \ --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \ --dataset datasets/processed/split/train.jsonl \ --output-dir training-outputs/codellama-fifo-v1 \ --resume-from-checkpoint auto \ [other parameters...] ``` The script will automatically find the latest checkpoint and resume from there. ### Manual Resume To resume from a specific checkpoint: ```bash --resume-from-checkpoint training-outputs/codellama-fifo-v1/checkpoint-25 ``` ### Force Fresh Training To start fresh (ignore existing checkpoints): ```bash --fresh ``` This will remove old checkpoints and start from scratch. --- ## 📈 Incremental Fine-Tuning ### Continue Training Existing Model with New Data When you have new data and want to continue training an existing fine-tuned model: ```bash python3 scripts/training/finetune_codellama.py \ --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \ --adapter-path training-outputs/codellama-fifo-v1 \ --dataset datasets/processed/new_data.jsonl \ --output-dir training-outputs/codellama-fifo-v2 \ [other parameters...] ``` **Key Points:** - `--adapter-path` points to the previous fine-tuned model - `--output-dir` should be a new directory (or same if you want to update) - New dataset will be combined with existing knowledge - Training will continue from where it left off ### Example Workflow ```bash # Step 1: Initial training python3 scripts/training/finetune_codellama.py \ --base-model /path/to/base \ --dataset initial_data.jsonl \ --output-dir model-v1 # Step 2: Add more data (incremental) python3 scripts/training/finetune_codellama.py \ --base-model /path/to/base \ --adapter-path model-v1 \ --dataset additional_data.jsonl \ --output-dir model-v2 # Step 3: Add even more data python3 scripts/training/finetune_codellama.py \ --base-model /path/to/base \ --adapter-path model-v2 \ --dataset even_more_data.jsonl \ --output-dir model-v3 ``` --- ## 🛑 Stopping Training ### Graceful Stop Training will automatically save checkpoints at regular intervals (every 25 steps by default). To stop: 1. Press `Ctrl+C` once - Training will finish current step and save 2. Wait for checkpoint to be saved 3. Resume later with `--resume-from-checkpoint auto` ### Force Stop If needed, you can force kill the process: ```bash # Find training process ps aux | grep finetune_codellama # Kill process kill ``` The last checkpoint will still be available for resume. --- ## 📊 Monitoring Training ### Check Training Status ```bash # View latest logs tail -f training-outputs/codellama-fifo-v1/training.log # Check available checkpoints ls -lh training-outputs/codellama-fifo-v1/checkpoint-* # View training config cat training-outputs/codellama-fifo-v1/training_config.json ``` ### Check GPU Usage ```bash watch -n 1 nvidia-smi ``` --- ## 🔧 All Command-Line Arguments | Argument | Default | Description | |----------|---------|-------------| | `--base-model` | **Required** | Base model path or HuggingFace ID | | `--adapter-path` | None | Path to existing LoRA adapter (incremental fine-tuning) | | `--dataset` | **Required** | Path to training dataset JSONL | | `--output-dir` | **Required** | Output directory for fine-tuned model | | `--resume-from-checkpoint` | None | Resume from checkpoint ('auto' or path) | | `--fresh` | False | Force fresh training (ignore checkpoints) | | `--max-length` | 1536 | Max sequence length | | `--num-epochs` | 5 | Number of epochs | | `--batch-size` | 2 | Batch size per device | | `--gradient-accumulation` | 4 | Gradient accumulation steps | | `--learning-rate` | 2e-5 | Learning rate | | `--lora-r` | 48 | LoRA rank | | `--lora-alpha` | 96 | LoRA alpha | | `--lora-dropout` | 0.15 | LoRA dropout | | `--warmup-ratio` | 0.1 | Warmup ratio | | `--eval-steps` | 25 | Evaluation steps | | `--save-steps` | 25 | Save steps | | `--early-stopping-patience` | 5 | Early stopping patience | | `--logging-steps` | 5 | Logging steps | --- ## 📁 Directory Structure ``` codellama-migration/ ├── models/ │ └── base-models/ │ └── CodeLlama-7B-Instruct/ # Base model ├── datasets/ │ └── processed/ │ └── split/ │ ├── train.jsonl # Training data │ ├── val.jsonl # Validation data │ └── test.jsonl # Test data ├── training-outputs/ │ └── codellama-fifo-v1/ # Fine-tuned model │ ├── checkpoint-25/ # Checkpoint 1 │ ├── checkpoint-50/ # Checkpoint 2 │ ├── checkpoint-75/ # Checkpoint 3 (latest) │ ├── adapter_config.json # LoRA config │ ├── adapter_model.safetensors # LoRA weights │ └── training_config.json # Training config └── scripts/ └── training/ └── finetune_codellama.py # Training script ``` --- ## ⚠️ Important Notes ### Dataset Format The dataset must be in JSONL format with `instruction` and `response` fields: ```json { "instruction": "System prompt + task description", "response": "Expected code output with ```verilog markers" } ``` ### Checkpoint Behavior - Checkpoints are saved every `--save-steps` (default: 25) - Only last 3 checkpoints are kept (to save disk space) - Best model (lowest validation loss) is automatically loaded at the end - Checkpoints include full training state for seamless resume ### Incremental Fine-Tuning Tips 1. **Use same base model** - Always use the same base model as the original training 2. **New output directory** - Use a new output directory for each incremental training session 3. **Preserve original** - Keep the original fine-tuned model safe (don't overwrite) 4. **Compatible data** - New data should follow the same format and domain ### Fresh Training vs Incremental - **Fresh Training**: Start from base model (no `--adapter-path`) - **Incremental**: Continue from fine-tuned model (`--adapter-path` specified) - **Resume**: Continue from checkpoint (same training session) --- ## 🐛 Troubleshooting ### Training Stops Unexpectedly ```bash # Check if checkpoint exists ls training-outputs/codellama-fifo-v1/checkpoint-* # Resume automatically --resume-from-checkpoint auto ``` ### Out of Memory - Reduce `--batch-size` (e.g., from 2 to 1) - Reduce `--max-length` (e.g., from 1536 to 1024) - Increase `--gradient-accumulation` to maintain effective batch size ### Model Not Improving - Check dataset quality - Adjust learning rate (try 1e-5 or 3e-5) - Increase epochs - Check validation loss trends --- ## 📚 Related Documents - `HYPERPARAMETER_ANALYSIS.md` - Detailed hyperparameter recommendations - `DATASET_SPLIT_VALIDATION_GUIDE.md` - Dataset preparation guide - `MIGRATION_PROGRESS.md` - Migration status and progress --- **Happy Fine-Tuning! 🚀**