| # π CodeLlama Fine-Tuning Guide | |
| **Last Updated:** November 25, 2025 | |
| --- | |
| ## π Overview | |
| This guide explains how to use the optimized CodeLlama fine-tuning script with checkpoint resume and incremental fine-tuning capabilities. | |
| --- | |
| ## π― Features | |
| ### β Implemented Features | |
| 1. **Optimized Hyperparameters** - Based on `HYPERPARAMETER_ANALYSIS.md` | |
| - Max Length: 1536 | |
| - LoRA Rank: 48 | |
| - LoRA Alpha: 96 | |
| - LoRA Dropout: 0.15 | |
| - Learning Rate: 2e-5 | |
| - Epochs: 5 | |
| - And more... | |
| 2. **Checkpoint Resume** - Automatically resume from last checkpoint if training is interrupted | |
| 3. **Incremental Fine-Tuning** - Continue training from existing fine-tuned model with new data | |
| 4. **Fresh Training** - Start from scratch (optionally clear old checkpoints) | |
| --- | |
| ## π Quick Start | |
| ### Start Fresh Training | |
| ```bash | |
| cd /workspace/ftt/codellama-migration | |
| python3 scripts/training/finetune_codellama.py \ | |
| --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \ | |
| --dataset datasets/processed/split/train.jsonl \ | |
| --output-dir training-outputs/codellama-fifo-v1 \ | |
| --max-length 1536 \ | |
| --num-epochs 5 \ | |
| --batch-size 2 \ | |
| --gradient-accumulation 4 \ | |
| --learning-rate 2e-5 \ | |
| --lora-r 48 \ | |
| --lora-alpha 96 \ | |
| --lora-dropout 0.15 | |
| ``` | |
| Or use the convenience script: | |
| ```bash | |
| bash start_training.sh | |
| ``` | |
| --- | |
| ## π Resuming from Checkpoint | |
| ### Automatic Resume (Recommended) | |
| If training is interrupted, simply run the same command again with `--resume-from-checkpoint auto`: | |
| ```bash | |
| python3 scripts/training/finetune_codellama.py \ | |
| --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \ | |
| --dataset datasets/processed/split/train.jsonl \ | |
| --output-dir training-outputs/codellama-fifo-v1 \ | |
| --resume-from-checkpoint auto \ | |
| [other parameters...] | |
| ``` | |
| The script will automatically find the latest checkpoint and resume from there. | |
| ### Manual Resume | |
| To resume from a specific checkpoint: | |
| ```bash | |
| --resume-from-checkpoint training-outputs/codellama-fifo-v1/checkpoint-25 | |
| ``` | |
| ### Force Fresh Training | |
| To start fresh (ignore existing checkpoints): | |
| ```bash | |
| --fresh | |
| ``` | |
| This will remove old checkpoints and start from scratch. | |
| --- | |
| ## π Incremental Fine-Tuning | |
| ### Continue Training Existing Model with New Data | |
| When you have new data and want to continue training an existing fine-tuned model: | |
| ```bash | |
| python3 scripts/training/finetune_codellama.py \ | |
| --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \ | |
| --adapter-path training-outputs/codellama-fifo-v1 \ | |
| --dataset datasets/processed/new_data.jsonl \ | |
| --output-dir training-outputs/codellama-fifo-v2 \ | |
| [other parameters...] | |
| ``` | |
| **Key Points:** | |
| - `--adapter-path` points to the previous fine-tuned model | |
| - `--output-dir` should be a new directory (or same if you want to update) | |
| - New dataset will be combined with existing knowledge | |
| - Training will continue from where it left off | |
| ### Example Workflow | |
| ```bash | |
| # Step 1: Initial training | |
| python3 scripts/training/finetune_codellama.py \ | |
| --base-model /path/to/base \ | |
| --dataset initial_data.jsonl \ | |
| --output-dir model-v1 | |
| # Step 2: Add more data (incremental) | |
| python3 scripts/training/finetune_codellama.py \ | |
| --base-model /path/to/base \ | |
| --adapter-path model-v1 \ | |
| --dataset additional_data.jsonl \ | |
| --output-dir model-v2 | |
| # Step 3: Add even more data | |
| python3 scripts/training/finetune_codellama.py \ | |
| --base-model /path/to/base \ | |
| --adapter-path model-v2 \ | |
| --dataset even_more_data.jsonl \ | |
| --output-dir model-v3 | |
| ``` | |
| --- | |
| ## π Stopping Training | |
| ### Graceful Stop | |
| Training will automatically save checkpoints at regular intervals (every 25 steps by default). To stop: | |
| 1. Press `Ctrl+C` once - Training will finish current step and save | |
| 2. Wait for checkpoint to be saved | |
| 3. Resume later with `--resume-from-checkpoint auto` | |
| ### Force Stop | |
| If needed, you can force kill the process: | |
| ```bash | |
| # Find training process | |
| ps aux | grep finetune_codellama | |
| # Kill process | |
| kill <PID> | |
| ``` | |
| The last checkpoint will still be available for resume. | |
| --- | |
| ## π Monitoring Training | |
| ### Check Training Status | |
| ```bash | |
| # View latest logs | |
| tail -f training-outputs/codellama-fifo-v1/training.log | |
| # Check available checkpoints | |
| ls -lh training-outputs/codellama-fifo-v1/checkpoint-* | |
| # View training config | |
| cat training-outputs/codellama-fifo-v1/training_config.json | |
| ``` | |
| ### Check GPU Usage | |
| ```bash | |
| watch -n 1 nvidia-smi | |
| ``` | |
| --- | |
| ## π§ All Command-Line Arguments | |
| | Argument | Default | Description | | |
| |----------|---------|-------------| | |
| | `--base-model` | **Required** | Base model path or HuggingFace ID | | |
| | `--adapter-path` | None | Path to existing LoRA adapter (incremental fine-tuning) | | |
| | `--dataset` | **Required** | Path to training dataset JSONL | | |
| | `--output-dir` | **Required** | Output directory for fine-tuned model | | |
| | `--resume-from-checkpoint` | None | Resume from checkpoint ('auto' or path) | | |
| | `--fresh` | False | Force fresh training (ignore checkpoints) | | |
| | `--max-length` | 1536 | Max sequence length | | |
| | `--num-epochs` | 5 | Number of epochs | | |
| | `--batch-size` | 2 | Batch size per device | | |
| | `--gradient-accumulation` | 4 | Gradient accumulation steps | | |
| | `--learning-rate` | 2e-5 | Learning rate | | |
| | `--lora-r` | 48 | LoRA rank | | |
| | `--lora-alpha` | 96 | LoRA alpha | | |
| | `--lora-dropout` | 0.15 | LoRA dropout | | |
| | `--warmup-ratio` | 0.1 | Warmup ratio | | |
| | `--eval-steps` | 25 | Evaluation steps | | |
| | `--save-steps` | 25 | Save steps | | |
| | `--early-stopping-patience` | 5 | Early stopping patience | | |
| | `--logging-steps` | 5 | Logging steps | | |
| --- | |
| ## π Directory Structure | |
| ``` | |
| codellama-migration/ | |
| βββ models/ | |
| β βββ base-models/ | |
| β βββ CodeLlama-7B-Instruct/ # Base model | |
| βββ datasets/ | |
| β βββ processed/ | |
| β βββ split/ | |
| β βββ train.jsonl # Training data | |
| β βββ val.jsonl # Validation data | |
| β βββ test.jsonl # Test data | |
| βββ training-outputs/ | |
| β βββ codellama-fifo-v1/ # Fine-tuned model | |
| β βββ checkpoint-25/ # Checkpoint 1 | |
| β βββ checkpoint-50/ # Checkpoint 2 | |
| β βββ checkpoint-75/ # Checkpoint 3 (latest) | |
| β βββ adapter_config.json # LoRA config | |
| β βββ adapter_model.safetensors # LoRA weights | |
| β βββ training_config.json # Training config | |
| βββ scripts/ | |
| βββ training/ | |
| βββ finetune_codellama.py # Training script | |
| ``` | |
| --- | |
| ## β οΈ Important Notes | |
| ### Dataset Format | |
| The dataset must be in JSONL format with `instruction` and `response` fields: | |
| ```json | |
| { | |
| "instruction": "System prompt + task description", | |
| "response": "Expected code output with ```verilog markers" | |
| } | |
| ``` | |
| ### Checkpoint Behavior | |
| - Checkpoints are saved every `--save-steps` (default: 25) | |
| - Only last 3 checkpoints are kept (to save disk space) | |
| - Best model (lowest validation loss) is automatically loaded at the end | |
| - Checkpoints include full training state for seamless resume | |
| ### Incremental Fine-Tuning Tips | |
| 1. **Use same base model** - Always use the same base model as the original training | |
| 2. **New output directory** - Use a new output directory for each incremental training session | |
| 3. **Preserve original** - Keep the original fine-tuned model safe (don't overwrite) | |
| 4. **Compatible data** - New data should follow the same format and domain | |
| ### Fresh Training vs Incremental | |
| - **Fresh Training**: Start from base model (no `--adapter-path`) | |
| - **Incremental**: Continue from fine-tuned model (`--adapter-path` specified) | |
| - **Resume**: Continue from checkpoint (same training session) | |
| --- | |
| ## π Troubleshooting | |
| ### Training Stops Unexpectedly | |
| ```bash | |
| # Check if checkpoint exists | |
| ls training-outputs/codellama-fifo-v1/checkpoint-* | |
| # Resume automatically | |
| --resume-from-checkpoint auto | |
| ``` | |
| ### Out of Memory | |
| - Reduce `--batch-size` (e.g., from 2 to 1) | |
| - Reduce `--max-length` (e.g., from 1536 to 1024) | |
| - Increase `--gradient-accumulation` to maintain effective batch size | |
| ### Model Not Improving | |
| - Check dataset quality | |
| - Adjust learning rate (try 1e-5 or 3e-5) | |
| - Increase epochs | |
| - Check validation loss trends | |
| --- | |
| ## π Related Documents | |
| - `HYPERPARAMETER_ANALYSIS.md` - Detailed hyperparameter recommendations | |
| - `DATASET_SPLIT_VALIDATION_GUIDE.md` - Dataset preparation guide | |
| - `MIGRATION_PROGRESS.md` - Migration status and progress | |
| --- | |
| **Happy Fine-Tuning! π** | |