Elinnos
/

codellama-fine-tuning

Model card Files Files and versions

xet

Community

Prithvik-1 commited on Nov 25, 2025

Commit

ff9646f

verified ·

1 Parent(s): 47f1a10

Upload TRAINING_GUIDE.md with huggingface_hub

Browse files

Files changed (1) hide show

TRAINING_GUIDE.md +319 -0

TRAINING_GUIDE.md ADDED Viewed

	@@ -0,0 +1,319 @@

+# 🚀 CodeLlama Fine-Tuning Guide
+**Last Updated:** November 25, 2025
+---
+## 📋 Overview
+This guide explains how to use the optimized CodeLlama fine-tuning script with checkpoint resume and incremental fine-tuning capabilities.
+---
+## 🎯 Features
+### ✅ Implemented Features
+1. **Optimized Hyperparameters** - Based on `HYPERPARAMETER_ANALYSIS.md`
+   - Max Length: 1536
+   - LoRA Rank: 48
+   - LoRA Alpha: 96
+   - LoRA Dropout: 0.15
+   - Learning Rate: 2e-5
+   - Epochs: 5
+   - And more...
+2. **Checkpoint Resume** - Automatically resume from last checkpoint if training is interrupted
+3. **Incremental Fine-Tuning** - Continue training from existing fine-tuned model with new data
+4. **Fresh Training** - Start from scratch (optionally clear old checkpoints)
+---
+## 🚀 Quick Start
+### Start Fresh Training
+```bash
+cd /workspace/ftt/codellama-migration
+python3 scripts/training/finetune_codellama.py \
+    --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
+    --dataset datasets/processed/split/train.jsonl \
+    --output-dir training-outputs/codellama-fifo-v1 \
+    --max-length 1536 \
+    --num-epochs 5 \
+    --batch-size 2 \
+    --gradient-accumulation 4 \
+    --learning-rate 2e-5 \
+    --lora-r 48 \
+    --lora-alpha 96 \
+    --lora-dropout 0.15
+```
+Or use the convenience script:
+```bash
+bash start_training.sh
+```
+---
+## 🔄 Resuming from Checkpoint
+### Automatic Resume (Recommended)
+If training is interrupted, simply run the same command again with `--resume-from-checkpoint auto`:
+```bash
+python3 scripts/training/finetune_codellama.py \
+    --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
+    --dataset datasets/processed/split/train.jsonl \
+    --output-dir training-outputs/codellama-fifo-v1 \
+    --resume-from-checkpoint auto \
+    [other parameters...]
+```
+The script will automatically find the latest checkpoint and resume from there.
+### Manual Resume
+To resume from a specific checkpoint:
+```bash
+--resume-from-checkpoint training-outputs/codellama-fifo-v1/checkpoint-25
+```
+### Force Fresh Training
+To start fresh (ignore existing checkpoints):
+```bash
+--fresh
+```
+This will remove old checkpoints and start from scratch.
+---
+## 📈 Incremental Fine-Tuning
+### Continue Training Existing Model with New Data
+When you have new data and want to continue training an existing fine-tuned model:
+```bash
+python3 scripts/training/finetune_codellama.py \
+    --base-model /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct \
+    --adapter-path training-outputs/codellama-fifo-v1 \
+    --dataset datasets/processed/new_data.jsonl \
+    --output-dir training-outputs/codellama-fifo-v2 \
+    [other parameters...]
+```
+**Key Points:**
+- `--adapter-path` points to the previous fine-tuned model
+- `--output-dir` should be a new directory (or same if you want to update)
+- New dataset will be combined with existing knowledge
+- Training will continue from where it left off
+### Example Workflow
+```bash
+# Step 1: Initial training
+python3 scripts/training/finetune_codellama.py \
+    --base-model /path/to/base \
+    --dataset initial_data.jsonl \
+    --output-dir model-v1
+# Step 2: Add more data (incremental)
+python3 scripts/training/finetune_codellama.py \
+    --base-model /path/to/base \
+    --adapter-path model-v1 \
+    --dataset additional_data.jsonl \
+    --output-dir model-v2
+# Step 3: Add even more data
+python3 scripts/training/finetune_codellama.py \
+    --base-model /path/to/base \
+    --adapter-path model-v2 \
+    --dataset even_more_data.jsonl \
+    --output-dir model-v3
+```
+---
+## 🛑 Stopping Training
+### Graceful Stop
+Training will automatically save checkpoints at regular intervals (every 25 steps by default). To stop:
+1. Press `Ctrl+C` once - Training will finish current step and save
+2. Wait for checkpoint to be saved
+3. Resume later with `--resume-from-checkpoint auto`
+### Force Stop
+If needed, you can force kill the process:
+```bash
+# Find training process
+ps aux | grep finetune_codellama
+# Kill process
+kill <PID>
+```
+The last checkpoint will still be available for resume.
+---
+## 📊 Monitoring Training
+### Check Training Status
+```bash
+# View latest logs
+tail -f training-outputs/codellama-fifo-v1/training.log
+# Check available checkpoints
+ls -lh training-outputs/codellama-fifo-v1/checkpoint-*
+# View training config
+cat training-outputs/codellama-fifo-v1/training_config.json
+```
+### Check GPU Usage
+```bash
+watch -n 1 nvidia-smi
+```
+---
+## 🔧 All Command-Line Arguments
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--base-model` | **Required** | Base model path or HuggingFace ID |
+| `--adapter-path` | None | Path to existing LoRA adapter (incremental fine-tuning) |
+| `--dataset` | **Required** | Path to training dataset JSONL |
+| `--output-dir` | **Required** | Output directory for fine-tuned model |
+| `--resume-from-checkpoint` | None | Resume from checkpoint ('auto' or path) |
+| `--fresh` | False | Force fresh training (ignore checkpoints) |
+| `--max-length` | 1536 | Max sequence length |
+| `--num-epochs` | 5 | Number of epochs |
+| `--batch-size` | 2 | Batch size per device |
+| `--gradient-accumulation` | 4 | Gradient accumulation steps |
+| `--learning-rate` | 2e-5 | Learning rate |
+| `--lora-r` | 48 | LoRA rank |
+| `--lora-alpha` | 96 | LoRA alpha |
+| `--lora-dropout` | 0.15 | LoRA dropout |
+| `--warmup-ratio` | 0.1 | Warmup ratio |
+| `--eval-steps` | 25 | Evaluation steps |
+| `--save-steps` | 25 | Save steps |
+| `--early-stopping-patience` | 5 | Early stopping patience |
+| `--logging-steps` | 5 | Logging steps |
+---
+## 📁 Directory Structure
+```
+codellama-migration/
+├── models/
+│   └── base-models/
+│       └── CodeLlama-7B-Instruct/    # Base model
+├── datasets/
+│   └── processed/
+│       └── split/
+│           ├── train.jsonl            # Training data
+│           ├── val.jsonl              # Validation data
+│           └── test.jsonl             # Test data
+├── training-outputs/
+│   └── codellama-fifo-v1/            # Fine-tuned model
+│       ├── checkpoint-25/             # Checkpoint 1
+│       ├── checkpoint-50/             # Checkpoint 2
+│       ├── checkpoint-75/             # Checkpoint 3 (latest)
+│       ├── adapter_config.json        # LoRA config
+│       ├── adapter_model.safetensors  # LoRA weights
+│       └── training_config.json       # Training config
+└── scripts/
+    └── training/
+        └── finetune_codellama.py      # Training script
+```
+---
+## ⚠️ Important Notes
+### Dataset Format
+The dataset must be in JSONL format with `instruction` and `response` fields:
+```json
+{
+  "instruction": "System prompt + task description",
+  "response": "Expected code output with ```verilog markers"
+}
+```
+### Checkpoint Behavior
+- Checkpoints are saved every `--save-steps` (default: 25)
+- Only last 3 checkpoints are kept (to save disk space)
+- Best model (lowest validation loss) is automatically loaded at the end
+- Checkpoints include full training state for seamless resume
+### Incremental Fine-Tuning Tips
+1. **Use same base model** - Always use the same base model as the original training
+2. **New output directory** - Use a new output directory for each incremental training session
+3. **Preserve original** - Keep the original fine-tuned model safe (don't overwrite)
+4. **Compatible data** - New data should follow the same format and domain
+### Fresh Training vs Incremental
+- **Fresh Training**: Start from base model (no `--adapter-path`)
+- **Incremental**: Continue from fine-tuned model (`--adapter-path` specified)
+- **Resume**: Continue from checkpoint (same training session)
+---
+## 🐛 Troubleshooting
+### Training Stops Unexpectedly
+```bash
+# Check if checkpoint exists
+ls training-outputs/codellama-fifo-v1/checkpoint-*
+# Resume automatically
+--resume-from-checkpoint auto
+```
+### Out of Memory
+- Reduce `--batch-size` (e.g., from 2 to 1)
+- Reduce `--max-length` (e.g., from 1536 to 1024)
+- Increase `--gradient-accumulation` to maintain effective batch size
+### Model Not Improving
+- Check dataset quality
+- Adjust learning rate (try 1e-5 or 3e-5)
+- Increase epochs
+- Check validation loss trends
+---
+## 📚 Related Documents
+- `HYPERPARAMETER_ANALYSIS.md` - Detailed hyperparameter recommendations
+- `DATASET_SPLIT_VALIDATION_GUIDE.md` - Dataset preparation guide
+- `MIGRATION_PROGRESS.md` - Migration status and progress
+---
+**Happy Fine-Tuning! 🚀**