Spaces:
Sleeping
Sleeping
File size: 7,001 Bytes
03a7eb9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 | # Fine-tuning Guide: XCoder-80K Dataset
This guide explains how to fine-tune Ollama models on the XCoder-80K code dataset.
## Overview
The `finetune_models.py` script fine-tunes open-source code models on the XCoder-80K dataset from Hugging Face:
| Ollama Model | HuggingFace Model | Size | Recommended |
|---|---|---|---|
| `llama3.2:latest` | meta-llama/Llama-2-7b-hf | 7B | β Best for code |
| `gemma3:4b` | google/gemma-7b | 7B | β Good alternative |
| `gemma3:1b` | google/gemma-2b | 2B | Lightweight option |
| `llava:latest` | Not suitable | Multimodal | β Skip (vision-only) |
**Dataset:** [banksy235/XCoder-80K](https://huggingface.co/datasets/banksy235/XCoder-80K)
- 80,000 code examples
- Covers multiple programming languages
- Suitable for code generation and repair
## Installation
### Quick Install (Recommended)
**Windows:**
```bash
install_finetune.bat
```
**Linux/macOS:**
```bash
bash install_finetune.sh
```
### Manual Installation
1. **Install PyTorch with CUDA 12.1 support:**
```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
```
2. **Install fine-tuning dependencies:**
```bash
pip install -r requirements-finetune.txt
```
3. **Verify installation:**
```bash
python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'GPU: {torch.cuda.is_available()}')"
```
### Install Hugging Face CLI (Optional)
For easier dataset management:
```bash
# macOS/Linux
curl -LsSf https://hf.co/cli/install.sh | bash -s
# Or via pip
pip install huggingface_hub
# Login (for private datasets)
huggingface-cli login
```
## Usage
### Option 1: Fine-tune Single Model
Fine-tune Llama-2-7b on XCoder-80K (recommended for fastest start):
```bash
python finetune_models.py --model llama3.2 \
--num-epochs 3 \
--batch-size 4 \
--learning-rate 2e-4
```
### Option 2: Fine-tune All Models Sequentially
```bash
python finetune_models.py --all-models \
--num-epochs 3 \
--batch-size 4 \
--max-samples 5000
```
### Option 3: Custom Configuration
```bash
python finetune_models.py \
--model llama3.2 \
--output-dir ./my_finetuned_models \
--num-epochs 5 \
--batch-size 8 \
--learning-rate 1e-4 \
--max-samples 10000 \
--no-lora # Disable LoRA (full fine-tuning)
```
## Training Arguments Explained
| Argument | Default | Description |
|---|---|---|
| `--model` | `llama3.2` | Model to fine-tune |
| `--all-models` | False | Fine-tune all available models |
| `--output-dir` | `./finetuned_models` | Where to save fine-tuned models |
| `--num-epochs` | 3 | Training epochs (more = longer training) |
| `--batch-size` | 4 | Batch size (larger = more VRAM needed) |
| `--learning-rate` | 2e-4 | Learning rate (lower = slower updates) |
| `--max-samples` | None | Limit samples (None = use all 80K) |
| `--no-lora` | False | Disable LoRA (full fine-tuning) |
| `--no-gradient-checkpointing` | False | Disable gradient checkpointing |
## Output
After training, models are saved to:
```
finetuned_models/
βββ llama3_2/
β βββ final/
β β βββ pytorch_model.bin
β β βββ config.json
β β βββ tokenizer.json
β βββ metadata.json
βββ gemma3_4b/
β βββ ...
βββ gemma3_1b/
βββ ...
```
## Using Fine-tuned Models with Ollama
After fine-tuning, you can create custom Ollama models. Create a `Modelfile`:
```dockerfile
FROM llama3.2:latest
# Replace the base model with fine-tuned weights
COPY ./finetuned_models/llama3_2/final /model
# Optional: Set parameters
PARAMETER temperature 0.7
PARAMETER top_k 40
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
```
Then create and run:
```bash
ollama create my-finetuned-llama -f Modelfile
ollama run my-finetuned-llama "your prompt here"
```
Or use directly in Python:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "./finetuned_models/llama3_2/final"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
# Use the model
inputs = tokenizer("def fibonacci", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))
```
## Hardware Requirements
| Configuration | VRAM | Training Speed | Recommended |
|---|---|---|---|
| RTX 4090 (24GB) | 24GB | ~2 hours | β Excellent |
| RTX 4080 (16GB) | 16GB | ~3-4 hours | β Good |
| RTX 4070 (12GB) | 12GB | ~5-6 hours | Acceptable |
| Tesla T4 (16GB) | 16GB | ~4-5 hours | Cloud-friendly |
| CPU only | N/A | ~1-2 days | Not recommended |
**Optimization Tips:**
- Use `--batch-size 2` for GPUs with <12GB VRAM
- Enable `--max-samples 1000` to train on subset first
- LoRA (default) uses 70% less VRAM than full fine-tuning
- Gradient checkpointing (default) reduces VRAM by 30%
## Integration with CodeArena RL
To use fine-tuned models with the CodeArena RL environment:
1. **Export to Ollama** (see above)
2. **Update Dashboard.jsx** to use the new model:
```javascript
const [ollamaModel, setOllamaModel] = useState('my-finetuned-llama');
```
3. **Or update ollama_rl_rollout.py:**
```bash
python ollama_rl_rollout.py --ollama-model my-finetuned-llama
```
## Monitoring Training
Training logs are saved to TensorBoard format:
```bash
tensorboard --logdir ./finetuned_models/llama3_2
```
Open http://localhost:6006 to monitor:
- Training loss
- Learning rate schedules
- GPU usage
## Troubleshooting
### Out of Memory (OOM)
```bash
# Reduce batch size
python finetune_models.py --batch-size 2
# Or limit samples
python finetune_models.py --max-samples 1000
```
### Slow Training
- Ensure GPU is being used: `nvidia-smi`
- Use smaller model: `--model gemma3:1b`
- Reduce max_length in tokenization (in code)
### Dataset Not Found
```bash
# Download manually first
python -c "from datasets import load_dataset; load_dataset('banksy235/XCoder-80K')"
# Or use Hugging Face CLI
hf download banksy235/XCoder-80K
```
## Dataset Structure
The XCoder-80K dataset contains code examples with metadata. The script automatically handles:
- Multi-language code (Python, JavaScript, Java, C++, etc.)
- Code with comments and docstrings
- Various programming tasks (algorithms, utilities, etc.)
## Next Steps
1. **Run fine-tuning:** `python finetune_models.py --model llama3.2`
2. **Monitor training:** `tensorboard --logdir ./finetuned_models/llama3_2`
3. **Export to Ollama:** Create custom Modelfile and `ollama create`
4. **Test in CodeArena:** Update dashboard to use fine-tuned model
5. **Measure improvements:** Run `python plot_rewards.py` to see RL performance gains
## References
- [XCoder-80K Dataset](https://huggingface.co/datasets/banksy235/XCoder-80K)
- [Hugging Face Transformers](https://huggingface.co/docs/transformers)
- [TRL (Transformer Reinforcement Learning)](https://github.com/huggingface/trl)
- [Ollama Documentation](https://ollama.ai)
- [PEFT (Parameter-Efficient Fine-Tuning)](https://github.com/huggingface/peft)
|