codellama-fine-tuning / INFERENCE_GUIDE.md

Upload INFERENCE_GUIDE.md with huggingface_hub

6045a16 verified 2 months ago

6.24 kB

	# 🚀 CodeLlama Inference Guide

	Last Updated: November 25, 2025

	---

	## 📋 Overview

	This guide explains how to use the updated CodeLlama inference script with your fine-tuned model.

	---

	## 🎯 Quick Start

	### Basic Inference (Single Prompt)

	```bash
	cd /workspace/ftt/codellama-migration

	python3 scripts/inference/inference_codellama.py \
	--mode local \
	--model-path training-outputs/codellama-fifo-v1 \
	--prompt "Your prompt here"
	```

	Or use the test script:

	```bash
	bash test_inference.sh
	```

	### Interactive Mode

	```bash
	python3 scripts/inference/inference_codellama.py \
	--mode local \
	--model-path training-outputs/codellama-fifo-v1
	```

	Type your prompts interactively, type `quit` or `exit` to stop.

	---

	## ⚙️ Command-Line Arguments

	### Required Arguments (for local mode)

	\| Argument \| Description \| Default \|
	\|----------\|-------------\|---------\|
	\| `--mode` \| Inference mode: `local` or `ollama` \| `local` \|
	\| `--model-path` \| Path to fine-tuned model \| `training-outputs/codellama-fifo-v1` \|

	### Optional Arguments

	\| Argument \| Description \| Default \|
	\|----------\|-------------\|---------\|
	\| `--base-model-path` \| Path to base CodeLlama model \| Auto-detected from training config \|
	\| `--prompt` \| Single prompt to process \| (Interactive mode if not provided) \|
	\| `--max-new-tokens` \| Maximum tokens to generate \| `800` \|
	\| `--temperature` \| Generation temperature (lower = deterministic) \| `0.3` \|
	\| `--merge-weights` \| Merge LoRA weights (slower load, faster inference) \| `False` \|
	\| `--no-quantization` \| Disable 4-bit quantization \| Auto (quantized on GPU) \|

	---

	## 📝 Examples

	### Example 1: Basic Inference

	```bash
	python3 scripts/inference/inference_codellama.py \
	--prompt "Generate a synchronous FIFO with 8-bit data width, depth 4"
	```

	### Example 2: Custom Parameters

	```bash
	python3 scripts/inference/inference_codellama.py \
	--model-path training-outputs/codellama-fifo-v1 \
	--prompt "Your prompt" \
	--max-new-tokens 1200 \
	--temperature 0.5
	```

	### Example 3: Merged Weights (Faster Inference)

	```bash
	python3 scripts/inference/inference_codellama.py \
	--model-path training-outputs/codellama-fifo-v1 \
	--merge-weights \
	--prompt "Your prompt"
	```

	Note: `--merge-weights` merges LoRA adapters into the base model. This takes longer to load but runs inference faster.

	### Example 4: Custom Base Model Path

	```bash
	python3 scripts/inference/inference_codellama.py \
	--model-path training-outputs/codellama-fifo-v1 \
	--base-model-path /path/to/custom/base/model \
	--prompt "Your prompt"
	```

	---

	## 🎛️ Generation Parameters

	### Temperature

	- 0.1-0.3: Very deterministic, focused outputs (recommended for code generation)
	- 0.5-0.7: Balanced creativity and determinism
	- 0.8-1.0: More creative, varied outputs

	Default: `0.3` (optimized for code generation)

	### Max New Tokens

	- 512: Short responses
	- 800: Default (balanced)
	- 1200+: Longer code blocks

	Default: `800` tokens

	---

	## 🔧 Model Loading

	### Automatic Base Model Detection

	The script automatically detects the base model in this order:

	1. `--base-model-path` argument (if provided)
	2. Local default path: `models/base-models/CodeLlama-7B-Instruct`
	3. Training config: Reads `training_config.json` from model directory
	4. HuggingFace: Falls back to `codellama/CodeLlama-7b-Instruct-hf`

	### LoRA Adapter vs Merged Model

	- LoRA Adapter (default): Faster loading, uses adapter weights
	- Merged Model (`--merge-weights`): Slower loading, but faster inference

	---

	## 📊 Output Format

	The inference script automatically:
	- Extracts Verilog code from markdown code blocks (````verilog`)
	- Removes conversation wrappers
	- Returns clean RTL code

	### Example Output

	Input:
	```
	Generate a synchronous FIFO with 8-bit data width, depth 4
	```

	Output:
	```verilog
	module sync_fifo_8b_4d (
	input clk,
	input rst,
	input write_en,
	input read_en,
	input [7:0] write_data,
	output [7:0] read_data
	);
	// ... code ...
	endmodule
	```

	---

	## 🚀 Performance Tips

	### 1. Use Merged Weights for Repeated Inference

	If running many inferences, merge weights once:

	```bash
	# First run (slower loading)
	python3 scripts/inference/inference_codellama.py \
	--merge-weights \
	--model-path training-outputs/codellama-fifo-v1

	# Subsequent runs use cached merged model (if saved)
	```

	### 2. Adjust Max Tokens Based on Task

	```bash
	# Short responses
	--max-new-tokens 400

	# Long code blocks
	--max-new-tokens 1200
	```

	### 3. Lower Temperature for Code Generation

	```bash
	# Very deterministic (recommended)
	--temperature 0.2

	# Slightly more varied
	--temperature 0.5
	```

	---

	## 📁 File Structure

	```
	codellama-migration/
	├── scripts/
	│ └── inference/
	│ └── inference_codellama.py # Updated inference script
	├── training-outputs/
	│ └── codellama-fifo-v1/ # Fine-tuned model
	│ ├── adapter_model.safetensors # LoRA weights
	│ ├── adapter_config.json
	│ └── training_config.json
	├── models/
	│ └── base-models/
	│ └── CodeLlama-7B-Instruct/ # Base model
	└── test_inference.sh # Test script
	```

	---

	## 🔍 Troubleshooting

	### Model Not Found

	```bash
	Error: Model path training-outputs/codellama-fifo-v1 does not exist
	```

	Solution: Check that the model path is correct:
	```bash
	ls -lh training-outputs/codellama-fifo-v1/
	```

	### Base Model Not Found

	If base model detection fails, specify explicitly:

	```bash
	--base-model-path /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct
	```

	### Out of Memory

	1. Ensure quantization is enabled (default on GPU)
	2. Reduce `--max-new-tokens`
	3. Use `--no-quantization` only if you have enough memory

	### Slow Inference

	1. Use `--merge-weights` for faster inference
	2. Reduce `--max-new-tokens`
	3. Lower `--temperature` (less sampling overhead)

	---

	## 📚 Related Documents

	- `TRAINING_GUIDE.md` - Fine-tuning guide
	- `HYPERPARAMETER_ANALYSIS.md` - Hyperparameter details
	- `MIGRATION_PROGRESS.md` - Migration status

	---

	Happy Inferencing! 🎉