π CodeLlama Inference Guide
Last Updated: November 25, 2025
π Overview
This guide explains how to use the updated CodeLlama inference script with your fine-tuned model.
π― Quick Start
Basic Inference (Single Prompt)
cd /workspace/ftt/codellama-migration
python3 scripts/inference/inference_codellama.py \
--mode local \
--model-path training-outputs/codellama-fifo-v1 \
--prompt "Your prompt here"
Or use the test script:
bash test_inference.sh
Interactive Mode
python3 scripts/inference/inference_codellama.py \
--mode local \
--model-path training-outputs/codellama-fifo-v1
Type your prompts interactively, type quit or exit to stop.
βοΈ Command-Line Arguments
Required Arguments (for local mode)
| Argument | Description | Default |
|---|---|---|
--mode |
Inference mode: local or ollama |
local |
--model-path |
Path to fine-tuned model | training-outputs/codellama-fifo-v1 |
Optional Arguments
| Argument | Description | Default |
|---|---|---|
--base-model-path |
Path to base CodeLlama model | Auto-detected from training config |
--prompt |
Single prompt to process | (Interactive mode if not provided) |
--max-new-tokens |
Maximum tokens to generate | 800 |
--temperature |
Generation temperature (lower = deterministic) | 0.3 |
--merge-weights |
Merge LoRA weights (slower load, faster inference) | False |
--no-quantization |
Disable 4-bit quantization | Auto (quantized on GPU) |
π Examples
Example 1: Basic Inference
python3 scripts/inference/inference_codellama.py \
--prompt "Generate a synchronous FIFO with 8-bit data width, depth 4"
Example 2: Custom Parameters
python3 scripts/inference/inference_codellama.py \
--model-path training-outputs/codellama-fifo-v1 \
--prompt "Your prompt" \
--max-new-tokens 1200 \
--temperature 0.5
Example 3: Merged Weights (Faster Inference)
python3 scripts/inference/inference_codellama.py \
--model-path training-outputs/codellama-fifo-v1 \
--merge-weights \
--prompt "Your prompt"
Note: --merge-weights merges LoRA adapters into the base model. This takes longer to load but runs inference faster.
Example 4: Custom Base Model Path
python3 scripts/inference/inference_codellama.py \
--model-path training-outputs/codellama-fifo-v1 \
--base-model-path /path/to/custom/base/model \
--prompt "Your prompt"
ποΈ Generation Parameters
Temperature
- 0.1-0.3: Very deterministic, focused outputs (recommended for code generation)
- 0.5-0.7: Balanced creativity and determinism
- 0.8-1.0: More creative, varied outputs
Default: 0.3 (optimized for code generation)
Max New Tokens
- 512: Short responses
- 800: Default (balanced)
- 1200+: Longer code blocks
Default: 800 tokens
π§ Model Loading
Automatic Base Model Detection
The script automatically detects the base model in this order:
--base-model-pathargument (if provided)- Local default path:
models/base-models/CodeLlama-7B-Instruct - Training config: Reads
training_config.jsonfrom model directory - HuggingFace: Falls back to
codellama/CodeLlama-7b-Instruct-hf
LoRA Adapter vs Merged Model
- LoRA Adapter (default): Faster loading, uses adapter weights
- Merged Model (
--merge-weights): Slower loading, but faster inference
π Output Format
The inference script automatically:
- Extracts Verilog code from markdown code blocks (````verilog`)
- Removes conversation wrappers
- Returns clean RTL code
Example Output
Input:
Generate a synchronous FIFO with 8-bit data width, depth 4
Output:
module sync_fifo_8b_4d (
input clk,
input rst,
input write_en,
input read_en,
input [7:0] write_data,
output [7:0] read_data
);
// ... code ...
endmodule
π Performance Tips
1. Use Merged Weights for Repeated Inference
If running many inferences, merge weights once:
# First run (slower loading)
python3 scripts/inference/inference_codellama.py \
--merge-weights \
--model-path training-outputs/codellama-fifo-v1
# Subsequent runs use cached merged model (if saved)
2. Adjust Max Tokens Based on Task
# Short responses
--max-new-tokens 400
# Long code blocks
--max-new-tokens 1200
3. Lower Temperature for Code Generation
# Very deterministic (recommended)
--temperature 0.2
# Slightly more varied
--temperature 0.5
π File Structure
codellama-migration/
βββ scripts/
β βββ inference/
β βββ inference_codellama.py # Updated inference script
βββ training-outputs/
β βββ codellama-fifo-v1/ # Fine-tuned model
β βββ adapter_model.safetensors # LoRA weights
β βββ adapter_config.json
β βββ training_config.json
βββ models/
β βββ base-models/
β βββ CodeLlama-7B-Instruct/ # Base model
βββ test_inference.sh # Test script
π Troubleshooting
Model Not Found
Error: Model path training-outputs/codellama-fifo-v1 does not exist
Solution: Check that the model path is correct:
ls -lh training-outputs/codellama-fifo-v1/
Base Model Not Found
If base model detection fails, specify explicitly:
--base-model-path /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct
Out of Memory
- Ensure quantization is enabled (default on GPU)
- Reduce
--max-new-tokens - Use
--no-quantizationonly if you have enough memory
Slow Inference
- Use
--merge-weightsfor faster inference - Reduce
--max-new-tokens - Lower
--temperature(less sampling overhead)
π Related Documents
TRAINING_GUIDE.md- Fine-tuning guideHYPERPARAMETER_ANALYSIS.md- Hyperparameter detailsMIGRATION_PROGRESS.md- Migration status
Happy Inferencing! π