codellama-fine-tuning / INFERENCE_GUIDE.md

Prithvik-1

Upload INFERENCE_GUIDE.md with huggingface_hub

6045a16 verified 2 months ago

preview code

raw

history blame contribute delete

6.24 kB

🚀 CodeLlama Inference Guide

Last Updated: November 25, 2025

📋 Overview

This guide explains how to use the updated CodeLlama inference script with your fine-tuned model.

🎯 Quick Start

Basic Inference (Single Prompt)

cd /workspace/ftt/codellama-migration

python3 scripts/inference/inference_codellama.py \
    --mode local \
    --model-path training-outputs/codellama-fifo-v1 \
    --prompt "Your prompt here"

Or use the test script:

bash test_inference.sh

Interactive Mode

python3 scripts/inference/inference_codellama.py \
    --mode local \
    --model-path training-outputs/codellama-fifo-v1

Type your prompts interactively, type quit or exit to stop.

⚙️ Command-Line Arguments

Required Arguments (for local mode)

Argument	Description	Default
`--mode`	Inference mode: `local` or `ollama`	`local`
`--model-path`	Path to fine-tuned model	`training-outputs/codellama-fifo-v1`

Optional Arguments

Argument	Description	Default
`--base-model-path`	Path to base CodeLlama model	Auto-detected from training config
`--prompt`	Single prompt to process	(Interactive mode if not provided)
`--max-new-tokens`	Maximum tokens to generate	`800`
`--temperature`	Generation temperature (lower = deterministic)	`0.3`
`--merge-weights`	Merge LoRA weights (slower load, faster inference)	`False`
`--no-quantization`	Disable 4-bit quantization	Auto (quantized on GPU)

📝 Examples

Example 1: Basic Inference

python3 scripts/inference/inference_codellama.py \
    --prompt "Generate a synchronous FIFO with 8-bit data width, depth 4"

Example 2: Custom Parameters

python3 scripts/inference/inference_codellama.py \
    --model-path training-outputs/codellama-fifo-v1 \
    --prompt "Your prompt" \
    --max-new-tokens 1200 \
    --temperature 0.5

Example 3: Merged Weights (Faster Inference)

python3 scripts/inference/inference_codellama.py \
    --model-path training-outputs/codellama-fifo-v1 \
    --merge-weights \
    --prompt "Your prompt"

Note: --merge-weights merges LoRA adapters into the base model. This takes longer to load but runs inference faster.

Example 4: Custom Base Model Path

python3 scripts/inference/inference_codellama.py \
    --model-path training-outputs/codellama-fifo-v1 \
    --base-model-path /path/to/custom/base/model \
    --prompt "Your prompt"

🎛️ Generation Parameters

Temperature

0.1-0.3: Very deterministic, focused outputs (recommended for code generation)
0.5-0.7: Balanced creativity and determinism
0.8-1.0: More creative, varied outputs

Default: 0.3 (optimized for code generation)

Max New Tokens

512: Short responses
800: Default (balanced)
1200+: Longer code blocks

Default: 800 tokens

🔧 Model Loading

Automatic Base Model Detection

The script automatically detects the base model in this order:

--base-model-path argument (if provided)
Local default path: models/base-models/CodeLlama-7B-Instruct
Training config: Reads training_config.json from model directory
HuggingFace: Falls back to codellama/CodeLlama-7b-Instruct-hf

LoRA Adapter vs Merged Model

LoRA Adapter (default): Faster loading, uses adapter weights
Merged Model (--merge-weights): Slower loading, but faster inference

📊 Output Format

The inference script automatically:

Extracts Verilog code from markdown code blocks (````verilog`)
Removes conversation wrappers
Returns clean RTL code

Example Output

Input:

Generate a synchronous FIFO with 8-bit data width, depth 4

Output:

module sync_fifo_8b_4d (
  input clk,
  input rst,
  input write_en,
  input read_en,
  input [7:0] write_data,
  output [7:0] read_data
);
  // ... code ...
endmodule

🚀 Performance Tips

1. Use Merged Weights for Repeated Inference

If running many inferences, merge weights once:

# First run (slower loading)
python3 scripts/inference/inference_codellama.py \
    --merge-weights \
    --model-path training-outputs/codellama-fifo-v1

# Subsequent runs use cached merged model (if saved)

2. Adjust Max Tokens Based on Task

# Short responses
--max-new-tokens 400

# Long code blocks
--max-new-tokens 1200

3. Lower Temperature for Code Generation

# Very deterministic (recommended)
--temperature 0.2

# Slightly more varied
--temperature 0.5

📁 File Structure

codellama-migration/
├── scripts/
│   └── inference/
│       └── inference_codellama.py    # Updated inference script
├── training-outputs/
│   └── codellama-fifo-v1/            # Fine-tuned model
│       ├── adapter_model.safetensors # LoRA weights
│       ├── adapter_config.json
│       └── training_config.json
├── models/
│   └── base-models/
│       └── CodeLlama-7B-Instruct/    # Base model
└── test_inference.sh                 # Test script

🔍 Troubleshooting

Model Not Found

Error: Model path training-outputs/codellama-fifo-v1 does not exist

Solution: Check that the model path is correct:

ls -lh training-outputs/codellama-fifo-v1/

Base Model Not Found

If base model detection fails, specify explicitly:

--base-model-path /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct

Out of Memory

Ensure quantization is enabled (default on GPU)
Reduce --max-new-tokens
Use --no-quantization only if you have enough memory

Slow Inference

Use --merge-weights for faster inference
Reduce --max-new-tokens
Lower --temperature (less sampling overhead)

📚 Related Documents

TRAINING_GUIDE.md - Fine-tuning guide
HYPERPARAMETER_ANALYSIS.md - Hyperparameter details
MIGRATION_PROGRESS.md - Migration status

Happy Inferencing! 🎉