codellama-fine-tuning / INFERENCE_GUIDE.md
Prithvik-1's picture
Upload INFERENCE_GUIDE.md with huggingface_hub
6045a16 verified

πŸš€ CodeLlama Inference Guide

Last Updated: November 25, 2025


πŸ“‹ Overview

This guide explains how to use the updated CodeLlama inference script with your fine-tuned model.


🎯 Quick Start

Basic Inference (Single Prompt)

cd /workspace/ftt/codellama-migration

python3 scripts/inference/inference_codellama.py \
    --mode local \
    --model-path training-outputs/codellama-fifo-v1 \
    --prompt "Your prompt here"

Or use the test script:

bash test_inference.sh

Interactive Mode

python3 scripts/inference/inference_codellama.py \
    --mode local \
    --model-path training-outputs/codellama-fifo-v1

Type your prompts interactively, type quit or exit to stop.


βš™οΈ Command-Line Arguments

Required Arguments (for local mode)

Argument Description Default
--mode Inference mode: local or ollama local
--model-path Path to fine-tuned model training-outputs/codellama-fifo-v1

Optional Arguments

Argument Description Default
--base-model-path Path to base CodeLlama model Auto-detected from training config
--prompt Single prompt to process (Interactive mode if not provided)
--max-new-tokens Maximum tokens to generate 800
--temperature Generation temperature (lower = deterministic) 0.3
--merge-weights Merge LoRA weights (slower load, faster inference) False
--no-quantization Disable 4-bit quantization Auto (quantized on GPU)

πŸ“ Examples

Example 1: Basic Inference

python3 scripts/inference/inference_codellama.py \
    --prompt "Generate a synchronous FIFO with 8-bit data width, depth 4"

Example 2: Custom Parameters

python3 scripts/inference/inference_codellama.py \
    --model-path training-outputs/codellama-fifo-v1 \
    --prompt "Your prompt" \
    --max-new-tokens 1200 \
    --temperature 0.5

Example 3: Merged Weights (Faster Inference)

python3 scripts/inference/inference_codellama.py \
    --model-path training-outputs/codellama-fifo-v1 \
    --merge-weights \
    --prompt "Your prompt"

Note: --merge-weights merges LoRA adapters into the base model. This takes longer to load but runs inference faster.

Example 4: Custom Base Model Path

python3 scripts/inference/inference_codellama.py \
    --model-path training-outputs/codellama-fifo-v1 \
    --base-model-path /path/to/custom/base/model \
    --prompt "Your prompt"

πŸŽ›οΈ Generation Parameters

Temperature

  • 0.1-0.3: Very deterministic, focused outputs (recommended for code generation)
  • 0.5-0.7: Balanced creativity and determinism
  • 0.8-1.0: More creative, varied outputs

Default: 0.3 (optimized for code generation)

Max New Tokens

  • 512: Short responses
  • 800: Default (balanced)
  • 1200+: Longer code blocks

Default: 800 tokens


πŸ”§ Model Loading

Automatic Base Model Detection

The script automatically detects the base model in this order:

  1. --base-model-path argument (if provided)
  2. Local default path: models/base-models/CodeLlama-7B-Instruct
  3. Training config: Reads training_config.json from model directory
  4. HuggingFace: Falls back to codellama/CodeLlama-7b-Instruct-hf

LoRA Adapter vs Merged Model

  • LoRA Adapter (default): Faster loading, uses adapter weights
  • Merged Model (--merge-weights): Slower loading, but faster inference

πŸ“Š Output Format

The inference script automatically:

  • Extracts Verilog code from markdown code blocks (````verilog`)
  • Removes conversation wrappers
  • Returns clean RTL code

Example Output

Input:

Generate a synchronous FIFO with 8-bit data width, depth 4

Output:

module sync_fifo_8b_4d (
  input clk,
  input rst,
  input write_en,
  input read_en,
  input [7:0] write_data,
  output [7:0] read_data
);
  // ... code ...
endmodule

πŸš€ Performance Tips

1. Use Merged Weights for Repeated Inference

If running many inferences, merge weights once:

# First run (slower loading)
python3 scripts/inference/inference_codellama.py \
    --merge-weights \
    --model-path training-outputs/codellama-fifo-v1

# Subsequent runs use cached merged model (if saved)

2. Adjust Max Tokens Based on Task

# Short responses
--max-new-tokens 400

# Long code blocks
--max-new-tokens 1200

3. Lower Temperature for Code Generation

# Very deterministic (recommended)
--temperature 0.2

# Slightly more varied
--temperature 0.5

πŸ“ File Structure

codellama-migration/
β”œβ”€β”€ scripts/
β”‚   └── inference/
β”‚       └── inference_codellama.py    # Updated inference script
β”œβ”€β”€ training-outputs/
β”‚   └── codellama-fifo-v1/            # Fine-tuned model
β”‚       β”œβ”€β”€ adapter_model.safetensors # LoRA weights
β”‚       β”œβ”€β”€ adapter_config.json
β”‚       └── training_config.json
β”œβ”€β”€ models/
β”‚   └── base-models/
β”‚       └── CodeLlama-7B-Instruct/    # Base model
└── test_inference.sh                 # Test script

πŸ” Troubleshooting

Model Not Found

Error: Model path training-outputs/codellama-fifo-v1 does not exist

Solution: Check that the model path is correct:

ls -lh training-outputs/codellama-fifo-v1/

Base Model Not Found

If base model detection fails, specify explicitly:

--base-model-path /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct

Out of Memory

  1. Ensure quantization is enabled (default on GPU)
  2. Reduce --max-new-tokens
  3. Use --no-quantization only if you have enough memory

Slow Inference

  1. Use --merge-weights for faster inference
  2. Reduce --max-new-tokens
  3. Lower --temperature (less sampling overhead)

πŸ“š Related Documents

  • TRAINING_GUIDE.md - Fine-tuning guide
  • HYPERPARAMETER_ANALYSIS.md - Hyperparameter details
  • MIGRATION_PROGRESS.md - Migration status

Happy Inferencing! πŸŽ‰