# 🚀 CodeLlama Inference Guide **Last Updated:** November 25, 2025 --- ## 📋 Overview This guide explains how to use the updated CodeLlama inference script with your fine-tuned model. --- ## 🎯 Quick Start ### Basic Inference (Single Prompt) ```bash cd /workspace/ftt/codellama-migration python3 scripts/inference/inference_codellama.py \ --mode local \ --model-path training-outputs/codellama-fifo-v1 \ --prompt "Your prompt here" ``` Or use the test script: ```bash bash test_inference.sh ``` ### Interactive Mode ```bash python3 scripts/inference/inference_codellama.py \ --mode local \ --model-path training-outputs/codellama-fifo-v1 ``` Type your prompts interactively, type `quit` or `exit` to stop. --- ## ⚙️ Command-Line Arguments ### Required Arguments (for local mode) | Argument | Description | Default | |----------|-------------|---------| | `--mode` | Inference mode: `local` or `ollama` | `local` | | `--model-path` | Path to fine-tuned model | `training-outputs/codellama-fifo-v1` | ### Optional Arguments | Argument | Description | Default | |----------|-------------|---------| | `--base-model-path` | Path to base CodeLlama model | Auto-detected from training config | | `--prompt` | Single prompt to process | (Interactive mode if not provided) | | `--max-new-tokens` | Maximum tokens to generate | `800` | | `--temperature` | Generation temperature (lower = deterministic) | `0.3` | | `--merge-weights` | Merge LoRA weights (slower load, faster inference) | `False` | | `--no-quantization` | Disable 4-bit quantization | Auto (quantized on GPU) | --- ## 📝 Examples ### Example 1: Basic Inference ```bash python3 scripts/inference/inference_codellama.py \ --prompt "Generate a synchronous FIFO with 8-bit data width, depth 4" ``` ### Example 2: Custom Parameters ```bash python3 scripts/inference/inference_codellama.py \ --model-path training-outputs/codellama-fifo-v1 \ --prompt "Your prompt" \ --max-new-tokens 1200 \ --temperature 0.5 ``` ### Example 3: Merged Weights (Faster Inference) ```bash python3 scripts/inference/inference_codellama.py \ --model-path training-outputs/codellama-fifo-v1 \ --merge-weights \ --prompt "Your prompt" ``` **Note:** `--merge-weights` merges LoRA adapters into the base model. This takes longer to load but runs inference faster. ### Example 4: Custom Base Model Path ```bash python3 scripts/inference/inference_codellama.py \ --model-path training-outputs/codellama-fifo-v1 \ --base-model-path /path/to/custom/base/model \ --prompt "Your prompt" ``` --- ## 🎛️ Generation Parameters ### Temperature - **0.1-0.3**: Very deterministic, focused outputs (recommended for code generation) - **0.5-0.7**: Balanced creativity and determinism - **0.8-1.0**: More creative, varied outputs **Default:** `0.3` (optimized for code generation) ### Max New Tokens - **512**: Short responses - **800**: Default (balanced) - **1200+**: Longer code blocks **Default:** `800` tokens --- ## 🔧 Model Loading ### Automatic Base Model Detection The script automatically detects the base model in this order: 1. `--base-model-path` argument (if provided) 2. Local default path: `models/base-models/CodeLlama-7B-Instruct` 3. Training config: Reads `training_config.json` from model directory 4. HuggingFace: Falls back to `codellama/CodeLlama-7b-Instruct-hf` ### LoRA Adapter vs Merged Model - **LoRA Adapter (default)**: Faster loading, uses adapter weights - **Merged Model (`--merge-weights`)**: Slower loading, but faster inference --- ## 📊 Output Format The inference script automatically: - Extracts Verilog code from markdown code blocks (````verilog`) - Removes conversation wrappers - Returns clean RTL code ### Example Output **Input:** ``` Generate a synchronous FIFO with 8-bit data width, depth 4 ``` **Output:** ```verilog module sync_fifo_8b_4d ( input clk, input rst, input write_en, input read_en, input [7:0] write_data, output [7:0] read_data ); // ... code ... endmodule ``` --- ## 🚀 Performance Tips ### 1. Use Merged Weights for Repeated Inference If running many inferences, merge weights once: ```bash # First run (slower loading) python3 scripts/inference/inference_codellama.py \ --merge-weights \ --model-path training-outputs/codellama-fifo-v1 # Subsequent runs use cached merged model (if saved) ``` ### 2. Adjust Max Tokens Based on Task ```bash # Short responses --max-new-tokens 400 # Long code blocks --max-new-tokens 1200 ``` ### 3. Lower Temperature for Code Generation ```bash # Very deterministic (recommended) --temperature 0.2 # Slightly more varied --temperature 0.5 ``` --- ## 📁 File Structure ``` codellama-migration/ ├── scripts/ │ └── inference/ │ └── inference_codellama.py # Updated inference script ├── training-outputs/ │ └── codellama-fifo-v1/ # Fine-tuned model │ ├── adapter_model.safetensors # LoRA weights │ ├── adapter_config.json │ └── training_config.json ├── models/ │ └── base-models/ │ └── CodeLlama-7B-Instruct/ # Base model └── test_inference.sh # Test script ``` --- ## 🔍 Troubleshooting ### Model Not Found ```bash Error: Model path training-outputs/codellama-fifo-v1 does not exist ``` **Solution:** Check that the model path is correct: ```bash ls -lh training-outputs/codellama-fifo-v1/ ``` ### Base Model Not Found If base model detection fails, specify explicitly: ```bash --base-model-path /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct ``` ### Out of Memory 1. Ensure quantization is enabled (default on GPU) 2. Reduce `--max-new-tokens` 3. Use `--no-quantization` only if you have enough memory ### Slow Inference 1. Use `--merge-weights` for faster inference 2. Reduce `--max-new-tokens` 3. Lower `--temperature` (less sampling overhead) --- ## 📚 Related Documents - `TRAINING_GUIDE.md` - Fine-tuning guide - `HYPERPARAMETER_ANALYSIS.md` - Hyperparameter details - `MIGRATION_PROGRESS.md` - Migration status --- **Happy Inferencing! 🎉**