| # π CodeLlama Inference Guide | |
| **Last Updated:** November 25, 2025 | |
| --- | |
| ## π Overview | |
| This guide explains how to use the updated CodeLlama inference script with your fine-tuned model. | |
| --- | |
| ## π― Quick Start | |
| ### Basic Inference (Single Prompt) | |
| ```bash | |
| cd /workspace/ftt/codellama-migration | |
| python3 scripts/inference/inference_codellama.py \ | |
| --mode local \ | |
| --model-path training-outputs/codellama-fifo-v1 \ | |
| --prompt "Your prompt here" | |
| ``` | |
| Or use the test script: | |
| ```bash | |
| bash test_inference.sh | |
| ``` | |
| ### Interactive Mode | |
| ```bash | |
| python3 scripts/inference/inference_codellama.py \ | |
| --mode local \ | |
| --model-path training-outputs/codellama-fifo-v1 | |
| ``` | |
| Type your prompts interactively, type `quit` or `exit` to stop. | |
| --- | |
| ## βοΈ Command-Line Arguments | |
| ### Required Arguments (for local mode) | |
| | Argument | Description | Default | | |
| |----------|-------------|---------| | |
| | `--mode` | Inference mode: `local` or `ollama` | `local` | | |
| | `--model-path` | Path to fine-tuned model | `training-outputs/codellama-fifo-v1` | | |
| ### Optional Arguments | |
| | Argument | Description | Default | | |
| |----------|-------------|---------| | |
| | `--base-model-path` | Path to base CodeLlama model | Auto-detected from training config | | |
| | `--prompt` | Single prompt to process | (Interactive mode if not provided) | | |
| | `--max-new-tokens` | Maximum tokens to generate | `800` | | |
| | `--temperature` | Generation temperature (lower = deterministic) | `0.3` | | |
| | `--merge-weights` | Merge LoRA weights (slower load, faster inference) | `False` | | |
| | `--no-quantization` | Disable 4-bit quantization | Auto (quantized on GPU) | | |
| --- | |
| ## π Examples | |
| ### Example 1: Basic Inference | |
| ```bash | |
| python3 scripts/inference/inference_codellama.py \ | |
| --prompt "Generate a synchronous FIFO with 8-bit data width, depth 4" | |
| ``` | |
| ### Example 2: Custom Parameters | |
| ```bash | |
| python3 scripts/inference/inference_codellama.py \ | |
| --model-path training-outputs/codellama-fifo-v1 \ | |
| --prompt "Your prompt" \ | |
| --max-new-tokens 1200 \ | |
| --temperature 0.5 | |
| ``` | |
| ### Example 3: Merged Weights (Faster Inference) | |
| ```bash | |
| python3 scripts/inference/inference_codellama.py \ | |
| --model-path training-outputs/codellama-fifo-v1 \ | |
| --merge-weights \ | |
| --prompt "Your prompt" | |
| ``` | |
| **Note:** `--merge-weights` merges LoRA adapters into the base model. This takes longer to load but runs inference faster. | |
| ### Example 4: Custom Base Model Path | |
| ```bash | |
| python3 scripts/inference/inference_codellama.py \ | |
| --model-path training-outputs/codellama-fifo-v1 \ | |
| --base-model-path /path/to/custom/base/model \ | |
| --prompt "Your prompt" | |
| ``` | |
| --- | |
| ## ποΈ Generation Parameters | |
| ### Temperature | |
| - **0.1-0.3**: Very deterministic, focused outputs (recommended for code generation) | |
| - **0.5-0.7**: Balanced creativity and determinism | |
| - **0.8-1.0**: More creative, varied outputs | |
| **Default:** `0.3` (optimized for code generation) | |
| ### Max New Tokens | |
| - **512**: Short responses | |
| - **800**: Default (balanced) | |
| - **1200+**: Longer code blocks | |
| **Default:** `800` tokens | |
| --- | |
| ## π§ Model Loading | |
| ### Automatic Base Model Detection | |
| The script automatically detects the base model in this order: | |
| 1. `--base-model-path` argument (if provided) | |
| 2. Local default path: `models/base-models/CodeLlama-7B-Instruct` | |
| 3. Training config: Reads `training_config.json` from model directory | |
| 4. HuggingFace: Falls back to `codellama/CodeLlama-7b-Instruct-hf` | |
| ### LoRA Adapter vs Merged Model | |
| - **LoRA Adapter (default)**: Faster loading, uses adapter weights | |
| - **Merged Model (`--merge-weights`)**: Slower loading, but faster inference | |
| --- | |
| ## π Output Format | |
| The inference script automatically: | |
| - Extracts Verilog code from markdown code blocks (````verilog`) | |
| - Removes conversation wrappers | |
| - Returns clean RTL code | |
| ### Example Output | |
| **Input:** | |
| ``` | |
| Generate a synchronous FIFO with 8-bit data width, depth 4 | |
| ``` | |
| **Output:** | |
| ```verilog | |
| module sync_fifo_8b_4d ( | |
| input clk, | |
| input rst, | |
| input write_en, | |
| input read_en, | |
| input [7:0] write_data, | |
| output [7:0] read_data | |
| ); | |
| // ... code ... | |
| endmodule | |
| ``` | |
| --- | |
| ## π Performance Tips | |
| ### 1. Use Merged Weights for Repeated Inference | |
| If running many inferences, merge weights once: | |
| ```bash | |
| # First run (slower loading) | |
| python3 scripts/inference/inference_codellama.py \ | |
| --merge-weights \ | |
| --model-path training-outputs/codellama-fifo-v1 | |
| # Subsequent runs use cached merged model (if saved) | |
| ``` | |
| ### 2. Adjust Max Tokens Based on Task | |
| ```bash | |
| # Short responses | |
| --max-new-tokens 400 | |
| # Long code blocks | |
| --max-new-tokens 1200 | |
| ``` | |
| ### 3. Lower Temperature for Code Generation | |
| ```bash | |
| # Very deterministic (recommended) | |
| --temperature 0.2 | |
| # Slightly more varied | |
| --temperature 0.5 | |
| ``` | |
| --- | |
| ## π File Structure | |
| ``` | |
| codellama-migration/ | |
| βββ scripts/ | |
| β βββ inference/ | |
| β βββ inference_codellama.py # Updated inference script | |
| βββ training-outputs/ | |
| β βββ codellama-fifo-v1/ # Fine-tuned model | |
| β βββ adapter_model.safetensors # LoRA weights | |
| β βββ adapter_config.json | |
| β βββ training_config.json | |
| βββ models/ | |
| β βββ base-models/ | |
| β βββ CodeLlama-7B-Instruct/ # Base model | |
| βββ test_inference.sh # Test script | |
| ``` | |
| --- | |
| ## π Troubleshooting | |
| ### Model Not Found | |
| ```bash | |
| Error: Model path training-outputs/codellama-fifo-v1 does not exist | |
| ``` | |
| **Solution:** Check that the model path is correct: | |
| ```bash | |
| ls -lh training-outputs/codellama-fifo-v1/ | |
| ``` | |
| ### Base Model Not Found | |
| If base model detection fails, specify explicitly: | |
| ```bash | |
| --base-model-path /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct | |
| ``` | |
| ### Out of Memory | |
| 1. Ensure quantization is enabled (default on GPU) | |
| 2. Reduce `--max-new-tokens` | |
| 3. Use `--no-quantization` only if you have enough memory | |
| ### Slow Inference | |
| 1. Use `--merge-weights` for faster inference | |
| 2. Reduce `--max-new-tokens` | |
| 3. Lower `--temperature` (less sampling overhead) | |
| --- | |
| ## π Related Documents | |
| - `TRAINING_GUIDE.md` - Fine-tuning guide | |
| - `HYPERPARAMETER_ANALYSIS.md` - Hyperparameter details | |
| - `MIGRATION_PROGRESS.md` - Migration status | |
| --- | |
| **Happy Inferencing! π** | |