Elinnos
/

codellama-fine-tuning

Model card Files Files and versions

xet

Community

Prithvik-1 commited on Nov 25, 2025

Commit

6045a16

verified ·

1 Parent(s): c2cfe99

Upload INFERENCE_GUIDE.md with huggingface_hub

Browse files

Files changed (1) hide show

INFERENCE_GUIDE.md +276 -0

INFERENCE_GUIDE.md ADDED Viewed

	@@ -0,0 +1,276 @@

+# 🚀 CodeLlama Inference Guide
+**Last Updated:** November 25, 2025
+---
+## 📋 Overview
+This guide explains how to use the updated CodeLlama inference script with your fine-tuned model.
+---
+## 🎯 Quick Start
+### Basic Inference (Single Prompt)
+```bash
+cd /workspace/ftt/codellama-migration
+python3 scripts/inference/inference_codellama.py \
+    --mode local \
+    --model-path training-outputs/codellama-fifo-v1 \
+    --prompt "Your prompt here"
+```
+Or use the test script:
+```bash
+bash test_inference.sh
+```
+### Interactive Mode
+```bash
+python3 scripts/inference/inference_codellama.py \
+    --mode local \
+    --model-path training-outputs/codellama-fifo-v1
+```
+Type your prompts interactively, type `quit` or `exit` to stop.
+---
+## ⚙️ Command-Line Arguments
+### Required Arguments (for local mode)
+| Argument | Description | Default |
+|----------|-------------|---------|
+| `--mode` | Inference mode: `local` or `ollama` | `local` |
+| `--model-path` | Path to fine-tuned model | `training-outputs/codellama-fifo-v1` |
+### Optional Arguments
+| Argument | Description | Default |
+|----------|-------------|---------|
+| `--base-model-path` | Path to base CodeLlama model | Auto-detected from training config |
+| `--prompt` | Single prompt to process | (Interactive mode if not provided) |
+| `--max-new-tokens` | Maximum tokens to generate | `800` |
+| `--temperature` | Generation temperature (lower = deterministic) | `0.3` |
+| `--merge-weights` | Merge LoRA weights (slower load, faster inference) | `False` |
+| `--no-quantization` | Disable 4-bit quantization | Auto (quantized on GPU) |
+---
+## 📝 Examples
+### Example 1: Basic Inference
+```bash
+python3 scripts/inference/inference_codellama.py \
+    --prompt "Generate a synchronous FIFO with 8-bit data width, depth 4"
+```
+### Example 2: Custom Parameters
+```bash
+python3 scripts/inference/inference_codellama.py \
+    --model-path training-outputs/codellama-fifo-v1 \
+    --prompt "Your prompt" \
+    --max-new-tokens 1200 \
+    --temperature 0.5
+```
+### Example 3: Merged Weights (Faster Inference)
+```bash
+python3 scripts/inference/inference_codellama.py \
+    --model-path training-outputs/codellama-fifo-v1 \
+    --merge-weights \
+    --prompt "Your prompt"
+```
+**Note:** `--merge-weights` merges LoRA adapters into the base model. This takes longer to load but runs inference faster.
+### Example 4: Custom Base Model Path
+```bash
+python3 scripts/inference/inference_codellama.py \
+    --model-path training-outputs/codellama-fifo-v1 \
+    --base-model-path /path/to/custom/base/model \
+    --prompt "Your prompt"
+```
+---
+## 🎛️ Generation Parameters
+### Temperature
+- **0.1-0.3**: Very deterministic, focused outputs (recommended for code generation)
+- **0.5-0.7**: Balanced creativity and determinism
+- **0.8-1.0**: More creative, varied outputs
+**Default:** `0.3` (optimized for code generation)
+### Max New Tokens
+- **512**: Short responses
+- **800**: Default (balanced)
+- **1200+**: Longer code blocks
+**Default:** `800` tokens
+---
+## 🔧 Model Loading
+### Automatic Base Model Detection
+The script automatically detects the base model in this order:
+1. `--base-model-path` argument (if provided)
+2. Local default path: `models/base-models/CodeLlama-7B-Instruct`
+3. Training config: Reads `training_config.json` from model directory
+4. HuggingFace: Falls back to `codellama/CodeLlama-7b-Instruct-hf`
+### LoRA Adapter vs Merged Model
+- **LoRA Adapter (default)**: Faster loading, uses adapter weights
+- **Merged Model (`--merge-weights`)**: Slower loading, but faster inference
+---
+## 📊 Output Format
+The inference script automatically:
+- Extracts Verilog code from markdown code blocks (````verilog`)
+- Removes conversation wrappers
+- Returns clean RTL code
+### Example Output
+**Input:**
+```
+Generate a synchronous FIFO with 8-bit data width, depth 4
+```
+**Output:**
+```verilog
+module sync_fifo_8b_4d (
+  input clk,
+  input rst,
+  input write_en,
+  input read_en,
+  input [7:0] write_data,
+  output [7:0] read_data
+);
+  // ... code ...
+endmodule
+```
+---
+## 🚀 Performance Tips
+### 1. Use Merged Weights for Repeated Inference
+If running many inferences, merge weights once:
+```bash
+# First run (slower loading)
+python3 scripts/inference/inference_codellama.py \
+    --merge-weights \
+    --model-path training-outputs/codellama-fifo-v1
+# Subsequent runs use cached merged model (if saved)
+```
+### 2. Adjust Max Tokens Based on Task
+```bash
+# Short responses
+--max-new-tokens 400
+# Long code blocks
+--max-new-tokens 1200
+```
+### 3. Lower Temperature for Code Generation
+```bash
+# Very deterministic (recommended)
+--temperature 0.2
+# Slightly more varied
+--temperature 0.5
+```
+---
+## 📁 File Structure
+```
+codellama-migration/
+├── scripts/
+│   └── inference/
+│       └── inference_codellama.py    # Updated inference script
+├── training-outputs/
+│   └── codellama-fifo-v1/            # Fine-tuned model
+│       ├── adapter_model.safetensors # LoRA weights
+│       ├── adapter_config.json
+│       └── training_config.json
+├── models/
+│   └── base-models/
+│       └── CodeLlama-7B-Instruct/    # Base model
+└── test_inference.sh                 # Test script
+```
+---
+## 🔍 Troubleshooting
+### Model Not Found
+```bash
+Error: Model path training-outputs/codellama-fifo-v1 does not exist
+```
+**Solution:** Check that the model path is correct:
+```bash
+ls -lh training-outputs/codellama-fifo-v1/
+```
+### Base Model Not Found
+If base model detection fails, specify explicitly:
+```bash
+--base-model-path /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct
+```
+### Out of Memory
+1. Ensure quantization is enabled (default on GPU)
+2. Reduce `--max-new-tokens`
+3. Use `--no-quantization` only if you have enough memory
+### Slow Inference
+1. Use `--merge-weights` for faster inference
+2. Reduce `--max-new-tokens`
+3. Lower `--temperature` (less sampling overhead)
+---
+## 📚 Related Documents
+- `TRAINING_GUIDE.md` - Fine-tuning guide
+- `HYPERPARAMETER_ANALYSIS.md` - Hyperparameter details
+- `MIGRATION_PROGRESS.md` - Migration status
+---
+**Happy Inferencing! 🎉**