codellama-fine-tuning / INFERENCE_GUIDE.md
Prithvik-1's picture
Upload INFERENCE_GUIDE.md with huggingface_hub
6045a16 verified
# πŸš€ CodeLlama Inference Guide
**Last Updated:** November 25, 2025
---
## πŸ“‹ Overview
This guide explains how to use the updated CodeLlama inference script with your fine-tuned model.
---
## 🎯 Quick Start
### Basic Inference (Single Prompt)
```bash
cd /workspace/ftt/codellama-migration
python3 scripts/inference/inference_codellama.py \
--mode local \
--model-path training-outputs/codellama-fifo-v1 \
--prompt "Your prompt here"
```
Or use the test script:
```bash
bash test_inference.sh
```
### Interactive Mode
```bash
python3 scripts/inference/inference_codellama.py \
--mode local \
--model-path training-outputs/codellama-fifo-v1
```
Type your prompts interactively, type `quit` or `exit` to stop.
---
## βš™οΈ Command-Line Arguments
### Required Arguments (for local mode)
| Argument | Description | Default |
|----------|-------------|---------|
| `--mode` | Inference mode: `local` or `ollama` | `local` |
| `--model-path` | Path to fine-tuned model | `training-outputs/codellama-fifo-v1` |
### Optional Arguments
| Argument | Description | Default |
|----------|-------------|---------|
| `--base-model-path` | Path to base CodeLlama model | Auto-detected from training config |
| `--prompt` | Single prompt to process | (Interactive mode if not provided) |
| `--max-new-tokens` | Maximum tokens to generate | `800` |
| `--temperature` | Generation temperature (lower = deterministic) | `0.3` |
| `--merge-weights` | Merge LoRA weights (slower load, faster inference) | `False` |
| `--no-quantization` | Disable 4-bit quantization | Auto (quantized on GPU) |
---
## πŸ“ Examples
### Example 1: Basic Inference
```bash
python3 scripts/inference/inference_codellama.py \
--prompt "Generate a synchronous FIFO with 8-bit data width, depth 4"
```
### Example 2: Custom Parameters
```bash
python3 scripts/inference/inference_codellama.py \
--model-path training-outputs/codellama-fifo-v1 \
--prompt "Your prompt" \
--max-new-tokens 1200 \
--temperature 0.5
```
### Example 3: Merged Weights (Faster Inference)
```bash
python3 scripts/inference/inference_codellama.py \
--model-path training-outputs/codellama-fifo-v1 \
--merge-weights \
--prompt "Your prompt"
```
**Note:** `--merge-weights` merges LoRA adapters into the base model. This takes longer to load but runs inference faster.
### Example 4: Custom Base Model Path
```bash
python3 scripts/inference/inference_codellama.py \
--model-path training-outputs/codellama-fifo-v1 \
--base-model-path /path/to/custom/base/model \
--prompt "Your prompt"
```
---
## πŸŽ›οΈ Generation Parameters
### Temperature
- **0.1-0.3**: Very deterministic, focused outputs (recommended for code generation)
- **0.5-0.7**: Balanced creativity and determinism
- **0.8-1.0**: More creative, varied outputs
**Default:** `0.3` (optimized for code generation)
### Max New Tokens
- **512**: Short responses
- **800**: Default (balanced)
- **1200+**: Longer code blocks
**Default:** `800` tokens
---
## πŸ”§ Model Loading
### Automatic Base Model Detection
The script automatically detects the base model in this order:
1. `--base-model-path` argument (if provided)
2. Local default path: `models/base-models/CodeLlama-7B-Instruct`
3. Training config: Reads `training_config.json` from model directory
4. HuggingFace: Falls back to `codellama/CodeLlama-7b-Instruct-hf`
### LoRA Adapter vs Merged Model
- **LoRA Adapter (default)**: Faster loading, uses adapter weights
- **Merged Model (`--merge-weights`)**: Slower loading, but faster inference
---
## πŸ“Š Output Format
The inference script automatically:
- Extracts Verilog code from markdown code blocks (````verilog`)
- Removes conversation wrappers
- Returns clean RTL code
### Example Output
**Input:**
```
Generate a synchronous FIFO with 8-bit data width, depth 4
```
**Output:**
```verilog
module sync_fifo_8b_4d (
input clk,
input rst,
input write_en,
input read_en,
input [7:0] write_data,
output [7:0] read_data
);
// ... code ...
endmodule
```
---
## πŸš€ Performance Tips
### 1. Use Merged Weights for Repeated Inference
If running many inferences, merge weights once:
```bash
# First run (slower loading)
python3 scripts/inference/inference_codellama.py \
--merge-weights \
--model-path training-outputs/codellama-fifo-v1
# Subsequent runs use cached merged model (if saved)
```
### 2. Adjust Max Tokens Based on Task
```bash
# Short responses
--max-new-tokens 400
# Long code blocks
--max-new-tokens 1200
```
### 3. Lower Temperature for Code Generation
```bash
# Very deterministic (recommended)
--temperature 0.2
# Slightly more varied
--temperature 0.5
```
---
## πŸ“ File Structure
```
codellama-migration/
β”œβ”€β”€ scripts/
β”‚ └── inference/
β”‚ └── inference_codellama.py # Updated inference script
β”œβ”€β”€ training-outputs/
β”‚ └── codellama-fifo-v1/ # Fine-tuned model
β”‚ β”œβ”€β”€ adapter_model.safetensors # LoRA weights
β”‚ β”œβ”€β”€ adapter_config.json
β”‚ └── training_config.json
β”œβ”€β”€ models/
β”‚ └── base-models/
β”‚ └── CodeLlama-7B-Instruct/ # Base model
└── test_inference.sh # Test script
```
---
## πŸ” Troubleshooting
### Model Not Found
```bash
Error: Model path training-outputs/codellama-fifo-v1 does not exist
```
**Solution:** Check that the model path is correct:
```bash
ls -lh training-outputs/codellama-fifo-v1/
```
### Base Model Not Found
If base model detection fails, specify explicitly:
```bash
--base-model-path /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct
```
### Out of Memory
1. Ensure quantization is enabled (default on GPU)
2. Reduce `--max-new-tokens`
3. Use `--no-quantization` only if you have enough memory
### Slow Inference
1. Use `--merge-weights` for faster inference
2. Reduce `--max-new-tokens`
3. Lower `--temperature` (less sampling overhead)
---
## πŸ“š Related Documents
- `TRAINING_GUIDE.md` - Fine-tuning guide
- `HYPERPARAMETER_ANALYSIS.md` - Hyperparameter details
- `MIGRATION_PROGRESS.md` - Migration status
---
**Happy Inferencing! πŸŽ‰**