# 🚀 CodeLlama Inference Guide

**Last Updated:** November 25, 2025

---

## 📋 Overview

This guide explains how to use the updated CodeLlama inference script with your fine-tuned model.

---

## 🎯 Quick Start

### Basic Inference (Single Prompt)

```bash
cd /workspace/ftt/codellama-migration

python3 scripts/inference/inference_codellama.py \
    --mode local \
    --model-path training-outputs/codellama-fifo-v1 \
    --prompt "Your prompt here"
```

Or use the test script:

```bash
bash test_inference.sh
```

### Interactive Mode

```bash
python3 scripts/inference/inference_codellama.py \
    --mode local \
    --model-path training-outputs/codellama-fifo-v1
```

Type your prompts interactively, type `quit` or `exit` to stop.

---

## ⚙️ Command-Line Arguments

### Required Arguments (for local mode)

| Argument | Description | Default |
|----------|-------------|---------|
| `--mode` | Inference mode: `local` or `ollama` | `local` |
| `--model-path` | Path to fine-tuned model | `training-outputs/codellama-fifo-v1` |

### Optional Arguments

| Argument | Description | Default |
|----------|-------------|---------|
| `--base-model-path` | Path to base CodeLlama model | Auto-detected from training config |
| `--prompt` | Single prompt to process | (Interactive mode if not provided) |
| `--max-new-tokens` | Maximum tokens to generate | `800` |
| `--temperature` | Generation temperature (lower = deterministic) | `0.3` |
| `--merge-weights` | Merge LoRA weights (slower load, faster inference) | `False` |
| `--no-quantization` | Disable 4-bit quantization | Auto (quantized on GPU) |

---

## 📝 Examples

### Example 1: Basic Inference

```bash
python3 scripts/inference/inference_codellama.py \
    --prompt "Generate a synchronous FIFO with 8-bit data width, depth 4"
```

### Example 2: Custom Parameters

```bash
python3 scripts/inference/inference_codellama.py \
    --model-path training-outputs/codellama-fifo-v1 \
    --prompt "Your prompt" \
    --max-new-tokens 1200 \
    --temperature 0.5
```

### Example 3: Merged Weights (Faster Inference)

```bash
python3 scripts/inference/inference_codellama.py \
    --model-path training-outputs/codellama-fifo-v1 \
    --merge-weights \
    --prompt "Your prompt"
```

**Note:** `--merge-weights` merges LoRA adapters into the base model. This takes longer to load but runs inference faster.

### Example 4: Custom Base Model Path

```bash
python3 scripts/inference/inference_codellama.py \
    --model-path training-outputs/codellama-fifo-v1 \
    --base-model-path /path/to/custom/base/model \
    --prompt "Your prompt"
```

---

## 🎛️ Generation Parameters

### Temperature

- **0.1-0.3**: Very deterministic, focused outputs (recommended for code generation)
- **0.5-0.7**: Balanced creativity and determinism
- **0.8-1.0**: More creative, varied outputs

**Default:** `0.3` (optimized for code generation)

### Max New Tokens

- **512**: Short responses
- **800**: Default (balanced)
- **1200+**: Longer code blocks

**Default:** `800` tokens

---

## 🔧 Model Loading

### Automatic Base Model Detection

The script automatically detects the base model in this order:

1. `--base-model-path` argument (if provided)
2. Local default path: `models/base-models/CodeLlama-7B-Instruct`
3. Training config: Reads `training_config.json` from model directory
4. HuggingFace: Falls back to `codellama/CodeLlama-7b-Instruct-hf`

### LoRA Adapter vs Merged Model

- **LoRA Adapter (default)**: Faster loading, uses adapter weights
- **Merged Model (`--merge-weights`)**: Slower loading, but faster inference

---

## 📊 Output Format

The inference script automatically:
- Extracts Verilog code from markdown code blocks (````verilog`)
- Removes conversation wrappers
- Returns clean RTL code

### Example Output

**Input:**
```
Generate a synchronous FIFO with 8-bit data width, depth 4
```

**Output:**
```verilog
module sync_fifo_8b_4d (
  input clk,
  input rst,
  input write_en,
  input read_en,
  input [7:0] write_data,
  output [7:0] read_data
);
  // ... code ...
endmodule
```

---

## 🚀 Performance Tips

### 1. Use Merged Weights for Repeated Inference

If running many inferences, merge weights once:

```bash
# First run (slower loading)
python3 scripts/inference/inference_codellama.py \
    --merge-weights \
    --model-path training-outputs/codellama-fifo-v1

# Subsequent runs use cached merged model (if saved)
```

### 2. Adjust Max Tokens Based on Task

```bash
# Short responses
--max-new-tokens 400

# Long code blocks
--max-new-tokens 1200
```

### 3. Lower Temperature for Code Generation

```bash
# Very deterministic (recommended)
--temperature 0.2

# Slightly more varied
--temperature 0.5
```

---

## 📁 File Structure

```
codellama-migration/
├── scripts/
│   └── inference/
│       └── inference_codellama.py    # Updated inference script
├── training-outputs/
│   └── codellama-fifo-v1/            # Fine-tuned model
│       ├── adapter_model.safetensors # LoRA weights
│       ├── adapter_config.json
│       └── training_config.json
├── models/
│   └── base-models/
│       └── CodeLlama-7B-Instruct/    # Base model
└── test_inference.sh                 # Test script
```

---

## 🔍 Troubleshooting

### Model Not Found

```bash
Error: Model path training-outputs/codellama-fifo-v1 does not exist
```

**Solution:** Check that the model path is correct:
```bash
ls -lh training-outputs/codellama-fifo-v1/
```

### Base Model Not Found

If base model detection fails, specify explicitly:

```bash
--base-model-path /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct
```

### Out of Memory

1. Ensure quantization is enabled (default on GPU)
2. Reduce `--max-new-tokens`
3. Use `--no-quantization` only if you have enough memory

### Slow Inference

1. Use `--merge-weights` for faster inference
2. Reduce `--max-new-tokens`
3. Lower `--temperature` (less sampling overhead)

---

## 📚 Related Documents

- `TRAINING_GUIDE.md` - Fine-tuning guide
- `HYPERPARAMETER_ANALYSIS.md` - Hyperparameter details
- `MIGRATION_PROGRESS.md` - Migration status

---

**Happy Inferencing! 🎉**