File size: 6,240 Bytes
6045a16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
# πŸš€ CodeLlama Inference Guide

**Last Updated:** November 25, 2025

---

## πŸ“‹ Overview

This guide explains how to use the updated CodeLlama inference script with your fine-tuned model.

---

## 🎯 Quick Start

### Basic Inference (Single Prompt)

```bash
cd /workspace/ftt/codellama-migration

python3 scripts/inference/inference_codellama.py \
    --mode local \
    --model-path training-outputs/codellama-fifo-v1 \
    --prompt "Your prompt here"
```

Or use the test script:

```bash
bash test_inference.sh
```

### Interactive Mode

```bash
python3 scripts/inference/inference_codellama.py \
    --mode local \
    --model-path training-outputs/codellama-fifo-v1
```

Type your prompts interactively, type `quit` or `exit` to stop.

---

## βš™οΈ Command-Line Arguments

### Required Arguments (for local mode)

| Argument | Description | Default |
|----------|-------------|---------|
| `--mode` | Inference mode: `local` or `ollama` | `local` |
| `--model-path` | Path to fine-tuned model | `training-outputs/codellama-fifo-v1` |

### Optional Arguments

| Argument | Description | Default |
|----------|-------------|---------|
| `--base-model-path` | Path to base CodeLlama model | Auto-detected from training config |
| `--prompt` | Single prompt to process | (Interactive mode if not provided) |
| `--max-new-tokens` | Maximum tokens to generate | `800` |
| `--temperature` | Generation temperature (lower = deterministic) | `0.3` |
| `--merge-weights` | Merge LoRA weights (slower load, faster inference) | `False` |
| `--no-quantization` | Disable 4-bit quantization | Auto (quantized on GPU) |

---

## πŸ“ Examples

### Example 1: Basic Inference

```bash
python3 scripts/inference/inference_codellama.py \
    --prompt "Generate a synchronous FIFO with 8-bit data width, depth 4"
```

### Example 2: Custom Parameters

```bash
python3 scripts/inference/inference_codellama.py \
    --model-path training-outputs/codellama-fifo-v1 \
    --prompt "Your prompt" \
    --max-new-tokens 1200 \
    --temperature 0.5
```

### Example 3: Merged Weights (Faster Inference)

```bash
python3 scripts/inference/inference_codellama.py \
    --model-path training-outputs/codellama-fifo-v1 \
    --merge-weights \
    --prompt "Your prompt"
```

**Note:** `--merge-weights` merges LoRA adapters into the base model. This takes longer to load but runs inference faster.

### Example 4: Custom Base Model Path

```bash
python3 scripts/inference/inference_codellama.py \
    --model-path training-outputs/codellama-fifo-v1 \
    --base-model-path /path/to/custom/base/model \
    --prompt "Your prompt"
```

---

## πŸŽ›οΈ Generation Parameters

### Temperature

- **0.1-0.3**: Very deterministic, focused outputs (recommended for code generation)
- **0.5-0.7**: Balanced creativity and determinism
- **0.8-1.0**: More creative, varied outputs

**Default:** `0.3` (optimized for code generation)

### Max New Tokens

- **512**: Short responses
- **800**: Default (balanced)
- **1200+**: Longer code blocks

**Default:** `800` tokens

---

## πŸ”§ Model Loading

### Automatic Base Model Detection

The script automatically detects the base model in this order:

1. `--base-model-path` argument (if provided)
2. Local default path: `models/base-models/CodeLlama-7B-Instruct`
3. Training config: Reads `training_config.json` from model directory
4. HuggingFace: Falls back to `codellama/CodeLlama-7b-Instruct-hf`

### LoRA Adapter vs Merged Model

- **LoRA Adapter (default)**: Faster loading, uses adapter weights
- **Merged Model (`--merge-weights`)**: Slower loading, but faster inference

---

## πŸ“Š Output Format

The inference script automatically:
- Extracts Verilog code from markdown code blocks (````verilog`)
- Removes conversation wrappers
- Returns clean RTL code

### Example Output

**Input:**
```
Generate a synchronous FIFO with 8-bit data width, depth 4
```

**Output:**
```verilog
module sync_fifo_8b_4d (
  input clk,
  input rst,
  input write_en,
  input read_en,
  input [7:0] write_data,
  output [7:0] read_data
);
  // ... code ...
endmodule
```

---

## πŸš€ Performance Tips

### 1. Use Merged Weights for Repeated Inference

If running many inferences, merge weights once:

```bash
# First run (slower loading)
python3 scripts/inference/inference_codellama.py \
    --merge-weights \
    --model-path training-outputs/codellama-fifo-v1

# Subsequent runs use cached merged model (if saved)
```

### 2. Adjust Max Tokens Based on Task

```bash
# Short responses
--max-new-tokens 400

# Long code blocks
--max-new-tokens 1200
```

### 3. Lower Temperature for Code Generation

```bash
# Very deterministic (recommended)
--temperature 0.2

# Slightly more varied
--temperature 0.5
```

---

## πŸ“ File Structure

```
codellama-migration/
β”œβ”€β”€ scripts/
β”‚   └── inference/
β”‚       └── inference_codellama.py    # Updated inference script
β”œβ”€β”€ training-outputs/
β”‚   └── codellama-fifo-v1/            # Fine-tuned model
β”‚       β”œβ”€β”€ adapter_model.safetensors # LoRA weights
β”‚       β”œβ”€β”€ adapter_config.json
β”‚       └── training_config.json
β”œβ”€β”€ models/
β”‚   └── base-models/
β”‚       └── CodeLlama-7B-Instruct/    # Base model
└── test_inference.sh                 # Test script
```

---

## πŸ” Troubleshooting

### Model Not Found

```bash
Error: Model path training-outputs/codellama-fifo-v1 does not exist
```

**Solution:** Check that the model path is correct:
```bash
ls -lh training-outputs/codellama-fifo-v1/
```

### Base Model Not Found

If base model detection fails, specify explicitly:

```bash
--base-model-path /workspace/ftt/codellama-migration/models/base-models/CodeLlama-7B-Instruct
```

### Out of Memory

1. Ensure quantization is enabled (default on GPU)
2. Reduce `--max-new-tokens`
3. Use `--no-quantization` only if you have enough memory

### Slow Inference

1. Use `--merge-weights` for faster inference
2. Reduce `--max-new-tokens`
3. Lower `--temperature` (less sampling overhead)

---

## πŸ“š Related Documents

- `TRAINING_GUIDE.md` - Fine-tuning guide
- `HYPERPARAMETER_ANALYSIS.md` - Hyperparameter details
- `MIGRATION_PROGRESS.md` - Migration status

---

**Happy Inferencing! πŸŽ‰**