File size: 7,879 Bytes
4a11103 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 |
# π CodeLlama Fine-Tuned Model Evaluation Report
**Date:** November 25, 2025
**Model:** `codellama-fifo-v1` (Fine-tuned CodeLlama-7B-Instruct)
**Dataset:** 70 training samples, 9 validation samples, 15 test samples
**Training:** 5 epochs, LoRA rank 48, learning rate 2e-5
---
## π― Executive Summary
The fine-tuned CodeLlama model was evaluated on 2 training samples and 2 test samples. The evaluation reveals that while the model attempts to generate RTL code, it's not consistently producing the expected code format. The model appears to be generating text descriptions and explanations rather than clean Verilog code.
**Key Findings:**
- **Training Set Average Similarity:** 13.30%
- **Test Set Average Similarity:** 0.93%
- **Overall Average Similarity:** 7.11%
---
## π Test Configuration
### Model Details
- **Base Model:** CodeLlama-7B-Instruct
- **Fine-tuned Model:** `training-outputs/codellama-fifo-v1`
- **Training Loss:** 0.221 (final), 0.530 (average)
- **Validation Loss:** 0.371
### Inference Parameters
- **Max New Tokens:** 800
- **Temperature:** 0.3
- **Generation Mode:** Non-streaming
---
## π Training Samples Evaluation
### Training Sample 1
**Instruction:**
```
Generate a synchronous FIFO with 8-bit data width, depth 4, write_enable, read_enable,
full flag, empty flag, write_err flag (pulses if write when full), and read_err flag
(pulses if read when empty).
```
**Expected Output:**
- Complete Verilog module with all required signals
- Proper FIFO implementation with error flags
- Code wrapped in ```verilog markers
**Generated Output:**
- Started generating module declaration
- Different signal naming (dout/din instead of read_data/write_data)
- Similarity: **5.89%** (36/611 characters match)
**Analysis:**
The model recognizes the task and begins generating a module, but uses different signal naming conventions and doesn't complete the implementation in the expected format.
---
### Training Sample 2
**Instruction:**
```
Generate a synchronous FIFO with 8-bit data width, depth 16, write_enable, read_enable,
full flag, empty flag, and occupancy output showing number of valid entries (0 to 16).
```
**Expected Output:**
- FIFO with occupancy output
- Depth 16 implementation
- Complete RTL code
**Generated Output:**
- Module declaration with similar structure
- Occupancy output included
- Similarity: **20.70%** (118/570 characters match)
**Analysis:**
Better performance on this sample. The model includes the occupancy output and maintains similar module structure, though implementation details differ.
---
## π§ͺ Test Samples Evaluation
### Test Sample 1
**Instruction:**
```
Generate a synchronous FIFO with 8-bit data width, depth 16, write_enable, read_enable,
full flag, empty flag, and peek capability to read data at arbitrary index without consuming it.
```
**Expected Output:**
- FIFO with peek functionality
- Peek index input and peek data output
**Generated Output:**
- Generated text description instead of code
- Discusses FIFO requirements but doesn't generate code
- Similarity: **1.22%** (31/2543 characters match)
**Analysis:**
The model failed to generate code for this more complex requirement (peek functionality). Instead, it produced explanatory text about FIFO requirements.
---
### Test Sample 2
**Instruction:**
```
Generate a synchronous FIFO with 28-bit data width, depth 32, write_enable, read_enable,
full flag, empty flag.
```
**Expected Output:**
- Standard FIFO with 28-bit data width
- Depth 32 implementation
**Generated Output:**
- Generated text description
- Discusses FIFO behavior but no code
- Similarity: **0.64%** (18/2831 characters match)
**Analysis:**
Similar issue - the model generates explanatory text rather than Verilog code. The non-standard data width (28-bit) may have contributed to the confusion.
---
## π Critical Issues Identified
### 1. **Incomplete Code Generation (Training Samples)**
- **Issue:** Model generates module declaration but stops before implementation
- **Example:** Training Sample 1 only generates port list, missing:
- Internal register declarations
- Always block with FIFO logic
- Assignment statements
**Root Cause:** Model may be stopping early or not continuing generation properly. The inference might be cutting off before completion.
### 2. **Text Generation Instead of Code (Test Samples)**
- **Issue:** Model generates repetitive text notes instead of Verilog code
- **Example:** Test Sample 1 generates 3000+ characters of repetitive "Note:" statements
- **Example:** Test Sample 2 repeats blocking assignment notes endlessly
**Root Cause:** Model appears to be hallucinating and repeating patterns. This suggests:
- Possible prompt format mismatch
- Temperature may be causing repetition
- Model may not understand test examples format
### 3. **Signal Naming Inconsistency**
- Training Sample 1 uses `din/dout` instead of `write_data/read_data`
- Different output declarations (e.g., `output reg` vs `output`)
### 4. **Repetition Issues**
- Test samples show severe repetition (same phrase repeated 50+ times)
- Indicates potential inference loop or improper stopping criteria
### 5. **Prompt Format Mismatch**
- Training data includes full system prompt in instruction
- Inference may need to match exact format
- Model might be confused about what to output
---
## π‘ Recommendations
### Immediate Actions
1. **Prompt Formatting**
- Ensure prompt matches training format exactly
- Remove system prompt from inference prompt if needed
- Add explicit "Generate Verilog code:" instruction
2. **Temperature Adjustment**
- Try temperature 0.1 for more deterministic output
- Test temperature 0.5 for more creative solutions
- Find optimal balance for code generation
3. **Post-Processing**
- Add code extraction logic to filter out text
- Verify generated code is syntactically correct
- Clean up any conversational wrappers
### Training Improvements
1. **Dataset Refinement**
- Ensure all training examples have clean code output
- Remove any conversational elements from responses
- Standardize code formatting
2. **Additional Training**
- Consider more epochs if loss continues decreasing
- Add more diverse FIFO examples
- Include edge cases in training data
3. **Hyperparameter Tuning**
- Current hyperparameters are reasonable
- Could try higher LoRA rank (64) for more capacity
- Adjust learning rate if needed
---
## π Performance Metrics
| Metric | Training Samples | Test Samples | Overall |
|--------|-----------------|--------------|---------|
| **Average Similarity** | 13.30% | 0.93% | 7.11% |
| **Best Similarity** | 20.70% | 1.22% | - |
| **Worst Similarity** | 5.89% | 0.64% | - |
| **Code Generation Rate** | 100% (partial) | 0% | 50% |
**Code Generation Rate:** Percentage of samples that produced actual Verilog code (even if partial)
---
## π§ Next Steps
1. **Test with Adjusted Prompts**
```python
# Try without system prompt
prompt = "Generate a synchronous FIFO with 8-bit data width, depth 4..."
```
2. **Inference Parameter Testing**
- Test lower temperature (0.1, 0.2)
- Test different max_new_tokens
- Try with merged weights
3. **Dataset Analysis**
- Review training data format
- Ensure consistency
- Check for any formatting issues
4. **Re-training Consideration**
- If issues persist, may need dataset reformatting
- Consider additional training epochs
- Evaluate need for more training data
---
## π Detailed Results
Full evaluation results are saved in:
- `evaluation_results.json` - Complete JSON with all generated outputs
- `evaluation_output.log` - Terminal output log
---
**Report Generated:** 2025-11-25
**Model Version:** codellama-fifo-v1
**Training Completed:** 2025-11-25 (45 steps, 5 epochs)
|