π CodeLlama Fine-Tuned Model Evaluation Report
Date: November 25, 2025
Model: codellama-fifo-v1 (Fine-tuned CodeLlama-7B-Instruct)
Dataset: 70 training samples, 9 validation samples, 15 test samples
Training: 5 epochs, LoRA rank 48, learning rate 2e-5
π― Executive Summary
The fine-tuned CodeLlama model was evaluated on 2 training samples and 2 test samples. The evaluation reveals that while the model attempts to generate RTL code, it's not consistently producing the expected code format. The model appears to be generating text descriptions and explanations rather than clean Verilog code.
Key Findings:
- Training Set Average Similarity: 13.30%
- Test Set Average Similarity: 0.93%
- Overall Average Similarity: 7.11%
π Test Configuration
Model Details
- Base Model: CodeLlama-7B-Instruct
- Fine-tuned Model:
training-outputs/codellama-fifo-v1 - Training Loss: 0.221 (final), 0.530 (average)
- Validation Loss: 0.371
Inference Parameters
- Max New Tokens: 800
- Temperature: 0.3
- Generation Mode: Non-streaming
π Training Samples Evaluation
Training Sample 1
Instruction:
Generate a synchronous FIFO with 8-bit data width, depth 4, write_enable, read_enable,
full flag, empty flag, write_err flag (pulses if write when full), and read_err flag
(pulses if read when empty).
Expected Output:
- Complete Verilog module with all required signals
- Proper FIFO implementation with error flags
- Code wrapped in ```verilog markers
Generated Output:
- Started generating module declaration
- Different signal naming (dout/din instead of read_data/write_data)
- Similarity: 5.89% (36/611 characters match)
Analysis: The model recognizes the task and begins generating a module, but uses different signal naming conventions and doesn't complete the implementation in the expected format.
Training Sample 2
Instruction:
Generate a synchronous FIFO with 8-bit data width, depth 16, write_enable, read_enable,
full flag, empty flag, and occupancy output showing number of valid entries (0 to 16).
Expected Output:
- FIFO with occupancy output
- Depth 16 implementation
- Complete RTL code
Generated Output:
- Module declaration with similar structure
- Occupancy output included
- Similarity: 20.70% (118/570 characters match)
Analysis: Better performance on this sample. The model includes the occupancy output and maintains similar module structure, though implementation details differ.
π§ͺ Test Samples Evaluation
Test Sample 1
Instruction:
Generate a synchronous FIFO with 8-bit data width, depth 16, write_enable, read_enable,
full flag, empty flag, and peek capability to read data at arbitrary index without consuming it.
Expected Output:
- FIFO with peek functionality
- Peek index input and peek data output
Generated Output:
- Generated text description instead of code
- Discusses FIFO requirements but doesn't generate code
- Similarity: 1.22% (31/2543 characters match)
Analysis: The model failed to generate code for this more complex requirement (peek functionality). Instead, it produced explanatory text about FIFO requirements.
Test Sample 2
Instruction:
Generate a synchronous FIFO with 28-bit data width, depth 32, write_enable, read_enable,
full flag, empty flag.
Expected Output:
- Standard FIFO with 28-bit data width
- Depth 32 implementation
Generated Output:
- Generated text description
- Discusses FIFO behavior but no code
- Similarity: 0.64% (18/2831 characters match)
Analysis: Similar issue - the model generates explanatory text rather than Verilog code. The non-standard data width (28-bit) may have contributed to the confusion.
π Critical Issues Identified
1. Incomplete Code Generation (Training Samples)
- Issue: Model generates module declaration but stops before implementation
- Example: Training Sample 1 only generates port list, missing:
- Internal register declarations
- Always block with FIFO logic
- Assignment statements
Root Cause: Model may be stopping early or not continuing generation properly. The inference might be cutting off before completion.
2. Text Generation Instead of Code (Test Samples)
- Issue: Model generates repetitive text notes instead of Verilog code
- Example: Test Sample 1 generates 3000+ characters of repetitive "Note:" statements
- Example: Test Sample 2 repeats blocking assignment notes endlessly
Root Cause: Model appears to be hallucinating and repeating patterns. This suggests:
- Possible prompt format mismatch
- Temperature may be causing repetition
- Model may not understand test examples format
3. Signal Naming Inconsistency
- Training Sample 1 uses
din/doutinstead ofwrite_data/read_data - Different output declarations (e.g.,
output regvsoutput)
4. Repetition Issues
- Test samples show severe repetition (same phrase repeated 50+ times)
- Indicates potential inference loop or improper stopping criteria
5. Prompt Format Mismatch
- Training data includes full system prompt in instruction
- Inference may need to match exact format
- Model might be confused about what to output
π‘ Recommendations
Immediate Actions
Prompt Formatting
- Ensure prompt matches training format exactly
- Remove system prompt from inference prompt if needed
- Add explicit "Generate Verilog code:" instruction
Temperature Adjustment
- Try temperature 0.1 for more deterministic output
- Test temperature 0.5 for more creative solutions
- Find optimal balance for code generation
Post-Processing
- Add code extraction logic to filter out text
- Verify generated code is syntactically correct
- Clean up any conversational wrappers
Training Improvements
Dataset Refinement
- Ensure all training examples have clean code output
- Remove any conversational elements from responses
- Standardize code formatting
Additional Training
- Consider more epochs if loss continues decreasing
- Add more diverse FIFO examples
- Include edge cases in training data
Hyperparameter Tuning
- Current hyperparameters are reasonable
- Could try higher LoRA rank (64) for more capacity
- Adjust learning rate if needed
π Performance Metrics
| Metric | Training Samples | Test Samples | Overall |
|---|---|---|---|
| Average Similarity | 13.30% | 0.93% | 7.11% |
| Best Similarity | 20.70% | 1.22% | - |
| Worst Similarity | 5.89% | 0.64% | - |
| Code Generation Rate | 100% (partial) | 0% | 50% |
Code Generation Rate: Percentage of samples that produced actual Verilog code (even if partial)
π§ Next Steps
Test with Adjusted Prompts
# Try without system prompt prompt = "Generate a synchronous FIFO with 8-bit data width, depth 4..."Inference Parameter Testing
- Test lower temperature (0.1, 0.2)
- Test different max_new_tokens
- Try with merged weights
Dataset Analysis
- Review training data format
- Ensure consistency
- Check for any formatting issues
Re-training Consideration
- If issues persist, may need dataset reformatting
- Consider additional training epochs
- Evaluate need for more training data
π Detailed Results
Full evaluation results are saved in:
evaluation_results.json- Complete JSON with all generated outputsevaluation_output.log- Terminal output log
Report Generated: 2025-11-25
Model Version: codellama-fifo-v1
Training Completed: 2025-11-25 (45 steps, 5 epochs)