codellama-fine-tuning / EVALUATION_REPORT.md

Prithvik-1

Upload EVALUATION_REPORT.md with huggingface_hub

4a11103 verified 2 months ago

preview code

raw

history blame contribute delete

7.88 kB

📊 CodeLlama Fine-Tuned Model Evaluation Report

Date: November 25, 2025
Model: codellama-fifo-v1 (Fine-tuned CodeLlama-7B-Instruct)
Dataset: 70 training samples, 9 validation samples, 15 test samples
Training: 5 epochs, LoRA rank 48, learning rate 2e-5

🎯 Executive Summary

The fine-tuned CodeLlama model was evaluated on 2 training samples and 2 test samples. The evaluation reveals that while the model attempts to generate RTL code, it's not consistently producing the expected code format. The model appears to be generating text descriptions and explanations rather than clean Verilog code.

Key Findings:

Training Set Average Similarity: 13.30%
Test Set Average Similarity: 0.93%
Overall Average Similarity: 7.11%

📋 Test Configuration

Model Details

Base Model: CodeLlama-7B-Instruct
Fine-tuned Model: training-outputs/codellama-fifo-v1
Training Loss: 0.221 (final), 0.530 (average)
Validation Loss: 0.371

Inference Parameters

Max New Tokens: 800
Temperature: 0.3
Generation Mode: Non-streaming

📚 Training Samples Evaluation

Training Sample 1

Instruction:

Generate a synchronous FIFO with 8-bit data width, depth 4, write_enable, read_enable, 
full flag, empty flag, write_err flag (pulses if write when full), and read_err flag 
(pulses if read when empty).

Expected Output:

Complete Verilog module with all required signals
Proper FIFO implementation with error flags
Code wrapped in ```verilog markers

Generated Output:

Started generating module declaration
Different signal naming (dout/din instead of read_data/write_data)
Similarity: 5.89% (36/611 characters match)

Analysis: The model recognizes the task and begins generating a module, but uses different signal naming conventions and doesn't complete the implementation in the expected format.

Training Sample 2

Instruction:

Generate a synchronous FIFO with 8-bit data width, depth 16, write_enable, read_enable, 
full flag, empty flag, and occupancy output showing number of valid entries (0 to 16).

Expected Output:

FIFO with occupancy output
Depth 16 implementation
Complete RTL code

Generated Output:

Module declaration with similar structure
Occupancy output included
Similarity: 20.70% (118/570 characters match)

Analysis: Better performance on this sample. The model includes the occupancy output and maintains similar module structure, though implementation details differ.

🧪 Test Samples Evaluation

Test Sample 1

Instruction:

Generate a synchronous FIFO with 8-bit data width, depth 16, write_enable, read_enable, 
full flag, empty flag, and peek capability to read data at arbitrary index without consuming it.

Expected Output:

FIFO with peek functionality
Peek index input and peek data output

Generated Output:

Generated text description instead of code
Discusses FIFO requirements but doesn't generate code
Similarity: 1.22% (31/2543 characters match)

Analysis: The model failed to generate code for this more complex requirement (peek functionality). Instead, it produced explanatory text about FIFO requirements.

Test Sample 2

Instruction:

Generate a synchronous FIFO with 28-bit data width, depth 32, write_enable, read_enable, 
full flag, empty flag.

Expected Output:

Standard FIFO with 28-bit data width
Depth 32 implementation

Generated Output:

Generated text description
Discusses FIFO behavior but no code
Similarity: 0.64% (18/2831 characters match)

Analysis: Similar issue - the model generates explanatory text rather than Verilog code. The non-standard data width (28-bit) may have contributed to the confusion.

🔍 Critical Issues Identified

1. Incomplete Code Generation (Training Samples)

Issue: Model generates module declaration but stops before implementation
Example: Training Sample 1 only generates port list, missing:
- Internal register declarations
- Always block with FIFO logic
- Assignment statements

Root Cause: Model may be stopping early or not continuing generation properly. The inference might be cutting off before completion.

2. Text Generation Instead of Code (Test Samples)

Issue: Model generates repetitive text notes instead of Verilog code
Example: Test Sample 1 generates 3000+ characters of repetitive "Note:" statements
Example: Test Sample 2 repeats blocking assignment notes endlessly

Root Cause: Model appears to be hallucinating and repeating patterns. This suggests:

Possible prompt format mismatch
Temperature may be causing repetition
Model may not understand test examples format

3. Signal Naming Inconsistency

Training Sample 1 uses din/dout instead of write_data/read_data
Different output declarations (e.g., output reg vs output)

4. Repetition Issues

Test samples show severe repetition (same phrase repeated 50+ times)
Indicates potential inference loop or improper stopping criteria

5. Prompt Format Mismatch

Training data includes full system prompt in instruction
Inference may need to match exact format
Model might be confused about what to output

💡 Recommendations

Immediate Actions

Prompt Formatting
- Ensure prompt matches training format exactly
- Remove system prompt from inference prompt if needed
- Add explicit "Generate Verilog code:" instruction
Temperature Adjustment
- Try temperature 0.1 for more deterministic output
- Test temperature 0.5 for more creative solutions
- Find optimal balance for code generation
Post-Processing
- Add code extraction logic to filter out text
- Verify generated code is syntactically correct
- Clean up any conversational wrappers

Training Improvements

Dataset Refinement
- Ensure all training examples have clean code output
- Remove any conversational elements from responses
- Standardize code formatting
Additional Training
- Consider more epochs if loss continues decreasing
- Add more diverse FIFO examples
- Include edge cases in training data
Hyperparameter Tuning
- Current hyperparameters are reasonable
- Could try higher LoRA rank (64) for more capacity
- Adjust learning rate if needed

📈 Performance Metrics

Metric	Training Samples	Test Samples	Overall
Average Similarity	13.30%	0.93%	7.11%
Best Similarity	20.70%	1.22%	-
Worst Similarity	5.89%	0.64%	-
Code Generation Rate	100% (partial)	0%	50%

Code Generation Rate: Percentage of samples that produced actual Verilog code (even if partial)

🔧 Next Steps

Test with Adjusted Prompts

# Try without system prompt
prompt = "Generate a synchronous FIFO with 8-bit data width, depth 4..."

Inference Parameter Testing
- Test lower temperature (0.1, 0.2)
- Test different max_new_tokens
- Try with merged weights
Dataset Analysis
- Review training data format
- Ensure consistency
- Check for any formatting issues
Re-training Consideration
- If issues persist, may need dataset reformatting
- Consider additional training epochs
- Evaluate need for more training data

📝 Detailed Results

Full evaluation results are saved in:

evaluation_results.json - Complete JSON with all generated outputs
evaluation_output.log - Terminal output log

Report Generated: 2025-11-25
Model Version: codellama-fifo-v1
Training Completed: 2025-11-25 (45 steps, 5 epochs)