π CodeLlama Evaluation Summary
Date: November 25, 2025
Model: codellama-fifo-v1
π― Quick Summary
| Metric | Value |
|---|---|
| Training Samples Avg Similarity | 13.30% |
| Test Samples Avg Similarity | 0.93% |
| Overall Similarity | 7.11% |
| Code Generation Rate | 50% (training only) |
β What Worked
- Model Loading: Successfully loads with LoRA adapters
- Training Samples: Partial code generation (module declarations)
- Training Sample 2: 20.70% similarity (best result)
β Critical Issues
- Incomplete Code: Training samples generate only module declarations
- Text Instead of Code: Test samples generate repetitive text notes
- Repetition: Severe repetition in test sample outputs
- Early Stopping: Code generation stops before completion
π§ Immediate Actions Needed
Fix Prompt Format
- Match training data format exactly
- Test without system prompt prefix
- Add explicit code generation instruction
Adjust Inference Parameters
- Try lower temperature (0.1-0.2)
- Increase max_new_tokens
- Test different stopping criteria
Check Training Data
- Verify all samples have complete code
- Ensure consistent formatting
- Remove any text-only samples
Re-test with Adjusted Prompts
- Use exact training format
- Test simpler prompts
- Verify generation doesn't stop early
π Detailed Results
See EVALUATION_REPORT.md for complete analysis.
Status: β οΈ NEEDS IMPROVEMENT
Next Steps: Adjust prompts and inference parameters, then re-test.