codellama-fine-tuning / EVALUATION_SUMMARY.md
Prithvik-1's picture
Upload EVALUATION_SUMMARY.md with huggingface_hub
64ccb77 verified

πŸ“Š CodeLlama Evaluation Summary

Date: November 25, 2025
Model: codellama-fifo-v1


🎯 Quick Summary

Metric Value
Training Samples Avg Similarity 13.30%
Test Samples Avg Similarity 0.93%
Overall Similarity 7.11%
Code Generation Rate 50% (training only)

βœ… What Worked

  1. Model Loading: Successfully loads with LoRA adapters
  2. Training Samples: Partial code generation (module declarations)
  3. Training Sample 2: 20.70% similarity (best result)

❌ Critical Issues

  1. Incomplete Code: Training samples generate only module declarations
  2. Text Instead of Code: Test samples generate repetitive text notes
  3. Repetition: Severe repetition in test sample outputs
  4. Early Stopping: Code generation stops before completion

πŸ”§ Immediate Actions Needed

  1. Fix Prompt Format

    • Match training data format exactly
    • Test without system prompt prefix
    • Add explicit code generation instruction
  2. Adjust Inference Parameters

    • Try lower temperature (0.1-0.2)
    • Increase max_new_tokens
    • Test different stopping criteria
  3. Check Training Data

    • Verify all samples have complete code
    • Ensure consistent formatting
    • Remove any text-only samples
  4. Re-test with Adjusted Prompts

    • Use exact training format
    • Test simpler prompts
    • Verify generation doesn't stop early

πŸ“ˆ Detailed Results

See EVALUATION_REPORT.md for complete analysis.


Status: ⚠️ NEEDS IMPROVEMENT
Next Steps: Adjust prompts and inference parameters, then re-test.