Sheikh-2.5-Coder / docs /EVALUATION_FRAMEWORK.md
likhonsheikh's picture
Add complete implementation documentation
ee377a6 verified

Sheikh-2.5-Coder Evaluation Framework

Overview

This comprehensive evaluation framework provides systematic testing and benchmarking for the Sheikh-2.5-Coder model across multiple dimensions including code generation quality, performance, web development capabilities, and regression detection.

Components

1. Main Evaluation Orchestrator (evaluate_model.py)

  • Purpose: Coordinates all evaluation benchmarks and generates comprehensive reports
  • Features:
    • Integrates all evaluation components
    • Creates HTML dashboards and visualizations
    • Generates detailed markdown reports
    • Manages target achievement tracking

2. Benchmark Evaluations

MMLU Code Evaluation (mmlu_evaluation.py)

  • Target: >60% accuracy on MMLU Code subset
  • Dataset: lukaemon/mmlu with code subset
  • Metrics: Accuracy, response time, confusion analysis
  • Features:
    • Multiple choice question answering
    • Programming concept understanding
    • Categorized performance analysis

HumanEval Coding Tasks (humaneval_evaluation.py)

  • Target: >40% Pass@1
  • Dataset: OpenAI HumanEval
  • Metrics: Pass@1, Pass@k, function correctness, syntax validity
  • Features:
    • Multi-completion generation for Pass@k calculation
    • Automated function testing
    • Code syntax validation

Web Development Tests (web_dev_tests.py)

  • Target: 75% quality score across web technologies
  • Coverage: JavaScript/TypeScript, React, XML, MDX, CSS
  • Features:
    • Language-specific quality assessment
    • Best practices compliance checking
    • Component pattern recognition

3. Performance Benchmarking (performance_benchmark.py)

  • Metrics: Inference speed, memory usage, context scaling, multi-threading
  • Features:
    • Hardware utilization monitoring
    • Batch size optimization testing
    • Memory profiling across quantization levels
    • Context length scalability analysis

4. Code Quality Assessment (code_quality_tests.py)

  • Targets: >95% syntax validity, >0.65 CodeBLEU score
  • Features:
    • Multi-language syntax validation
    • Code complexity analysis
    • Best practices compliance
    • CodeBLEU score calculation

5. Regression Testing (regression_testing.py)

  • Purpose: Detect performance regressions against baselines
  • Features:
    • Statistical significance testing
    • Multi-baseline comparison
    • Automated regression reporting
    • Performance degradation detection

Configuration

Evaluation Configuration (evaluation_config.yaml)

evaluation:
  model_settings:
    device: "auto"
    dtype: "float16"
    max_new_tokens: 512
    temperature: 0.7
    
  targets:
    mmlu_code_accuracy: 0.60
    humaneval_pass1: 0.40
    codebleu_score: 0.65
    syntax_validity: 0.95
    web_dev_quality: 0.75

Usage

Quick Start

# Run comprehensive evaluation
python scripts/evaluate_model.py \
    --model_path /path/to/model \
    --config scripts/evaluation_config.yaml \
    --output_path ./evaluation_results \
    --run_id eval_$(date +%Y%m%d_%H%M%S)

Individual Benchmark Runs

# MMLU Code evaluation
python scripts/mmlu_evaluation.py \
    --model_path /path/to/model \
    --config scripts/evaluation_config.yaml \
    --output_path ./results/mmlu \
    --run_id mmlu_eval

# HumanEval evaluation
python scripts/humaneval_evaluation.py \
    --model_path /path/to/model \
    --config scripts/evaluation_config.yaml \
    --output_path ./results/humaneval \
    --run_id humaneval_eval

# Web development tests
python scripts/web_dev_tests.py \
    --model_path /path/to/model \
    --config scripts/evaluation_config.yaml \
    --output_path ./results/webdev \
    --run_id webdev_eval

# Performance benchmarking
python scripts/performance_benchmark.py \
    --model_path /path/to/model \
    --config scripts/evaluation_config.yaml \
    --output_path ./results/performance \
    --run_id perf_eval

# Code quality tests
python scripts/code_quality_tests.py \
    --model_path /path/to/model \
    --config scripts/evaluation_config.yaml \
    --output_path ./results/quality \
    --run_id quality_eval

# Regression testing
python scripts/regression_testing.py \
    --model_path /path/to/model \
    --config scripts/evaluation_config.yaml \
    --output_path ./results/regression \
    --run_id regression_eval

Advanced Configuration

# Custom targets and settings
python scripts/evaluate_model.py \
    --model_path /path/to/model \
    --config scripts/evaluation_config.yaml \
    --output_path ./evaluation_results \
    --run_id custom_eval \
    --skip_load  # Dry run without model loading

Output Files

Generated Reports

  • comprehensive_report_{run_id}.md - Main evaluation report
  • evaluation_results_{run_id}.json - Detailed JSON results
  • evaluation_summary_{run_id}.csv - CSV summary
  • performance_metrics_{run_id}.json - Performance metrics

Individual Benchmark Outputs

Each benchmark generates:

  • {benchmark}_results_{run_id}.json - Detailed results
  • {benchmark}_detailed_{run_id}.csv - Sample-level data
  • {benchmark}_{run_id}.log - Execution logs

Target Achievement

The framework tracks the following performance targets:

Benchmark Target Metric
MMLU Code >60% Accuracy
HumanEval >40% Pass@1
Web Development >75% Quality Score
Code Quality >95% Syntax Validity
Code Quality >0.65 CodeBLEU Score

Performance Expectations

Inference Speed

  • Excellent: >50 tokens/second
  • Good: 30-50 tokens/second
  • Acceptable: 20-30 tokens/second
  • Poor: <20 tokens/second

Memory Usage

  • Efficient: <8GB model size
  • Standard: 8-12GB model size
  • Large: 12-20GB model size

Integration

Continuous Integration

# .github/workflows/evaluation.yml
name: Model Evaluation
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v2
      - name: Run Evaluation
        run: |
          python scripts/evaluate_model.py \
            --model_path ${{ github.workspace }} \
            --config scripts/evaluation_config.yaml \
            --output_path ./results \
            --run_id ci_${{ github.sha }}

Automated Reporting

The framework integrates with:

  • HuggingFace Evaluate Library: Standard metrics
  • MLflow: Experiment tracking
  • Weights & Biases: Visualization dashboards
  • GitHub Actions: CI/CD integration

Troubleshooting

Common Issues

  1. Model Loading Failures

    # Check model path and permissions
    ls -la /path/to/model
    # Verify CUDA availability
    python -c "import torch; print(torch.cuda.is_available())"
    
  2. Memory Issues

    # Reduce batch sizes in config
    evaluation:
      model_settings:
        device_map: "cpu"  # Use CPU instead of GPU
    
  3. Dataset Access

    # Login to HuggingFace
    huggingface-cli login
    # Or disable remote code loading
    

Performance Optimization

  1. GPU Memory Optimization

    • Use device_map="auto" for automatic placement
    • Enable gradient checkpointing for memory efficiency
    • Use quantization (int8, int4) for larger models
  2. Speed Optimization

    • Increase batch sizes for throughput
    • Use faster attention implementations
    • Enable TensorRT optimization

Customization

Adding New Benchmarks

  1. Create new evaluation script following existing patterns
  2. Add to evaluate_model.py orchestrator
  3. Update evaluation_config.yaml with new settings
  4. Implement result saving and target tracking

Modifying Targets

Edit evaluation_config.yaml:

targets:
  mmlu_code_accuracy: 0.65  # Increased target
  humaneval_pass1: 0.45     # Increased target
  custom_metric: 0.80       # New metric

Custom Quality Metrics

Extend existing evaluation classes:

def evaluate_custom_metric(self, code_samples):
    # Implement custom quality assessment
    return custom_score

Support

Logging and Debugging

  • All scripts generate detailed logs in output directories
  • Enable debug mode in configuration:
    logging:
      level: "DEBUG"
      debug_mode: true
    

Resource Requirements

  • Minimum: 8GB RAM, 1 GPU (4GB VRAM)
  • Recommended: 16GB RAM, 1 GPU (8GB VRAM)
  • Optimal: 32GB RAM, 2+ GPUs (16GB+ VRAM each)

Best Practices

  1. Baseline Comparisons: Always maintain baseline results for regression detection
  2. Incremental Testing: Run individual benchmarks during development
  3. Regular Evaluation: Schedule periodic comprehensive evaluations
  4. Result Archiving: Save evaluation results for historical analysis

License

This evaluation framework is part of the Sheikh-2.5-Coder project. See the main repository for license information.


Note: This framework is designed for systematic model evaluation and should be integrated into continuous development workflows for best results.