workofarttattoo's picture
|
download
raw
6.43 kB

Echo Prime Banner

Echo Model - Benchmark Evaluation Suite

This package evaluates your echo model on major Hugging Face benchmarks and generates submission files for leaderboards.

๐Ÿš€ BMAD Method Integration

Echo Prime now includes the BMAD Method (Breakthrough Method of Agile AI Driven Development) - an AI-driven agile development framework with 21+ specialized agents and 50+ guided workflows.

Quick Start with BMAD:

Use BMAD to:

  • Plan and implement new Echo Prime features with structured workflows
  • Collaborate with specialized AI agents (PM, Architect, Developer, QA, etc.)
  • Maintain comprehensive documentation and code quality
  • Scale from quick bug fixes to enterprise-level features

Location: /Users/noone/echo_prime/bmad/
BMAD Artifacts: /Users/noone/echo_prime/_bmad-output/
Local Gate Command: make bmad-gate

Benchmarks Included

  1. GSM8K - Grade school math problems (8K questions)
  2. MMLU-Pro - Advanced multitask language understanding (12K questions)
  3. GPQA - Graduate-level STEM questions (requires access approval)
  4. HLE - Humanity's Last Exam (requires access approval)

Prerequisites

  1. Ollama must be running with the echo model:

    ollama serve
    ollama list  # Verify echo is available
    
  2. Python 3.8+ with pip

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Run Evaluations

python3 evaluate_benchmarks.py

This will:

  • Connect to your local Ollama instance
  • Run evaluations on GSM8K and MMLU-Pro (publicly accessible)
  • Generate .eval_results/*.yaml files for Hugging Face submission
  • Display results summary

3. Access Gated Benchmarks (Optional)

To evaluate on GPQA and HLE:

  1. Request access:

  2. Login to Hugging Face:

    pip install huggingface-cli
    huggingface-cli login
    
  3. Re-run the evaluation script

Submitting Results to Hugging Face

Option 1: Create New Model Repo

  1. Go to https://huggingface.co/new-model

  2. Create a repo for your echo model (e.g., your-username/echo)

  3. Clone it locally:

    git clone https://huggingface.co/your-username/echo
    cd echo
    
  4. Copy evaluation results:

    cp -r .eval_results/ your-echo-repo/
    
  5. Commit and push:

    cd your-echo-repo
    git add .eval_results/
    git commit -m "Add benchmark evaluation results
    

Results from evaluations using Inspect AI framework:

  • GSM8K: [score]
  • MMLU-Pro: [score]

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com" git push


### Option 2: Submit via Pull Request

If the model repo already exists but you don't have write access:

1. Go to the model page on Hugging Face
2. Click the "Community" tab
3. Open a Pull Request
4. Add your `.eval_results/*.yaml` files
5. The PR will show as "community-provided" on the model page

## Results Format

Generated YAML files follow the Hugging Face eval results schema:

```yaml
- dataset:
 id: openai/gsm8k          # Benchmark dataset ID
 task_id: default           # Task identifier
  value: 0.8542               # Your model's score
  date: "2026-02-07"          # Evaluation date
  source:
 url: https://...          # Link to evaluation logs
 name: "Evaluation Results"
  notes: "Evaluated using Ollama"

Customization

Adjust Sample Size

Edit evaluate_benchmarks.py and modify the num_samples parameter:

# Evaluate fewer samples for faster testing
gsm8k_score = run_gsm8k_evaluation(model, num_samples=10)

# Evaluate full dataset for official results
gsm8k_score = run_gsm8k_evaluation(model, num_samples=1000)

Change Model Name

If your model isn't exactly named "echo" in Ollama:

# Option 1: Specify directly in the script
model = "your-model-name"

# Option 2: Pass as command line argument (modify script to accept args)

Troubleshooting

"Cannot connect to Ollama"

  • Ensure Ollama is running: ollama serve
  • Check it's listening on port 11434: curl http://localhost:11434/api/tags
  • Verify echo model exists: ollama list

"No module named 'datasets'"

pip install datasets

Evaluation is slow

  • Reduce num_samples for testing
  • Use a GPU-enabled Ollama setup
  • Consider running overnight for full evaluations

Scores seem incorrect

  • Check model output format matches expected format
  • Review the extraction logic in evaluation functions
  • Add debug prints to see model responses

Understanding the Results

Badges on Hugging Face

Once submitted, your results may display badges:

  • ๐Ÿ”’ verified: Evaluation with cryptographic proof (requires HF Jobs)
  • ๐Ÿ‘ฅ community: Submitted via open PR (not merged to main)
  • ๐Ÿ“Š leaderboard: Links to benchmark leaderboard
  • ๐Ÿ“„ source: Links to evaluation logs

Leaderboard Updates

After pushing results to your model repo:

  1. Results appear on your model page within minutes
  2. Leaderboard updates may take a few hours
  3. Benchmark pages aggregate scores across all models

Advanced: Using Inspect AI

For official verified results using Inspect AI:

# Install Inspect AI
pip install inspect-ai

# Run evaluation (requires eval.yaml files from benchmarks)
inspect eval openai/gsm8k@default --model ollama/echo

# Generate submission with verify token
# (Requires running in Hugging Face Jobs for verification)

Resources

License

This evaluation code is provided as-is for benchmarking purposes.

Xet Storage Details

Size:
6.43 kB
ยท
Xet hash:
a6dea2be18517eddb059de7222ac812ae85555ce81e34381946d30145e16fa82

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.