Buckets:

workofarttattoo
/

echo_prime

Files

xet

workofarttattoo/echo_prime / README.md

workofarttattoo

22 days ago

preview code

download

raw

6.43 kB

Echo Model - Benchmark Evaluation Suite

This package evaluates your echo model on major Hugging Face benchmarks and generates submission files for leaderboards.

🚀 BMAD Method Integration

Echo Prime now includes the BMAD Method (Breakthrough Method of Agile AI Driven Development) - an AI-driven agile development framework with 21+ specialized agents and 50+ guided workflows.

Quick Start with BMAD:

📖 See BMAD_INTEGRATION.md for complete documentation
⚡ See BMAD_QUICK_REFERENCE.md for quick commands
🧭 See BMAD Production Readiness for release gate and artifact paths
💬 Run /bmad-help in your AI IDE to get started

Use BMAD to:

Plan and implement new Echo Prime features with structured workflows
Collaborate with specialized AI agents (PM, Architect, Developer, QA, etc.)
Maintain comprehensive documentation and code quality
Scale from quick bug fixes to enterprise-level features

Location: /Users/noone/echo_prime/bmad/
BMAD Artifacts: /Users/noone/echo_prime/_bmad-output/
Local Gate Command: make bmad-gate

Benchmarks Included

GSM8K - Grade school math problems (8K questions)
MMLU-Pro - Advanced multitask language understanding (12K questions)
GPQA - Graduate-level STEM questions (requires access approval)
HLE - Humanity's Last Exam (requires access approval)

Prerequisites

Ollama must be running with the echo model:

ollama serve
ollama list  # Verify echo is available

Python 3.8+ with pip

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Run Evaluations

python3 evaluate_benchmarks.py

This will:

Connect to your local Ollama instance
Run evaluations on GSM8K and MMLU-Pro (publicly accessible)
Generate .eval_results/*.yaml files for Hugging Face submission
Display results summary

3. Access Gated Benchmarks (Optional)

To evaluate on GPQA and HLE:

Request access:
- GPQA Dataset
- HLE Dataset

pip install huggingface-cli
huggingface-cli login

Re-run the evaluation script

Submitting Results to Hugging Face

Option 1: Create New Model Repo

Go to https://huggingface.co/new-model
Create a repo for your echo model (e.g., your-username/echo)

Clone it locally:

git clone https://huggingface.co/your-username/echo
cd echo

Copy evaluation results:
```
cp -r .eval_results/ your-echo-repo/
```

Commit and push:

cd your-echo-repo
git add .eval_results/
git commit -m "Add benchmark evaluation results

Results from evaluations using Inspect AI framework:

GSM8K: [score]
MMLU-Pro: [score]

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com" git push


### Option 2: Submit via Pull Request

If the model repo already exists but you don't have write access:

1. Go to the model page on Hugging Face
2. Click the "Community" tab
3. Open a Pull Request
4. Add your `.eval_results/*.yaml` files
5. The PR will show as "community-provided" on the model page

## Results Format

Generated YAML files follow the Hugging Face eval results schema:

```yaml
- dataset:
 id: openai/gsm8k          # Benchmark dataset ID
 task_id: default           # Task identifier
  value: 0.8542               # Your model's score
  date: "2026-02-07"          # Evaluation date
  source:
 url: https://...          # Link to evaluation logs
 name: "Evaluation Results"
  notes: "Evaluated using Ollama"

Customization

Adjust Sample Size

Edit evaluate_benchmarks.py and modify the num_samples parameter:

# Evaluate fewer samples for faster testing
gsm8k_score = run_gsm8k_evaluation(model, num_samples=10)

# Evaluate full dataset for official results
gsm8k_score = run_gsm8k_evaluation(model, num_samples=1000)

Change Model Name

If your model isn't exactly named "echo" in Ollama:

# Option 1: Specify directly in the script
model = "your-model-name"

# Option 2: Pass as command line argument (modify script to accept args)

Troubleshooting

"Cannot connect to Ollama"

Ensure Ollama is running: ollama serve
Check it's listening on port 11434: curl http://localhost:11434/api/tags
Verify echo model exists: ollama list

"No module named 'datasets'"

pip install datasets

Evaluation is slow

Reduce num_samples for testing
Use a GPU-enabled Ollama setup
Consider running overnight for full evaluations

Scores seem incorrect

Check model output format matches expected format
Review the extraction logic in evaluation functions
Add debug prints to see model responses

Understanding the Results

Badges on Hugging Face

Once submitted, your results may display badges:

🔒 verified: Evaluation with cryptographic proof (requires HF Jobs)
👥 community: Submitted via open PR (not merged to main)
📊 leaderboard: Links to benchmark leaderboard
📄 source: Links to evaluation logs

Leaderboard Updates

After pushing results to your model repo:

Results appear on your model page within minutes
Leaderboard updates may take a few hours
Benchmark pages aggregate scores across all models

Advanced: Using Inspect AI

For official verified results using Inspect AI:

# Install Inspect AI
pip install inspect-ai

# Run evaluation (requires eval.yaml files from benchmarks)
inspect eval openai/gsm8k@default --model ollama/echo

# Generate submission with verify token
# (Requires running in Hugging Face Jobs for verification)

Resources

License

This evaluation code is provided as-is for benchmarking purposes.

Xet Storage Details

Size:: 6.43 kB
Xet hash:: a6dea2be18517eddb059de7222ac812ae85555ce81e34381946d30145e16fa82

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.