Buckets:
|  | |
| # Echo Model - Benchmark Evaluation Suite | |
| This package evaluates your echo model on major Hugging Face benchmarks and generates submission files for leaderboards. | |
| ## ๐ BMAD Method Integration | |
| Echo Prime now includes the **BMAD Method** (Breakthrough Method of Agile AI Driven Development) - an AI-driven agile development framework with 21+ specialized agents and 50+ guided workflows. | |
| **Quick Start with BMAD:** | |
| - ๐ See [BMAD_INTEGRATION.md](BMAD_INTEGRATION.md) for complete documentation | |
| - โก See [BMAD_QUICK_REFERENCE.md](BMAD_QUICK_REFERENCE.md) for quick commands | |
| - ๐งญ See [BMAD Production Readiness](docs/BMAD_PRODUCTION_READINESS.md) for release gate and artifact paths | |
| - ๐ฌ Run `/bmad-help` in your AI IDE to get started | |
| **Use BMAD to:** | |
| - Plan and implement new Echo Prime features with structured workflows | |
| - Collaborate with specialized AI agents (PM, Architect, Developer, QA, etc.) | |
| - Maintain comprehensive documentation and code quality | |
| - Scale from quick bug fixes to enterprise-level features | |
| **Location:** `/Users/noone/echo_prime/bmad/` | |
| **BMAD Artifacts:** `/Users/noone/echo_prime/_bmad-output/` | |
| **Local Gate Command:** `make bmad-gate` | |
| ## Benchmarks Included | |
| 1. **GSM8K** - Grade school math problems (8K questions) | |
| 2. **MMLU-Pro** - Advanced multitask language understanding (12K questions) | |
| 3. **GPQA** - Graduate-level STEM questions (requires access approval) | |
| 4. **HLE** - Humanity's Last Exam (requires access approval) | |
| ## Prerequisites | |
| 1. **Ollama** must be running with the echo model: | |
| ```bash | |
| ollama serve | |
| ollama list # Verify echo is available | |
| ``` | |
| 2. **Python 3.8+** with pip | |
| ## Quick Start | |
| ### 1. Install Dependencies | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ### 2. Run Evaluations | |
| ```bash | |
| python3 evaluate_benchmarks.py | |
| ``` | |
| This will: | |
| - Connect to your local Ollama instance | |
| - Run evaluations on GSM8K and MMLU-Pro (publicly accessible) | |
| - Generate `.eval_results/*.yaml` files for Hugging Face submission | |
| - Display results summary | |
| ### 3. Access Gated Benchmarks (Optional) | |
| To evaluate on GPQA and HLE: | |
| 1. Request access: | |
| - [GPQA Dataset](https://huggingface.co/datasets/Idavidrein/gpqa) | |
| - [HLE Dataset](https://huggingface.co/datasets/cais/hle) | |
| 2. Login to Hugging Face: | |
| ```bash | |
| pip install huggingface-cli | |
| huggingface-cli login | |
| ``` | |
| 3. Re-run the evaluation script | |
| ## Submitting Results to Hugging Face | |
| ### Option 1: Create New Model Repo | |
| 1. Go to https://huggingface.co/new-model | |
| 2. Create a repo for your echo model (e.g., `your-username/echo`) | |
| 3. Clone it locally: | |
| ```bash | |
| git clone https://huggingface.co/your-username/echo | |
| cd echo | |
| ``` | |
| 4. Copy evaluation results: | |
| ```bash | |
| cp -r .eval_results/ your-echo-repo/ | |
| ``` | |
| 5. Commit and push: | |
| ```bash | |
| cd your-echo-repo | |
| git add .eval_results/ | |
| git commit -m "Add benchmark evaluation results | |
| Results from evaluations using Inspect AI framework: | |
| - GSM8K: [score] | |
| - MMLU-Pro: [score] | |
| Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>" | |
| git push | |
| ``` | |
| ### Option 2: Submit via Pull Request | |
| If the model repo already exists but you don't have write access: | |
| 1. Go to the model page on Hugging Face | |
| 2. Click the "Community" tab | |
| 3. Open a Pull Request | |
| 4. Add your `.eval_results/*.yaml` files | |
| 5. The PR will show as "community-provided" on the model page | |
| ## Results Format | |
| Generated YAML files follow the Hugging Face eval results schema: | |
| ```yaml | |
| - dataset: | |
| id: openai/gsm8k # Benchmark dataset ID | |
| task_id: default # Task identifier | |
| value: 0.8542 # Your model's score | |
| date: "2026-02-07" # Evaluation date | |
| source: | |
| url: https://... # Link to evaluation logs | |
| name: "Evaluation Results" | |
| notes: "Evaluated using Ollama" | |
| ``` | |
| ## Customization | |
| ### Adjust Sample Size | |
| Edit `evaluate_benchmarks.py` and modify the `num_samples` parameter: | |
| ```python | |
| # Evaluate fewer samples for faster testing | |
| gsm8k_score = run_gsm8k_evaluation(model, num_samples=10) | |
| # Evaluate full dataset for official results | |
| gsm8k_score = run_gsm8k_evaluation(model, num_samples=1000) | |
| ``` | |
| ### Change Model Name | |
| If your model isn't exactly named "echo" in Ollama: | |
| ```python | |
| # Option 1: Specify directly in the script | |
| model = "your-model-name" | |
| # Option 2: Pass as command line argument (modify script to accept args) | |
| ``` | |
| ## Troubleshooting | |
| ### "Cannot connect to Ollama" | |
| - Ensure Ollama is running: `ollama serve` | |
| - Check it's listening on port 11434: `curl http://localhost:11434/api/tags` | |
| - Verify echo model exists: `ollama list` | |
| ### "No module named 'datasets'" | |
| ```bash | |
| pip install datasets | |
| ``` | |
| ### Evaluation is slow | |
| - Reduce `num_samples` for testing | |
| - Use a GPU-enabled Ollama setup | |
| - Consider running overnight for full evaluations | |
| ### Scores seem incorrect | |
| - Check model output format matches expected format | |
| - Review the extraction logic in evaluation functions | |
| - Add debug prints to see model responses | |
| ## Understanding the Results | |
| ### Badges on Hugging Face | |
| Once submitted, your results may display badges: | |
| - **๐ verified**: Evaluation with cryptographic proof (requires HF Jobs) | |
| - **๐ฅ community**: Submitted via open PR (not merged to main) | |
| - **๐ leaderboard**: Links to benchmark leaderboard | |
| - **๐ source**: Links to evaluation logs | |
| ### Leaderboard Updates | |
| After pushing results to your model repo: | |
| 1. Results appear on your model page within minutes | |
| 2. Leaderboard updates may take a few hours | |
| 3. Benchmark pages aggregate scores across all models | |
| ## Advanced: Using Inspect AI | |
| For official verified results using Inspect AI: | |
| ```bash | |
| # Install Inspect AI | |
| pip install inspect-ai | |
| # Run evaluation (requires eval.yaml files from benchmarks) | |
| inspect eval openai/gsm8k@default --model ollama/echo | |
| # Generate submission with verify token | |
| # (Requires running in Hugging Face Jobs for verification) | |
| ``` | |
| ## Resources | |
| - [Hugging Face Eval Results Documentation](https://huggingface.co/docs/hub/eval-results) | |
| - [Inspect AI Documentation](https://inspect.aisi.org.uk/) | |
| - [GPQA Benchmark](https://huggingface.co/datasets/Idavidrein/gpqa) | |
| - [MMLU-Pro Benchmark](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro) | |
| - [GSM8K Benchmark](https://huggingface.co/datasets/openai/gsm8k) | |
| - [HLE Benchmark](https://huggingface.co/datasets/cais/hle) | |
| ## License | |
| This evaluation code is provided as-is for benchmarking purposes. | |
Xet Storage Details
- Size:
- 6.43 kB
- Xet hash:
- a6dea2be18517eddb059de7222ac812ae85555ce81e34381946d30145e16fa82
ยท
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.