workofarttattoo's picture
|
download
raw
6.43 kB
![Echo Prime Banner](media/banner.png)
# Echo Model - Benchmark Evaluation Suite
This package evaluates your echo model on major Hugging Face benchmarks and generates submission files for leaderboards.
## ๐Ÿš€ BMAD Method Integration
Echo Prime now includes the **BMAD Method** (Breakthrough Method of Agile AI Driven Development) - an AI-driven agile development framework with 21+ specialized agents and 50+ guided workflows.
**Quick Start with BMAD:**
- ๐Ÿ“– See [BMAD_INTEGRATION.md](BMAD_INTEGRATION.md) for complete documentation
- โšก See [BMAD_QUICK_REFERENCE.md](BMAD_QUICK_REFERENCE.md) for quick commands
- ๐Ÿงญ See [BMAD Production Readiness](docs/BMAD_PRODUCTION_READINESS.md) for release gate and artifact paths
- ๐Ÿ’ฌ Run `/bmad-help` in your AI IDE to get started
**Use BMAD to:**
- Plan and implement new Echo Prime features with structured workflows
- Collaborate with specialized AI agents (PM, Architect, Developer, QA, etc.)
- Maintain comprehensive documentation and code quality
- Scale from quick bug fixes to enterprise-level features
**Location:** `/Users/noone/echo_prime/bmad/`
**BMAD Artifacts:** `/Users/noone/echo_prime/_bmad-output/`
**Local Gate Command:** `make bmad-gate`
## Benchmarks Included
1. **GSM8K** - Grade school math problems (8K questions)
2. **MMLU-Pro** - Advanced multitask language understanding (12K questions)
3. **GPQA** - Graduate-level STEM questions (requires access approval)
4. **HLE** - Humanity's Last Exam (requires access approval)
## Prerequisites
1. **Ollama** must be running with the echo model:
```bash
ollama serve
ollama list # Verify echo is available
```
2. **Python 3.8+** with pip
## Quick Start
### 1. Install Dependencies
```bash
pip install -r requirements.txt
```
### 2. Run Evaluations
```bash
python3 evaluate_benchmarks.py
```
This will:
- Connect to your local Ollama instance
- Run evaluations on GSM8K and MMLU-Pro (publicly accessible)
- Generate `.eval_results/*.yaml` files for Hugging Face submission
- Display results summary
### 3. Access Gated Benchmarks (Optional)
To evaluate on GPQA and HLE:
1. Request access:
- [GPQA Dataset](https://huggingface.co/datasets/Idavidrein/gpqa)
- [HLE Dataset](https://huggingface.co/datasets/cais/hle)
2. Login to Hugging Face:
```bash
pip install huggingface-cli
huggingface-cli login
```
3. Re-run the evaluation script
## Submitting Results to Hugging Face
### Option 1: Create New Model Repo
1. Go to https://huggingface.co/new-model
2. Create a repo for your echo model (e.g., `your-username/echo`)
3. Clone it locally:
```bash
git clone https://huggingface.co/your-username/echo
cd echo
```
4. Copy evaluation results:
```bash
cp -r .eval_results/ your-echo-repo/
```
5. Commit and push:
```bash
cd your-echo-repo
git add .eval_results/
git commit -m "Add benchmark evaluation results
Results from evaluations using Inspect AI framework:
- GSM8K: [score]
- MMLU-Pro: [score]
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>"
git push
```
### Option 2: Submit via Pull Request
If the model repo already exists but you don't have write access:
1. Go to the model page on Hugging Face
2. Click the "Community" tab
3. Open a Pull Request
4. Add your `.eval_results/*.yaml` files
5. The PR will show as "community-provided" on the model page
## Results Format
Generated YAML files follow the Hugging Face eval results schema:
```yaml
- dataset:
id: openai/gsm8k # Benchmark dataset ID
task_id: default # Task identifier
value: 0.8542 # Your model's score
date: "2026-02-07" # Evaluation date
source:
url: https://... # Link to evaluation logs
name: "Evaluation Results"
notes: "Evaluated using Ollama"
```
## Customization
### Adjust Sample Size
Edit `evaluate_benchmarks.py` and modify the `num_samples` parameter:
```python
# Evaluate fewer samples for faster testing
gsm8k_score = run_gsm8k_evaluation(model, num_samples=10)
# Evaluate full dataset for official results
gsm8k_score = run_gsm8k_evaluation(model, num_samples=1000)
```
### Change Model Name
If your model isn't exactly named "echo" in Ollama:
```python
# Option 1: Specify directly in the script
model = "your-model-name"
# Option 2: Pass as command line argument (modify script to accept args)
```
## Troubleshooting
### "Cannot connect to Ollama"
- Ensure Ollama is running: `ollama serve`
- Check it's listening on port 11434: `curl http://localhost:11434/api/tags`
- Verify echo model exists: `ollama list`
### "No module named 'datasets'"
```bash
pip install datasets
```
### Evaluation is slow
- Reduce `num_samples` for testing
- Use a GPU-enabled Ollama setup
- Consider running overnight for full evaluations
### Scores seem incorrect
- Check model output format matches expected format
- Review the extraction logic in evaluation functions
- Add debug prints to see model responses
## Understanding the Results
### Badges on Hugging Face
Once submitted, your results may display badges:
- **๐Ÿ”’ verified**: Evaluation with cryptographic proof (requires HF Jobs)
- **๐Ÿ‘ฅ community**: Submitted via open PR (not merged to main)
- **๐Ÿ“Š leaderboard**: Links to benchmark leaderboard
- **๐Ÿ“„ source**: Links to evaluation logs
### Leaderboard Updates
After pushing results to your model repo:
1. Results appear on your model page within minutes
2. Leaderboard updates may take a few hours
3. Benchmark pages aggregate scores across all models
## Advanced: Using Inspect AI
For official verified results using Inspect AI:
```bash
# Install Inspect AI
pip install inspect-ai
# Run evaluation (requires eval.yaml files from benchmarks)
inspect eval openai/gsm8k@default --model ollama/echo
# Generate submission with verify token
# (Requires running in Hugging Face Jobs for verification)
```
## Resources
- [Hugging Face Eval Results Documentation](https://huggingface.co/docs/hub/eval-results)
- [Inspect AI Documentation](https://inspect.aisi.org.uk/)
- [GPQA Benchmark](https://huggingface.co/datasets/Idavidrein/gpqa)
- [MMLU-Pro Benchmark](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro)
- [GSM8K Benchmark](https://huggingface.co/datasets/openai/gsm8k)
- [HLE Benchmark](https://huggingface.co/datasets/cais/hle)
## License
This evaluation code is provided as-is for benchmarking purposes.

Xet Storage Details

Size:
6.43 kB
ยท
Xet hash:
a6dea2be18517eddb059de7222ac812ae85555ce81e34381946d30145e16fa82

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.