Buckets:

workofarttattoo
/

echo_prime

Files

xet

workofarttattoo/echo_prime / README.md

workofarttattoo

23 days ago

preview code

download

raw

6.43 kB

	![Echo Prime Banner](media/banner.png)
	# Echo Model - Benchmark Evaluation Suite

	This package evaluates your echo model on major Hugging Face benchmarks and generates submission files for leaderboards.

	## 🚀 BMAD Method Integration

	Echo Prime now includes the BMAD Method (Breakthrough Method of Agile AI Driven Development) - an AI-driven agile development framework with 21+ specialized agents and 50+ guided workflows.

	Quick Start with BMAD:
	- 📖 See [BMAD_INTEGRATION.md](BMAD_INTEGRATION.md) for complete documentation
	- ⚡ See [BMAD_QUICK_REFERENCE.md](BMAD_QUICK_REFERENCE.md) for quick commands
	- 🧭 See [BMAD Production Readiness](docs/BMAD_PRODUCTION_READINESS.md) for release gate and artifact paths
	- 💬 Run `/bmad-help` in your AI IDE to get started

	Use BMAD to:
	- Plan and implement new Echo Prime features with structured workflows
	- Collaborate with specialized AI agents (PM, Architect, Developer, QA, etc.)
	- Maintain comprehensive documentation and code quality
	- Scale from quick bug fixes to enterprise-level features

	Location: `/Users/noone/echo_prime/bmad/`
	BMAD Artifacts: `/Users/noone/echo_prime/_bmad-output/`
	Local Gate Command: `make bmad-gate`

	## Benchmarks Included

	1. GSM8K - Grade school math problems (8K questions)
	2. MMLU-Pro - Advanced multitask language understanding (12K questions)
	3. GPQA - Graduate-level STEM questions (requires access approval)
	4. HLE - Humanity's Last Exam (requires access approval)

	## Prerequisites

	1. Ollama must be running with the echo model:
	```bash
	ollama serve
	ollama list # Verify echo is available
	```

	2. Python 3.8+ with pip

	## Quick Start

	### 1. Install Dependencies

	```bash
	pip install -r requirements.txt
	```

	### 2. Run Evaluations

	```bash
	python3 evaluate_benchmarks.py
	```

	This will:
	- Connect to your local Ollama instance
	- Run evaluations on GSM8K and MMLU-Pro (publicly accessible)
	- Generate `.eval_results/*.yaml` files for Hugging Face submission
	- Display results summary

	### 3. Access Gated Benchmarks (Optional)

	To evaluate on GPQA and HLE:

	1. Request access:
	- [GPQA Dataset](https://huggingface.co/datasets/Idavidrein/gpqa)
	- [HLE Dataset](https://huggingface.co/datasets/cais/hle)

	2. Login to Hugging Face:
	```bash
	pip install huggingface-cli
	huggingface-cli login
	```

	3. Re-run the evaluation script

	## Submitting Results to Hugging Face

	### Option 1: Create New Model Repo

	1. Go to https://huggingface.co/new-model
	2. Create a repo for your echo model (e.g., `your-username/echo`)
	3. Clone it locally:
	```bash
	git clone https://huggingface.co/your-username/echo
	cd echo
	```

	4. Copy evaluation results:
	```bash
	cp -r .eval_results/ your-echo-repo/
	```

	5. Commit and push:
	```bash
	cd your-echo-repo
	git add .eval_results/
	git commit -m "Add benchmark evaluation results

	Results from evaluations using Inspect AI framework:
	- GSM8K: [score]
	- MMLU-Pro: [score]

	Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>"
	git push
	```

	### Option 2: Submit via Pull Request

	If the model repo already exists but you don't have write access:

	1. Go to the model page on Hugging Face
	2. Click the "Community" tab
	3. Open a Pull Request
	4. Add your `.eval_results/*.yaml` files
	5. The PR will show as "community-provided" on the model page

	## Results Format

	Generated YAML files follow the Hugging Face eval results schema:

	```yaml
	- dataset:
	id: openai/gsm8k # Benchmark dataset ID
	task_id: default # Task identifier
	value: 0.8542 # Your model's score
	date: "2026-02-07" # Evaluation date
	source:
	url: https://... # Link to evaluation logs
	name: "Evaluation Results"
	notes: "Evaluated using Ollama"
	```

	## Customization

	### Adjust Sample Size

	Edit `evaluate_benchmarks.py` and modify the `num_samples` parameter:

	```python
	# Evaluate fewer samples for faster testing
	gsm8k_score = run_gsm8k_evaluation(model, num_samples=10)

	# Evaluate full dataset for official results
	gsm8k_score = run_gsm8k_evaluation(model, num_samples=1000)
	```

	### Change Model Name

	If your model isn't exactly named "echo" in Ollama:

	```python
	# Option 1: Specify directly in the script
	model = "your-model-name"

	# Option 2: Pass as command line argument (modify script to accept args)
	```

	## Troubleshooting

	### "Cannot connect to Ollama"

	- Ensure Ollama is running: `ollama serve`
	- Check it's listening on port 11434: `curl http://localhost:11434/api/tags`
	- Verify echo model exists: `ollama list`

	### "No module named 'datasets'"

	```bash
	pip install datasets
	```

	### Evaluation is slow

	- Reduce `num_samples` for testing
	- Use a GPU-enabled Ollama setup
	- Consider running overnight for full evaluations

	### Scores seem incorrect

	- Check model output format matches expected format
	- Review the extraction logic in evaluation functions
	- Add debug prints to see model responses

	## Understanding the Results

	### Badges on Hugging Face

	Once submitted, your results may display badges:

	- 🔒 verified: Evaluation with cryptographic proof (requires HF Jobs)
	- 👥 community: Submitted via open PR (not merged to main)
	- 📊 leaderboard: Links to benchmark leaderboard
	- 📄 source: Links to evaluation logs

	### Leaderboard Updates

	After pushing results to your model repo:
	1. Results appear on your model page within minutes
	2. Leaderboard updates may take a few hours
	3. Benchmark pages aggregate scores across all models

	## Advanced: Using Inspect AI

	For official verified results using Inspect AI:

	```bash
	# Install Inspect AI
	pip install inspect-ai

	# Run evaluation (requires eval.yaml files from benchmarks)
	inspect eval openai/gsm8k@default --model ollama/echo

	# Generate submission with verify token
	# (Requires running in Hugging Face Jobs for verification)
	```

	## Resources

	- [Hugging Face Eval Results Documentation](https://huggingface.co/docs/hub/eval-results)
	- [Inspect AI Documentation](https://inspect.aisi.org.uk/)
	- [GPQA Benchmark](https://huggingface.co/datasets/Idavidrein/gpqa)
	- [MMLU-Pro Benchmark](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro)
	- [GSM8K Benchmark](https://huggingface.co/datasets/openai/gsm8k)
	- [HLE Benchmark](https://huggingface.co/datasets/cais/hle)

	## License

	This evaluation code is provided as-is for benchmarking purposes.

Xet Storage Details

Size:: 6.43 kB
Xet hash:: a6dea2be18517eddb059de7222ac812ae85555ce81e34381946d30145e16fa82

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.