Spaces:

OpenEvals
/

InferenceProviderTesting

Running

App Files Files Community

InferenceProviderTesting / README.md

Clémentine

gitignore

096bf86 5 months ago

preview code

raw

history blame

5 kB

	---
	title: InferenceProviderTestingBackend
	emoji: 📈
	colorFrom: yellow
	colorTo: indigo
	sdk: gradio
	sdk_version: 5.49.1
	app_file: app.py
	pinned: false
	---

	# Inference Provider Testing Dashboard

	A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API.

	## Setup

	### Prerequisites

	- Python 3.8+
	- Hugging Face account with API token
	- Access to the `IPTesting` namespace on Hugging Face

	### Installation

	1. Clone or navigate to this repository:
	```bash
	cd InferenceProviderTestingBackend
	```

	2. Install dependencies:
	```bash
	pip install -r requirements.txt
	```

	3. Set up your Hugging Face token as an environment variable:
	```bash
	export HF_TOKEN="your_huggingface_token_here"
	```

	Important: Your HF_TOKEN must have:
	- Permission to call inference providers
	- Write access to the `IPTesting` organization

	## Usage

	### Starting the Dashboard

	Run the Gradio app:
	```bash
	python app.py
	```

	### Initialize Models and Providers

	1. Click the "Fetch and Initialize Models/Providers" button to automatically populate the `models_providers.txt` file with popular models and their available inference providers.

	2. Alternatively, manually edit `models_providers.txt` with your desired model-provider combinations:
	```
	meta-llama/Llama-3.2-3B-Instruct fireworks-ai
	meta-llama/Llama-3.2-3B-Instruct together-ai
	Qwen/Qwen2.5-7B-Instruct fireworks-ai
	mistralai/Mistral-7B-Instruct-v0.3 together-ai
	```

	Format: `model_name provider_name` (separated by spaces or tabs)

	### Launching Jobs

	1. Enter the evaluation tasks in the Tasks field (e.g., `lighteval\|mmlu\|0\|0`)
	2. Verify the config file path (default: `models_providers.txt`)
	3. Click "Launch Jobs"

	The system will:
	- Read all model-provider combinations from the config file
	- Launch a separate evaluation job for each combination
	- Log the job ID and status
	- Monitor job progress automatically

	### Monitoring Jobs

	The Job Results table displays all jobs with:
	- Model: The model being tested
	- Provider: The inference provider
	- Last Run: Timestamp of when the job was last launched
	- Status: Current status (running/complete/failed/cancelled)
	- Current Score: Average score from the most recent run
	- Previous Score: Average score from the prior run (for comparison)
	- Latest Job Id: Latest job id to put in https://huggingface.co/jobs/NAMESPACE/JOBID for inspection

	The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates.

	## Configuration

	### Tasks Format

	The tasks parameter follows the lighteval format. Examples:
	- `lighteval\|mmlu\|0` - MMLU benchmark

	### Daily Checkpoint

	The system automatically saves all results to the HuggingFace dataset at 00:00 (midnight) every day.

	### Data Persistence

	All job results are stored in a HuggingFace dataset (`IPTesting/inference-provider-test-results`), which means:
	- Results persist across app restarts
	- Historical score comparisons are maintained
	- Data can be accessed programmatically via the HF datasets library

	## Architecture

	- Main Thread: Runs the Gradio interface
	- Monitor Thread: Updates job statuses every 30 seconds and extracts scores from completed jobs
	- APScheduler: Background scheduler that handles daily checkpoint saves at midnight (cron-based)
	- Thread-safe: Uses locks to prevent access issues when checking job_results
	- HuggingFace Dataset Storage: Persists results to `IPTesting/inference-provider-test-results` dataset

	## Troubleshooting

	### Jobs Not Launching

	- Verify your `HF_TOKEN` is set and has the required permissions
	- Check that the `IPTesting` namespace exists and you have access
	- Review logs for specific error messages

	### Scores Not Appearing

	- Scores are extracted from job logs after completion
	- The extraction parses the results table that appears in job logs
	- It extracts the score for each task (from the first row where the task name appears)
	- The final score is the average of all task scores
	- Example table format:
	```
	\| Task \| Version \| Metric \| Value \| Stderr \|
	\| extended:ifeval:0 \| \| prompt_level_strict_acc \| 0.9100 \| 0.0288 \|
	\| lighteval:gpqa:diamond:0 \| \| gpqa_pass@k_with_k \| 0.5000 \| 0.0503 \|
	```
	- If scores don't appear, check console output for extraction errors or parsing issues

	## Files

	- [app.py](app.py) - Main Gradio application with UI and job management
	- [utils/](utils/) - Utility package with helper modules:
	- [utils/io.py](utils/io.py) - I/O operations: model/provider fetching, file operations, dataset persistence
	- [utils/jobs.py](utils/jobs.py) - Job management: launching, monitoring, score extraction
	- [models_providers.txt](models_providers.txt) - Configuration file with model-provider combinations
	- [requirements.txt](requirements.txt) - Python dependencies
	- [README.md](README.md) - This file