Spaces:
Sleeping
title: InferenceProviderTestingBackend
emoji: 📈
colorFrom: yellow
colorTo: indigo
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
Inference Provider Testing Dashboard
A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API.
Features
- Automatic Model Discovery: Fetch popular text-generation models with inference providers from Hugging Face Hub
- Batch Job Launching: Run evaluation jobs for multiple model-provider combinations from a configuration file
- Results Table Dashboard: View all jobs with model, provider, last run, status, current score, and previous score
- Score Tracking: Automatically extracts average scores from completed jobs and tracks history
- Persistent Storage: Results saved to HuggingFace dataset for persistence across restarts
- Individual Job Relaunch: Easily relaunch specific model-provider combinations
- Real-time Monitoring: Auto-refresh results table every 30 seconds
- Daily Checkpoint: Automatic daily save at midnight to preserve state
Setup
Prerequisites
- Python 3.8+
- Hugging Face account with API token
- Access to the
IPTestingnamespace on Hugging Face
Installation
- Clone or navigate to this repository:
cd InferenceProviderTestingBackend
- Install dependencies:
pip install -r requirements.txt
- Set up your Hugging Face token as an environment variable:
export HF_TOKEN="your_huggingface_token_here"
Important: Your HF_TOKEN must have:
- Permission to call inference providers
- Write access to the
IPTestingorganization
Usage
Starting the Dashboard
Run the Gradio app:
python app.py
The dashboard will be available at http://localhost:7860
Initialize Models and Providers
Click the "Fetch and Initialize Models/Providers" button to automatically populate the
models_providers.txtfile with popular models and their available inference providers.Alternatively, manually edit
models_providers.txtwith your desired model-provider combinations:
meta-llama/Llama-3.2-3B-Instruct fireworks-ai
meta-llama/Llama-3.2-3B-Instruct together-ai
Qwen/Qwen2.5-7B-Instruct fireworks-ai
mistralai/Mistral-7B-Instruct-v0.3 together-ai
Format: model_name provider_name (separated by spaces or tabs)
Launching Jobs
- Enter the evaluation tasks in the Tasks field (e.g.,
lighteval|mmlu|0|0) - Verify the config file path (default:
models_providers.txt) - Click "Launch Jobs"
The system will:
- Read all model-provider combinations from the config file
- Launch a separate evaluation job for each combination
- Log the job ID and status
- Monitor job progress automatically
Monitoring Jobs
The Job Results table displays all jobs with:
- Model: The model being tested
- Provider: The inference provider
- Last Run: Timestamp of when the job was last launched
- Status: Current status (running/complete/failed/cancelled)
- Current Score: Average score from the most recent run
- Previous Score: Average score from the prior run (for comparison)
The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates.
Relaunching Individual Jobs
To rerun a specific model-provider combination:
- Enter the model name (e.g.,
meta-llama/Llama-3.2-3B-Instruct) - Enter the provider name (e.g.,
fireworks-ai) - Optionally modify the tasks
- Click "Relaunch Job"
When relaunching, the current score automatically moves to previous score for comparison.
Configuration
Tasks Format
The tasks parameter follows the lighteval format. Examples:
lighteval|mmlu|0|0- MMLU benchmarklighteval|hellaswag|0|0- HellaSwag benchmark
Daily Checkpoint
The system automatically saves all results to the HuggingFace dataset at 00:00 (midnight) every day. This ensures data persistence and prevents data loss from long-running sessions.
Data Persistence
All job results are stored in a HuggingFace dataset (IPTesting/inference-provider-test-results), which means:
- Results persist across app restarts
- Historical score comparisons are maintained
- Data can be accessed programmatically via the HF datasets library
Job Command Details
Each job runs with the following configuration:
- Image:
hf.co/spaces/OpenEvals/EvalsOnTheHub - Command:
lighteval endpoint inference-providers - Namespace:
IPTesting - Flags:
--push-to-hub --save-details --results-org IPTesting
Results are automatically pushed to the IPTesting organization on Hugging Face Hub.
Architecture
- Main Thread: Runs the Gradio interface
- Monitor Thread: Updates job statuses every 30 seconds and extracts scores from completed jobs
- APScheduler: Background scheduler that handles daily checkpoint saves at midnight (cron-based)
- Thread-safe Operations: Uses locks to prevent race conditions when accessing job_results
- HuggingFace Dataset Storage: Persists results to
IPTesting/inference-provider-test-resultsdataset
Troubleshooting
Jobs Not Launching
- Verify your
HF_TOKENis set and has the required permissions - Check that the
IPTestingnamespace exists and you have access - Review logs for specific error messages
Empty Models List
- Ensure you have internet connectivity
- The Hugging Face Hub API must be accessible
- Try running the initialization again
Job Status Not Updating
- Check your internet connection
- Verify the job IDs are valid
- Check console output for API errors
Scores Not Appearing
- Scores are extracted from job logs after completion
- The extraction parses the results table that appears in job logs
- It extracts the score for each task (from the first row where the task name appears)
- The final score is the average of all task scores
- Example table format:
| Task | Version | Metric | Value | Stderr | | extended:ifeval:0 | | prompt_level_strict_acc | 0.9100 | 0.0288 | | lighteval:gpqa:diamond:0 | | gpqa_pass@k_with_k | 0.5000 | 0.0503 | - If scores don't appear, check console output for extraction errors or parsing issues
Files
- app.py - Main Gradio application with UI and job management
- utils/ - Utility package with helper modules:
- utils/io.py - I/O operations: model/provider fetching, file operations, dataset persistence
- utils/jobs.py - Job management: launching, monitoring, score extraction
- utils/init.py - Package initialization and exports
- models_providers.txt - Configuration file with model-provider combinations
- requirements.txt - Python dependencies
- README.md - This file
License
This project is provided as-is for evaluation testing purposes.