| | --- |
| | title: InferenceProviderTestingBackend |
| | emoji: 📈 |
| | colorFrom: yellow |
| | colorTo: indigo |
| | sdk: gradio |
| | sdk_version: 5.49.1 |
| | app_file: app.py |
| | pinned: false |
| | --- |
| | |
| | # Inference Provider Testing Dashboard |
| |
|
| | A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API. |
| |
|
| | ## Setup |
| |
|
| | ### Prerequisites |
| |
|
| | - Python 3.8+ |
| | - Hugging Face account with API token |
| | - Access to the `IPTesting` namespace on Hugging Face |
| |
|
| | ### Installation |
| |
|
| | 1. Clone or navigate to this repository: |
| | ```bash |
| | cd InferenceProviderTestingBackend |
| | ``` |
| |
|
| | 2. Install dependencies: |
| | ```bash |
| | pip install -r requirements.txt |
| | ``` |
| |
|
| | 3. Set up your Hugging Face token as an environment variable: |
| | ```bash |
| | export HF_TOKEN="your_huggingface_token_here" |
| | ``` |
| |
|
| | **Important**: Your HF_TOKEN must have: |
| | - Permission to call inference providers |
| | - Write access to the `IPTesting` organization |
| | |
| | ## Usage |
| | |
| | ### Starting the Dashboard |
| | |
| | Run the Gradio app: |
| | ```bash |
| | python app.py |
| | ``` |
| | |
| | ### Initialize Models and Providers |
| | |
| | 1. Click the **"Fetch and Initialize Models/Providers"** button to automatically populate the `models_providers.txt` file with popular models and their available inference providers. |
| |
|
| | 2. Alternatively, manually edit `models_providers.txt` with your desired model-provider combinations: |
| | ``` |
| | meta-llama/Llama-3.2-3B-Instruct fireworks-ai |
| | meta-llama/Llama-3.2-3B-Instruct together-ai |
| | Qwen/Qwen2.5-7B-Instruct fireworks-ai |
| | mistralai/Mistral-7B-Instruct-v0.3 together-ai |
| | ``` |
| |
|
| | Format: `model_name provider_name` (separated by spaces or tabs) |
| |
|
| | ### Launching Jobs |
| |
|
| | 1. Enter the evaluation tasks in the **Tasks** field (e.g., `lighteval|mmlu|0|0`) |
| | 2. Verify the config file path (default: `models_providers.txt`) |
| | 3. Click **"Launch Jobs"** |
| |
|
| | The system will: |
| | - Read all model-provider combinations from the config file |
| | - Launch a separate evaluation job for each combination |
| | - Log the job ID and status |
| | - Monitor job progress automatically |
| |
|
| | ### Monitoring Jobs |
| |
|
| | The **Job Results** table displays all jobs with: |
| | - **Model**: The model being tested |
| | - **Provider**: The inference provider |
| | - **Last Run**: Timestamp of when the job was last launched |
| | - **Status**: Current status (running/complete/failed/cancelled) |
| | - **Current Score**: Average score from the most recent run |
| | - **Previous Score**: Average score from the prior run (for comparison) |
| | - **Latest Job Id**: Latest job id to put in https://huggingface.co/jobs/NAMESPACE/JOBID for inspection |
| |
|
| | The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates. |
| |
|
| | ## Configuration |
| |
|
| | ### Tasks Format |
| |
|
| | The tasks parameter follows the lighteval format. Examples: |
| | - `lighteval|mmlu|0` - MMLU benchmark |
| |
|
| | ### Daily Checkpoint |
| |
|
| | The system automatically saves all results to the HuggingFace dataset at **00:00 (midnight)** every day. |
| |
|
| | ### Data Persistence |
| |
|
| | All job results are stored in a HuggingFace dataset (`IPTesting/inference-provider-test-results`), which means: |
| | - Results persist across app restarts |
| | - Historical score comparisons are maintained |
| | - Data can be accessed programmatically via the HF datasets library |
| |
|
| | ## Architecture |
| |
|
| | - **Main Thread**: Runs the Gradio interface |
| | - **Monitor Thread**: Updates job statuses every 30 seconds and extracts scores from completed jobs |
| | - **APScheduler**: Background scheduler that handles daily checkpoint saves at midnight (cron-based) |
| | - **Thread-safe**: Uses locks to prevent access issues when checking job_results |
| | - **HuggingFace Dataset Storage**: Persists results to `IPTesting/inference-provider-test-results` dataset |
| | |
| | ## Troubleshooting |
| | |
| | ### Jobs Not Launching |
| | |
| | - Verify your `HF_TOKEN` is set and has the required permissions |
| | - Check that the `IPTesting` namespace exists and you have access |
| | - Review logs for specific error messages |
| |
|
| | ### Scores Not Appearing |
| |
|
| | - Scores are extracted from job logs after completion |
| | - The extraction parses the results table that appears in job logs |
| | - It extracts the score for each task (from the first row where the task name appears) |
| | - The final score is the average of all task scores |
| | - Example table format: |
| | ``` |
| | | Task | Version | Metric | Value | Stderr | |
| | | extended:ifeval:0 | | prompt_level_strict_acc | 0.9100 | 0.0288 | |
| | | lighteval:gpqa:diamond:0 | | gpqa_pass@k_with_k | 0.5000 | 0.0503 | |
| | ``` |
| | - If scores don't appear, check console output for extraction errors or parsing issues |
| |
|
| | ## Files |
| |
|
| | - [app.py](app.py) - Main Gradio application with UI and job management |
| | - [utils/](utils/) - Utility package with helper modules: |
| | - [utils/io.py](utils/io.py) - I/O operations: model/provider fetching, file operations, dataset persistence |
| | - [utils/jobs.py](utils/jobs.py) - Job management: launching, monitoring, score extraction |
| | - [models_providers.txt](models_providers.txt) - Configuration file with model-provider combinations |
| | - [requirements.txt](requirements.txt) - Python dependencies |
| | - [README.md](README.md) - This file |
| |
|
| |
|