Spaces:

OpenEvals
/

InferenceProviderTesting

Sleeping

App Files Files Community

InferenceProviderTesting / README.md

Clémentine

wip

7f5506e 7 months ago

7.04 kB

title: InferenceProviderTestingBackend
emoji: 📈
colorFrom: yellow
colorTo: indigo
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false

Inference Provider Testing Dashboard

A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API.

Features

Automatic Model Discovery: Fetch popular text-generation models with inference providers from Hugging Face Hub
Batch Job Launching: Run evaluation jobs for multiple model-provider combinations from a configuration file
Results Table Dashboard: View all jobs with model, provider, last run, status, current score, and previous score
Score Tracking: Automatically extracts average scores from completed jobs and tracks history
Persistent Storage: Results saved to HuggingFace dataset for persistence across restarts
Individual Job Relaunch: Easily relaunch specific model-provider combinations
Real-time Monitoring: Auto-refresh results table every 30 seconds
Daily Checkpoint: Automatic daily save at midnight to preserve state

Setup

Prerequisites

Python 3.8+
Hugging Face account with API token
Access to the IPTesting namespace on Hugging Face

Installation

Clone or navigate to this repository:

cd InferenceProviderTestingBackend

Install dependencies:

pip install -r requirements.txt

Set up your Hugging Face token as an environment variable:

export HF_TOKEN="your_huggingface_token_here"

Important: Your HF_TOKEN must have:

Permission to call inference providers
Write access to the IPTesting organization

Usage

Starting the Dashboard

Run the Gradio app:

python app.py

The dashboard will be available at http://localhost:7860

Initialize Models and Providers

Click the "Fetch and Initialize Models/Providers" button to automatically populate the models_providers.txt file with popular models and their available inference providers.
Alternatively, manually edit models_providers.txt with your desired model-provider combinations:

meta-llama/Llama-3.2-3B-Instruct  fireworks-ai
meta-llama/Llama-3.2-3B-Instruct  together-ai
Qwen/Qwen2.5-7B-Instruct  fireworks-ai
mistralai/Mistral-7B-Instruct-v0.3  together-ai

Format: model_name provider_name (separated by spaces or tabs)

Launching Jobs

Enter the evaluation tasks in the Tasks field (e.g., lighteval|mmlu|0|0)
Verify the config file path (default: models_providers.txt)
Click "Launch Jobs"

The system will:

Read all model-provider combinations from the config file
Launch a separate evaluation job for each combination
Log the job ID and status
Monitor job progress automatically

Monitoring Jobs

The Job Results table displays all jobs with:

Model: The model being tested
Provider: The inference provider
Last Run: Timestamp of when the job was last launched
Status: Current status (running/complete/failed/cancelled)
Current Score: Average score from the most recent run
Previous Score: Average score from the prior run (for comparison)

The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates.

Relaunching Individual Jobs

To rerun a specific model-provider combination:

Enter the model name (e.g., meta-llama/Llama-3.2-3B-Instruct)
Enter the provider name (e.g., fireworks-ai)
Optionally modify the tasks
Click "Relaunch Job"

When relaunching, the current score automatically moves to previous score for comparison.

Configuration

Tasks Format

The tasks parameter follows the lighteval format. Examples:

lighteval|mmlu|0|0 - MMLU benchmark
lighteval|hellaswag|0|0 - HellaSwag benchmark

Daily Checkpoint

The system automatically saves all results to the HuggingFace dataset at 00:00 (midnight) every day. This ensures data persistence and prevents data loss from long-running sessions.

Data Persistence

All job results are stored in a HuggingFace dataset (IPTesting/inference-provider-test-results), which means:

Results persist across app restarts
Historical score comparisons are maintained
Data can be accessed programmatically via the HF datasets library

Job Command Details

Each job runs with the following configuration:

Image: hf.co/spaces/OpenEvals/EvalsOnTheHub
Command: lighteval endpoint inference-providers
Namespace: IPTesting
Flags: --push-to-hub --save-details --results-org IPTesting

Results are automatically pushed to the IPTesting organization on Hugging Face Hub.

Architecture

Main Thread: Runs the Gradio interface
Monitor Thread: Updates job statuses every 30 seconds and extracts scores from completed jobs
APScheduler: Background scheduler that handles daily checkpoint saves at midnight (cron-based)
Thread-safe Operations: Uses locks to prevent race conditions when accessing job_results
HuggingFace Dataset Storage: Persists results to IPTesting/inference-provider-test-results dataset

Troubleshooting

Jobs Not Launching

Verify your HF_TOKEN is set and has the required permissions
Check that the IPTesting namespace exists and you have access
Review logs for specific error messages

Empty Models List

Ensure you have internet connectivity
The Hugging Face Hub API must be accessible
Try running the initialization again

Job Status Not Updating

Check your internet connection
Verify the job IDs are valid
Check console output for API errors

Scores Not Appearing

Scores are extracted from job logs after completion
The extraction parses the results table that appears in job logs
It extracts the score for each task (from the first row where the task name appears)
The final score is the average of all task scores

Example table format:

| Task                    | Version | Metric                | Value  | Stderr |
| extended:ifeval:0       |         | prompt_level_strict_acc | 0.9100 | 0.0288 |
| lighteval:gpqa:diamond:0 |        | gpqa_pass@k_with_k     | 0.5000 | 0.0503 |

If scores don't appear, check console output for extraction errors or parsing issues

Files

app.py - Main Gradio application with UI and job management
utils/ - Utility package with helper modules:
- utils/io.py - I/O operations: model/provider fetching, file operations, dataset persistence
- utils/jobs.py - Job management: launching, monitoring, score extraction
- utils/init.py - Package initialization and exports
models_providers.txt - Configuration file with model-provider combinations
requirements.txt - Python dependencies
README.md - This file

License

This project is provided as-is for evaluation testing purposes.