Clémentine
wip
7f5506e
|
raw
history blame
7.04 kB
metadata
title: InferenceProviderTestingBackend
emoji: 📈
colorFrom: yellow
colorTo: indigo
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false

Inference Provider Testing Dashboard

A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API.

Features

  • Automatic Model Discovery: Fetch popular text-generation models with inference providers from Hugging Face Hub
  • Batch Job Launching: Run evaluation jobs for multiple model-provider combinations from a configuration file
  • Results Table Dashboard: View all jobs with model, provider, last run, status, current score, and previous score
  • Score Tracking: Automatically extracts average scores from completed jobs and tracks history
  • Persistent Storage: Results saved to HuggingFace dataset for persistence across restarts
  • Individual Job Relaunch: Easily relaunch specific model-provider combinations
  • Real-time Monitoring: Auto-refresh results table every 30 seconds
  • Daily Checkpoint: Automatic daily save at midnight to preserve state

Setup

Prerequisites

  • Python 3.8+
  • Hugging Face account with API token
  • Access to the IPTesting namespace on Hugging Face

Installation

  1. Clone or navigate to this repository:
cd InferenceProviderTestingBackend
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up your Hugging Face token as an environment variable:
export HF_TOKEN="your_huggingface_token_here"

Important: Your HF_TOKEN must have:

  • Permission to call inference providers
  • Write access to the IPTesting organization

Usage

Starting the Dashboard

Run the Gradio app:

python app.py

The dashboard will be available at http://localhost:7860

Initialize Models and Providers

  1. Click the "Fetch and Initialize Models/Providers" button to automatically populate the models_providers.txt file with popular models and their available inference providers.

  2. Alternatively, manually edit models_providers.txt with your desired model-provider combinations:

meta-llama/Llama-3.2-3B-Instruct  fireworks-ai
meta-llama/Llama-3.2-3B-Instruct  together-ai
Qwen/Qwen2.5-7B-Instruct  fireworks-ai
mistralai/Mistral-7B-Instruct-v0.3  together-ai

Format: model_name provider_name (separated by spaces or tabs)

Launching Jobs

  1. Enter the evaluation tasks in the Tasks field (e.g., lighteval|mmlu|0|0)
  2. Verify the config file path (default: models_providers.txt)
  3. Click "Launch Jobs"

The system will:

  • Read all model-provider combinations from the config file
  • Launch a separate evaluation job for each combination
  • Log the job ID and status
  • Monitor job progress automatically

Monitoring Jobs

The Job Results table displays all jobs with:

  • Model: The model being tested
  • Provider: The inference provider
  • Last Run: Timestamp of when the job was last launched
  • Status: Current status (running/complete/failed/cancelled)
  • Current Score: Average score from the most recent run
  • Previous Score: Average score from the prior run (for comparison)

The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates.

Relaunching Individual Jobs

To rerun a specific model-provider combination:

  1. Enter the model name (e.g., meta-llama/Llama-3.2-3B-Instruct)
  2. Enter the provider name (e.g., fireworks-ai)
  3. Optionally modify the tasks
  4. Click "Relaunch Job"

When relaunching, the current score automatically moves to previous score for comparison.

Configuration

Tasks Format

The tasks parameter follows the lighteval format. Examples:

  • lighteval|mmlu|0|0 - MMLU benchmark
  • lighteval|hellaswag|0|0 - HellaSwag benchmark

Daily Checkpoint

The system automatically saves all results to the HuggingFace dataset at 00:00 (midnight) every day. This ensures data persistence and prevents data loss from long-running sessions.

Data Persistence

All job results are stored in a HuggingFace dataset (IPTesting/inference-provider-test-results), which means:

  • Results persist across app restarts
  • Historical score comparisons are maintained
  • Data can be accessed programmatically via the HF datasets library

Job Command Details

Each job runs with the following configuration:

  • Image: hf.co/spaces/OpenEvals/EvalsOnTheHub
  • Command: lighteval endpoint inference-providers
  • Namespace: IPTesting
  • Flags: --push-to-hub --save-details --results-org IPTesting

Results are automatically pushed to the IPTesting organization on Hugging Face Hub.

Architecture

  • Main Thread: Runs the Gradio interface
  • Monitor Thread: Updates job statuses every 30 seconds and extracts scores from completed jobs
  • APScheduler: Background scheduler that handles daily checkpoint saves at midnight (cron-based)
  • Thread-safe Operations: Uses locks to prevent race conditions when accessing job_results
  • HuggingFace Dataset Storage: Persists results to IPTesting/inference-provider-test-results dataset

Troubleshooting

Jobs Not Launching

  • Verify your HF_TOKEN is set and has the required permissions
  • Check that the IPTesting namespace exists and you have access
  • Review logs for specific error messages

Empty Models List

  • Ensure you have internet connectivity
  • The Hugging Face Hub API must be accessible
  • Try running the initialization again

Job Status Not Updating

  • Check your internet connection
  • Verify the job IDs are valid
  • Check console output for API errors

Scores Not Appearing

  • Scores are extracted from job logs after completion
  • The extraction parses the results table that appears in job logs
  • It extracts the score for each task (from the first row where the task name appears)
  • The final score is the average of all task scores
  • Example table format:
    | Task                    | Version | Metric                | Value  | Stderr |
    | extended:ifeval:0       |         | prompt_level_strict_acc | 0.9100 | 0.0288 |
    | lighteval:gpqa:diamond:0 |        | gpqa_pass@k_with_k     | 0.5000 | 0.0503 |
    
  • If scores don't appear, check console output for extraction errors or parsing issues

Files

License

This project is provided as-is for evaluation testing purposes.