lzzzzy's picture
Pin Space dependencies to compatible versions
ceeaaae

A newer version of the Gradio SDK is available: 6.12.0

Upgrade
metadata
title: BizGenEval Leaderboard
emoji: 🥇
colorFrom: green
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: true
license: mit
short_description: Official BizGenEval leaderboard on Hugging Face.
sdk_version: 5.50.0
tags:
  - leaderboard

BizGenEval Leaderboard

This repository hosts the Hugging Face leaderboard for BizGenEval, the benchmark introduced in BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation.

Primary project resources:

  • Project page: https://aka.ms/BizGenEval
  • GitHub: https://github.com/microsoft/BizGenEval
  • Dataset: https://huggingface.co/datasets/microsoft/BizGenEval

The codebase supports:

  1. LOCAL_DEV mode (no HF permission required): reads/writes local namespaced paths under eval-queue/ and eval-results/.
  2. HF mode (with permission): syncs datasets from the Hub and uploads queue requests.

1) Local development quick start (no HF permission)

Step 1. Create and activate virtualenv

cd /Users/clarencestark/code/BizGenEval-Leaderboard
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Step 2. Bootstrap local demo data

python3 scripts/bootstrap_local_dev.py

This will create:

  • eval-queue/bizgeneval/requests/microsoft/Phi-4o-mini_eval_request_False_float16_Original.json
  • eval-results/bizgeneval/results/microsoft/Phi-4o-mini/summary.json

Step 3. Launch in local mode

export LOCAL_DEV=1
python3 app.py

In LOCAL_DEV mode:

  • snapshot_download is skipped.
  • Model-card/tokenizer checks are skipped during submission.
  • New submissions are written to local eval-queue/bizgeneval/requests/ only (no upload).

2) Result file format supported

The leaderboard parser currently supports two formats:

A) BizGenEval summary format (recommended)

Put a summary.json under:

eval-results/bizgeneval/results/<org>/<model>/summary.json

Example:

{
  "model_name": "microsoft/Phi-4o-mini",
  "model_sha": "main",
  "by_domain": {
    "slides": {"error_score": 0.8125},
    "webpage": {"error_score": 0.845},
    "poster": {"error_score": 0.7875},
    "chart": {"error_score": 0.8025},
    "scientific_figure": {"error_score": 0.77}
  },
  "by_dimension": {
    "layout": {"error_score": 0.835},
    "attribute": {"error_score": 0.805},
    "text": {"error_score": 0.79},
    "knowledge": {"error_score": 0.775}
  }
}

error_score can be either 0~1 or 0~100; both are accepted and normalized to a displayed 0~100 scale.

B) Legacy template format

Legacy config/results JSON is still accepted for compatibility.

3) Queue file format

Queue entries are JSON files in:

eval-queue/bizgeneval/requests/<org>/*.json

A typical file contains:

  • model
  • revision
  • precision
  • weight_type
  • status (PENDING, RUNNING, FINISHED*)
  • metadata (license, params, likes, ...)

4) Config knobs

Main config file: src/envs.py

  • LOCAL_DEV (env): 1/true/on to enable local mode
  • HF_OWNER (env, optional): owner fallback
  • PROJECT_NAMESPACE (env, optional): defaults to bizgeneval
  • HF_SPACE_REPO (env, optional)
  • HF_QUEUE_REPO (env, optional)
  • HF_RESULTS_REPO (env, optional)
  • HF_TOKEN (env): required only for Hub sync/upload

Default repo names are:

  • Space: microsoft/BizGenEval-Leaderboard
  • Queue dataset: demo-leaderboard-backend/requests
  • Results dataset: demo-leaderboard-backend/results

5) Key code locations

  • Columns and UI display fields: src/display/utils.py
  • Result parser: src/leaderboard/read_evals.py
  • DataFrame build logic: src/populate.py
  • Submission validation/upload behavior: src/submission/submit.py
  • Task definitions and page text: src/about.py