GRM / README.md
mbagdasarova-nvidia's picture
Upload 9 files
5c49242 verified

A newer version of the Streamlit SDK is available: 1.58.0

Upgrade
metadata
title: GRM Leaderboard
colorFrom: gray
colorTo: blue
sdk: streamlit
app_file: streamlit_app.py
pinned: false

GRM Leaderboard

Static Streamlit Space for comparing language models on a game-focused evaluation suite.

What This Project Is

  • Single-repo Hugging Face Space
  • Frontend-only app with no database or external backend
  • Static benchmark registry and static model score data stored in Python files
  • Runtime leaderboard calculation from those local data files

Runtime

  • Platform: Hugging Face Spaces
  • UI framework: Streamlit
  • Entry point: streamlit_app.py
  • Dependencies: requirements.txt
  • Space metadata: this README frontmatter

Main Page Flow

Tab 1:

  • Overview
  • Leaderboard and consolidated score explorer
  • Model detail and benchmark detail panels

Tab 2:

  • Benchmark Library
  • GRM-Bench authored benchmark families

File Ownership

  • streamlit_app.py: page layout, tabs, controls, score explorer, and benchmark library composition
  • data_views.py: table shaping for leaderboard, benchmark matrix, model details, and benchmark library
  • ui_theme.py: Streamlit CSS, header HTML, overview copy, and theme tokens
  • app.py: previous Gradio implementation retained during transition
  • benchmarks.py: benchmark registry, category assignments, PRD metadata, descriptions, summaries, and weights
  • scores.py: per-model benchmark scores on a 0 to 100 scale
  • scoring.py: category scoring, GRM score calculation, and ranking logic
  • requirements.txt: runtime dependencies
  • README.md: Space metadata and maintainer handoff notes

Data Model

  • benchmarks.py stores BENCHMARKS as a list of dicts
  • Each benchmark entry includes: id, name, category, domain, source, phase, priority, calc_weight, included_in_grm, description, summary, methodology, detection_scope, paper
  • Valid categories are ROLEPLAY, ACTIONS, and GENERAL
  • scores.py stores MODEL_SCORES keyed by model display name
  • Each model score dict is keyed by benchmark id
  • Missing scores are skipped during weighted averaging
  • scores.py stores MODEL_METADATA for model family, size, precision, and open-weight visibility

Scoring

  • Category score = sum(score x weight) / sum(weight)
  • GRM score = average of Roleplay, Actions, and General category scores
  • scores.py values stay on a 0 to 100 scale to match the PRD source data
  • Non-scored dimensions can appear in the Benchmark Library but are excluded from official GRM calculation

How To Update The Site

Update model scores:

  • Edit scores.py
  • Change benchmark values for an existing model or add a new model block

Update evaluation suite rows or benchmark descriptions:

  • Edit benchmarks.py
  • The evaluation table and benchmark detail sections are generated from this registry

Add a new benchmark:

  • Add the benchmark entry to benchmarks.py
  • Set its category and calc_weight
  • Add corresponding values in scores.py for each model you want included
  • Set included_in_grm to false for future or reference-only dimensions

Update the authored GRM-Bench families:

  • Edit GRM_BENCH_DIMENSIONS in benchmarks.py

Update page structure, copy, or styling:

  • Edit streamlit_app.py, data_views.py, or ui_theme.py

Local Development

  • Install dependencies: pip install -r requirements.txt
  • Run locally: streamlit run streamlit_app.py
  • The app launches a local Streamlit server using the same static content as the Space

Deployment Notes

  • The live Space deploys from the remote main branch
  • README frontmatter controls the Space runtime metadata
  • requirements.txt must match imports used by streamlit_app.py
  • Current scores in scores.py are static PRD-backed values with TBD entries represented as missing scores

Maintenance Notes

  • The UI uses Streamlit dataframes and Python-generated data views
  • Leaderboard order is recalculated on each launch from scores.py
  • Custom CSS is injected from ui_theme.py