---
title: GRM Leaderboard
colorFrom: gray
colorTo: blue
sdk: streamlit
app_file: streamlit_app.py
pinned: false
---

# GRM Leaderboard

Static Streamlit Space for comparing language models on a game-focused evaluation suite.

## What This Project Is

- Single-repo Hugging Face Space
- Frontend-only app with no database or external backend
- Static benchmark registry and static model score data stored in Python files
- Runtime leaderboard calculation from those local data files

## Runtime

- Platform: Hugging Face Spaces
- UI framework: Streamlit
- Entry point: streamlit_app.py
- Dependencies: requirements.txt
- Space metadata: this README frontmatter

## Main Page Flow

Tab 1:
- Overview
- Leaderboard and consolidated score explorer
- Model detail and benchmark detail panels

Tab 2:
- Benchmark Library
- GRM-Bench authored benchmark families

## File Ownership

- streamlit_app.py: page layout, tabs, controls, score explorer, and benchmark library composition
- data_views.py: table shaping for leaderboard, benchmark matrix, model details, and benchmark library
- ui_theme.py: Streamlit CSS, header HTML, overview copy, and theme tokens
- app.py: previous Gradio implementation retained during transition
- benchmarks.py: benchmark registry, category assignments, PRD metadata, descriptions, summaries, and weights
- scores.py: per-model benchmark scores on a 0 to 100 scale
- scoring.py: category scoring, GRM score calculation, and ranking logic
- requirements.txt: runtime dependencies
- README.md: Space metadata and maintainer handoff notes

## Data Model

- benchmarks.py stores BENCHMARKS as a list of dicts
- Each benchmark entry includes: id, name, category, domain, source, phase, priority, calc_weight, included_in_grm, description, summary, methodology, detection_scope, paper
- Valid categories are ROLEPLAY, ACTIONS, and GENERAL
- scores.py stores MODEL_SCORES keyed by model display name
- Each model score dict is keyed by benchmark id
- Missing scores are skipped during weighted averaging
- scores.py stores MODEL_METADATA for model family, size, precision, and open-weight visibility

## Scoring

- Category score = sum(score x weight) / sum(weight)
- GRM score = average of Roleplay, Actions, and General category scores
- scores.py values stay on a 0 to 100 scale to match the PRD source data
- Non-scored dimensions can appear in the Benchmark Library but are excluded from official GRM calculation

## How To Update The Site

Update model scores:
- Edit scores.py
- Change benchmark values for an existing model or add a new model block

Update evaluation suite rows or benchmark descriptions:
- Edit benchmarks.py
- The evaluation table and benchmark detail sections are generated from this registry

Add a new benchmark:
- Add the benchmark entry to benchmarks.py
- Set its category and calc_weight
- Add corresponding values in scores.py for each model you want included
- Set included_in_grm to false for future or reference-only dimensions

Update the authored GRM-Bench families:
- Edit GRM_BENCH_DIMENSIONS in benchmarks.py

Update page structure, copy, or styling:
- Edit streamlit_app.py, data_views.py, or ui_theme.py

## Local Development

- Install dependencies: pip install -r requirements.txt
- Run locally: streamlit run streamlit_app.py
- The app launches a local Streamlit server using the same static content as the Space

## Deployment Notes

- The live Space deploys from the remote main branch
- README frontmatter controls the Space runtime metadata
- requirements.txt must match imports used by streamlit_app.py
- Current scores in scores.py are static PRD-backed values with TBD entries represented as missing scores

## Maintenance Notes

- The UI uses Streamlit dataframes and Python-generated data views
- Leaderboard order is recalculated on each launch from scores.py
- Custom CSS is injected from ui_theme.py