Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
A newer version of the Streamlit SDK is available: 1.58.0
metadata
title: GRM Leaderboard
colorFrom: gray
colorTo: blue
sdk: streamlit
app_file: streamlit_app.py
pinned: false
GRM Leaderboard
Static Streamlit Space for comparing language models on a game-focused evaluation suite.
What This Project Is
- Single-repo Hugging Face Space
- Frontend-only app with no database or external backend
- Static benchmark registry and static model score data stored in Python files
- Runtime leaderboard calculation from those local data files
Runtime
- Platform: Hugging Face Spaces
- UI framework: Streamlit
- Entry point: streamlit_app.py
- Dependencies: requirements.txt
- Space metadata: this README frontmatter
Main Page Flow
Tab 1:
- Overview
- Leaderboard and consolidated score explorer
- Model detail and benchmark detail panels
Tab 2:
- Benchmark Library
- GRM-Bench authored benchmark families
File Ownership
- streamlit_app.py: page layout, tabs, controls, score explorer, and benchmark library composition
- data_views.py: table shaping for leaderboard, benchmark matrix, model details, and benchmark library
- ui_theme.py: Streamlit CSS, header HTML, overview copy, and theme tokens
- app.py: previous Gradio implementation retained during transition
- benchmarks.py: benchmark registry, category assignments, PRD metadata, descriptions, summaries, and weights
- scores.py: per-model benchmark scores on a 0 to 100 scale
- scoring.py: category scoring, GRM score calculation, and ranking logic
- requirements.txt: runtime dependencies
- README.md: Space metadata and maintainer handoff notes
Data Model
- benchmarks.py stores BENCHMARKS as a list of dicts
- Each benchmark entry includes: id, name, category, domain, source, phase, priority, calc_weight, included_in_grm, description, summary, methodology, detection_scope, paper
- Valid categories are ROLEPLAY, ACTIONS, and GENERAL
- scores.py stores MODEL_SCORES keyed by model display name
- Each model score dict is keyed by benchmark id
- Missing scores are skipped during weighted averaging
- scores.py stores MODEL_METADATA for model family, size, precision, and open-weight visibility
Scoring
- Category score = sum(score x weight) / sum(weight)
- GRM score = average of Roleplay, Actions, and General category scores
- scores.py values stay on a 0 to 100 scale to match the PRD source data
- Non-scored dimensions can appear in the Benchmark Library but are excluded from official GRM calculation
How To Update The Site
Update model scores:
- Edit scores.py
- Change benchmark values for an existing model or add a new model block
Update evaluation suite rows or benchmark descriptions:
- Edit benchmarks.py
- The evaluation table and benchmark detail sections are generated from this registry
Add a new benchmark:
- Add the benchmark entry to benchmarks.py
- Set its category and calc_weight
- Add corresponding values in scores.py for each model you want included
- Set included_in_grm to false for future or reference-only dimensions
Update the authored GRM-Bench families:
- Edit GRM_BENCH_DIMENSIONS in benchmarks.py
Update page structure, copy, or styling:
- Edit streamlit_app.py, data_views.py, or ui_theme.py
Local Development
- Install dependencies: pip install -r requirements.txt
- Run locally: streamlit run streamlit_app.py
- The app launches a local Streamlit server using the same static content as the Space
Deployment Notes
- The live Space deploys from the remote main branch
- README frontmatter controls the Space runtime metadata
- requirements.txt must match imports used by streamlit_app.py
- Current scores in scores.py are static PRD-backed values with TBD entries represented as missing scores
Maintenance Notes
- The UI uses Streamlit dataframes and Python-generated data views
- Leaderboard order is recalculated on each launch from scores.py
- Custom CSS is injected from ui_theme.py