Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| title: GRM Leaderboard | |
| colorFrom: gray | |
| colorTo: blue | |
| sdk: streamlit | |
| app_file: streamlit_app.py | |
| pinned: false | |
| # GRM Leaderboard | |
| Static Streamlit Space for comparing language models on a game-focused evaluation suite. | |
| ## What This Project Is | |
| - Single-repo Hugging Face Space | |
| - Frontend-only app with no database or external backend | |
| - Static benchmark registry and static model score data stored in Python files | |
| - Runtime leaderboard calculation from those local data files | |
| ## Runtime | |
| - Platform: Hugging Face Spaces | |
| - UI framework: Streamlit | |
| - Entry point: streamlit_app.py | |
| - Dependencies: requirements.txt | |
| - Space metadata: this README frontmatter | |
| ## Main Page Flow | |
| Tab 1: | |
| - Overview | |
| - Leaderboard and consolidated score explorer | |
| - Model detail and benchmark detail panels | |
| Tab 2: | |
| - Benchmark Library | |
| - GRM-Bench authored benchmark families | |
| ## File Ownership | |
| - streamlit_app.py: page layout, tabs, controls, score explorer, and benchmark library composition | |
| - data_views.py: table shaping for leaderboard, benchmark matrix, model details, and benchmark library | |
| - ui_theme.py: Streamlit CSS, header HTML, overview copy, and theme tokens | |
| - app.py: previous Gradio implementation retained during transition | |
| - benchmarks.py: benchmark registry, category assignments, PRD metadata, descriptions, summaries, and weights | |
| - scores.py: per-model benchmark scores on a 0 to 100 scale | |
| - scoring.py: category scoring, GRM score calculation, and ranking logic | |
| - requirements.txt: runtime dependencies | |
| - README.md: Space metadata and maintainer handoff notes | |
| ## Data Model | |
| - benchmarks.py stores BENCHMARKS as a list of dicts | |
| - Each benchmark entry includes: id, name, category, domain, source, phase, priority, calc_weight, included_in_grm, description, summary, methodology, detection_scope, paper | |
| - Valid categories are ROLEPLAY, ACTIONS, and GENERAL | |
| - scores.py stores MODEL_SCORES keyed by model display name | |
| - Each model score dict is keyed by benchmark id | |
| - Missing scores are skipped during weighted averaging | |
| - scores.py stores MODEL_METADATA for model family, size, precision, and open-weight visibility | |
| ## Scoring | |
| - Category score = sum(score x weight) / sum(weight) | |
| - GRM score = average of Roleplay, Actions, and General category scores | |
| - scores.py values stay on a 0 to 100 scale to match the PRD source data | |
| - Non-scored dimensions can appear in the Benchmark Library but are excluded from official GRM calculation | |
| ## How To Update The Site | |
| Update model scores: | |
| - Edit scores.py | |
| - Change benchmark values for an existing model or add a new model block | |
| Update evaluation suite rows or benchmark descriptions: | |
| - Edit benchmarks.py | |
| - The evaluation table and benchmark detail sections are generated from this registry | |
| Add a new benchmark: | |
| - Add the benchmark entry to benchmarks.py | |
| - Set its category and calc_weight | |
| - Add corresponding values in scores.py for each model you want included | |
| - Set included_in_grm to false for future or reference-only dimensions | |
| Update the authored GRM-Bench families: | |
| - Edit GRM_BENCH_DIMENSIONS in benchmarks.py | |
| Update page structure, copy, or styling: | |
| - Edit streamlit_app.py, data_views.py, or ui_theme.py | |
| ## Local Development | |
| - Install dependencies: pip install -r requirements.txt | |
| - Run locally: streamlit run streamlit_app.py | |
| - The app launches a local Streamlit server using the same static content as the Space | |
| ## Deployment Notes | |
| - The live Space deploys from the remote main branch | |
| - README frontmatter controls the Space runtime metadata | |
| - requirements.txt must match imports used by streamlit_app.py | |
| - Current scores in scores.py are static PRD-backed values with TBD entries represented as missing scores | |
| ## Maintenance Notes | |
| - The UI uses Streamlit dataframes and Python-generated data views | |
| - Leaderboard order is recalculated on each launch from scores.py | |
| - Custom CSS is injected from ui_theme.py | |