--- title: GRM Leaderboard colorFrom: gray colorTo: blue sdk: streamlit app_file: streamlit_app.py pinned: false --- # GRM Leaderboard Static Streamlit Space for comparing language models on a game-focused evaluation suite. ## What This Project Is - Single-repo Hugging Face Space - Frontend-only app with no database or external backend - Static benchmark registry and static model score data stored in Python files - Runtime leaderboard calculation from those local data files ## Runtime - Platform: Hugging Face Spaces - UI framework: Streamlit - Entry point: streamlit_app.py - Dependencies: requirements.txt - Space metadata: this README frontmatter ## Main Page Flow Tab 1: - Overview - Leaderboard and consolidated score explorer - Model detail and benchmark detail panels Tab 2: - Benchmark Library - GRM-Bench authored benchmark families ## File Ownership - streamlit_app.py: page layout, tabs, controls, score explorer, and benchmark library composition - data_views.py: table shaping for leaderboard, benchmark matrix, model details, and benchmark library - ui_theme.py: Streamlit CSS, header HTML, overview copy, and theme tokens - app.py: previous Gradio implementation retained during transition - benchmarks.py: benchmark registry, category assignments, PRD metadata, descriptions, summaries, and weights - scores.py: per-model benchmark scores on a 0 to 100 scale - scoring.py: category scoring, GRM score calculation, and ranking logic - requirements.txt: runtime dependencies - README.md: Space metadata and maintainer handoff notes ## Data Model - benchmarks.py stores BENCHMARKS as a list of dicts - Each benchmark entry includes: id, name, category, domain, source, phase, priority, calc_weight, included_in_grm, description, summary, methodology, detection_scope, paper - Valid categories are ROLEPLAY, ACTIONS, and GENERAL - scores.py stores MODEL_SCORES keyed by model display name - Each model score dict is keyed by benchmark id - Missing scores are skipped during weighted averaging - scores.py stores MODEL_METADATA for model family, size, precision, and open-weight visibility ## Scoring - Category score = sum(score x weight) / sum(weight) - GRM score = average of Roleplay, Actions, and General category scores - scores.py values stay on a 0 to 100 scale to match the PRD source data - Non-scored dimensions can appear in the Benchmark Library but are excluded from official GRM calculation ## How To Update The Site Update model scores: - Edit scores.py - Change benchmark values for an existing model or add a new model block Update evaluation suite rows or benchmark descriptions: - Edit benchmarks.py - The evaluation table and benchmark detail sections are generated from this registry Add a new benchmark: - Add the benchmark entry to benchmarks.py - Set its category and calc_weight - Add corresponding values in scores.py for each model you want included - Set included_in_grm to false for future or reference-only dimensions Update the authored GRM-Bench families: - Edit GRM_BENCH_DIMENSIONS in benchmarks.py Update page structure, copy, or styling: - Edit streamlit_app.py, data_views.py, or ui_theme.py ## Local Development - Install dependencies: pip install -r requirements.txt - Run locally: streamlit run streamlit_app.py - The app launches a local Streamlit server using the same static content as the Space ## Deployment Notes - The live Space deploys from the remote main branch - README frontmatter controls the Space runtime metadata - requirements.txt must match imports used by streamlit_app.py - Current scores in scores.py are static PRD-backed values with TBD entries represented as missing scores ## Maintenance Notes - The UI uses Streamlit dataframes and Python-generated data views - Leaderboard order is recalculated on each launch from scores.py - Custom CSS is injected from ui_theme.py