Spaces:

nvidia
/

GRM

Running on CPU Upgrade

GRM / README.md

Upload 9 files

5c49242 verified 22 days ago

3.99 kB

A newer version of the Streamlit SDK is available: 1.58.0

title: GRM Leaderboard
colorFrom: gray
colorTo: blue
sdk: streamlit
app_file: streamlit_app.py
pinned: false

GRM Leaderboard

Static Streamlit Space for comparing language models on a game-focused evaluation suite.

Tab 1:

Tab 2:

streamlit_app.py: page layout, tabs, controls, score explorer, and benchmark library composition
data_views.py: table shaping for leaderboard, benchmark matrix, model details, and benchmark library
ui_theme.py: Streamlit CSS, header HTML, overview copy, and theme tokens
app.py: previous Gradio implementation retained during transition
benchmarks.py: benchmark registry, category assignments, PRD metadata, descriptions, summaries, and weights
scores.py: per-model benchmark scores on a 0 to 100 scale
scoring.py: category scoring, GRM score calculation, and ranking logic
requirements.txt: runtime dependencies
README.md: Space metadata and maintainer handoff notes

benchmarks.py stores BENCHMARKS as a list of dicts
Each benchmark entry includes: id, name, category, domain, source, phase, priority, calc_weight, included_in_grm, description, summary, methodology, detection_scope, paper
Valid categories are ROLEPLAY, ACTIONS, and GENERAL
scores.py stores MODEL_SCORES keyed by model display name
Each model score dict is keyed by benchmark id
Missing scores are skipped during weighted averaging
scores.py stores MODEL_METADATA for model family, size, precision, and open-weight visibility

Category score = sum(score x weight) / sum(weight)
GRM score = average of Roleplay, Actions, and General category scores
scores.py values stay on a 0 to 100 scale to match the PRD source data
Non-scored dimensions can appear in the Benchmark Library but are excluded from official GRM calculation

Update model scores:

Update evaluation suite rows or benchmark descriptions:

Edit benchmarks.py
The evaluation table and benchmark detail sections are generated from this registry

Add a new benchmark:

Update the authored GRM-Bench families:

Update page structure, copy, or styling:

Install dependencies: pip install -r requirements.txt
Run locally: streamlit run streamlit_app.py
The app launches a local Streamlit server using the same static content as the Space

The live Space deploys from the remote main branch
README frontmatter controls the Space runtime metadata
requirements.txt must match imports used by streamlit_app.py
Current scores in scores.py are static PRD-backed values with TBD entries represented as missing scores