Spaces:

nvidia
/

GRM

Running on CPU Upgrade

App Files Files Community

GRM / README.md

mbagdasarova-nvidia

Upload 9 files

5c49242 verified 22 days ago

preview code

raw

history blame contribute delete

3.99 kB

	---
	title: GRM Leaderboard
	colorFrom: gray
	colorTo: blue
	sdk: streamlit
	app_file: streamlit_app.py
	pinned: false
	---

	# GRM Leaderboard

	Static Streamlit Space for comparing language models on a game-focused evaluation suite.

	## What This Project Is

	- Single-repo Hugging Face Space
	- Frontend-only app with no database or external backend
	- Static benchmark registry and static model score data stored in Python files
	- Runtime leaderboard calculation from those local data files

	## Runtime

	- Platform: Hugging Face Spaces
	- UI framework: Streamlit
	- Entry point: streamlit_app.py
	- Dependencies: requirements.txt
	- Space metadata: this README frontmatter

	## Main Page Flow

	Tab 1:
	- Overview
	- Leaderboard and consolidated score explorer
	- Model detail and benchmark detail panels

	Tab 2:
	- Benchmark Library
	- GRM-Bench authored benchmark families

	## File Ownership

	- streamlit_app.py: page layout, tabs, controls, score explorer, and benchmark library composition
	- data_views.py: table shaping for leaderboard, benchmark matrix, model details, and benchmark library
	- ui_theme.py: Streamlit CSS, header HTML, overview copy, and theme tokens
	- app.py: previous Gradio implementation retained during transition
	- benchmarks.py: benchmark registry, category assignments, PRD metadata, descriptions, summaries, and weights
	- scores.py: per-model benchmark scores on a 0 to 100 scale
	- scoring.py: category scoring, GRM score calculation, and ranking logic
	- requirements.txt: runtime dependencies
	- README.md: Space metadata and maintainer handoff notes

	## Data Model

	- benchmarks.py stores BENCHMARKS as a list of dicts
	- Each benchmark entry includes: id, name, category, domain, source, phase, priority, calc_weight, included_in_grm, description, summary, methodology, detection_scope, paper
	- Valid categories are ROLEPLAY, ACTIONS, and GENERAL
	- scores.py stores MODEL_SCORES keyed by model display name
	- Each model score dict is keyed by benchmark id
	- Missing scores are skipped during weighted averaging
	- scores.py stores MODEL_METADATA for model family, size, precision, and open-weight visibility

	## Scoring

	- Category score = sum(score x weight) / sum(weight)
	- GRM score = average of Roleplay, Actions, and General category scores
	- scores.py values stay on a 0 to 100 scale to match the PRD source data
	- Non-scored dimensions can appear in the Benchmark Library but are excluded from official GRM calculation

	## How To Update The Site

	Update model scores:
	- Edit scores.py
	- Change benchmark values for an existing model or add a new model block

	Update evaluation suite rows or benchmark descriptions:
	- Edit benchmarks.py
	- The evaluation table and benchmark detail sections are generated from this registry

	Add a new benchmark:
	- Add the benchmark entry to benchmarks.py
	- Set its category and calc_weight
	- Add corresponding values in scores.py for each model you want included
	- Set included_in_grm to false for future or reference-only dimensions

	Update the authored GRM-Bench families:
	- Edit GRM_BENCH_DIMENSIONS in benchmarks.py

	Update page structure, copy, or styling:
	- Edit streamlit_app.py, data_views.py, or ui_theme.py

	## Local Development

	- Install dependencies: pip install -r requirements.txt
	- Run locally: streamlit run streamlit_app.py
	- The app launches a local Streamlit server using the same static content as the Space

	## Deployment Notes

	- The live Space deploys from the remote main branch
	- README frontmatter controls the Space runtime metadata
	- requirements.txt must match imports used by streamlit_app.py
	- Current scores in scores.py are static PRD-backed values with TBD entries represented as missing scores

	## Maintenance Notes

	- The UI uses Streamlit dataframes and Python-generated data views
	- Leaderboard order is recalculated on each launch from scores.py
	- Custom CSS is injected from ui_theme.py