File size: 3,986 Bytes
5c49242
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
---

title: GRM Leaderboard
colorFrom: gray
colorTo: blue
sdk: streamlit
app_file: streamlit_app.py
pinned: false
---


# GRM Leaderboard

Static Streamlit Space for comparing language models on a game-focused evaluation suite.

## What This Project Is

- Single-repo Hugging Face Space
- Frontend-only app with no database or external backend
- Static benchmark registry and static model score data stored in Python files
- Runtime leaderboard calculation from those local data files

## Runtime

- Platform: Hugging Face Spaces
- UI framework: Streamlit
- Entry point: streamlit_app.py

- Dependencies: requirements.txt

- Space metadata: this README frontmatter



## Main Page Flow



Tab 1:

- Overview

- Leaderboard and consolidated score explorer

- Model detail and benchmark detail panels



Tab 2:

- Benchmark Library

- GRM-Bench authored benchmark families



## File Ownership



- streamlit_app.py: page layout, tabs, controls, score explorer, and benchmark library composition
- data_views.py: table shaping for leaderboard, benchmark matrix, model details, and benchmark library

- ui_theme.py: Streamlit CSS, header HTML, overview copy, and theme tokens
- app.py: previous Gradio implementation retained during transition
- benchmarks.py: benchmark registry, category assignments, PRD metadata, descriptions, summaries, and weights
- scores.py: per-model benchmark scores on a 0 to 100 scale
- scoring.py: category scoring, GRM score calculation, and ranking logic
- requirements.txt: runtime dependencies
- README.md: Space metadata and maintainer handoff notes

## Data Model

- benchmarks.py stores BENCHMARKS as a list of dicts
- Each benchmark entry includes: id, name, category, domain, source, phase, priority, calc_weight, included_in_grm, description, summary, methodology, detection_scope, paper
- Valid categories are ROLEPLAY, ACTIONS, and GENERAL
- scores.py stores MODEL_SCORES keyed by model display name

- Each model score dict is keyed by benchmark id

- Missing scores are skipped during weighted averaging

- scores.py stores MODEL_METADATA for model family, size, precision, and open-weight visibility

## Scoring

- Category score = sum(score x weight) / sum(weight)
- GRM score = average of Roleplay, Actions, and General category scores
- scores.py values stay on a 0 to 100 scale to match the PRD source data
- Non-scored dimensions can appear in the Benchmark Library but are excluded from official GRM calculation

## How To Update The Site

Update model scores:
- Edit scores.py
- Change benchmark values for an existing model or add a new model block

Update evaluation suite rows or benchmark descriptions:
- Edit benchmarks.py
- The evaluation table and benchmark detail sections are generated from this registry

Add a new benchmark:
- Add the benchmark entry to benchmarks.py
- Set its category and calc_weight

- Add corresponding values in scores.py for each model you want included

- Set included_in_grm to false for future or reference-only dimensions



Update the authored GRM-Bench families:

- Edit GRM_BENCH_DIMENSIONS in benchmarks.py



Update page structure, copy, or styling:

- Edit streamlit_app.py, data_views.py, or ui_theme.py

## Local Development

- Install dependencies: pip install -r requirements.txt
- Run locally: streamlit run streamlit_app.py

- The app launches a local Streamlit server using the same static content as the Space



## Deployment Notes



- The live Space deploys from the remote main branch

- README frontmatter controls the Space runtime metadata

- requirements.txt must match imports used by streamlit_app.py
- Current scores in scores.py are static PRD-backed values with TBD entries represented as missing scores

## Maintenance Notes

- The UI uses Streamlit dataframes and Python-generated data views
- Leaderboard order is recalculated on each launch from scores.py
- Custom CSS is injected from ui_theme.py