iWorldBench commited on
Commit
697f377
·
1 Parent(s): 27d9916

Deploy iWorld-Bench leaderboard

Browse files
Files changed (3) hide show
  1. README.md +56 -45
  2. app.py +12 -211
  3. requirements.txt +5 -21
README.md CHANGED
@@ -1,11 +1,10 @@
1
  ---
2
- <<<<<<< HEAD
3
  title: iWorld-Bench Leaderboard
4
  emoji: 🌍
5
  colorFrom: blue
6
  colorTo: green
7
  sdk: gradio
8
- sdk_version: "4.44.0"
9
  python_version: "3.12"
10
  app_file: app.py
11
  pinned: false
@@ -14,52 +13,64 @@ pinned: false
14
  # iWorld-Bench Leaderboard
15
 
16
  A comprehensive benchmark for interactive world models.
17
- =======
18
- title: IWorld Bench
19
- emoji: 🥇
20
- colorFrom: green
21
- colorTo: indigo
22
- sdk: gradio
23
- app_file: app.py
24
- pinned: true
25
- license: mit
26
- short_description: A Benchmark for Interactive World Models
27
- sdk_version: 5.43.1
28
- tags:
29
- - leaderboard
30
- ---
31
 
32
- # Start the configuration
33
-
34
- Most of the variables to change for a default leaderboard are in `src/env.py` (replace the path for your leaderboard) and `src/about.py` (for tasks).
35
-
36
- Results files should have the following format and be stored as json files:
37
- ```json
38
- {
39
- "config": {
40
- "model_dtype": "torch.float16", # or torch.bfloat16 or 8bit or 4bit
41
- "model_name": "path of the model on the hub: org/model",
42
- "model_sha": "revision on the hub",
43
- },
44
- "results": {
45
- "task_name": {
46
- "metric_name": score,
47
- },
48
- "task_name2": {
49
- "metric_name": score,
50
- }
51
- }
52
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  ```
54
 
55
- Request files are created automatically by this tool.
56
 
57
- If you encounter problem on the space, don't hesitate to restart it to remove the create eval-queue, eval-queue-bk, eval-results and eval-results-bk created folder.
 
 
 
 
 
58
 
59
- # Code logic for more complex edits
60
 
61
- You'll find
62
- - the main table' columns names and properties in `src/display/utils.py`
63
- - the logic to read all results and request files, then convert them in dataframe lines, in `src/leaderboard/read_evals.py`, and `src/populate.py`
64
- - the logic to allow or filter submissions in `src/submission/submit.py` and `src/submission/check_validity.py`
65
- >>>>>>> 274bb98a1643b352ae5569c75aeb43fc9ca01625
 
1
  ---
 
2
  title: iWorld-Bench Leaderboard
3
  emoji: 🌍
4
  colorFrom: blue
5
  colorTo: green
6
  sdk: gradio
7
+ sdk_version: "4.44.1"
8
  python_version: "3.12"
9
  app_file: app.py
10
  pinned: false
 
13
  # iWorld-Bench Leaderboard
14
 
15
  A comprehensive benchmark for interactive world models.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
+ ## Local run
18
+
19
+ ```bash
20
+ pip install -r requirements.txt
21
+ python app.py
22
+ ```
23
+
24
+ If you deploy in Docker and Gradio reports that localhost is not accessible, set environment variable `GRADIO_SHARE=true`. On Hugging Face Spaces the default (`share` off) is correct.
25
+
26
+ ## Deploy to Hugging Face Space(与本地 `readme更新.txt` 一致)
27
+
28
+ 网页新建 Space:Licence MIT、SDK **Gradio**、硬件 CPU basic、Public。复制 Git 地址 `https://huggingface.co/spaces/<用户名>/<Space名>`。
29
+
30
+ **WSL** 或 **Git Bash** 中(路径请改成你的仓库位置):
31
+
32
+ ```bash
33
+ # 0. 进入项目根目录(含 app.py)
34
+ cd /mnt/d/lab/Thu_lab/iworld-bench
35
+
36
+ # 1. CLI(若 requirements 里已包含可跳过)
37
+ pip install -U "huggingface_hub[cli]"
38
+
39
+ # 2. 登录(Token:https://huggingface.co/settings/tokens )
40
+ huggingface-cli login
41
+
42
+ # 3. 确认登录
43
+ huggingface-cli whoami
44
+
45
+ # 4. 若尚未初始化 git(可选)
46
+ # git init
47
+
48
+ # 5. 添加远程(把 <用户名>/<Space名> 换成你的)
49
+ git remote add origin https://huggingface.co/spaces/<用户名>/<Space名>
50
+ # 若已存在 origin,用:git remote set-url origin https://huggingface.co/spaces/<用户名>/<Space名>
51
+
52
+ # 6. 拉取 Space 自动生成的小提交(若失败可跳过)
53
+ git pull origin main --allow-unrelated-histories
54
+ # 若没有 main:git branch -M main
55
+
56
+ # 7. 只添加需要上云的文件(不要 add bench/、不要 add zip)
57
+ git add README.md app.py requirements.txt data/results.csv
58
+ find src -name '*.py' -not -path 'src/bench/*' -exec git add {} +
59
+
60
+ # 8. 提交并推送
61
+ git commit -m "Deploy iWorld-Bench leaderboard"
62
+ git push -u origin main
63
  ```
64
 
65
+ 推送后在 Space 页面 **⋯ Restart Space**。若构建失败,在 Space **Settings → Repository secrets** 可设变量 `GRADIO_SHARE`(一般留空即可;仅当你自建 Docker 且报 localhost 相关错误时再设为 `true`)。
66
 
67
+ Windows 若只用 PowerShell 且没有 `find`,可改用:
68
+
69
+ ```powershell
70
+ git add README.md app.py requirements.txt data/results.csv
71
+ Get-ChildItem -Path src -Filter *.py -Recurse | Where-Object { $_.FullName -notmatch '\\bench\\' } | ForEach-Object { git add $_.FullName }
72
+ ```
73
 
74
+ ## Dependency note
75
 
76
+ Keep `starlette<1.0` (see `requirements.txt`). Starlette 1.0 changed `TemplateResponse`; Gradio 4.44.x is built for the older API. Installing Starlette 1.x can cause `TypeError: unhashable type: 'dict'` when loading the Gradio UI.
 
 
 
 
app.py CHANGED
@@ -1,4 +1,3 @@
1
- <<<<<<< HEAD
2
  from typing import Optional, List
3
  import gradio as gr
4
  import pandas as pd
@@ -18,6 +17,7 @@ radar_plotter = RadarPlotter(data_loader)
18
 
19
  DEFAULT_METRIC = "Average ⭐"
20
 
 
21
  def reload_data():
22
  msg = data_loader.reload_data()
23
  if data_loader.df_all is None or data_loader.df_all.empty:
@@ -54,6 +54,7 @@ def reload_data():
54
  gr.update(choices=category_choices, value="All"), \
55
  html_table, radar_fig
56
 
 
57
  def update_leaderboard_wrapper(metric, top_k, model_filter,
58
  category_filter, sort_mode, selected_metrics):
59
  clean_metric = clean_metric_names([metric])[0]
@@ -80,6 +81,7 @@ def update_leaderboard_wrapper(metric, top_k, model_filter,
80
  radar_fig = radar_plotter.create_radar_chart(radar_df)
81
  return html_table, radar_fig
82
 
 
83
  def create_comparison_plot_wrapper(model_filter, category_filter,
84
  selected_plot_metric, plot_sort_mode):
85
  clean_metric = clean_metric_names([selected_plot_metric])[0]
@@ -92,6 +94,7 @@ def create_comparison_plot_wrapper(model_filter, category_filter,
92
  sort_mode=plot_sort_mode
93
  )
94
 
 
95
  academic_css = get_academic_css()
96
 
97
  with gr.Blocks(css=academic_css) as demo:
@@ -214,215 +217,13 @@ with gr.Blocks(css=academic_css) as demo:
214
  outputs=[status_box, category_dropdown, leaderboard_html, radar_plot],
215
  )
216
 
 
217
  if __name__ == "__main__":
 
 
 
218
  demo.launch(
219
- server_name="0.0.0.0",
220
- server_port=7860,
221
- share=True,
222
- )
223
- =======
224
- import gradio as gr
225
- from gradio_leaderboard import Leaderboard, ColumnFilter, SelectColumns
226
- import pandas as pd
227
- from apscheduler.schedulers.background import BackgroundScheduler
228
- from huggingface_hub import snapshot_download
229
-
230
- from src.about import (
231
- CITATION_BUTTON_LABEL,
232
- CITATION_BUTTON_TEXT,
233
- EVALUATION_QUEUE_TEXT,
234
- INTRODUCTION_TEXT,
235
- LLM_BENCHMARKS_TEXT,
236
- TITLE,
237
- )
238
- from src.display.css_html_js import custom_css
239
- from src.display.utils import (
240
- BENCHMARK_COLS,
241
- COLS,
242
- EVAL_COLS,
243
- EVAL_TYPES,
244
- AutoEvalColumn,
245
- ModelType,
246
- fields,
247
- WeightType,
248
- Precision
249
- )
250
- from src.envs import API, EVAL_REQUESTS_PATH, EVAL_RESULTS_PATH, QUEUE_REPO, REPO_ID, RESULTS_REPO, TOKEN
251
- from src.populate import get_evaluation_queue_df, get_leaderboard_df
252
- from src.submission.submit import add_new_eval
253
-
254
-
255
- def restart_space():
256
- API.restart_space(repo_id=REPO_ID)
257
-
258
- ### Space initialisation
259
- try:
260
- print(EVAL_REQUESTS_PATH)
261
- snapshot_download(
262
- repo_id=QUEUE_REPO, local_dir=EVAL_REQUESTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30, token=TOKEN
263
- )
264
- except Exception:
265
- restart_space()
266
- try:
267
- print(EVAL_RESULTS_PATH)
268
- snapshot_download(
269
- repo_id=RESULTS_REPO, local_dir=EVAL_RESULTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30, token=TOKEN
270
- )
271
- except Exception:
272
- restart_space()
273
-
274
-
275
- LEADERBOARD_DF = get_leaderboard_df(EVAL_RESULTS_PATH, EVAL_REQUESTS_PATH, COLS, BENCHMARK_COLS)
276
-
277
- (
278
- finished_eval_queue_df,
279
- running_eval_queue_df,
280
- pending_eval_queue_df,
281
- ) = get_evaluation_queue_df(EVAL_REQUESTS_PATH, EVAL_COLS)
282
-
283
- def init_leaderboard(dataframe):
284
- if dataframe is None or dataframe.empty:
285
- raise ValueError("Leaderboard DataFrame is empty or None.")
286
- return Leaderboard(
287
- value=dataframe,
288
- datatype=[c.type for c in fields(AutoEvalColumn)],
289
- select_columns=SelectColumns(
290
- default_selection=[c.name for c in fields(AutoEvalColumn) if c.displayed_by_default],
291
- cant_deselect=[c.name for c in fields(AutoEvalColumn) if c.never_hidden],
292
- label="Select Columns to Display:",
293
- ),
294
- search_columns=[AutoEvalColumn.model.name, AutoEvalColumn.license.name],
295
- hide_columns=[c.name for c in fields(AutoEvalColumn) if c.hidden],
296
- filter_columns=[
297
- ColumnFilter(AutoEvalColumn.model_type.name, type="checkboxgroup", label="Model types"),
298
- ColumnFilter(AutoEvalColumn.precision.name, type="checkboxgroup", label="Precision"),
299
- ColumnFilter(
300
- AutoEvalColumn.params.name,
301
- type="slider",
302
- min=0.01,
303
- max=150,
304
- label="Select the number of parameters (B)",
305
- ),
306
- ColumnFilter(
307
- AutoEvalColumn.still_on_hub.name, type="boolean", label="Deleted/incomplete", default=True
308
- ),
309
- ],
310
- bool_checkboxgroup_label="Hide models",
311
- interactive=False,
312
- )
313
-
314
-
315
- demo = gr.Blocks(css=custom_css)
316
- with demo:
317
- gr.HTML(TITLE)
318
- gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text")
319
-
320
- with gr.Tabs(elem_classes="tab-buttons") as tabs:
321
- with gr.TabItem("🏅 LLM Benchmark", elem_id="llm-benchmark-tab-table", id=0):
322
- leaderboard = init_leaderboard(LEADERBOARD_DF)
323
-
324
- with gr.TabItem("📝 About", elem_id="llm-benchmark-tab-table", id=2):
325
- gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
326
-
327
- with gr.TabItem("🚀 Submit here! ", elem_id="llm-benchmark-tab-table", id=3):
328
- with gr.Column():
329
- with gr.Row():
330
- gr.Markdown(EVALUATION_QUEUE_TEXT, elem_classes="markdown-text")
331
-
332
- with gr.Column():
333
- with gr.Accordion(
334
- f"✅ Finished Evaluations ({len(finished_eval_queue_df)})",
335
- open=False,
336
- ):
337
- with gr.Row():
338
- finished_eval_table = gr.components.Dataframe(
339
- value=finished_eval_queue_df,
340
- headers=EVAL_COLS,
341
- datatype=EVAL_TYPES,
342
- row_count=5,
343
- )
344
- with gr.Accordion(
345
- f"🔄 Running Evaluation Queue ({len(running_eval_queue_df)})",
346
- open=False,
347
- ):
348
- with gr.Row():
349
- running_eval_table = gr.components.Dataframe(
350
- value=running_eval_queue_df,
351
- headers=EVAL_COLS,
352
- datatype=EVAL_TYPES,
353
- row_count=5,
354
- )
355
-
356
- with gr.Accordion(
357
- f"⏳ Pending Evaluation Queue ({len(pending_eval_queue_df)})",
358
- open=False,
359
- ):
360
- with gr.Row():
361
- pending_eval_table = gr.components.Dataframe(
362
- value=pending_eval_queue_df,
363
- headers=EVAL_COLS,
364
- datatype=EVAL_TYPES,
365
- row_count=5,
366
- )
367
- with gr.Row():
368
- gr.Markdown("# ✉️✨ Submit your model here!", elem_classes="markdown-text")
369
-
370
- with gr.Row():
371
- with gr.Column():
372
- model_name_textbox = gr.Textbox(label="Model name")
373
- revision_name_textbox = gr.Textbox(label="Revision commit", placeholder="main")
374
- model_type = gr.Dropdown(
375
- choices=[t.to_str(" : ") for t in ModelType if t != ModelType.Unknown],
376
- label="Model type",
377
- multiselect=False,
378
- value=None,
379
- interactive=True,
380
- )
381
-
382
- with gr.Column():
383
- precision = gr.Dropdown(
384
- choices=[i.value.name for i in Precision if i != Precision.Unknown],
385
- label="Precision",
386
- multiselect=False,
387
- value="float16",
388
- interactive=True,
389
- )
390
- weight_type = gr.Dropdown(
391
- choices=[i.value.name for i in WeightType],
392
- label="Weights type",
393
- multiselect=False,
394
- value="Original",
395
- interactive=True,
396
- )
397
- base_model_name_textbox = gr.Textbox(label="Base model (for delta or adapter weights)")
398
-
399
- submit_button = gr.Button("Submit Eval")
400
- submission_result = gr.Markdown()
401
- submit_button.click(
402
- add_new_eval,
403
- [
404
- model_name_textbox,
405
- base_model_name_textbox,
406
- revision_name_textbox,
407
- precision,
408
- weight_type,
409
- model_type,
410
- ],
411
- submission_result,
412
- )
413
-
414
- with gr.Row():
415
- with gr.Accordion("📙 Citation", open=False):
416
- citation_button = gr.Textbox(
417
- value=CITATION_BUTTON_TEXT,
418
- label=CITATION_BUTTON_LABEL,
419
- lines=20,
420
- elem_id="citation-button",
421
- show_copy_button=True,
422
- )
423
-
424
- scheduler = BackgroundScheduler()
425
- scheduler.add_job(restart_space, "interval", seconds=1800)
426
- scheduler.start()
427
- demo.queue(default_concurrency_limit=40).launch()
428
- >>>>>>> 274bb98a1643b352ae5569c75aeb43fc9ca01625
 
 
1
  from typing import Optional, List
2
  import gradio as gr
3
  import pandas as pd
 
17
 
18
  DEFAULT_METRIC = "Average ⭐"
19
 
20
+
21
  def reload_data():
22
  msg = data_loader.reload_data()
23
  if data_loader.df_all is None or data_loader.df_all.empty:
 
54
  gr.update(choices=category_choices, value="All"), \
55
  html_table, radar_fig
56
 
57
+
58
  def update_leaderboard_wrapper(metric, top_k, model_filter,
59
  category_filter, sort_mode, selected_metrics):
60
  clean_metric = clean_metric_names([metric])[0]
 
81
  radar_fig = radar_plotter.create_radar_chart(radar_df)
82
  return html_table, radar_fig
83
 
84
+
85
  def create_comparison_plot_wrapper(model_filter, category_filter,
86
  selected_plot_metric, plot_sort_mode):
87
  clean_metric = clean_metric_names([selected_plot_metric])[0]
 
94
  sort_mode=plot_sort_mode
95
  )
96
 
97
+
98
  academic_css = get_academic_css()
99
 
100
  with gr.Blocks(css=academic_css) as demo:
 
217
  outputs=[status_box, category_dropdown, leaderboard_html, radar_plot],
218
  )
219
 
220
+
221
  if __name__ == "__main__":
222
+ import os
223
+
224
+ # HF Spaces: leave share off (default). Docker / locked-down hosts: set GRADIO_SHARE=true.
225
  demo.launch(
226
+ server_name=os.environ.get("GRADIO_SERVER_NAME", "0.0.0.0"),
227
+ server_port=int(os.environ.get("GRADIO_SERVER_PORT", "7860")),
228
+ share=os.environ.get("GRADIO_SHARE", "false").strip().lower() in ("1", "true", "yes"),
229
+ )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
requirements.txt CHANGED
@@ -1,25 +1,9 @@
1
- <<<<<<< HEAD
2
- gradio>=4.0.0
 
 
3
  huggingface-hub==0.23.0
4
  pandas>=2.0.0
5
  matplotlib>=3.7.0
6
  numpy>=1.24.0
7
- plotly>=5.0.0
8
- =======
9
- APScheduler
10
- black
11
- datasets
12
- gradio
13
- gradio[oauth]
14
- gradio_leaderboard==0.0.13
15
- gradio_client
16
- huggingface-hub>=0.18.0
17
- matplotlib
18
- numpy
19
- pandas
20
- python-dateutil
21
- tqdm
22
- transformers
23
- tokenizers>=0.15.0
24
- sentencepiece
25
- >>>>>>> 274bb98a1643b352ae5569c75aeb43fc9ca01625
 
1
+ # Pin starlette<1: Gradio 4.44.x calls Starlette TemplateResponse with the pre-1.0
2
+ # argument order; Starlette 1.0+ breaks that and triggers Jinja2 "unhashable type: dict".
3
+ gradio>=4.44.1
4
+ starlette>=0.37.0,<1.0.0
5
  huggingface-hub==0.23.0
6
  pandas>=2.0.0
7
  matplotlib>=3.7.0
8
  numpy>=1.24.0
9
+ plotly>=5.0.0