Spaces:

EmbodiedCity
/

iWorld-Bench

Running

App Files Files Community

iWorldBench commited on 19 days ago

Commit

697f377

1 Parent(s): 27d9916

Deploy iWorld-Bench leaderboard

Browse files

Files changed (3) hide show

README.md +56 -45
app.py +12 -211
requirements.txt +5 -21

README.md CHANGED Viewed

@@ -1,11 +1,10 @@
 ---
-<<<<<<< HEAD
 title: iWorld-Bench Leaderboard
 emoji: 🌍
 colorFrom: blue
 colorTo: green
 sdk: gradio
-sdk_version: "4.44.0"
 python_version: "3.12"
 app_file: app.py
 pinned: false
@@ -14,52 +13,64 @@ pinned: false
 # iWorld-Bench Leaderboard
 A comprehensive benchmark for interactive world models.
-=======
-title: IWorld Bench
-emoji: 🥇
-colorFrom: green
-colorTo: indigo
-sdk: gradio
-app_file: app.py
-pinned: true
-license: mit
-short_description: A Benchmark for Interactive World Models
-sdk_version: 5.43.1
-tags:
-- leaderboard
----
-# Start the configuration
-Most of the variables to change for a default leaderboard are in `src/env.py` (replace the path for your leaderboard) and `src/about.py` (for tasks).
-Results files should have the following format and be stored as json files:
-```json
-{
-    "config": {
-        "model_dtype": "torch.float16", # or torch.bfloat16 or 8bit or 4bit
-        "model_name": "path of the model on the hub: org/model",
-        "model_sha": "revision on the hub",
-    },
-    "results": {
-        "task_name": {
-            "metric_name": score,
-        },
-        "task_name2": {
-            "metric_name": score,
-        }
-    }
-}
 ```
-Request files are created automatically by this tool.
-If you encounter problem on the space, don't hesitate to restart it to remove the create eval-queue, eval-queue-bk, eval-results and eval-results-bk created folder.
-# Code logic for more complex edits
-You'll find
-- the main table' columns names and properties in `src/display/utils.py`
-- the logic to read all results and request files, then convert them in dataframe lines, in `src/leaderboard/read_evals.py`, and `src/populate.py`
-- the logic to allow or filter submissions in `src/submission/submit.py` and `src/submission/check_validity.py`
->>>>>>> 274bb98a1643b352ae5569c75aeb43fc9ca01625

 ---
 title: iWorld-Bench Leaderboard
 emoji: 🌍
 colorFrom: blue
 colorTo: green
 sdk: gradio
+sdk_version: "4.44.1"
 python_version: "3.12"
 app_file: app.py
 pinned: false
 # iWorld-Bench Leaderboard
 A comprehensive benchmark for interactive world models.
+## Local run
+```bash
+pip install -r requirements.txt
+python app.py
+```
+If you deploy in Docker and Gradio reports that localhost is not accessible, set environment variable `GRADIO_SHARE=true`. On Hugging Face Spaces the default (`share` off) is correct.
+## Deploy to Hugging Face Space（与本地 `readme更新.txt` 一致）
+网页新建 Space：Licence MIT、SDK **Gradio**、硬件 CPU basic、Public。复制 Git 地址 `https://huggingface.co/spaces/<用户名>/<Space名>`。
+在 **WSL** 或 **Git Bash** 中（路径请改成你的仓库位置）：
+```bash
+# 0. 进入项目根目录（含 app.py）
+cd /mnt/d/lab/Thu_lab/iworld-bench
+# 1. CLI（若 requirements 里已包含可跳过）
+pip install -U "huggingface_hub[cli]"
+# 2. 登录（Token：https://huggingface.co/settings/tokens ）
+huggingface-cli login
+# 3. 确认登录
+huggingface-cli whoami
+# 4. 若尚未初始化 git（可选）
+# git init
+# 5. 添加远程（把 <用户名>/<Space名> 换成你的）
+git remote add origin https://huggingface.co/spaces/<用户名>/<Space名>
+# 若已存在 origin，用：git remote set-url origin https://huggingface.co/spaces/<用户名>/<Space名>
+# 6. 拉取 Space 自动生成的小提交（若失败可跳过）
+git pull origin main --allow-unrelated-histories
+# 若没有 main：git branch -M main
+# 7. 只添加需要上云的文件（不要 add bench/、不要 add zip）
+git add README.md app.py requirements.txt data/results.csv
+find src -name '*.py' -not -path 'src/bench/*' -exec git add {} +
+# 8. 提交并推送
+git commit -m "Deploy iWorld-Bench leaderboard"
+git push -u origin main
 ```
+推送后在 Space 页面 **⋯ → Restart Space**。若构建失败，在 Space **Settings → Repository secrets** 可设变量 `GRADIO_SHARE`（一般留空即可；仅当你自建 Docker 且报 localhost 相关错误时再设为 `true`）。
+Windows 若只用 PowerShell 且没有 `find`，可改用：
+```powershell
+git add README.md app.py requirements.txt data/results.csv
+Get-ChildItem -Path src -Filter *.py -Recurse | Where-Object { $_.FullName -notmatch '\\bench\\' } | ForEach-Object { git add $_.FullName }
+```
+## Dependency note
+Keep `starlette<1.0` (see `requirements.txt`). Starlette 1.0 changed `TemplateResponse`; Gradio 4.44.x is built for the older API. Installing Starlette 1.x can cause `TypeError: unhashable type: 'dict'` when loading the Gradio UI.

app.py CHANGED Viewed

@@ -1,4 +1,3 @@
-<<<<<<< HEAD
 from typing import Optional, List
 import gradio as gr
 import pandas as pd
@@ -18,6 +17,7 @@ radar_plotter = RadarPlotter(data_loader)
 DEFAULT_METRIC = "Average ⭐"
 def reload_data():
     msg = data_loader.reload_data()
     if data_loader.df_all is None or data_loader.df_all.empty:
@@ -54,6 +54,7 @@ def reload_data():
            gr.update(choices=category_choices, value="All"), \
            html_table, radar_fig
 def update_leaderboard_wrapper(metric, top_k, model_filter,
                                category_filter, sort_mode, selected_metrics):
     clean_metric = clean_metric_names([metric])[0]
@@ -80,6 +81,7 @@ def update_leaderboard_wrapper(metric, top_k, model_filter,
     radar_fig = radar_plotter.create_radar_chart(radar_df)
     return html_table, radar_fig
 def create_comparison_plot_wrapper(model_filter, category_filter,
                                   selected_plot_metric, plot_sort_mode):
     clean_metric = clean_metric_names([selected_plot_metric])[0]
@@ -92,6 +94,7 @@ def create_comparison_plot_wrapper(model_filter, category_filter,
         sort_mode=plot_sort_mode
     )
 academic_css = get_academic_css()
 with gr.Blocks(css=academic_css) as demo:
@@ -214,215 +217,13 @@ with gr.Blocks(css=academic_css) as demo:
         outputs=[status_box, category_dropdown, leaderboard_html, radar_plot],
     )
 if __name__ == "__main__":
     demo.launch(
-        server_name="0.0.0.0",
-        server_port=7860,
-        share=True,
-    )
-=======
-import gradio as gr
-from gradio_leaderboard import Leaderboard, ColumnFilter, SelectColumns
-import pandas as pd
-from apscheduler.schedulers.background import BackgroundScheduler
-from huggingface_hub import snapshot_download
-from src.about import (
-    CITATION_BUTTON_LABEL,
-    CITATION_BUTTON_TEXT,
-    EVALUATION_QUEUE_TEXT,
-    INTRODUCTION_TEXT,
-    LLM_BENCHMARKS_TEXT,
-    TITLE,
-)
-from src.display.css_html_js import custom_css
-from src.display.utils import (
-    BENCHMARK_COLS,
-    COLS,
-    EVAL_COLS,
-    EVAL_TYPES,
-    AutoEvalColumn,
-    ModelType,
-    fields,
-    WeightType,
-    Precision
-)
-from src.envs import API, EVAL_REQUESTS_PATH, EVAL_RESULTS_PATH, QUEUE_REPO, REPO_ID, RESULTS_REPO, TOKEN
-from src.populate import get_evaluation_queue_df, get_leaderboard_df
-from src.submission.submit import add_new_eval
-def restart_space():
-    API.restart_space(repo_id=REPO_ID)
-### Space initialisation
-try:
-    print(EVAL_REQUESTS_PATH)
-    snapshot_download(
-        repo_id=QUEUE_REPO, local_dir=EVAL_REQUESTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30, token=TOKEN
-    )
-except Exception:
-    restart_space()
-try:
-    print(EVAL_RESULTS_PATH)
-    snapshot_download(
-        repo_id=RESULTS_REPO, local_dir=EVAL_RESULTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30, token=TOKEN
-    )
-except Exception:
-    restart_space()
-LEADERBOARD_DF = get_leaderboard_df(EVAL_RESULTS_PATH, EVAL_REQUESTS_PATH, COLS, BENCHMARK_COLS)
-(
-    finished_eval_queue_df,
-    running_eval_queue_df,
-    pending_eval_queue_df,
-) = get_evaluation_queue_df(EVAL_REQUESTS_PATH, EVAL_COLS)
-def init_leaderboard(dataframe):
-    if dataframe is None or dataframe.empty:
-        raise ValueError("Leaderboard DataFrame is empty or None.")
-    return Leaderboard(
-        value=dataframe,
-        datatype=[c.type for c in fields(AutoEvalColumn)],
-        select_columns=SelectColumns(
-            default_selection=[c.name for c in fields(AutoEvalColumn) if c.displayed_by_default],
-            cant_deselect=[c.name for c in fields(AutoEvalColumn) if c.never_hidden],
-            label="Select Columns to Display:",
-        ),
-        search_columns=[AutoEvalColumn.model.name, AutoEvalColumn.license.name],
-        hide_columns=[c.name for c in fields(AutoEvalColumn) if c.hidden],
-        filter_columns=[
-            ColumnFilter(AutoEvalColumn.model_type.name, type="checkboxgroup", label="Model types"),
-            ColumnFilter(AutoEvalColumn.precision.name, type="checkboxgroup", label="Precision"),
-            ColumnFilter(
-                AutoEvalColumn.params.name,
-                type="slider",
-                min=0.01,
-                max=150,
-                label="Select the number of parameters (B)",
-            ),
-            ColumnFilter(
-                AutoEvalColumn.still_on_hub.name, type="boolean", label="Deleted/incomplete", default=True
-            ),
-        ],
-        bool_checkboxgroup_label="Hide models",
-        interactive=False,
-    )
-demo = gr.Blocks(css=custom_css)
-with demo:
-    gr.HTML(TITLE)
-    gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text")
-    with gr.Tabs(elem_classes="tab-buttons") as tabs:
-        with gr.TabItem("🏅 LLM Benchmark", elem_id="llm-benchmark-tab-table", id=0):
-            leaderboard = init_leaderboard(LEADERBOARD_DF)
-        with gr.TabItem("📝 About", elem_id="llm-benchmark-tab-table", id=2):
-            gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
-        with gr.TabItem("🚀 Submit here! ", elem_id="llm-benchmark-tab-table", id=3):
-            with gr.Column():
-                with gr.Row():
-                    gr.Markdown(EVALUATION_QUEUE_TEXT, elem_classes="markdown-text")
-                with gr.Column():
-                    with gr.Accordion(
-                        f"✅ Finished Evaluations ({len(finished_eval_queue_df)})",
-                        open=False,
-                    ):
-                        with gr.Row():
-                            finished_eval_table = gr.components.Dataframe(
-                                value=finished_eval_queue_df,
-                                headers=EVAL_COLS,
-                                datatype=EVAL_TYPES,
-                                row_count=5,
-                            )
-                    with gr.Accordion(
-                        f"🔄 Running Evaluation Queue ({len(running_eval_queue_df)})",
-                        open=False,
-                    ):
-                        with gr.Row():
-                            running_eval_table = gr.components.Dataframe(
-                                value=running_eval_queue_df,
-                                headers=EVAL_COLS,
-                                datatype=EVAL_TYPES,
-                                row_count=5,
-                            )
-                    with gr.Accordion(
-                        f"⏳ Pending Evaluation Queue ({len(pending_eval_queue_df)})",
-                        open=False,
-                    ):
-                        with gr.Row():
-                            pending_eval_table = gr.components.Dataframe(
-                                value=pending_eval_queue_df,
-                                headers=EVAL_COLS,
-                                datatype=EVAL_TYPES,
-                                row_count=5,
-                            )
-            with gr.Row():
-                gr.Markdown("# ✉️✨ Submit your model here!", elem_classes="markdown-text")
-            with gr.Row():
-                with gr.Column():
-                    model_name_textbox = gr.Textbox(label="Model name")
-                    revision_name_textbox = gr.Textbox(label="Revision commit", placeholder="main")
-                    model_type = gr.Dropdown(
-                        choices=[t.to_str(" : ") for t in ModelType if t != ModelType.Unknown],
-                        label="Model type",
-                        multiselect=False,
-                        value=None,
-                        interactive=True,
-                    )
-                with gr.Column():
-                    precision = gr.Dropdown(
-                        choices=[i.value.name for i in Precision if i != Precision.Unknown],
-                        label="Precision",
-                        multiselect=False,
-                        value="float16",
-                        interactive=True,
-                    )
-                    weight_type = gr.Dropdown(
-                        choices=[i.value.name for i in WeightType],
-                        label="Weights type",
-                        multiselect=False,
-                        value="Original",
-                        interactive=True,
-                    )
-                    base_model_name_textbox = gr.Textbox(label="Base model (for delta or adapter weights)")
-            submit_button = gr.Button("Submit Eval")
-            submission_result = gr.Markdown()
-            submit_button.click(
-                add_new_eval,
-                [
-                    model_name_textbox,
-                    base_model_name_textbox,
-                    revision_name_textbox,
-                    precision,
-                    weight_type,
-                    model_type,
-                ],
-                submission_result,
-            )
-    with gr.Row():
-        with gr.Accordion("📙 Citation", open=False):
-            citation_button = gr.Textbox(
-                value=CITATION_BUTTON_TEXT,
-                label=CITATION_BUTTON_LABEL,
-                lines=20,
-                elem_id="citation-button",
-                show_copy_button=True,
-            )
-scheduler = BackgroundScheduler()
-scheduler.add_job(restart_space, "interval", seconds=1800)
-scheduler.start()
-demo.queue(default_concurrency_limit=40).launch()
->>>>>>> 274bb98a1643b352ae5569c75aeb43fc9ca01625

 from typing import Optional, List
 import gradio as gr
 import pandas as pd
 DEFAULT_METRIC = "Average ⭐"
 def reload_data():
     msg = data_loader.reload_data()
     if data_loader.df_all is None or data_loader.df_all.empty:
            gr.update(choices=category_choices, value="All"), \
            html_table, radar_fig
 def update_leaderboard_wrapper(metric, top_k, model_filter,
                                category_filter, sort_mode, selected_metrics):
     clean_metric = clean_metric_names([metric])[0]
     radar_fig = radar_plotter.create_radar_chart(radar_df)
     return html_table, radar_fig
 def create_comparison_plot_wrapper(model_filter, category_filter,
                                   selected_plot_metric, plot_sort_mode):
     clean_metric = clean_metric_names([selected_plot_metric])[0]
         sort_mode=plot_sort_mode
     )
 academic_css = get_academic_css()
 with gr.Blocks(css=academic_css) as demo:
         outputs=[status_box, category_dropdown, leaderboard_html, radar_plot],
     )
 if __name__ == "__main__":
+    import os
+    # HF Spaces: leave share off (default). Docker / locked-down hosts: set GRADIO_SHARE=true.
     demo.launch(
+        server_name=os.environ.get("GRADIO_SERVER_NAME", "0.0.0.0"),
+        server_port=int(os.environ.get("GRADIO_SERVER_PORT", "7860")),
+        share=os.environ.get("GRADIO_SHARE", "false").strip().lower() in ("1", "true", "yes"),
+    )

requirements.txt CHANGED Viewed

@@ -1,25 +1,9 @@
-<<<<<<< HEAD
-gradio>=4.0.0
 huggingface-hub==0.23.0
 pandas>=2.0.0
 matplotlib>=3.7.0
 numpy>=1.24.0
-plotly>=5.0.0
-=======
-APScheduler
-black
-datasets
-gradio
-gradio[oauth]
-gradio_leaderboard==0.0.13
-gradio_client
-huggingface-hub>=0.18.0
-matplotlib
-numpy
-pandas
-python-dateutil
-tqdm
-transformers
-tokenizers>=0.15.0
-sentencepiece
->>>>>>> 274bb98a1643b352ae5569c75aeb43fc9ca01625

+# Pin starlette<1: Gradio 4.44.x calls Starlette TemplateResponse with the pre-1.0
+# argument order; Starlette 1.0+ breaks that and triggers Jinja2 "unhashable type: dict".
+gradio>=4.44.1
+starlette>=0.37.0,<1.0.0
 huggingface-hub==0.23.0
 pandas>=2.0.0
 matplotlib>=3.7.0
 numpy>=1.24.0
+plotly>=5.0.0