Spaces:

vLAR
/

PhysInOne-Leaderboard

Running

App Files Files Community

vLAR commited on 12 days ago

Commit

86fd0d5

1 Parent(s): 54708e8

v2

Browse files

Files changed (14) hide show

DESIGN_v0.5_archive.md +674 -0
Dockerfile.worker +0 -19
requirements.worker.txt +0 -5
src/envs.py +16 -38
src/leaderboard/read_evals.py +22 -75
src/storage/archive.py +0 -149
src/storage/hub.py +16 -102
src/storage/progress.py +0 -159
src/submission/check_validity.py +0 -60
src/submission/frontend.py +5 -24
src/worker/__init__.py +0 -0
src/worker/loop.py +0 -211
src/worker/runner.py +0 -291
worker.py +0 -173

DESIGN_v0.5_archive.md ADDED Viewed

	@@ -0,0 +1,674 @@

+# 📋 PhysInOne Benchmark 多任务自动评测系统 — 设计文档 (v0.5)
+> **维护方**：[vLAR Group](https://vlar-group.github.io/) (HK PolyU)
+## 0. v0.5 关键变化（相对 v0.4）
+| 变化点 | v0.4 | v0.5 |
+|---|---|---|
+| Worker Space | 待创建 | **已创建**：[`vLAR/PhysInOne-Leaderboard-Worker`](https://huggingface.co/spaces/vLAR/PhysInOne-Leaderboard-Worker) |
+| 持久存储 | 无；状态全部塞进 HF dataset | **Worker 挂载了永久 bucket `/data`**（Spaces persistent storage），承担所有 worker 端状态 |
+| GT / progress / lease / logs / archive | 全部存在 HF dataset | **全部迁到 Worker `/data`**，HF dataset 不再承担此职 |
+| HF dataset 角色 | 对象存储 + 状态库 + 日志（三合一） | 仅作为前后端**通信邮箱**：`submissions/<sid>.json`（pending 入箱）+ `public/queue.json`（Worker 镜像快照）+ `results/<task>/<team>_best.json` |
+| 用户 ZIP 下载落地 | `/tmp/<sid>/...` | **`/tmp` 内的临时目录**（评测后立即删），与 `/data` 严格隔离 |
+| 仓库划分 | 单仓库（Frontend + Worker 代码共存） | **本仓库 = Frontend Space 专用**；Worker 代码在独立仓库 |
+| 队列扫描 | Frontend 自己 list + 逐文件 download 全部 `submissions/*.json` + `*.progress.json` | Worker 每个 tick 写一份 `public/queue.json`；**Frontend 只下载这一个文件** |
+详见 §11.5 (cross-Space `/data` 隔离) 与 §10 (代码结构)。
+---
+## 0.4. v0.4 关键变化（相对 v0.3）
+| 变化点 | v0.3 | v0.4 |
+|---|---|---|
+| 评测粒度 | 一次提交 = 一个 ZIP，整体评测 | **一次提交 = N 个 scene**；每个 scene 一个 ZIP，**逐 scene** 下载/评测/回写 |
+| 用户身份 | 自填 `team_name` | **以 `user_dataset` 的 owner 作为身份 ID**；`display_name` 仅用于展示 |
+| GT 组织 | 单文件 `ground_truth/<gt_file>` | **按 scene 分目录** `ground_truth/<scene_id>/...`；任务通过 `task_manifest/<task>.json` 声明所用 scene 子集 |
+| 断点续跑 | 无；Worker 死掉 → submission 卡 `running` | **Worker 重启即续**，scene 粒度。新增 `submissions/<sid>.progress.json` |
+| 失败租约 | 无 | `running` 带 `lease_expires_at`，超时可被重新接管 |
+| 日志膨胀 | `logs/execution_logs.jsonl` 单文件 | 按月分片 + 大小轮转；done 记录定期归档 |
+| 多 Worker | 单 Worker | 支持横向多开（CAS 防抢同一 sid），可选 |
+| 前端刷新 | 手动按钮 | Queue tab 上加 `gr.Timer(15s)` 自动刷新 |
+---
+## 1. 三件套架构
+> **v0.5 读者注意**：以下 ASCII 拓扑仍描述 v0.4 时 dataset 【代三职】（GT / progress / logs / archive 都在 dataset）的状态。**v0.5 后**这些物住都迁到 Worker `/data`，dataset 仅保留 `submissions/<sid>.json` / `public/queue.json` / `results/`。详见 §0 与 §11.5。
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│ 用户浏览器                                                            │
+└──────────────────────────┬───────────────────────────────────────────┘
+                           ▼
+┌──────────────────────────────────────────────────────────────────────┐
+│ ① Frontend Space (PUBLIC, Gradio)                                    │
+│    vLAR/PhysInOne-Leaderboard                                        │
+│    - 提交表单：display_name / task / user_dataset                     │
+│    - 写 submissions/<sid>.json (status=pending) 到 ③                 │
+│    - 读 results/ + submissions/ + progress/ 渲染榜单 / 队列 / 进度    │
+└──────────────────────────┬───────────────────────────────────────────┘
+                           │ huggingface_hub API
+                           ▼
+┌──────────────────────────────────────────────────────────────────────┐
+│ ③ Private Dataset (PRIVATE) — 唯一持久层 / 通信总线                   │
+│    vLAR/PhysInOne-Leaderboard (HF dataset, PRIVATE)                  │
+│    submissions/<sid>.json          ← 任务表 (pending/running/done/failed) │
+│    submissions/<sid>.progress.json ← scene 级中间结果（CAS 整体替换） │
+│    task_manifest/<task>.json       ← 该任务所用 scene_id 列表         │
+│    ground_truth/<scene_id>/...     ← 按 scene 组织��真值              │
+│    results/<task>/<team_id>_best.json ← 历史最佳快照（聚合后）        │
+│    logs/exec-YYYY-MM[.partNN].jsonl ← 日志，按月分片 + 大小轮转       │
+│    archive/submissions-YYYY-MM/    ← 老的 done/failed 归档            │
+└──────────────────────────▲───────────────────────────────────────────┘
+                           │ huggingface_hub API（读+写）
+┌──────────────────────────┴───────────────────────────────────────────┐
+│ ② Worker Space (PRIVATE, Docker, 纯后端：APScheduler + stdlib HTTP)  │
+│    vLAR/PhysInOne-Leaderboard-Worker                                 │
+│    - 启动：扫 running 且 lease 过期 → 接管（断点续跑）                │
+│    - tick：扫 pending → CAS 接管 (running + lease)                    │
+│    - 对待办 scene 列表逐个：                                          │
+│         hf_hub_download(user_dataset, <scene_id>.zip) → /tmp         │
+│         unzip → task.evaluate_scene(scene_dir, gt) → metrics         │
+│         CAS 追加到 progress.json（含 lease 续期）→ 删 /tmp           │
+│    - 所有 scene 处理完毕 → 聚合 → status=done + 更新 results/         │
+│    - GET :7860/status 返回 JSON（worker_id / in_flight / 状态）       │
+└──────────────────────────────────────────────────────────────────────┘
+```
+### 1.1 设计原则
+- **③ 是唯一通信总线**：① 与 ② 之间不直接通信，全部走 ③。
+- **scene 是评测的最小原子单位**：所有任务都按 scene 切片，天然支持断点续跑、流式进度、可选并行。
+- **Worker 假定随时被杀**：无内存状态依赖；进程死亡 → lease 过期 → 重启 / 另一个 Worker 自动接管。
+- **存储分离**：用户的大 ZIP 留在用户自己的 HF dataset；我们仅记录指针。
+- **不归档用户数据**：评测完单 scene 立刻删 `/tmp`；不缓存用户上传内容。
+- **身份 = dataset owner**：天然防撞名、防冒充，无需 OAuth。
+---
+## 2. 三件套资源清单
+| # | 资源 | 类型 | 可见性 | 名称 | 状态 |
+|---|---|---|---|---|---|
+| 1 | Frontend Space | Space (Gradio SDK) | **PUBLIC** | `vLAR/PhysInOne-Leaderboard` | ✅ 已建 |
+| 2 | Worker Space | Space (**Docker SDK，纯后端**) + **persistent `/data`** | **PRIVATE** | [`vLAR/PhysInOne-Leaderboard-Worker`](https://huggingface.co/spaces/vLAR/PhysInOne-Leaderboard-Worker) | ✅ 已建 |
+| 3 | Private Dataset (**通信邮箱**) | Dataset | **PRIVATE** | `vLAR/PhysInOne-Leaderboard` | ✅ 已建 |
+### 2.1 Secrets
+| Space | Secret 名 | 权限要求 |
+|---|---|---|
+| Frontend | `HF_TOKEN` | 对 ③ write |
+| Worker | `HF_TOKEN` | 对 ③ **write** + 对用户 dataset **read** |
+### 2.2 Worker 入口
+Worker Space 使用 **`sdk: docker`**（不是 gradio），README YAML：
+```yaml
+sdk: docker
+app_port: 7860
+```
+根目录提供 [`Dockerfile.worker`](Dockerfile.worker)（HF Spaces 会自动识别 `Dockerfile` 名；如使用 `.worker` 后缀需在 Space README 中显式 `dockerfile_path:`）。
+Worker 进程：
+- APScheduler 后台循环 `tick()` / `archive` / `restart_space`
+- stdlib `http.server` 在 `:7860/status` 暴露 JSON 健康状态（HF Spaces 需要进程监听端口才不会被判定为崩溃）
+- 不依赖 Gradio、不渲染任何 UI；所有可视化都在 Frontend Space
+> **关于 "HF 有 bucket 吗"**：HF **没有 S3 风格 bucket**。只有 model / dataset / space 三种 git repo（大文件走 git-lfs）。我们的 dataset 同时承担"对象存储 + 状态库 + 日志"三个角色——因此 §9 专门定义**抗膨胀策略**。
+---
+## 3. 用户身份模型
+| 概念 | 来源 | 用途 | 是否可变 |
+|---|---|---|---|
+| `team_id` | `user_dataset.split("/")[0]` (HF account / org) | 历史最佳去重 key、results 路径 | **不可变** |
+| `display_name` | 表单填写 | 榜单展示文本 | 可变；每次提交覆盖 |
+榜单显示 `Awesome Team (userA)`，排序键是 `team_id`。
+> 彻底解决"两个用户都叫 Team_A 互相覆盖"的问题，且**无需 OAuth**——dataset 的 owner 是 HF 强保证的。
+---
+## 4. 用户 dataset 约定结构
+```
+<user_dataset>/                      ← 用户自己的 HF dataset (建议 public)
+├── README.md                        ← HF dataset card
+└── predictions/
+    └── <task_name>/
+        ├── scene_001.zip
+        ├── scene_002.zip
+        └── ...                      ← 每个 scene 一个 ZIP
+```
+- 用户**只��为想刷的 task 准备 ZIP**；未提交的 scene 在 `progress.json` 里记 `missing`，并体现在聚合 metric 里（按聚合策略，比如缺失算 0 分 / 跳过等）。
+- 单 scene ZIP 推荐 < 200MB，上限可配。
+- ZIP 内部结构由各 task 的 `expected_scene_layout` 自描述（§6）。
+---
+## 5. ③ 的数据 Schema
+> **v0.5 读者注意**：下述 5.1–5.7 描述 v0.4 dataset 全量存储的布局。**v0.5 下**：
+> * `5.1 submissions/<sid>.json` 仅在 pending 阶段存于 dataset；Worker 接管后迁到 `/data/inflight/<sid>.json` 并从 dataset 删除。Frontend 提交时只写 envelope（`submission_id` / `team_id` / `display_name` / `task_name` / `user_dataset` / `submitted_at` / `status=pending` / `primary_metric`）。
+> * `5.2 progress.json` / `5.3 task_manifest/` / `5.4 ground_truth/` / `5.6 logs/` / `5.7 archive/` 在 dataset 中**不再存在**——等价文件由 Worker 在 `/data` 下本地维护。
+> * 新增 `5.8 public/queue.json`——Worker 每 tick 重写的 UI 快照。详见 §11.5。
+### 5.1 `submissions/<sid>.json` — 提交记录
+```json
+{
+  "submission_id": "8e5f6g7h",
+  "team_id": "userA",
+  "display_name": "Awesome Team",
+  "task_name": "task_future_prediction",
+  "user_dataset": "userA/my-results",
+  "submitted_at": "2026-05-20T10:00:00Z",
+  "status": "pending",
+  "started_at": null,
+  "completed_at": null,
+  "lease_expires_at": null,
+  "primary_metric": "psnr",
+  "score": null,
+  "metrics": null,
+  "scenes_total": 50,
+  "scenes_done": 0,
+  "scenes_failed": 0,
+  "error_message": null,
+  "worker_attempt": 0
+}
+```
+**状态机**：
+```
+[pending] ──CAS claim──▶ [running] ──all scenes settled──▶ [done]
+                            │
+                            ├──fatal err──▶ [failed]
+                            │
+                            └──lease expired──▶ 仍是 [running]
+                                  ↑                │
+                                  └────CAS reclaim─┘  另一 worker / 重启接管
+```
+### 5.2 `submissions/<sid>.progress.json` — scene 级中间结果
+```json
+{
+  "submission_id": "8e5f6g7h",
+  "scenes_total": 50,
+  "scene_results": {
+    "scene_001": {"metrics": {"psnr": 31.2}, "ts": "2026-05-20T10:01:00Z"},
+    "scene_002": {"metrics": {"psnr": 29.8}, "ts": "2026-05-20T10:02:00Z"}
+  },
+  "scene_failures": {
+    "scene_003": "missing: scene_003.zip in user dataset"
+  },
+  "last_worker": "worker-<uuid>",
+  "last_updated_at": "2026-05-20T10:02:00Z"
+}
+```
+> 每完成一个 scene → CAS 覆盖写。"一个 scene 一次 commit"，但所有 scene 数据都在 **同一文件**——不会把目录撑爆。
+### 5.3 `task_manifest/<task>.json`
+```json
+{
+  "task_name": "task_future_prediction",
+  "scene_ids": ["scene_001", "scene_002", "..."],
+  "primary_metric": "psnr",
+  "scene_layout": "predictions/<task>/<scene_id>.zip"
+}
+```
+由任务负责人维护；Worker 启动每个 submission 时读取，决定要拉哪些 scene。
+### 5.4 `ground_truth/<scene_id>/...`
+任务负责人上传。Worker 按需 `hf_hub_download(filename=f"ground_truth/{scene_id}/...")`，配 **进程内 LRU**（容量可配，例如 16 个 scene），避免重复下载。
+### 5.5 `results/<task>/<team_id>_best.json`
+```json
+{
+  "team_id": "userA",
+  "display_name": "Awesome Team",
+  "task_name": "task_future_prediction",
+  "score": 32.41,
+  "primary_metric": "psnr",
+  "metrics": {"psnr": 32.41, "ssim": 0.91, "lpips": 0.12},
+  "scenes_count": 50,
+  "updated_at": "2026-05-20T10:05:00Z",
+  "submission_id": "8e5f6g7h"
+}
+```
+默认聚合：各 scene metric 算术平均。任务可以在 plugin 内重写 `aggregate_fn`。
+### 5.6 `logs/exec-YYYY-MM[.partNN].jsonl` — 日志
+- 按 **年-月** 切文件；当月超过 50MB 自动 `.part02.jsonl` / `.part03.jsonl`。
+- 仅 append（CAS 读 → 拼接 → 写）。
+- 旧月份只读，不再触碰。
+### 5.7 `archive/submissions-YYYY-MM/`
+定时任务把 30 天前的 `done` / `failed` `submissions/<sid>.{json,progress.json}` 移到归档目录，主目录保持小而快。
+### 5.8 `public/queue.json` — Worker → Frontend 镜像（**v0.5 新增**）
+```json
+{
+  "generated_at": "2026-05-20T10:05:00Z",
+  "worker_id": "worker-<uuid>",
+  "rows": [
+    {
+      "Submitted At": "2026-05-20T10:00:00Z",
+      "ID": "8e5f6g7h",
+      "Team": "Awesome Team",
+      "Owner": "userA",
+      "Task": "task_future_prediction",
+      "Status": "running",
+      "Position": "running",
+      "Scenes": "23/50 (+1 fail)",
+      "Score": null,
+      "Error": ""
+    }
+  ]
+}
+```
+- Worker 每个 tick 把 `/data` 中所有进行中 + 最近 N 条已完成的 submission 序列化进这一个文件，CAS 覆盖写。
+- Frontend `src/leaderboard/read_evals.py::load_queue_df` 只下载这一个文件，按 `Submitted At` 倒序展示。
+- 字段语义与 §7.3 `Position` 计算保持一致。
+---
+## 6. 任务插件接口（v0.4 调整）
+```python
+@dataclass
+class TaskPlugin:
+    name: str
+    display_name: str
+    description: str
+    primary_metric: str
+    higher_is_better: bool = True
+    leaderboard_columns: List[str] = field(default_factory=list)
+    # scene 级评测的契约 -------------------------------------------------
+    expected_scene_layout: str            # 给参赛者看的单 scene ZIP 内部说明
+    validate_scene_fn:   Callable[[str], None]              # validate(scene_sandbox_dir)
+    evaluate_scene_fn:   Callable[[str, Any], Dict[str, float]]
+                                          # evaluate(scene_sandbox_dir, gt_scene) -> metrics
+    load_gt_scene_fn:    Callable[[str, str], Any]
+                                          # (scene_id, gt_dir) -> in-memory GT object
+    # 聚合 -------------------------------------------------------------
+    aggregate_fn: Optional[Callable[[Dict[str, Dict[str, float]]], Dict[str, float]]] = None
+        # 默认按 metric key 算术平均；任务可重写（e.g. 加权 / 取中位 / 缺失补零）
+```
+> **取消** v0.3 的整体 `validate(sandbox_dir)` / `evaluate(sandbox_dir, gt)`，现在以 scene 为最小单元。
+---
+## 7. Worker 处理流程（含断点续跑）
+```python
+# ====== boot ======
+on_startup():
+    # 接管所有 lease 过期的 running（前任 worker 死了）
+    for sub in list_submissions(status="running"):
+        if sub.lease_expires_at < now:
+            CAS_reclaim(sub)
+# ====== periodic tick (每 POLL_INTERVAL_SECONDS) ======
+tick():
+    for sub in list_submissions(status="pending", order_by="submitted_at"):
+        if CAS_claim(sub):       # pending → running + lease
+            dispatch(sub)
+# ====== dispatch (per submission) ======
+dispatch(sub):
+    manifest = read task_manifest/<sub.task>.json
+    progress = read submissions/<sub.sid>.progress.json   (or empty)
+    settled  = set(progress.scene_results) | set(progress.scene_failures)
+    todo     = [s for s in manifest.scene_ids if s not in settled]
+    for scene_id in todo:
+        if shutdown_requested(): return       # 优雅退出；下次启动续跑
+        try:
+            zip_path = hf_hub_download(
+                sub.user_dataset, f"predictions/{sub.task}/{scene_id}.zip"
+            )
+            with TempDir() as sandbox:
+                safe_extract_zip(zip_path, sandbox)
+                task.validate_scene(sandbox)
+                gt = lru_load_gt_scene(sub.task, scene_id)
+                metrics = task.evaluate_scene(sandbox, gt)
+            CAS_progress_add_result(sub.sid, scene_id, metrics)  # 同时续 lease
+        except EntryNotFoundError:
+            CAS_progress_add_failure(sub.sid, scene_id, "missing zip")
+        except Exception as e:
+            CAS_progress_add_failure(sub.sid, scene_id, sanitize(e))
+        finally:
+            try: os.remove(zip_path)
+            except: pass
+    # 全部 settled → 聚合并 finalize
+    finalize(sub):
+        agg = task.aggregate(progress.scene_results)
+        CAS_update_submission(status="done", score=agg[primary], metrics=agg, ...)
+        CAS_update_best_if_improved(sub.team_id, sub.task, agg)
+        append_log(...)
+```
+### 7.1 Lease 机制
+- Claim 时写 `lease_expires_at = now + LEASE_SECONDS`（默认 5min）。
+- 每完成一个 scene → 续到 `now + LEASE_SECONDS`。
+- 其他 worker tick 时若发现 `running && lease_expires_at < now` → CAS 改回自己接管。
+### 7.2 多 Worker
+- 直接多开 Worker Space 实例即可；CAS 保证不会两个 worker 抢同一 sid（claim 失败）。
+- 单个 worker 内通过 `ThreadPoolExecutor(max_workers=WORKER_CONCURRENCY)` 做**任务间并行**（默认 2）；同一 sid 由 `_in_flight` set + CAS 双保险防重入。
+- 任务**内**并行（一个 submission 多 scene 同时跑）当前**不做**；HF Space CPU 配额有限，意义不大。
+### 7.4 流程图（含断点续跑路径）
+```mermaid
+flowchart TD
+    BOOT([Space 启动]) --> BR[boot_reclaim:<br/>扫 status=running<br/>且 lease 已过期]
+    BR -- "CAS 续 lease<br/>worker_attempt++" --> DISPATCH
+    SCHED([APScheduler tick<br/>每 POLL_INTERVAL_SECONDS]) --> SNAP
+    SNAP[snapshot_download<br/>submissions/*.json] --> PEND{有 pending?}
+    PEND -- 否 --> SCHED
+    PEND -- "是<br/>FIFO by submitted_at" --> CAS1
+    CAS1[/"CAS: pending → running<br/>lease_expires_at = now+5min"/]
+    CAS1 -- "409/412 别人抢先" --> PEND
+    CAS1 -- OK --> DISPATCH
+    DISPATCH[ThreadPoolExecutor<br/>max=WORKER_CONCURRENCY]
+    DISPATCH --> PM
+    subgraph process_submission["process_submission(sid)  ◇ 可并发多个 sid"]
+        direction TB
+        PM[读 task_manifest<br/>scene_ids 列表] --> PI[load_progress<br/>**断点恢复入口**]
+        PI --> LOOP{遍历 scene_ids}
+        LOOP -- "已在 scene_results<br/>或 scene_failures" --> LOOP
+        LOOP -- 未处理 --> DL
+        DL[hf_hub_download<br/>user_dataset 中的 scene.zip]
+        DL -- 404/403 --> FAIL
+        DL -- OK --> VZ
+        VZ[validate_zip_filesize<br/>safe_extract_zip → sandbox]
+        VZ -- 超限/路径穿越 --> FAIL
+        VZ -- OK --> EVAL
+        EVAL[task.validate_scene<br/>get_gt_scene LRU 缓存<br/>task.evaluate_scene → metrics]
+        EVAL -- 异常 --> FAIL
+        EVAL -- OK --> OK1
+        OK1[add_scene_result<br/>CAS 写 progress.json] --> RENEW
+        FAIL[add_scene_failure<br/>CAS 写 progress.json] --> RENEW
+        RENEW[renew_lease<br/>CAS 续 lease +5min] --> LOOP
+        LOOP -- 全部 settled --> AGG
+        AGG[aggregate per-scene metrics] --> FIN
+        FIN[finalize_submission<br/>status=done/failed<br/>清空 lease]
+        FIN --> BEST[CAS update<br/>results/task/team_best.json<br/>仅当更优]
+        BEST --> LOG[append_log<br/>logs/exec-YYYY-MM.jsonl]
+    end
+    LOG --> SCHED
+    %% 优雅退出
+    SIG([SIGTERM/SIGINT]) -.-> SD[request_shutdown<br/>设置 _shutdown_event]
+    SD -.-> LOOP
+    LOOP -.-> |"shutdown_requested<br/>直接 return"| EXIT([进程退出<br/>status 保持 running<br/>下次 boot_reclaim 接管])
+```
+**断点恢复的关键路径**（图中标 `**断点恢复入口**`）：`load_progress` 读到的 `scene_results` / `scene_failures` 都是历史落地结果；`LOOP` 跳过任何已 settled 的 scene，所以同一 sid 被任何 worker 重入都是幂等的。
+### 7.3 用户视角的"前面还有几个 pending"
+前端 Queue 表为每条记录额外算一列 `Position`：
+```
+position(my) = #(other pending with submitted_at < my.submitted_at) + #(all running)
+```
+并对单 submission 渲染"已完成 X / N scene"进度条（读 progress.json）。
+---
+## 8. 安全
+| 风险 | 处置 |
+|---|---|
+| ZIP 炸弹（per scene） | 条目/字节双限 + 流式校验 |
+| 路径穿越 / 软链接 | 全部拒绝 |
+| 用户 dataset 不存在 / 文件缺失 | 单 scene 级 `scene_failures`，不影响其他 scene |
+| GT 泄露 | scene 评测异常 → 仅写 "internal evaluation error"，traceback 仅入 worker 日志 |
+| 并发冲突 | per-file CAS（`parent_commit`）；单 sid 只有一个 worker 持 lease |
+| Worker 崩溃 | lease 过期机制 + scene 级幂等续跑 |
+| Token 泄露 | 仅 Space Secrets，不进入任何日志 |
+| 撞名 / 冒名 | `team_id = dataset owner` 替代自填名 |
+| 输入校验（前端） | 正则：`user_dataset` 形如 `user/repo`，`display_name` 长度 2-32，`task_name` 在已注册任务里 |
+---
+## 9. 抗膨胀策略（HF dataset = git repo 的现实约束）
+> 没有 bucket，所以我们必须主动管理 commit 历史和文件数。
+| 策略 | 实施 |
+|---|---|
+| 单文件 CAS 覆盖 | `progress.json` 一份文件承载一个 submission 的所有 scene 结果，**不**每 scene 单独存文件 |
+| 日志按月切片 | 主路径 `logs/exec-YYYY-MM.jsonl`，跨月新建 |
+| 日志大小轮转 | 当月文件 > 50MB → `exec-YYYY-MM.part02.jsonl` |
+| 主目录归档 | cron 把 30 天前的 done/failed `submissions/<sid>.*` 移到 `archive/submissions-YYYY-MM/` |
+| 短 commit message | 便于按 path 过滤 git log |
+| Frontend 读 ③ | 走 HF CDN，缓存层会吸收绝大多数读流量 |
+| 紧急退路（v0.5 再做） | 把 logs/archive 迁到 Cloudflare R2 / Backblaze B2；③ 只保留权威结果与最佳快照 |
+---
+## 10. 代码结构
+自 v0.5 起，**本仓库 = Frontend Space 专用**，Worker 在独立仓库 `vLAR/PhysInOne-Leaderboard-Worker`。
+### 10.1 本仓库（Frontend Space）
+```
+.
+├── app.py                       # Gradio 入口（Leaderboard / Submit / Queue / About）
+├── requirements.txt             # gradio + huggingface-hub + pandas + apscheduler
+├── DESIGN.md
+├── README.md
+└── src/
+    ├── envs.py                  # 仅前端配置：仓库 ID + 邮箱路径 + 表单正则
+    ├── about.py                 # UI 文案
+    ├── populate.py              # UI 层薄包装
+    ├── display/                 # CSS / 表格格式化
+    ├── tasks/                   # 任务清单（前端用其 display_name / primary_metric）
+    │   ├── base.py
+    │   ├── _template/
+    │   ├── task_future_prediction/
+    │   └── task_property_estimation/
+    ├── submission/
+    │   └── frontend.py          # 校验 + 把 pending submission 写入邮箱
+    ├── storage/
+    │   └── hub.py               # 极简 HF I/O：download_file / list / upload_json
+    └── leaderboard/
+        └── read_evals.py        # 读 results/ + 读 public/queue.json 快照
+```
+**已删除（v0.4 → v0.5）**：`worker.py`、`Dockerfile.worker`、`requirements.worker.txt`、`src/worker/`、`src/storage/progress.py`、`src/storage/archive.py`、`src/submission/check_validity.py`。这些都搬去 Worker 仓库了。
+### 10.2 Worker 仓库（另一个 repo）
+```
+.
+├── worker.py                    # stdlib http.server + APScheduler
+├── Dockerfile                   # HF Spaces Docker SDK 入口
+���── requirements.txt             # huggingface-hub + apscheduler（无 gradio）
+└── src/
+    ├── envs.py                  # /data 路径 + lease/poll/limit 全套配置
+    ├── worker/
+    │   ├── loop.py              # tick + reclaim；扫 dataset 的 submissions/ 入箱
+    │   ├── runner.py            # scene 流水线（写 /data，下载到 /tmp）
+    │   └── lease.py             # 基于 /data 本地文件的 lease
+    ├── storage/
+    │   ├── local.py             # /data 上的 progress / lease / archive
+    │   ├── mirror.py            # 每 tick 写 public/queue.json + results/ 到 dataset
+    │   └── gt.py                # /data 上常驻 GT，无需重复下载
+    └── submission/
+        └── check_validity.py    # 安全解压（在 /tmp 中操作）
+```
+---
+## 11. 设计决策 FAQ
+本节记录开发过程中对架构设计的反复推敲，便于后续维护者理解"为什么不这样做"。
+### 11.1 HF dataset 能修改文件，还是只能增删？
+**能修改（原地覆盖）。** HF dataset 本质是 git 仓库，`upload_file` 支持对同一路径重复写入（每次产生一个新 commit 覆盖旧内容）。本系统"状态迁移"的实现就是对 `submissions/<sid>.json` 反复调用 `upload_file` 改 `status` 字段。唯一现实约束：每次写都是一次 git push（HTTP 请求），频率不宜过高。
+### 11.2 为什么不用"多队列"设计（4 个独立列表文件）？
+一个直觉上更"清晰"的设计是维护 4 个列表文件：
+```
+pending_queue.json    [sid, sid, ...]
+processing_list.json  [sid, sid]
+complete_list.json    [sid (done), sid (failed), ...]
+status/<sid>.json     详细进度
+```
+**问题根源：原子性缺失。**
+"把 sid 从 pending 移到 processing"需要两步：
+1. 改 `pending_queue.json`（删掉 sid）
+2. 改 `processing_list.json`（加入 sid）
+两步之间若 Worker 崩溃 / 并发写，sid 可能凭空消失或被双重领取。要修复就需要分布式事务——在 git 存储上实现代价极高。
+**此外还有写放大问题：** `complete_list.json` 随时间无限膨胀；每次追加一条都要"读全表 + append + 写回"，写入量随历史提交数线性增长（O(N²) 写操作）。
+当前设计把"4 个逻辑队列"压缩进每个 sid 的单一文件，用 `status` 字段区分视图，CAS 在单文件粒度上做原子操作，彻底避免了上述问题。
+### 11.3 当前架构用"四队列"语言重新描述
+| 四队列概念 | 当前实现对应物 |
+|---|---|
+| pending queue | 所有 `status=="pending"` 的 `submissions/<sid>.json` |
+| processing list | 所有 `status=="running"` 的 `submissions/<sid>.json` |
+| complete list | 所有 `status in {"done","failed"}` 的 `submissions/<sid>.json`；30 天后归档至 `archive/submissions-YYYY-MM/` |
+| processing task status | `submissions/<sid>.progress.json`（scene 级中间结果，CAS 整体替换） |
+操作映射：
+| 四队列操作 | 当前实现 |
+|---|---|
+| append 到 pending queue | `upload_file("submissions/<sid>.json", {status: "pending"})` |
+| 从 pending 取出加入 processing | CAS `status: "pending" → "running"` + 写 `lease_expires_at` |
+| 处理完毕加到 complete list | CAS `status: "running" → "done"/"failed"` |
+| 重启优先处理 processing | `boot_reclaim`：扫 `status=="running"` 且 lease 过期 |
+`progress.json` 与主 `submissions/<sid>.json` 拆成两个文件的原因：前者每完成一个 scene 写一次（高频），后者只在三次状态转换时写（低频）。拆开后高频写不会污染低频文件的 commit history，也不会产生 CAS 伪冲突。
+### 11.4 Worker 扫描效率优化（已实现）
+**原始设计**：每次 tick 调用 `snapshot_download(submissions/**)` 全量下载所有文件（包括历史 done/failed）。
+**问题**：随历史提交增多，每次 tick 的 HTTP etag 检查数线性增长，大部分是无用的已完成任务。
+**现有优化**（`src/worker/loop.py`）：
+```
+维护内存集合 _settled_sids（done/failed 的 sid 集合）
+tick():
+  API.list_repo_files()  ← 1次 API 调用，只传文件名，无数据
+  过滤 _settled_sids      ← 纯内存
+  对剩余 sid: hf_hub_download(force=False)  ← etag 缓存，无变化不传数据
+  finalize() 时: _mark_settled(sid)
+```
+稳态下（大量历史 done/failed），实际数据下载量收敛为 O(pending + running)，而不是 O(所有历史提交)。Worker 重启后 `_settled_sids` 清空，第一次 tick 重新下载所有文件并自动填充集合，之后恢复高效模式。
+### 11.5 跨 Space 的 `/data` 隔离 → 邮箱镜像策略（v0.5）
+**事实**：HF Spaces 的 persistent storage 是**单 Space 私有**的——Frontend Space 看不到 Worker Space 的 `/data`。
+**问题**：v0.5 把 GT / progress / lease / logs / archive 全搬进 Worker `/data` 之后，Frontend 怎么渲染队列？
+**选择 A（采纳）**：**Worker 单向镜像**。Worker 每个 tick 把当前队列状态序列化成 `public/queue.json` 写回 HF dataset；Frontend 只需 `download_file("public/queue.json", force=False)`，etag 命中即零字节传输。
+  * 优点：保留 "③ 是唯一通信总线" 不变；前端代码退化为单文件读；Worker 离线时 Frontend 自动显示陈旧快照而不是报错。
+  * 缺点：队列最多滞后一个 tick（~30s）。可接受。
+**选择 B（未采纳）**：让 Frontend HTTP 调 Worker 的 `:7860/queue` 端点。
+  * 缺点：Frontend → Worker Space 的私有访问要带 token，且 Worker 重启时前端会立刻 5xx；调试复杂度显著上升。
+**邮箱（dataset）schema 收敛后**：
+| 路径 | 写入方 | 读取方 | 备注 |
+|---|---|---|---|
+| `submissions/<sid>.json` | Frontend | Worker | 只在 pending 阶段存在；Worker 接管后立刻拷到 `/data/inflight/` 并从 dataset 删除 |
+| `public/queue.json` | Worker | Frontend | 整张表的 UI 快照（含 status / progress / position） |
+| `results/<task>/<team>_best.json` | Worker | Frontend | 历史最佳；仅在提分时被改写 |
+所有 v0.4 的 `submissions/<sid>.progress.json` / `task_manifest/` / `ground_truth/` / `logs/` / `archive/` 在 dataset 中**不再存在**——它们的等价物住在 Worker 的 `/data` 下。
+---
+## 12. ✅ TODO 总清单
+### A. 平台资源
+- [x] Frontend Space `vLAR/PhysInOne-Leaderboard` — **Public**，Gradio SDK。
+- [x] Private Dataset `vLAR/PhysInOne-Leaderboard` — **Private**。
+- [x] Worker Space `vLAR/PhysInOne-Leaderboard-Worker` — **Private**，**Docker SDK**，**persistent `/data` 已挂载**。
+- [ ] Worker Space → Secrets：`HF_TOKEN`（write 私有 dataset + read 用户 dataset）。
+- [ ] Frontend Space → Secrets：`HF_TOKEN`（write 私有 dataset）。
+### B. 私有 dataset 初始化
+- [ ] 上传占位 `submissions/.gitkeep` / `archive/.gitkeep` / `logs/.gitkeep`。
+- [ ] 上传 `task_manifest/<task>.json`（每任务一份）。
+- [ ] 上传 `ground_truth/<scene_id>/...`（任务负责人）。
+### C. 代码（v0.4 重写 — **本次待实施**）
+- [ ] `src/tasks/base.py`：改为 scene 级接口（`evaluate_scene` / `aggregate_fn`）。
+- [ ] `src/storage/progress.py`：progress.json 的 CAS helper（含 lease 续期）。
+- [ ] `src/worker/lease.py`：claim / renew / reclaim。
+- [ ] `src/worker/runner.py`：scene 流水线 + `SIGTERM` 优雅退出。
+- [ ] `src/worker/loop.py`：tick + 启动钩子（接管过期 lease）。
+- [ ] `src/storage/archive.py`：归档 + 日志轮转的 cron job。
+- [ ] `src/submission/frontend.py`：去掉 `filename`，新增 `display_name`，以 owner 推 `team_id`。
+- [ ] `src/leaderboard/read_evals.py`：加 progress + position 计算。
+- [ ] `app.py`：表单字段调整 + 每 submission 进度条 + Timer 自动刷新。
+- [ ] 现有 `_template` / 两个占位任务改为 scene 级。
+### D. 任务侧（合作者）
+- [ ] 复制 `src/tasks/_template/`，实现 `validate_scene` / `evaluate_scene` / `load_gt_scene`。
+- [ ] 上传 `task_manifest/<task>.json` 声明所用 scene_id。
+- [ ] 上传 `ground_truth/<scene_id>/...`。
+- [ ] 提供"用户 ZIP 内部结构"的 markdown 说明 (`expected_scene_layout`)。
+### E. 留白 / 待 user 确认
+- [ ] **Worker GPU**：默认 CPU basic。
+- [ ] **POLL_INTERVAL_SECONDS**：默认 30s。
+- [ ] **LEASE_SECONDS**：默认 300s（5min）。
+- [ ] **scene 失败自动重试**：默认不重试；运维可改 progress.json 删掉条目让 worker 续跑。
+- [ ] **单 scene 评测超时**：默认无；如需，加 `multiprocessing` 子进程 timeout。
+- [ ] **用户 dataset 允许 private**：默认仅 public。
+- [ ] **多 Worker**：默认 1 个；按队列长度手动扩。
+- [ ] **聚合策略**：默认算术平均（缺失 scene 跳过 / 算 0？待定）。
+- [ ] **归档触发**：默认 done/failed 满 30 天后；定时 cron 由 worker 跑。

Dockerfile.worker DELETED Viewed

@@ -1,19 +0,0 @@
-# Worker Space — pure backend container.
-# Configure the HF Space with `sdk: docker` and this Dockerfile at repo root.
-FROM python:3.11-slim
-ENV PYTHONUNBUFFERED=1 \
-    PIP_NO_CACHE_DIR=1 \
-    PIP_DISABLE_PIP_VERSION_CHECK=1 \
-    HF_HOME=/data/hf
-WORKDIR /app
-# Minimal runtime deps for the worker (no gradio).
-COPY requirements.worker.txt /app/requirements.worker.txt
-RUN pip install --no-cache-dir -r /app/requirements.worker.txt
-COPY . /app
-EXPOSE 7860
-CMD ["python", "worker.py"]

requirements.worker.txt DELETED Viewed

@@ -1,5 +0,0 @@
-APScheduler>=3.10
-huggingface-hub>=0.18.0
-pandas
-numpy
-python-dateutil

src/envs.py CHANGED Viewed

@@ -1,7 +1,15 @@
-"""Central configuration for v0.4 (scene-level + lease)."""
 import os
 import re
-import uuid
 from huggingface_hub import HfApi
@@ -15,48 +23,18 @@ FRONTEND_REPO_ID = os.environ.get("HF_FRONTEND_SPACE", f"{OWNER}/{BENCHMARK_NAME
 WORKER_REPO_ID = os.environ.get("HF_WORKER_SPACE", f"{OWNER}/{BENCHMARK_NAME}-Worker")
 RESULTS_REPO = os.environ.get("HF_RESULTS_REPO", f"{OWNER}/{BENCHMARK_NAME}")
-# ---------- dataset internal paths -------------------------------------------
-DS_SUBMISSIONS_DIR = "submissions"
-DS_TASK_MANIFEST_DIR = "task_manifest"
-DS_GROUND_TRUTH_DIR = "ground_truth"
-DS_RESULTS_DIR = "results"
-DS_LOGS_DIR = "logs"
-DS_ARCHIVE_DIR = "archive"
-# ---------- local cache ------------------------------------------------------
-CACHE_PATH = os.getenv("HF_HOME", ".")
-DATA_CACHE_PATH = os.path.join(CACHE_PATH, "benchmark-cache")
-for _sub in ("submissions", "results", "manifest", "downloads", "gt"):
-    os.makedirs(os.path.join(DATA_CACHE_PATH, _sub), exist_ok=True)
-SUBMISSIONS_CACHE_PATH = os.path.join(DATA_CACHE_PATH, "submissions")
-RESULTS_CACHE_PATH = os.path.join(DATA_CACHE_PATH, "results")
-MANIFEST_CACHE_PATH = os.path.join(DATA_CACHE_PATH, "manifest")
-DOWNLOADS_CACHE_PATH = os.path.join(DATA_CACHE_PATH, "downloads")
-GT_CACHE_PATH = os.path.join(DATA_CACHE_PATH, "gt")
-# ---------- size limits (per scene) ------------------------------------------
-MAX_SCENE_ZIP_SIZE = int(os.environ.get("MAX_SCENE_ZIP_SIZE", 500 * 1024 * 1024))   # 500 MB
-MAX_EXTRACTED_BYTES = int(os.environ.get("MAX_EXTRACTED_BYTES", 2 * 1024 * 1024 * 1024))  # 2 GB
-MAX_ZIP_ENTRIES = int(os.environ.get("MAX_ZIP_ENTRIES", 100000))
-# ---------- worker -----------------------------------------------------------
-POLL_INTERVAL_SECONDS = int(os.environ.get("POLL_INTERVAL_SECONDS", 30))
-LEASE_SECONDS = int(os.environ.get("LEASE_SECONDS", 300))                 # 5 min
-WORKER_CONCURRENCY = int(os.environ.get("WORKER_CONCURRENCY", 2))         # 任务间并行度
-WORKER_ID = os.environ.get("WORKER_ID", f"worker-{uuid.uuid4().hex[:8]}")
-GT_LRU_SIZE = int(os.environ.get("GT_LRU_SIZE", 16))
-# ---------- log + archive ----------------------------------------------------
-LOG_FILE_MAX_BYTES = int(os.environ.get("LOG_FILE_MAX_BYTES", 50 * 1024 * 1024))  # 50 MB
-ARCHIVE_AFTER_DAYS = int(os.environ.get("ARCHIVE_AFTER_DAYS", 30))
-ARCHIVE_INTERVAL_SECONDS = int(os.environ.get("ARCHIVE_INTERVAL_SECONDS", 6 * 3600))  # 6 h
 # ---------- validation regex -------------------------------------------------
 DISPLAY_NAME_REGEX = re.compile(r"^[\w\-. ]{2,40}$", re.UNICODE)
 HF_REPO_ID_REGEX = re.compile(r"^[A-Za-z0-9_.\-]{1,96}/[A-Za-z0-9_.\-]{1,96}$")
-SCENE_ID_REGEX = re.compile(r"^[A-Za-z0-9_\-]{1,64}$")
 # ---------- CAS --------------------------------------------------------------
 CAS_MAX_RETRIES = 6
 API = HfApi(token=TOKEN)

+"""Central configuration for Frontend Space (v0.5).
+NOTE: all worker-side state (progress / lease / GT cache / archive / logs)
+lives on the Worker Space's persistent disk (`/data`). The HF dataset is now
+only a thin **mailbox** between Frontend and Worker:
+    submissions/<sid>.json         (Frontend writes; Worker ingests + deletes)
+    public/queue.json              (Worker mirrors snapshot; Frontend reads)
+    results/<task>/<team>_best.json (Worker writes; Frontend reads)
+"""
 import os
 import re
 from huggingface_hub import HfApi
 WORKER_REPO_ID = os.environ.get("HF_WORKER_SPACE", f"{OWNER}/{BENCHMARK_NAME}-Worker")
 RESULTS_REPO = os.environ.get("HF_RESULTS_REPO", f"{OWNER}/{BENCHMARK_NAME}")
+# ---------- dataset paths (mailbox surface only) -----------------------------
+DS_SUBMISSIONS_DIR = "submissions"   # Frontend write-only, Worker consumes
+DS_RESULTS_DIR = "results"           # Worker write, Frontend read
+DS_PUBLIC_DIR = "public"             # Worker write snapshot, Frontend read
+DS_QUEUE_SNAPSHOT = f"{DS_PUBLIC_DIR}/queue.json"
 # ---------- validation regex -------------------------------------------------
 DISPLAY_NAME_REGEX = re.compile(r"^[\w\-. ]{2,40}$", re.UNICODE)
 HF_REPO_ID_REGEX = re.compile(r"^[A-Za-z0-9_.\-]{1,96}/[A-Za-z0-9_.\-]{1,96}$")
 # ---------- CAS --------------------------------------------------------------
 CAS_MAX_RETRIES = 6
 API = HfApi(token=TOKEN)

src/leaderboard/read_evals.py CHANGED Viewed

@@ -1,15 +1,16 @@
-"""Read leaderboards + queue with position + per-submission progress."""
 from __future__ import annotations
 import logging
 import os
-from typing import Dict, List, Optional
 import pandas as pd
 from src.envs import (
     DS_RESULTS_DIR,
-    DS_SUBMISSIONS_DIR,
 )
 from src.storage.hub import download_file, list_repo_files_under
 from src.tasks.base import TaskPlugin
@@ -19,7 +20,6 @@ logger = logging.getLogger(__name__)
 def _load_remote_json(path_in_repo: str) -> Optional[dict]:
     """Download (etag-cached) and parse a JSON file from the private dataset."""
-    import json
     local = download_file(path_in_repo, force=False)
     if local is None:
         return None
@@ -70,82 +70,29 @@ def load_task_leaderboard(task: TaskPlugin) -> pd.DataFrame:
     return df[["#", *cols]]
-# --------------------- queue with position + progress -----------------------
-def _scan_submissions() -> List[dict]:
-    """Fetch all non-terminal submission JSONs via list + per-file download."""
-    files = list_repo_files_under(DS_SUBMISSIONS_DIR)
-    out: List[dict] = []
-    for fpath in files:
-        bn = os.path.basename(fpath)
-        if bn.endswith(".progress.json") or not bn.endswith(".json"):
-            continue
-        data = _load_remote_json(fpath)
-        if data:
-            out.append(data)
-    return out
-def _scan_progress() -> Dict[str, dict]:
-    """Fetch all progress JSONs for submissions that are currently in-flight."""
-    files = list_repo_files_under(DS_SUBMISSIONS_DIR)
-    out: Dict[str, dict] = {}
-    for fpath in files:
-        if not fpath.endswith(".progress.json"):
-            continue
-        data = _load_remote_json(fpath)
-        if data and data.get("submission_id"):
-            out[data["submission_id"]] = data
-    return out
 def load_queue_df(limit: int = 100) -> pd.DataFrame:
-    subs = _scan_submissions()
-    progs = _scan_progress()
-    # position计算：在我之前还有几个 pending + 全部 running
-    pending_sorted = sorted(
-        [s for s in subs if s.get("status") == "pending"],
-        key=lambda d: d.get("submitted_at", ""),
-    )
-    pending_pos_of: Dict[str, int] = {s["submission_id"]: i + 1
-                                      for i, s in enumerate(pending_sorted)}
-    running_total = sum(1 for s in subs if s.get("status") == "running")
-    rows: List[dict] = []
-    for s in subs:
-        sid = s.get("submission_id", "")
-        status = s.get("status", "?")
-        prog = progs.get(sid, {})
-        n_done = len(prog.get("scene_results", {})) or s.get("scenes_done", 0) or 0
-        n_fail = len(prog.get("scene_failures", {})) or s.get("scenes_failed", 0) or 0
-        n_total = prog.get("scenes_total") or s.get("scenes_total") or 0
-        if status == "pending":
-            position = pending_pos_of.get(sid, 0) + running_total
-            position_str = f"#{position}"
-        elif status == "running":
-            position_str = "running"
-        else:
-            position_str = "—"
-        progress_str = (f"{n_done}/{n_total}"
-                        + (f" (+{n_fail} fail)" if n_fail else ""))
-        rows.append({
-            "Submitted At": s.get("submitted_at"),
-            "ID": sid,
-            "Team": s.get("display_name") or s.get("team_id"),
-            "Owner": s.get("team_id"),
-            "Task": s.get("task_name"),
-            "Status": status,
-            "Position": position_str,
-            "Scenes": progress_str,
-            "Score": s.get("score"),
-            "Error": (s.get("error_message") or "")[:120],
-        })
     if not rows:
-        return pd.DataFrame(columns=["Submitted At", "ID", "Team", "Owner", "Task",
-                                     "Status", "Position", "Scenes", "Score", "Error"])
     df = pd.DataFrame.from_records(rows)
     df = df.sort_values(by="Submitted At", ascending=False, na_position="last").head(limit)
     return df.reset_index(drop=True)

+"""Read leaderboards + queue (v0.5: queue is a single Worker-mirrored snapshot)."""
 from __future__ import annotations
+import json
 import logging
 import os
+from typing import List, Optional
 import pandas as pd
 from src.envs import (
+    DS_QUEUE_SNAPSHOT,
     DS_RESULTS_DIR,
 )
 from src.storage.hub import download_file, list_repo_files_under
 from src.tasks.base import TaskPlugin
 def _load_remote_json(path_in_repo: str) -> Optional[dict]:
     """Download (etag-cached) and parse a JSON file from the private dataset."""
     local = download_file(path_in_repo, force=False)
     if local is None:
         return None
     return df[["#", *cols]]
+# --------------------- queue (Worker-mirrored snapshot) ---------------------
+_QUEUE_COLUMNS = ["Submitted At", "ID", "Team", "Owner", "Task",
+                  "Status", "Position", "Scenes", "Score", "Error"]
 def load_queue_df(limit: int = 100) -> pd.DataFrame:
+    """Read the queue snapshot that the Worker mirrors at `public/queue.json`.
+    The Worker keeps the canonical state on its /data persistent disk and
+    rewrites this single file every tick. Frontend only needs one download.
+    """
+    snap = _load_remote_json(DS_QUEUE_SNAPSHOT)
+    if not snap or not isinstance(snap, dict):
+        return pd.DataFrame(columns=_QUEUE_COLUMNS)
+    rows = snap.get("rows") or []
     if not rows:
+        return pd.DataFrame(columns=_QUEUE_COLUMNS)
     df = pd.DataFrame.from_records(rows)
+    for c in _QUEUE_COLUMNS:
+        if c not in df.columns:
+            df[c] = None
+    df = df[_QUEUE_COLUMNS]
     df = df.sort_values(by="Submitted At", ascending=False, na_position="last").head(limit)
     return df.reset_index(drop=True)

src/storage/archive.py DELETED Viewed

@@ -1,149 +0,0 @@
-"""Log append (monthly + size rotation) + submission archive cron."""
-from __future__ import annotations
-import json
-import logging
-import os
-import time
-from datetime import datetime, timezone
-from typing import Optional
-from huggingface_hub import hf_hub_download
-from huggingface_hub.utils import EntryNotFoundError, HfHubHTTPError
-from src.envs import (
-    ARCHIVE_AFTER_DAYS,
-    CAS_MAX_RETRIES,
-    DS_ARCHIVE_DIR,
-    DS_LOGS_DIR,
-    DS_SUBMISSIONS_DIR,
-    LOG_FILE_MAX_BYTES,
-    RESULTS_REPO,
-)
-from src.storage.hub import (
-    API,
-    delete_file,
-    list_repo_files_under,
-    read_json,
-    upload_bytes,
-)
-logger = logging.getLogger(__name__)
-def _now() -> datetime:
-    return datetime.now(timezone.utc)
-def _month_prefix() -> str:
-    return _now().strftime("%Y-%m")
-def _candidate_log_paths(month: str) -> list[str]:
-    """Existing log file candidates for this month, sorted ascending."""
-    files = list_repo_files_under(DS_LOGS_DIR)
-    pat = f"{DS_LOGS_DIR}/exec-{month}"
-    return sorted(f for f in files if f.startswith(pat))
-def _resolve_log_target(month: str) -> tuple[str, bytes]:
-    """Return (path, existing bytes) for current append target, applying rotation."""
-    candidates = _candidate_log_paths(month)
-    if not candidates:
-        return (f"{DS_LOGS_DIR}/exec-{month}.jsonl", b"")
-    latest = candidates[-1]
-    try:
-        local = hf_hub_download(repo_id=RESULTS_REPO, repo_type="dataset",
-                                filename=latest, force_download=True)
-        with open(local, "rb") as f:
-            body = f.read()
-    except (EntryNotFoundError, HfHubHTTPError):
-        body = b""
-    if len(body) >= LOG_FILE_MAX_BYTES:
-        # rotate to next part
-        n_parts = sum(1 for c in candidates if ".part" in c) + 1
-        return (f"{DS_LOGS_DIR}/exec-{month}.part{n_parts + 1:02d}.jsonl", b"")
-    return (latest, body)
-def append_log(entry: dict) -> None:
-    line = (json.dumps(entry, ensure_ascii=False) + "\n").encode("utf-8")
-    month = _month_prefix()
-    for attempt in range(CAS_MAX_RETRIES):
-        path, body = _resolve_log_target(month)
-        # parent SHA: relax — log append is best-effort.
-        try:
-            upload_bytes(path, body + line,
-                         f"append log: {entry.get('submission_id', '?')}")
-            return
-        except HfHubHTTPError as e:
-            status = getattr(e.response, "status_code", None)
-            if status in (409, 412):
-                time.sleep(0.4 * (2 ** attempt))
-                continue
-            logger.warning("append_log failed: %s", e)
-            return
-    logger.warning("append_log retries exhausted")
-# ---------------------- archive cron ----------------------------------------
-def _parse_iso(s: Optional[str]) -> Optional[datetime]:
-    if not s:
-        return None
-    try:
-        return datetime.strptime(s, "%Y-%m-%dT%H:%M:%SZ").replace(tzinfo=timezone.utc)
-    except Exception:
-        return None
-def archive_old_submissions() -> int:
-    """Move done/failed submissions older than ARCHIVE_AFTER_DAYS to archive/.
-    Implementation: download json -> upload under archive/ -> delete originals.
-    Returns count archived.
-    """
-    cutoff_age = ARCHIVE_AFTER_DAYS * 86400
-    now = _now()
-    files = list_repo_files_under(DS_SUBMISSIONS_DIR)
-    sids = set()
-    for f in files:
-        bn = os.path.basename(f)
-        if bn.endswith(".json"):
-            sids.add(bn.split(".")[0])
-    moved = 0
-    for sid in sids:
-        sub_path = f"{DS_SUBMISSIONS_DIR}/{sid}.json"
-        existing = read_json(sub_path)
-        if existing is None:
-            continue
-        data, _sha = existing
-        if data.get("status") not in ("done", "failed"):
-            continue
-        ts = _parse_iso(data.get("completed_at") or data.get("submitted_at"))
-        if ts is None or (now - ts).total_seconds() < cutoff_age:
-            continue
-        month = ts.strftime("%Y-%m")
-        archive_prefix = f"{DS_ARCHIVE_DIR}/submissions-{month}"
-        try:
-            # upload archived copies
-            upload_bytes(
-                f"{archive_prefix}/{sid}.json",
-                json.dumps(data, ensure_ascii=False, indent=2).encode("utf-8"),
-                f"archive {sid}",
-            )
-            prog_path = f"{DS_SUBMISSIONS_DIR}/{sid}.progress.json"
-            prog = read_json(prog_path)
-            if prog is not None:
-                upload_bytes(
-                    f"{archive_prefix}/{sid}.progress.json",
-                    json.dumps(prog[0], ensure_ascii=False, indent=2).encode("utf-8"),
-                    f"archive progress {sid}",
-                )
-            # delete originals
-            delete_file(sub_path, f"archive cleanup {sid}")
-            delete_file(prog_path, f"archive cleanup {sid} progress")
-            moved += 1
-        except Exception as e:
-            logger.warning("archive %s failed: %s", sid, e)
-    return moved

src/storage/hub.py CHANGED Viewed

@@ -1,37 +1,29 @@
-"""Generic CAS-based read/write helpers for the private dataset.
-All hub access (other than user-dataset downloads) goes through this module.
 """
 from __future__ import annotations
 import json
 import logging
-import os
-import time
 from io import BytesIO
-from typing import Any, Callable, Dict, List, Optional, Tuple
-from huggingface_hub import hf_hub_download, snapshot_download
-from huggingface_hub.utils import EntryNotFoundError, HfHubHTTPError, RepositoryNotFoundError
-from src.envs import (
-    API,
-    CAS_MAX_RETRIES,
-    DATA_CACHE_PATH,
-    RESULTS_REPO,
-)
 logger = logging.getLogger(__name__)
-def _parent_sha() -> Optional[str]:
-    try:
-        return API.dataset_info(repo_id=RESULTS_REPO).sha
-    except Exception:
-        return None
-# --------------------------- download helpers --------------------------------
 def download_file(path_in_repo: str, force: bool = True) -> Optional[str]:
     try:
         return hf_hub_download(
@@ -48,35 +40,7 @@ def download_file(path_in_repo: str, force: bool = True) -> Optional[str]:
         raise
-def read_json(path_in_repo: str) -> Optional[Tuple[dict, Optional[str]]]:
-    local = download_file(path_in_repo)
-    if local is None:
-        return None
-    try:
-        with open(local, "r", encoding="utf-8") as f:
-            return json.load(f), _parent_sha()
-    except json.JSONDecodeError as e:
-        logger.warning("malformed json at %s: %s", path_in_repo, e)
-        return None
-def snapshot_subdir(allow_patterns: List[str],
-                    local_dir: str = DATA_CACHE_PATH) -> str:
-    try:
-        return snapshot_download(
-            repo_id=RESULTS_REPO,
-            repo_type="dataset",
-            local_dir=local_dir,
-            allow_patterns=allow_patterns,
-            etag_timeout=30,
-        )
-    except (RepositoryNotFoundError, HfHubHTTPError) as e:
-        logger.warning("snapshot_subdir failed for %s: %s", allow_patterns, e)
-        return local_dir
 def list_repo_files_under(prefix: str) -> List[str]:
-    """List files under a path-prefix in the private dataset."""
     try:
         all_files = API.list_repo_files(repo_id=RESULTS_REPO, repo_type="dataset")
     except Exception as e:
@@ -86,62 +50,12 @@ def list_repo_files_under(prefix: str) -> List[str]:
     return [f for f in all_files if f.startswith(prefix)]
-# --------------------------- upload helpers ----------------------------------
-def upload_bytes(path_in_repo: str, body: bytes, commit_message: str,
-                 parent_commit: Optional[str] = None) -> None:
-    kwargs: Dict[str, Any] = dict(
         path_or_fileobj=BytesIO(body),
         path_in_repo=path_in_repo,
         repo_id=RESULTS_REPO,
         repo_type="dataset",
         commit_message=commit_message,
     )
-    if parent_commit:
-        kwargs["parent_commit"] = parent_commit
-    API.upload_file(**kwargs)
-def upload_json(path_in_repo: str, payload: dict, commit_message: str,
-                parent_commit: Optional[str] = None) -> None:
-    body = json.dumps(payload, ensure_ascii=False, indent=2).encode("utf-8")
-    upload_bytes(path_in_repo, body, commit_message, parent_commit)
-def delete_file(path_in_repo: str, commit_message: str) -> bool:
-    try:
-        API.delete_file(
-            path_in_repo=path_in_repo,
-            repo_id=RESULTS_REPO,
-            repo_type="dataset",
-            commit_message=commit_message,
-        )
-        return True
-    except (EntryNotFoundError, HfHubHTTPError) as e:
-        logger.warning("delete_file %s failed: %s", path_in_repo, e)
-        return False
-def cas_update_json(path_in_repo: str,
-                    mutate_fn: Callable[[Optional[dict]], Optional[dict]],
-                    commit_message: str) -> Optional[dict]:
-    """Optimistic CAS: read -> mutate -> write with parent_commit.
-    `mutate_fn(current_or_None)` returns the new payload or None to abort.
-    Returns the persisted payload, or the current one if aborted.
-    """
-    for attempt in range(CAS_MAX_RETRIES):
-        existing = read_json(path_in_repo)
-        current, parent = (existing[0], existing[1]) if existing else (None, _parent_sha())
-        new_payload = mutate_fn(current)
-        if new_payload is None:
-            return current
-        try:
-            upload_json(path_in_repo, new_payload, commit_message, parent)
-            return new_payload
-        except HfHubHTTPError as e:
-            status = getattr(e.response, "status_code", None)
-            if status in (409, 412):
-                time.sleep(0.4 * (2 ** attempt))
-                continue
-            raise
-    raise RuntimeError(f"CAS exhausted for {path_in_repo}")

+"""Minimal hub I/O for the Frontend Space (v0.5).
+Frontend talks to the dataset only via a tiny mailbox:
+  * `upload_json(submissions/<sid>.json, ...)` — write pending submission
+  * `download_file(public/queue.json)`         — read queue snapshot
+  * `download_file(results/<task>/<team>_best.json)` — read leaderboard rows
+  * `list_repo_files_under(results/<task>)`    — enumerate leaderboard entries
+All CAS / progress / lease logic lives on the Worker (against its /data disk),
+so this module no longer ships those helpers.
 """
 from __future__ import annotations
 import json
 import logging
 from io import BytesIO
+from typing import List, Optional
+from huggingface_hub import hf_hub_download
+from huggingface_hub.utils import EntryNotFoundError, HfHubHTTPError
+from src.envs import API, RESULTS_REPO
 logger = logging.getLogger(__name__)
 def download_file(path_in_repo: str, force: bool = True) -> Optional[str]:
     try:
         return hf_hub_download(
         raise
 def list_repo_files_under(prefix: str) -> List[str]:
     try:
         all_files = API.list_repo_files(repo_id=RESULTS_REPO, repo_type="dataset")
     except Exception as e:
     return [f for f in all_files if f.startswith(prefix)]
+def upload_json(path_in_repo: str, payload: dict, commit_message: str) -> None:
+    body = json.dumps(payload, ensure_ascii=False, indent=2).encode("utf-8")
+    API.upload_file(
         path_or_fileobj=BytesIO(body),
         path_in_repo=path_in_repo,
         repo_id=RESULTS_REPO,
         repo_type="dataset",
         commit_message=commit_message,
     )

src/storage/progress.py DELETED Viewed

@@ -1,159 +0,0 @@
-"""progress.json CAS helpers (per-submission scene results + lease)."""
-from __future__ import annotations
-from datetime import datetime, timedelta, timezone
-from typing import Dict, Optional
-from src.envs import DS_SUBMISSIONS_DIR, LEASE_SECONDS, WORKER_ID
-from src.storage.hub import cas_update_json, read_json
-def _now() -> datetime:
-    return datetime.now(timezone.utc)
-def _iso(dt: datetime) -> str:
-    return dt.strftime("%Y-%m-%dT%H:%M:%SZ")
-def progress_path(sid: str) -> str:
-    return f"{DS_SUBMISSIONS_DIR}/{sid}.progress.json"
-def submission_path(sid: str) -> str:
-    return f"{DS_SUBMISSIONS_DIR}/{sid}.json"
-def load_progress(sid: str) -> Dict:
-    """Returns the progress dict (possibly empty skeleton)."""
-    existing = read_json(progress_path(sid))
-    if existing is None:
-        return {
-            "submission_id": sid,
-            "scenes_total": 0,
-            "scene_results": {},
-            "scene_failures": {},
-            "last_worker": None,
-            "last_updated_at": None,
-        }
-    return existing[0]
-def init_progress_if_absent(sid: str, scenes_total: int) -> None:
-    def mutate(current):
-        if current is not None:
-            return None
-        return {
-            "submission_id": sid,
-            "scenes_total": scenes_total,
-            "scene_results": {},
-            "scene_failures": {},
-            "last_worker": WORKER_ID,
-            "last_updated_at": _iso(_now()),
-        }
-    cas_update_json(progress_path(sid), mutate, f"init progress {sid}")
-def add_scene_result(sid: str, scene_id: str, metrics: Dict[str, float]) -> Dict:
-    def mutate(current):
-        cur = current or {
-            "submission_id": sid, "scenes_total": 0,
-            "scene_results": {}, "scene_failures": {},
-        }
-        cur["scene_results"] = dict(cur.get("scene_results", {}))
-        cur["scene_failures"] = dict(cur.get("scene_failures", {}))
-        cur["scene_results"][scene_id] = {"metrics": metrics, "ts": _iso(_now())}
-        cur["scene_failures"].pop(scene_id, None)
-        cur["last_worker"] = WORKER_ID
-        cur["last_updated_at"] = _iso(_now())
-        return cur
-    return cas_update_json(progress_path(sid), mutate, f"scene {scene_id} OK ({sid})") or {}
-def add_scene_failure(sid: str, scene_id: str, reason: str) -> Dict:
-    def mutate(current):
-        cur = current or {
-            "submission_id": sid, "scenes_total": 0,
-            "scene_results": {}, "scene_failures": {},
-        }
-        cur["scene_results"] = dict(cur.get("scene_results", {}))
-        cur["scene_failures"] = dict(cur.get("scene_failures", {}))
-        cur["scene_failures"][scene_id] = reason[:240]
-        cur["last_worker"] = WORKER_ID
-        cur["last_updated_at"] = _iso(_now())
-        return cur
-    return cas_update_json(progress_path(sid), mutate, f"scene {scene_id} FAIL ({sid})") or {}
-# --------------------- submission CAS (status + lease) -----------------------
-def claim_submission(sid: str) -> Optional[dict]:
-    """Atomic pending → running (+ lease). Returns running payload or None."""
-    holder = {"payload": None}
-    def mutate(current):
-        if current is None or current.get("status") != "pending":
-            return None
-        new = dict(current)
-        new["status"] = "running"
-        new["started_at"] = _iso(_now())
-        new["worker_attempt"] = int(current.get("worker_attempt", 0)) + 1
-        new["lease_expires_at"] = _iso(_now() + timedelta(seconds=LEASE_SECONDS))
-        new["last_worker"] = WORKER_ID
-        holder["payload"] = new
-        return new
-    cas_update_json(submission_path(sid), mutate, f"claim {sid}")
-    return holder["payload"]
-def reclaim_submission(sid: str) -> Optional[dict]:
-    """Take over a running submission whose lease has expired."""
-    holder = {"payload": None}
-    def mutate(current):
-        if current is None or current.get("status") != "running":
-            return None
-        lease = current.get("lease_expires_at")
-        if lease and lease > _iso(_now()):
-            return None  # still leased by someone else
-        new = dict(current)
-        new["worker_attempt"] = int(current.get("worker_attempt", 0)) + 1
-        new["lease_expires_at"] = _iso(_now() + timedelta(seconds=LEASE_SECONDS))
-        new["last_worker"] = WORKER_ID
-        holder["payload"] = new
-        return new
-    cas_update_json(submission_path(sid), mutate, f"reclaim {sid}")
-    return holder["payload"]
-def renew_lease(sid: str, scene_counts: Optional[Dict[str, int]] = None) -> None:
-    def mutate(current):
-        if current is None or current.get("status") != "running":
-            return None
-        new = dict(current)
-        new["lease_expires_at"] = _iso(_now() + timedelta(seconds=LEASE_SECONDS))
-        new["last_worker"] = WORKER_ID
-        if scene_counts:
-            new["scenes_done"] = scene_counts.get("done", new.get("scenes_done", 0))
-            new["scenes_failed"] = scene_counts.get("failed", new.get("scenes_failed", 0))
-        return new
-    cas_update_json(submission_path(sid), mutate, f"renew lease {sid}")
-def finalize_submission(sid: str, status: str, score: Optional[float],
-                        metrics: Optional[dict], error_message: Optional[str],
-                        scenes_done: int, scenes_failed: int) -> None:
-    def mutate(current):
-        base = current or {"submission_id": sid}
-        new = dict(base)
-        new["status"] = status
-        new["completed_at"] = _iso(_now())
-        new["lease_expires_at"] = None
-        new["score"] = score
-        new["metrics"] = metrics
-        new["error_message"] = error_message
-        new["scenes_done"] = scenes_done
-        new["scenes_failed"] = scenes_failed
-        return new
-    cas_update_json(submission_path(sid), mutate, f"finalize {sid}={status}")

src/submission/check_validity.py DELETED Viewed

@@ -1,60 +0,0 @@
-"""ZIP safe-extract — anti zip-bomb / path traversal."""
-from __future__ import annotations
-import os
-import zipfile
-from src.envs import MAX_EXTRACTED_BYTES, MAX_SCENE_ZIP_SIZE, MAX_ZIP_ENTRIES
-def validate_zip_filesize(zip_path: str) -> None:
-    size = os.path.getsize(zip_path)
-    if size > MAX_SCENE_ZIP_SIZE:
-        raise ValueError(f"scene ZIP 大小 {size} 超过上限 {MAX_SCENE_ZIP_SIZE} 字节。")
-def safe_extract_zip(zip_path: str, dest_dir: str) -> None:
-    dest_real = os.path.realpath(dest_dir)
-    try:
-        zf = zipfile.ZipFile(zip_path, "r")
-    except zipfile.BadZipFile as e:
-        raise ValueError(f"非法 ZIP 文件：{e}") from e
-    with zf:
-        infos = zf.infolist()
-        if len(infos) > MAX_ZIP_ENTRIES:
-            raise ValueError(f"ZIP 条目数 {len(infos)} 超过上限 {MAX_ZIP_ENTRIES}。")
-        declared = sum(max(0, i.file_size) for i in infos)
-        if declared > MAX_EXTRACTED_BYTES:
-            raise ValueError(f"声明的解压字节 {declared} 超过上限 {MAX_EXTRACTED_BYTES}。")
-        written = 0
-        for info in infos:
-            name = info.filename
-            if name.startswith("/") or ".." in name.replace("\\", "/").split("/"):
-                raise ValueError(f"非法路径条目：{name}")
-            if (info.external_attr >> 16) & 0xF000 == 0xA000:
-                raise ValueError(f"禁止符号链接：{name}")
-            target = os.path.realpath(os.path.join(dest_dir, name))
-            if not (target == dest_real or target.startswith(dest_real + os.sep)):
-                raise ValueError(f"非法目标路径：{name}")
-            if info.is_dir():
-                os.makedirs(target, exist_ok=True)
-                continue
-            os.makedirs(os.path.dirname(target), exist_ok=True)
-            with zf.open(info, "r") as src, open(target, "wb") as dst:
-                while True:
-                    chunk = src.read(64 * 1024)
-                    if not chunk:
-                        break
-                    written += len(chunk)
-                    if written > MAX_EXTRACTED_BYTES:
-                        dst.close()
-                        try:
-                            os.remove(target)
-                        except OSError:
-                            pass
-                        raise ValueError(
-                            f"解压字节超过上限 {MAX_EXTRACTED_BYTES}，疑似 zip 炸弹。"
-                        )
-                    dst.write(chunk)

src/submission/frontend.py CHANGED Viewed

@@ -12,10 +12,9 @@ from src.envs import (
     API,
     DISPLAY_NAME_REGEX,
     DS_SUBMISSIONS_DIR,
-    DS_TASK_MANIFEST_DIR,
     HF_REPO_ID_REGEX,
 )
-from src.storage.hub import read_json, upload_json
 from src.tasks import get_task
 logger = logging.getLogger(__name__)
@@ -53,14 +52,6 @@ def _check_user_dataset_exists(user_dataset: str) -> Optional[str]:
         return f"检查 dataset 失败：{e}"
-def _scenes_total_for(task_name: str) -> int:
-    """Read scenes_total from task_manifest; 0 if absent."""
-    existing = read_json(f"{DS_TASK_MANIFEST_DIR}/{task_name}.json")
-    if existing is None:
-        return 0
-    return len(existing[0].get("scene_ids") or [])
 def submit_task(display_name: str, task_name: str, user_dataset: str) -> dict:
     display_name = (display_name or "").strip()
     task_name = (task_name or "").strip()
@@ -77,8 +68,9 @@ def submit_task(display_name: str, task_name: str, user_dataset: str) -> dict:
     task = get_task(task_name)
     team_id = user_dataset.split("/")[0]
     submission_id = uuid.uuid4().hex[:10]
-    scenes_total = _scenes_total_for(task_name)
     payload = {
         "submission_id": submission_id,
         "team_id": team_id,
@@ -87,18 +79,7 @@ def submit_task(display_name: str, task_name: str, user_dataset: str) -> dict:
         "user_dataset": user_dataset,
         "submitted_at": _utc_now_iso(),
         "status": "pending",
-        "started_at": None,
-        "completed_at": None,
-        "lease_expires_at": None,
         "primary_metric": task.primary_metric,
-        "score": None,
-        "metrics": None,
-        "scenes_total": scenes_total,
-        "scenes_done": 0,
-        "scenes_failed": 0,
-        "error_message": None,
-        "worker_attempt": 0,
-        "last_worker": None,
     }
     path = f"{DS_SUBMISSIONS_DIR}/{submission_id}.json"
     try:
@@ -109,13 +90,13 @@ def submit_task(display_name: str, task_name: str, user_dataset: str) -> dict:
                 "message": f"❌ 提交写入失败：{e}",
                 "submission_id": None}
     return {
         "status": "accepted",
         "submission_id": submission_id,
         "message": (
             f"✅ 已提交（id `{submission_id}`，状态：pending）。\n\n"
-            f"队伍身份：`{team_id}` · 展示名：`{display_name}` · "
-            f"该任务共 **{scenes_total}** 个 scene。\n\n"
             f"请在「Queue」标签页查看进度。"
         ),
     }

     API,
     DISPLAY_NAME_REGEX,
     DS_SUBMISSIONS_DIR,
     HF_REPO_ID_REGEX,
 )
+from src.storage.hub import upload_json
 from src.tasks import get_task
 logger = logging.getLogger(__name__)
         return f"检查 dataset 失败：{e}"
 def submit_task(display_name: str, task_name: str, user_dataset: str) -> dict:
     display_name = (display_name or "").strip()
     task_name = (task_name or "").strip()
     task = get_task(task_name)
     team_id = user_dataset.split("/")[0]
     submission_id = uuid.uuid4().hex[:10]
+    # Frontend only writes the mailbox envelope. Worker fills in scenes_total
+    # and per-scene state into its own /data once it ingests this file.
     payload = {
         "submission_id": submission_id,
         "team_id": team_id,
         "user_dataset": user_dataset,
         "submitted_at": _utc_now_iso(),
         "status": "pending",
         "primary_metric": task.primary_metric,
     }
     path = f"{DS_SUBMISSIONS_DIR}/{submission_id}.json"
     try:
                 "message": f"❌ 提交写入失败：{e}",
                 "submission_id": None}
     return {
         "status": "accepted",
         "submission_id": submission_id,
         "message": (
             f"✅ 已提交（id `{submission_id}`，状态：pending）。\n\n"
+            f"队伍身份：`{team_id}` · 展示名：`{display_name}`\n\n"
             f"请在「Queue」标签页查看进度。"
         ),
     }

src/worker/__init__.py DELETED Viewed

File without changes

src/worker/loop.py DELETED Viewed

@@ -1,211 +0,0 @@
-"""Worker poll loop:
-- on_startup: reclaim any `running` submission whose lease has expired.
-- tick: claim pending submissions and dispatch them to a thread pool.
-- dispatch concurrency = WORKER_CONCURRENCY (task-level parallelism).
-扫描策略（避免全量重复下载）
--------------------------------
-维护一个内存级 _settled_sids (done/failed 的 sid 集合)。
-每次 tick:
-  1. API.list_repo_files → 拿全部文件名 (单次 API call, 无数据传输)
-  2. 过滤掉 _settled_sids — 纯内存操作
-  3. 对剩余 sid 调用 hf_hub_download (本地 etag 缓存, 若文件未变则命中缓存不传数据)
-  4. finalize 时将 sid 加入 _settled_sids
-随着时间推移, settled 集合越来越大, 实际下载量收敛到 O(pending + running), 而不是
-O(所有历史提交)。
-"""
-from __future__ import annotations
-import json
-import logging
-import os
-import threading
-from concurrent.futures import ThreadPoolExecutor
-from datetime import datetime, timezone
-from typing import List, Optional
-from src.envs import (
-    DATA_CACHE_PATH,
-    DS_SUBMISSIONS_DIR,
-    POLL_INTERVAL_SECONDS,
-    WORKER_CONCURRENCY,
-    WORKER_ID,
-)
-from src.storage import progress as progress_store
-from src.storage.hub import download_file, list_repo_files_under
-from src.worker.runner import process_submission, shutdown_requested
-logger = logging.getLogger(__name__)
-_executor = ThreadPoolExecutor(max_workers=max(1, WORKER_CONCURRENCY),
-                               thread_name_prefix="worker")
-_in_flight_lock = threading.Lock()
-_in_flight: set[str] = set()
-# ---- settled set: sids whose status is done/failed (terminal, never re-dispatch) ----
-_settled_lock = threading.Lock()
-_settled_sids: set[str] = set()
-_TERMINAL_STATUSES = {"done", "failed"}
-def _mark_settled(sid: str) -> None:
-    with _settled_lock:
-        _settled_sids.add(sid)
-def _is_settled(sid: str) -> bool:
-    with _settled_lock:
-        return sid in _settled_sids
-def _now_iso() -> str:
-    return datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
-def _parse_sid(filename: str) -> Optional[str]:
-    """submissions/<sid>.json → sid.  Returns None for progress files."""
-    bn = os.path.basename(filename)
-    if bn.endswith(".progress.json") or not bn.endswith(".json"):
-        return None
-    return bn[: -len(".json")]
-def _fetch_submission(sid: str) -> Optional[dict]:
-    """Download (or serve from etag cache) one submission JSON. Returns None on error."""
-    path = f"{DS_SUBMISSIONS_DIR}/{sid}.json"
-    local = download_file(path, force=False)  # force=False → use etag cache
-    if local is None:
-        return None
-    try:
-        with open(local, "r", encoding="utf-8") as f:
-            return json.load(f)
-    except Exception as e:
-        logger.warning("malformed submission %s: %s", sid, e)
-        return None
-def _scan_submissions() -> List[dict]:
-    """
-    1. list_repo_files → all submission filenames (1 API call, no data download)
-    2. skip settled sids
-    3. hf_hub_download each remaining sid (etag-cached)
-    """
-    all_files = list_repo_files_under(DS_SUBMISSIONS_DIR)
-    rows: List[dict] = []
-    for fpath in all_files:
-        sid = _parse_sid(fpath)
-        if sid is None or _is_settled(sid):
-            continue
-        data = _fetch_submission(sid)
-        if data is None:
-            continue
-        # Opportunistically update settled set from whatever we downloaded
-        if data.get("status") in _TERMINAL_STATUSES:
-            _mark_settled(sid)
-            continue
-        rows.append(data)
-    return rows
-def _list_pending(rows: List[dict]) -> List[dict]:
-    pend = [r for r in rows if r.get("status") == "pending"]
-    pend.sort(key=lambda d: d.get("submitted_at", ""))
-    return pend
-def _list_expired_running(rows: List[dict]) -> List[dict]:
-    now_iso = _now_iso()
-    out = []
-    for r in rows:
-        if r.get("status") != "running":
-            continue
-        lease = r.get("lease_expires_at")
-        if lease and lease < now_iso:
-            out.append(r)
-    return out
-def _try_dispatch(claimed: dict) -> bool:
-    sid = claimed["submission_id"]
-    with _in_flight_lock:
-        if sid in _in_flight:
-            return False
-        _in_flight.add(sid)
-    def _run():
-        try:
-            logger.info("[%s] start sid=%s task=%s", WORKER_ID, sid, claimed.get("task_name"))
-            process_submission(claimed)
-            logger.info("[%s] end   sid=%s", WORKER_ID, sid)
-            # Mark settled so future ticks skip downloading this sid entirely.
-            # (process_submission writes done/failed, so it IS terminal after this.)
-            _mark_settled(sid)
-        except Exception:
-            logger.exception("[%s] dispatch crashed sid=%s", WORKER_ID, sid)
-        finally:
-            with _in_flight_lock:
-                _in_flight.discard(sid)
-    _executor.submit(_run)
-    return True
-def boot_reclaim() -> int:
-    """Reclaim any running-but-lease-expired submissions (call on startup)."""
-    rows = _scan_submissions()
-    expired = _list_expired_running(rows)
-    n = 0
-    for r in expired:
-        if shutdown_requested():
-            break
-        claimed = progress_store.reclaim_submission(r["submission_id"])
-        if claimed is None:
-            continue
-        if _try_dispatch(claimed):
-            n += 1
-    logger.info("[%s] boot_reclaim dispatched %d", WORKER_ID, n)
-    return n
-def tick() -> int:
-    """Periodic tick: claim pending up to free capacity, then dispatch."""
-    if shutdown_requested():
-        return 0
-    with _in_flight_lock:
-        free = max(0, WORKER_CONCURRENCY - len(_in_flight))
-    if free == 0:
-        logger.debug("[%s] tick: no free slots", WORKER_ID)
-        return 0
-    rows = _scan_submissions()
-    pending = _list_pending(rows)
-    if not pending:
-        return 0
-    logger.info("[%s] tick: %d pending, %d free slots", WORKER_ID, len(pending), free)
-    dispatched = 0
-    for r in pending:
-        if dispatched >= free or shutdown_requested():
-            break
-        claimed = progress_store.claim_submission(r["submission_id"])
-        if claimed is None:
-            continue
-        if _try_dispatch(claimed):
-            dispatched += 1
-    return dispatched
-def in_flight_snapshot() -> List[str]:
-    with _in_flight_lock:
-        return sorted(_in_flight)
-def shutdown_executor(wait: bool = True) -> None:
-    _executor.shutdown(wait=wait)
-# poll interval re-export for the worker entry to schedule it
-__all__ = ["tick", "boot_reclaim", "in_flight_snapshot", "shutdown_executor",
-           "POLL_INTERVAL_SECONDS"]

src/worker/runner.py DELETED Viewed

@@ -1,291 +0,0 @@
-"""Per-submission scene pipeline.
-Assumptions:
-- The submission has already been CAS-claimed (status=running, lease set).
-- This module is responsible for:
-    * loading the task manifest
-    * iterating scenes (skipping settled ones — resume-friendly)
-    * downloading & evaluating each scene
-    * updating progress.json + renewing lease
-    * finalizing the submission once all scenes are settled
-"""
-from __future__ import annotations
-import logging
-import os
-import shutil
-import tempfile
-import threading
-from collections import OrderedDict
-from datetime import datetime, timezone
-from typing import Any, Dict, List, Optional, Tuple
-from huggingface_hub import hf_hub_download, snapshot_download
-from huggingface_hub.utils import EntryNotFoundError, HfHubHTTPError
-from src.envs import (
-    DOWNLOADS_CACHE_PATH,
-    DS_GROUND_TRUTH_DIR,
-    DS_TASK_MANIFEST_DIR,
-    GT_CACHE_PATH,
-    GT_LRU_SIZE,
-    RESULTS_REPO,
-    TOKEN,
-    WORKER_ID,
-)
-from src.storage import progress as progress_store
-from src.storage.archive import append_log
-from src.storage.hub import cas_update_json, read_json
-from src.submission.check_validity import safe_extract_zip, validate_zip_filesize
-from src.tasks import get_task
-from src.tasks.base import TaskPlugin
-logger = logging.getLogger(__name__)
-# Cooperative shutdown signal (set by worker entry on SIGTERM/SIGINT).
-_shutdown_event = threading.Event()
-def request_shutdown() -> None:
-    _shutdown_event.set()
-def shutdown_requested() -> bool:
-    return _shutdown_event.is_set()
-# --------------------------- GT cache (LRU) ----------------------------------
-_gt_lock = threading.Lock()
-_gt_lru: "OrderedDict[Tuple[str, str], Any]" = OrderedDict()
-def _load_gt_dir(task_name: str, scene_id: str) -> str:
-    """Snapshot GT dir for this scene into local cache; return local path."""
-    local_root = snapshot_download(
-        repo_id=RESULTS_REPO,
-        repo_type="dataset",
-        local_dir=GT_CACHE_PATH,
-        allow_patterns=[f"{DS_GROUND_TRUTH_DIR}/{scene_id}/**"],
-        etag_timeout=30,
-    )
-    return os.path.join(local_root, DS_GROUND_TRUTH_DIR, scene_id)
-def get_gt_scene(task: TaskPlugin, scene_id: str) -> Any:
-    key = (task.name, scene_id)
-    with _gt_lock:
-        if key in _gt_lru:
-            _gt_lru.move_to_end(key)
-            return _gt_lru[key]
-    gt_dir = _load_gt_dir(task.name, scene_id)
-    gt_obj = task.load_gt_scene(scene_id, gt_dir)
-    with _gt_lock:
-        _gt_lru[key] = gt_obj
-        _gt_lru.move_to_end(key)
-        while len(_gt_lru) > GT_LRU_SIZE:
-            _gt_lru.popitem(last=False)
-    return gt_obj
-# --------------------------- helpers ----------------------------------------
-def _utc_now_iso() -> str:
-    return datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
-def _read_manifest(task_name: str) -> Optional[dict]:
-    res = read_json(f"{DS_TASK_MANIFEST_DIR}/{task_name}.json")
-    return res[0] if res else None
-def _scene_filename(task_name: str, scene_id: str, layout: Optional[str]) -> str:
-    if layout:
-        return (layout
-                .replace("<task>", task_name)
-                .replace("<scene_id>", scene_id))
-    return f"predictions/{task_name}/{scene_id}.zip"
-# --------------------------- main entry --------------------------------------
-def process_submission(sub: dict) -> None:
-    """Run the scene pipeline for one (already-claimed) submission.
-    Safe to call again on the same sub after a crash — it will resume.
-    """
-    sid = sub["submission_id"]
-    task_name = sub["task_name"]
-    team_id = sub.get("team_id") or sub.get("user_dataset", "/").split("/")[0]
-    user_dataset = sub["user_dataset"]
-    try:
-        task = get_task(task_name)
-    except ValueError as e:
-        progress_store.finalize_submission(
-            sid, "failed", None, None, f"unknown task: {e}", 0, 0,
-        )
-        append_log({"submission_id": sid, "ts": _utc_now_iso(),
-                    "status": "Failed", "error_message": str(e)})
-        return
-    manifest = _read_manifest(task_name)
-    if manifest is None or not manifest.get("scene_ids"):
-        msg = f"task_manifest/{task_name}.json 不存在或为空。"
-        progress_store.finalize_submission(sid, "failed", None, None, msg, 0, 0)
-        append_log({"submission_id": sid, "ts": _utc_now_iso(),
-                    "status": "Failed", "error_message": msg})
-        return
-    scene_ids: List[str] = list(manifest["scene_ids"])
-    layout = manifest.get("scene_layout")
-    progress_store.init_progress_if_absent(sid, len(scene_ids))
-    prog = progress_store.load_progress(sid)
-    # ---- per-scene loop ----
-    for scene_id in scene_ids:
-        if shutdown_requested():
-            logger.info("[%s] %s shutdown requested, exiting", WORKER_ID, sid)
-            return
-        if scene_id in prog.get("scene_results", {}) or scene_id in prog.get("scene_failures", {}):
-            continue
-        zip_path: Optional[str] = None
-        scene_target_dir = os.path.join(DOWNLOADS_CACHE_PATH, sid, scene_id)
-        try:
-            os.makedirs(scene_target_dir, exist_ok=True)
-            zip_path = hf_hub_download(
-                repo_id=user_dataset,
-                repo_type="dataset",
-                filename=_scene_filename(task_name, scene_id, layout),
-                token=TOKEN,
-                local_dir=scene_target_dir,
-                force_download=True,
-            )
-            validate_zip_filesize(zip_path)
-            with tempfile.TemporaryDirectory(prefix=f"bench_{sid}_{scene_id}_") as sandbox:
-                safe_extract_zip(zip_path, sandbox)
-                task.validate_scene(sandbox)
-                gt = get_gt_scene(task, scene_id)
-                metrics = task.evaluate_scene(sandbox, gt)
-            prog = progress_store.add_scene_result(sid, scene_id, metrics)
-        except EntryNotFoundError:
-            prog = progress_store.add_scene_failure(
-                sid, scene_id, "missing: scene zip not found in user dataset"
-            )
-        except HfHubHTTPError as e:
-            status = getattr(e.response, "status_code", "?")
-            prog = progress_store.add_scene_failure(
-                sid, scene_id, f"download error (status {status})"
-            )
-        except ValueError as e:
-            prog = progress_store.add_scene_failure(sid, scene_id, str(e))
-        except Exception as e:
-            logger.exception("[%s] scene eval crashed sid=%s scene=%s", WORKER_ID, sid, scene_id)
-            prog = progress_store.add_scene_failure(
-                sid, scene_id, "internal evaluation error",
-            )
-        finally:
-            if zip_path:
-                try:
-                    os.remove(zip_path)
-                except OSError:
-                    pass
-            shutil.rmtree(scene_target_dir, ignore_errors=True)
-        # renew lease + update scene counters in submission json
-        progress_store.renew_lease(sid, {
-            "done": len(prog.get("scene_results", {})),
-            "failed": len(prog.get("scene_failures", {})),
-        })
-    # ---- finalize ----
-    settled = set(prog.get("scene_results", {})) | set(prog.get("scene_failures", {}))
-    if not settled.issuperset(scene_ids):
-        # incomplete (e.g. shutdown). Leave running; next worker picks up.
-        return
-    per_scene_metrics = {
-        sid_: entry["metrics"]
-        for sid_, entry in prog.get("scene_results", {}).items()
-    }
-    try:
-        agg = task.aggregate(per_scene_metrics)
-    except Exception as e:
-        logger.exception("[%s] aggregate failed sid=%s", WORKER_ID, sid)
-        progress_store.finalize_submission(
-            sid, "failed", None, None, "aggregation error",
-            len(prog.get("scene_results", {})),
-            len(prog.get("scene_failures", {})),
-        )
-        append_log({"submission_id": sid, "ts": _utc_now_iso(),
-                    "status": "Failed", "error_message": f"aggregate: {e}"})
-        return
-    primary = agg.get(task.primary_metric) if agg else None
-    failures = len(prog.get("scene_failures", {}))
-    successes = len(prog.get("scene_results", {}))
-    final_status = "done" if successes > 0 else "failed"
-    err_msg = None if final_status == "done" else "no scene evaluated successfully"
-    progress_store.finalize_submission(
-        sid, final_status,
-        float(primary) if primary is not None else None,
-        agg if final_status == "done" else None,
-        err_msg,
-        successes, failures,
-    )
-    if final_status == "done":
-        try:
-            is_new, prev = _maybe_update_best(task, team_id, sub.get("display_name", team_id),
-                                              float(primary), agg, sid)
-        except Exception:
-            logger.exception("[%s] best-update failed sid=%s", WORKER_ID, sid)
-            is_new, prev = False, None
-        append_log({
-            "submission_id": sid, "ts": _utc_now_iso(),
-            "status": "Success", "team_id": team_id, "task_name": task_name,
-            "score": primary, "metrics": agg,
-            "is_new_record": is_new, "previous_best": prev,
-            "scenes_done": successes, "scenes_failed": failures,
-        })
-    else:
-        append_log({
-            "submission_id": sid, "ts": _utc_now_iso(),
-            "status": "Failed", "team_id": team_id, "task_name": task_name,
-            "error_message": err_msg,
-            "scenes_done": successes, "scenes_failed": failures,
-        })
-def _maybe_update_best(task: TaskPlugin, team_id: str, display_name: str,
-                       primary_score: float, metrics: dict, submission_id: str
-                       ) -> Tuple[bool, Optional[float]]:
-    from src.envs import DS_RESULTS_DIR
-    path = f"{DS_RESULTS_DIR}/{task.name}/{team_id}_best.json"
-    holder = {"prev": None, "is_new": False}
-    def mutate(current):
-        if current is None:
-            holder["prev"] = None
-            beats = True
-        else:
-            holder["prev"] = float(current.get("score", 0))
-            prev = holder["prev"]
-            beats = (primary_score > prev) if task.higher_is_better else (primary_score < prev)
-        if not beats:
-            return None
-        holder["is_new"] = True
-        return {
-            "team_id": team_id,
-            "display_name": display_name,
-            "task_name": task.name,
-            "score": primary_score,
-            "primary_metric": task.primary_metric,
-            "metrics": metrics,
-            "scenes_count": int(metrics.get("scenes_count", 0)) if isinstance(metrics, dict) else 0,
-            "updated_at": _utc_now_iso(),
-            "submission_id": submission_id,
-        }
-    cas_update_json(path, mutate, f"best {task.name}/{team_id}")
-    return holder["is_new"], holder["prev"]

worker.py DELETED Viewed

@@ -1,173 +0,0 @@
-"""Worker Space entry — pure backend.
-No Gradio: HF Spaces only requires the container to expose a port; we serve a
-minimal stdlib HTTP endpoint for health/status. Configure the Space with
-`sdk: docker` and a Dockerfile that runs `python worker.py`.
-"""
-from __future__ import annotations
-import json
-import logging
-import signal
-import threading
-from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
-from apscheduler.schedulers.background import BackgroundScheduler
-from src.envs import (
-    API,
-    ARCHIVE_INTERVAL_SECONDS,
-    POLL_INTERVAL_SECONDS,
-    WORKER_CONCURRENCY,
-    WORKER_ID,
-    WORKER_REPO_ID,
-)
-from src.storage.archive import archive_old_submissions
-from src.tasks import TASKS
-from src.worker.loop import (
-    boot_reclaim,
-    in_flight_snapshot,
-    shutdown_executor,
-    tick,
-)
-from src.worker.runner import request_shutdown, shutdown_requested
-logging.basicConfig(
-    level=logging.INFO,
-    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
-)
-logger = logging.getLogger(__name__)
-HTTP_PORT = 7860  # HF Spaces default exposed port
-_last_status = {"text": "scheduler initialising", "last_tick_dispatched": 0}
-# --------------------------- scheduled jobs ---------------------------------
-def _scheduled_tick() -> None:
-    if shutdown_requested():
-        return
-    try:
-        n = tick()
-        _last_status["text"] = (
-            f"last tick dispatched {n}; in-flight={in_flight_snapshot()}"
-        )
-        _last_status["last_tick_dispatched"] = n
-    except Exception as e:
-        logger.exception("tick crashed")
-        _last_status["text"] = f"last tick crashed: {e}"
-def _scheduled_archive() -> None:
-    if shutdown_requested():
-        return
-    try:
-        n = archive_old_submissions()
-        if n:
-            logger.info("[%s] archived %d submissions", WORKER_ID, n)
-    except Exception:
-        logger.exception("archive cron crashed")
-def _restart_space() -> None:
-    try:
-        API.restart_space(repo_id=WORKER_REPO_ID)
-    except Exception as e:
-        logger.warning("restart_space failed: %s", e)
-# --------------------------- health endpoint --------------------------------
-class _HealthHandler(BaseHTTPRequestHandler):
-    def _send(self, code: int, body: bytes, ctype: str = "application/json") -> None:
-        self.send_response(code)
-        self.send_header("Content-Type", ctype)
-        self.send_header("Content-Length", str(len(body)))
-        self.end_headers()
-        self.wfile.write(body)
-    def do_GET(self):  # noqa: N802 (stdlib signature)
-        if self.path in ("/", "/status", "/healthz"):
-            body = {
-                "worker_id": WORKER_ID,
-                "concurrency": WORKER_CONCURRENCY,
-                "poll_interval_seconds": POLL_INTERVAL_SECONDS,
-                "in_flight": in_flight_snapshot(),
-                "shutdown_requested": shutdown_requested(),
-                "last_status": _last_status["text"],
-                "registered_tasks": list(TASKS.keys()),
-            }
-            self._send(200, json.dumps(body, ensure_ascii=False, indent=2).encode("utf-8"))
-        else:
-            self._send(404, b'{"error":"not found"}')
-    def log_message(self, format, *args):  # silence default access logging
-        return
-def _start_http_server() -> ThreadingHTTPServer:
-    server = ThreadingHTTPServer(("0.0.0.0", HTTP_PORT), _HealthHandler)
-    threading.Thread(
-        target=server.serve_forever, name="health-http", daemon=True
-    ).start()
-    logger.info("[%s] health server on :%d", WORKER_ID, HTTP_PORT)
-    return server
-# --------------------------- signal handlers --------------------------------
-def _install_signal_handlers() -> None:
-    def _handler(signum, frame):
-        logger.info(
-            "[%s] received signal %s; requesting graceful shutdown",
-            WORKER_ID, signum,
-        )
-        request_shutdown()
-    for s in (signal.SIGTERM, signal.SIGINT):
-        try:
-            signal.signal(s, _handler)
-        except (ValueError, OSError):
-            pass
-# --------------------------- main -------------------------------------------
-def main() -> None:
-    _install_signal_handlers()
-    try:
-        boot_reclaim()
-    except Exception:
-        logger.exception("boot_reclaim failed")
-    scheduler = BackgroundScheduler()
-    scheduler.add_job(_scheduled_tick, "interval", seconds=POLL_INTERVAL_SECONDS)
-    scheduler.add_job(_scheduled_archive, "interval", seconds=ARCHIVE_INTERVAL_SECONDS)
-    scheduler.add_job(_restart_space, "interval", seconds=6 * 3600)
-    scheduler.start()
-    server = _start_http_server()
-    logger.info("[%s] worker up; poll=%ds concurrency=%d",
-                WORKER_ID, POLL_INTERVAL_SECONDS, WORKER_CONCURRENCY)
-    # Block main thread until shutdown is signalled.
-    while not shutdown_requested():
-        try:
-            signal.pause()  # POSIX: wake on signal
-        except (AttributeError, KeyboardInterrupt):
-            # Windows or Ctrl-C inside debugger: fall back to busy-wait
-            import time
-            time.sleep(1.0)
-    logger.info("[%s] shutting down...", WORKER_ID)
-    try:
-        scheduler.shutdown(wait=False)
-    except Exception:
-        pass
-    try:
-        server.shutdown()
-    except Exception:
-        pass
-    shutdown_executor(wait=True)
-    logger.info("[%s] bye", WORKER_ID)
-if __name__ == "__main__":
-    main()