Spaces:

Intel
/

low_bit_open_llm_leaderboard

Running

App Files Files Community

lvkaokao commited on about 1 month ago

Commit

a09ebea

1 Parent(s): 05268c0

enhance retry logic.

Browse files

Signed-off-by: lkk12014402 <kaokao.lv@intel.com>

Files changed (5) hide show

docs/eta_logic.md +101 -0
docs/retry_logic.md +251 -0
src/ci_dispatcher.py +163 -89
src/queue_eta.py +3 -2
tests/test_ci_dispatcher.py +190 -41

docs/eta_logic.md ADDED Viewed

	@@ -0,0 +1,101 @@

+# ETA 预估等待时间 — 运行逻辑与计算公式
+## 时序图
+```mermaid
+sequenceDiagram
+    participant User as 用户 (Browser)
+    participant App as app.py
+    participant Submit as submit.py
+    participant Upload as _upload_to_hub()
+    participant StatusDir as status/ 目录
+    participant ETA as queue_eta.py
+    User->>App: 点击 Submit 按钮
+    App->>Submit: submit_model() / submit_quant()
+    Note over Submit: 1. 验证模型存在性
+    Note over Submit: 2. 构建 eval_entry (status="Pending")
+    Note over Submit: 3. Dedup 检查
+    Submit->>Upload: _upload_to_hub(eval_entry)
+    Upload->>StatusDir: 写入 status/{user}/{prefix}.json<br/>{"status": "Pending", "script": "auto_eval", ...}
+    Upload-->>Submit: 写入成功
+    Note over Submit: 4. 更新内存 dedup 缓存<br/>_EVAL_REQUESTED.add(dedup_key)
+    Submit->>ETA: compute_single_eta(status_path, model_params, concurrency=2)
+    ETA->>StatusDir: os.walk(status/) 扫描所有 JSON 文件
+    StatusDir-->>ETA: 返回所有 status 条目
+    Note over ETA: 分类计数:<br/>• active_count: Running/Waiting/Triggered<br/>• pending_count: Pending + auto_*<br/>(包含刚写入的自身)
+    Note over ETA: 计算 ETA（见下方公式）
+    ETA-->>Submit: 返回 eta_hours
+    Submit->>Submit: format_eta(eta_hours) → "~3h"
+    Submit-->>User: "已提交！预计完成时间: ~3h"
+```
+## 计算公式
+```
+ETA = running_remaining + ⌈queue_pos / concurrency⌉ × task_duration
+```
+### 各变量含义
+| 变量 | 含义 | 计算方式 |
+|------|------|----------|
+| `queue_pos` | 当前模型在 Pending 队列中的位置 | `max(pending_count, 1)` — 自身已包含在内 |
+| `concurrency` | 最大并发数 | 固定值 `2` |
+| `task_duration` | 单任务预估时长 | 模型 ≤ 30B → 3h；模型 > 30B → 5h |
+| `running_remaining` | 当前运行任务的剩余时间 | 有活跃任务: `sum(各活跃任务时长) / active_count`<br/>无活跃任务: `0` |
+| `active_count` | 正在运行的任务数 | status 为 Running/Waiting/Triggered 的条目数 |
+| `pending_count` | 排队等待的任务数 | status=Pending 且 script=auto_* 的条目数 |
+### 示例推演
+**场景：空队列，concurrency=2，连续提交两个 ≤30B 模型**
+```
+┌─────────────┬────────┬──────────┬───────────┬─────────────────────┬─────┐
+│ 提交顺序     │ active │ pending  │ queue_pos │ 公式                 │ ETA │
+├─────────────┼────────┼──────────┼───────────┼─────────────────────┼─────┤
+│ 第1个模型    │ 0      │ 1(自身)  │ 1         │ 0 + ⌈1/2⌉×3 = 1×3  │ 3h  │
+│ 第2个模型    │ 0      │ 2        │ 2         │ 0 + ⌈2/2⌉×3 = 1×3  │ 3h  │
+│ 第3个模型    │ 0      │ 3        │ 3         │ 0 + ⌈3/2⌉×3 = 2×3  │ 6h  │
+│ 第4个模型    │ 0      │ 4        │ 4         │ 0 + ⌈4/2⌉×3 = 2×3  │ 6h  │
+│ 第5个模型    │ 0      │ 5        │ 5         │ 0 + ⌈5/2⌉×3 = 3×3  │ 9h  │
+└─────────────┴────────┴──────────┴───────────┴─────────────────────┴─────┘
+```
+**场景：1个任务正在运行，1个 Pending，提交新模型**
+```
+active_count = 1, active 任务预估 3h
+pending_count = 2 (已有1个 + 自身)
+queue_pos = 2
+running_remaining = 3/1 = 3h
+ETA = 3 + ⌈2/2⌉×3 = 3 + 3 = 6h
+```
+## 关键代码路径
+```
+submit.py: submit_model() / submit_quant()
+    │
+    ├── _upload_to_hub()          ← 先写 status/ 文件
+    │       └── status/{user}/eval_request_xxx.json
+    │
+    ├── _EVAL_REQUESTED.add()     ← 更新内存 dedup 缓存
+    │
+    └── compute_single_eta()      ← 再计算 ETA
+            │
+            ├── os.walk(status/)  ← 扫描（此时自身文件已存在）
+            ├── 分类: active vs pending
+            ├── queue_pos = max(pending_count, 1)
+            └── ETA = running_remaining + ⌈pos/concurrency⌉ × task_hours
+```

docs/retry_logic.md ADDED Viewed

	@@ -0,0 +1,251 @@

+# CI Dispatcher Retry 机制
+## 核心概念
+Retry **不是立即重新执行**，而是把 status 重置为 `Pending`，重新进入队列排队。下一次被调度时才会执行，需要等待并发槽位空闲 + 30分钟 cooldown。
+### 状态数据源
+| 数据源 | 位置 | 用途 |
+|---|---|---|
+| 本地 status 文件 | `lb_eval/status/<owner>/<model>.json` | 持久化层，git 管理，leaderboard UI 读取展示 |
+| Azure REST API | `GET /pipelines/{id}/runs/{runId}` | 实时查询 pipeline 真实运行状态 |
+每个 scan 周期：`git pull` 拉最新 status → 对 active 条目调 Azure API 校验 → 修正 status → `git push`。
+---
+## 状态流转图
+```mermaid
+stateDiagram-v2
+    [*] --> Pending: 用户提交
+    Pending --> Triggered: dispatcher 调 Azure API 触发 pipeline
+    Triggered --> Running: Azure 报 inProgress
+    Running --> Finished: CI 内 upload_results_github.py 写回
+    Triggered --> Pending: Azure 报 failed/canceled (retry)
+    Running --> Pending: Azure 报 failed/canceled (retry)
+    Running --> Pending: Azure succeeded 但超过1h无 Finished 写回 (retry)
+    Running --> Pending: Azure inProgress 超过8h (retry)
+    Triggered --> Pending: Azure API 不可达超过6h (retry)
+    Triggered --> Pending: 无 ci_run_id 超过6h (retry)
+    Pending --> Failed: retry_count >= 3
+    note right of Pending
+        retry 回到 Pending 后需等待:
+        1. 30分钟 cooldown
+        2. 并发槽位空闲
+        3. 按提交时间排队
+    end note
+    note right of Failed
+        终态，不再重试
+        用户不可重复提交
+        需人工介入
+    end note
+```
+---
+## Retry 时序图
+```mermaid
+sequenceDiagram
+    participant S as APScheduler (每60s)
+    participant D as CIDispatcher
+    participant SF as lb_eval/status/ (git)
+    participant AZ as Azure DevOps API
+    participant CI as CI Pipeline
+    Note over S: ── Scan Cycle Start ──
+    S->>D: scan_and_dispatch()
+    D->>SF: git pull (同步最新)
+    rect rgb(255, 240, 240)
+        Note over D: Step 2: Reconcile Active Entries
+        D->>SF: 读取所有 active 条目 (Running/Triggered/Waiting)
+        loop 每个 active 条目
+            D->>AZ: GET /runs/{ci_run_id}
+            alt Azure: completed + failed/canceled
+                Note over D: 立即重置
+                D->>D: _apply_retry_limit(retry_count++)
+                alt retry_count < 3
+                    D->>SF: status = Pending + last_failed_time = now
+                else retry_count >= 3
+                    D->>SF: status = Failed (终态)
+                end
+            else Azure: completed + succeeded
+                alt triggered_time 距今 < 1h
+                    Note over D: 在 grace period 内，等 CI 写回 Finished
+                else triggered_time 距今 > 1h
+                    Note over D: 写回失败，按 retry 处理
+                    D->>D: _apply_retry_limit(retry_count++)
+                    D->>SF: status = Pending + last_failed_time = now
+                end
+            else Azure: inProgress
+                alt triggered_time 距今 < 8h
+                    D->>SF: status = Running (正常)
+                else triggered_time 距今 > 8h
+                    Note over D: Pipeline 挂起，按 retry 处理
+                    D->>D: _apply_retry_limit(retry_count++)
+                    D->>SF: status = Pending + last_failed_time = now
+                end
+            else Azure API 不可达
+                alt triggered_time 距今 < 6h
+                    Note over D: 保持现状，等待恢复
+                else triggered_time 距今 > 6h
+                    D->>D: _apply_retry_limit(retry_count++)
+                    D->>SF: status = Pending + last_failed_time = now
+                end
+            end
+        end
+        D->>SF: git push (批量写回)
+    end
+    rect rgb(240, 255, 240)
+        Note over D: Step 3-4: Dispatch Pending
+        D->>SF: 读取 Pending 条目
+        Note over D: 过滤: retry_count > 0 且<br/>last_failed_time 距今 < 30min 的跳过
+        D->>D: 按 submitted_time 排序
+        D->>D: 取 min(pending数, 剩余并发槽位) 个
+        loop 每个待调度条目
+            D->>AZ: POST /runs (触发 pipeline)
+            AZ-->>D: run_id
+            D->>SF: status = Triggered, ci_run_id = run_id
+        end
+        D->>SF: git push
+    end
+    Note over S: ── 60s 后下一轮 ──
+```
+---
+## Retry 触发场景详解
+### 场景 1：Azure 报 `completed + failed`
+**原因**：Pod 启动失败、脚本执行报错、OOM 等。
+```
+时间线:
+T+0     Pending → Triggered (dispatcher 触发 CI)
+T+5min  Triggered → Running (Azure 报 inProgress)
+T+30min Azure 报 completed+failed
+        → dispatcher 重置为 Pending (retry_count=1, last_failed_time=T+30min)
+T+60min cooldown 未结束，跳过 (30min cooldown)
+T+61min cooldown 结束��有空位 → 重新触发
+```
+### 场景 2：Azure 报 `completed + canceled`
+**原因**：人工取消、Azure 调度取消。处理方式与 failed 完全一致。
+### 场景 3：Azure 报 `completed + succeeded`，但无 Finished 写回
+**原因**：`upload_results_github.py` 执行失败（git push 冲突、网络问题、脚本 bug）。
+```
+时间线:
+T+0     Pending → Triggered
+T+5min  Triggered → Running
+T+3h    Azure 报 succeeded，CI pipeline 已结束
+        但 status 仍为 Running (upload_results_github.py 没写回 Finished)
+        → dispatcher 检查: triggered_time 距今 > 1h (SUCCEEDED_GRACE_HOURS)
+        → 重置为 Pending (retry_count=1)
+```
+**为什么不立即重置？** 因为 `upload_results_github.py` 在 CI pipeline 结束前执行，
+Azure 报 succeeded 时 git push 可能还在进行中，给 1 小时 grace period。
+### 场景 4：Pipeline 挂起 (Azure 持续报 `inProgress`)
+**原因**：Pod 分配后进程卡死、GPU 驱动问题、无限循环。
+```
+时间线:
+T+0     Pending → Triggered → Running
+T+8h    Azure 仍报 inProgress
+        → dispatcher 检查: triggered_time 距今 > 8h (MAX_ACTIVE_HOURS)
+        → 重置为 Pending (retry_count=1)
+```
+### 场景 5：Azure API 不可达
+**原因**：Azure 服务端故障、网络分区、PAT 过期。
+```
+时间线:
+T+0     Pending → Triggered (正常触发)
+T+1min  dispatcher 调 Azure API → 超时/连接失败
+        → 保持现状 (可能是临时问题)
+T+6h    Azure API 仍不可达
+        → dispatcher 检查: triggered_time 距今 > 6h
+        → 重置为 Pending (retry_count=1)
+```
+### 场景 6：无 ci_run_id
+**原因**：触发 Azure API 调用成功返回，但写 status 文件时崩溃；或触发 API 调用本身
+超时，status 已写为 Triggered 但没拿到 run_id。
+处理方式同场景 5 (6小时超时)。
+---
+## 常量配置
+| 常量 | 值 | 含义 |
+|---|---|---|
+| `_SUCCEEDED_GRACE_HOURS` | 1h | Azure succeeded 后等写回 Finished 的宽限期 |
+| `_MAX_ACTIVE_HOURS` | 8h | 任何 active 条目的绝对超时 |
+| `_API_UNREACHABLE_HOURS` | 6h | Azure API 不可达的降级超时 |
+| `_RETRY_COOLDOWN_HOURS` | 0.5h (30min) | retry 回 Pending 后的冷却期 |
+| `_MAX_RETRIES` | 3 | 最大重试次数，超过后标记 Failed |
+---
+## 多并发安全性
+### 保护机制
+1. **单线程调度**: APScheduler `max_instances=1 + coalesce=True`，不会有两个 scan 并行
+2. **Git 互斥**: `git_lock` (threading.Condition) 保护所有 git 操作
+3. **Non-fast-forward 保护**: 如果 CI 同时 push 了 Finished，dispatcher 的 push 会冲突
+4. **Push 重试**: dispatcher reconcile 的 push 有 3 次重试 (pull → retry push)
+### 可能的竞争场景
+```
+Dispatcher                              CI Pipeline
+    │                                       │
+    ├─ git pull                             │
+    ├─ 读取 status: Running                 │
+    ├─ 查 Azure: succeeded                  │
+    ├─ 在 grace period 内，保持 Running      │
+    │                                       ├─ upload_results_github.py
+    │                                       ├─ status = Finished
+    │                                       ├─ git push ✓
+    ├─ git push (其他条目的修改)             │
+    │   └─ 冲突！pull → retry push ✓        │
+    │                                       │
+    ├─ 下次 scan: git pull                  │
+    ├─ 读取 status: Finished                │
+    ├─ 不是 active，跳过 ✓                  │
+```
+**结论**：即使并发 push 冲突，数据不会丢失。最坏情况是本轮 reconcile 结果延迟到下轮生效。
+---
+## 用户不可重复提交
+即使模型被标记为 `Failed`（retry 耗尽），用户也不能重新提交同一模型。
+`already_submitted_models()` 扫描 `status/` 和 `pending_requests/` 下所有 JSON 文件，
+不区分状态。Failed 模型需要管理员人工介入处理。

src/ci_dispatcher.py CHANGED Viewed

@@ -14,10 +14,14 @@ Concurrency control:
 Active status reconciliation:
     Each scan cycle queries the Azure DevOps Runs API for entries that have a
-    ``ci_run_id``.  The real pipeline state is mapped back to local status so
-    that failed/canceled runs are detected immediately instead of waiting for a
-    fixed timeout.  A short fallback timeout (``_STALE_FALLBACK_HOURS``) still
-    exists for edge cases where the Azure API is unreachable.
 """
 import base64
@@ -39,9 +43,23 @@ _AZP_BRANCH = "main"
 # Statuses that count towards active concurrency slots
 _ACTIVE_STATUSES = frozenset({"Running", "Waiting", "Triggered"})
-# Fallback: if Azure API is unreachable for this long, reset entry to Pending.
-# This is a safety net, NOT the primary reconciliation mechanism.
-_STALE_FALLBACK_HOURS = 6
 # Maximum number of pipeline retries before marking an entry as Failed.
 _MAX_RETRIES = 3
@@ -188,13 +206,32 @@ class CIDispatcher:
         return running, triggered, pending
     def _find_pending(self) -> list[tuple[str, dict]]:
-        """Return ``[(status_file_path, entry_dict), ...]`` sorted oldest-first."""
         pending = []
         for fpath, data in self._iter_status_files():
             if (
                 data.get("status") == "Pending"
                 and data.get("script") in ("auto_quant", "auto_eval")
             ):
                 pending.append((fpath, data))
         pending.sort(key=lambda x: x[1].get("submitted_time", ""))
         return pending
@@ -327,17 +364,16 @@ class CIDispatcher:
     def _reconcile_active_runs(self):
         """Query Azure for all active entries and sync local status to reality.
-        For each status file in Triggered/Waiting/Running that carries a
-        ``ci_run_id``, we call the Azure Runs API to get the real pipeline state
-        and map it back to a local status update.
-        Fallback: entries whose Azure status cannot be determined AND whose
-        ``triggered_time`` exceeds ``_STALE_FALLBACK_HOURS`` are reset to
-        Pending as a safety net.
         """
         active_entries: list[tuple[str, dict]] = []
         for fpath, data in self._iter_status_files():
-            if data.get("status") in _ACTIVE_STATUSES and data.get("ci_run_id"):
                 active_entries.append((fpath, data))
         if not active_entries:
@@ -346,51 +382,90 @@ class CIDispatcher:
         updates: list[tuple[str, dict]] = []  # entries that need git write-back
         for fpath, data in active_entries:
-            run_id = data["ci_run_id"]
             model = data.get("model", "?")
             old_status = data["status"]
             azure_state = self._query_azure_run(run_id)
             if azure_state is None:
                 # API unreachable — apply fallback timeout
-                new_status = self._fallback_stale_check(data)
-                if new_status and new_status != old_status:
-                    new_status = self._apply_retry_limit(data, new_status, model)
-                    data["status"] = new_status
-                    if new_status == "Pending":
-                        data.pop("ci_run_id", None)
-                        data.pop("triggered_time", None)
-                    elif new_status == "Failed":
-                        data.pop("ci_run_id", None)
-                        data.pop("triggered_time", None)
                     updates.append((fpath, data))
                     logger.warning(
-                        "[dispatcher] Fallback: %s → %s (Azure unreachable)",
-                        model, new_status,
                     )
                 continue
-            # Map Azure state → local status
-            new_status = self._map_azure_state(azure_state)
-            if new_status == old_status:
-                continue  # nothing changed
-            # Apply retry limit when resetting to Pending
-            if new_status == "Pending":
-                new_status = self._apply_retry_limit(data, new_status, model)
-            logger.info(
-                "[dispatcher] Reconcile %s: %s → %s (azure: state=%s, result=%s)",
-                model, old_status, new_status,
-                azure_state.get("state"), azure_state.get("result"),
-            )
-            data["status"] = new_status
-            if new_status in ("Pending", "Failed"):
-                data.pop("ci_run_id", None)
-                data.pop("triggered_time", None)
-            updates.append((fpath, data))
         if not updates:
             return
@@ -410,10 +485,26 @@ class CIDispatcher:
                     self._repo.index.commit(
                         f"[dispatcher] Reconcile {len(staged)} entries from Azure status"
                     )
-                    self._repo.remotes.origin.push(self._repo.active_branch.name)
-                    logger.info(
-                        "[dispatcher] Pushed reconciled status for %d entries", len(staged)
-                    )
             except Exception:
                 logger.error(
                     "[dispatcher] Failed to push reconciled status updates", exc_info=True
@@ -444,49 +535,32 @@ class CIDispatcher:
             return None
     @staticmethod
-    def _map_azure_state(azure_state: dict) -> str:
-        """Map Azure pipeline run state/result to local status string.
-        Azure states:
-            state: "inProgress" | "completed" | "canceling"
-            result (when completed): "succeeded" | "failed" | "canceled"
-        Local mapping:
-            inProgress       → Running  (CI is actively executing)
-            completed+succeeded → Running  (keep Running; CI template will write Finished)
-            completed+failed    → Pending  (reset for retry)
-            completed+canceled  → Pending  (reset for retry)
-            canceling           → Running  (still active, wait for completion)
-        """
-        state = azure_state.get("state", "")
-        result = azure_state.get("result", "")
-        if state == _AZP_STATE_COMPLETED:
-            if result == _AZP_RESULT_SUCCEEDED:
-                # Don't touch — the CI pipeline itself writes Finished via
-                # git-status-template.yml.  Keep as Running so dispatcher
-                # doesn't re-trigger.
-                return "Running"
-            # failed or canceled → reset to Pending for retry
-            return "Pending"
-        # inProgress / canceling → mark as Running
-        return "Running"
-    @staticmethod
-    def _fallback_stale_check(data: dict) -> str | None:
-        """Return new status if stale beyond fallback threshold, else None."""
         tstr = data.get("triggered_time", "")
         if not tstr:
             return None
         try:
-            triggered_at = datetime.strptime(tstr, "%Y-%m-%dT%H:%M:%SZ")
         except ValueError:
             return None
-        threshold = datetime.utcnow() - timedelta(hours=_STALE_FALLBACK_HOURS)
-        if triggered_at < threshold:
-            return "Pending"
-        return None
     @staticmethod
     def _apply_retry_limit(data: dict, candidate_status: str, model: str) -> str:

 Active status reconciliation:
     Each scan cycle queries the Azure DevOps Runs API for entries that have a
+    ``ci_run_id``.  The real pipeline state is mapped back to local status.
+    Time-based guards cover ALL failure modes:
+    1. Azure ``completed+failed/canceled`` → retry (immediate)
+    2. Azure ``completed+succeeded`` but no write-back after grace period → retry
+    3. Azure ``inProgress`` beyond max runtime → retry (pipeline hung)
+    4. Azure API unreachable beyond fallback timeout → retry
+    5. All retries go through ``_MAX_RETRIES`` limit → ``Failed``
 """
 import base64
 # Statuses that count towards active concurrency slots
 _ACTIVE_STATUSES = frozenset({"Running", "Waiting", "Triggered"})
+# ── Timeout & retry constants ────────────────────────────────────────────
+# After Azure reports succeeded, wait this long for CI to write back Finished.
+# If status is still Running after this, assume write-back failed and retry.
+_SUCCEEDED_GRACE_HOURS = 1
+# Absolute max hours any entry can stay in Running/Triggered/Waiting.
+# Covers pipelines that hang indefinitely (Azure shows inProgress forever).
+_MAX_ACTIVE_HOURS = 8
+# Fallback when Azure API is unreachable — if triggered_time exceeds this,
+# assume something went wrong and reset.
+_API_UNREACHABLE_HOURS = 6
+# Cooldown: minimum hours between retries of the same entry.
+# Prevents rapid-fire retries for deterministic failures (OOM, bad model).
+_RETRY_COOLDOWN_HOURS = 0.5  # 30 minutes
 # Maximum number of pipeline retries before marking an entry as Failed.
 _MAX_RETRIES = 3
         return running, triggered, pending
     def _find_pending(self) -> list[tuple[str, dict]]:
+        """Return ``[(status_file_path, entry_dict), ...]`` sorted oldest-first.
+        Entries that have been retried recently (within ``_RETRY_COOLDOWN_HOURS``)
+        are excluded so deterministic failures don't burn through retries instantly.
+        """
         pending = []
         for fpath, data in self._iter_status_files():
             if (
                 data.get("status") == "Pending"
                 and data.get("script") in ("auto_quant", "auto_eval")
             ):
+                # Enforce cooldown for retried entries
+                if data.get("retry_count", 0) > 0:
+                    lft = data.get("last_failed_time", "")
+                    if lft:
+                        try:
+                            failed_at = datetime.strptime(lft, "%Y-%m-%dT%H:%M:%SZ")
+                            if self._hours_since(failed_at) < _RETRY_COOLDOWN_HOURS:
+                                logger.debug(
+                                    "[dispatcher] %s: cooldown (%.0f min remaining)",
+                                    data.get("model", "?"),
+                                    (_RETRY_COOLDOWN_HOURS - self._hours_since(failed_at)) * 60,
+                                )
+                                continue
+                        except ValueError:
+                            pass  # bad timestamp, proceed with dispatch
                 pending.append((fpath, data))
         pending.sort(key=lambda x: x[1].get("submitted_time", ""))
         return pending
     def _reconcile_active_runs(self):
         """Query Azure for all active entries and sync local status to reality.
+        Handles ALL failure scenarios:
+        1. Azure completed+failed/canceled → retry immediately
+        2. Azure completed+succeeded but no Finished write-back → retry after grace period
+        3. Azure inProgress beyond max runtime → retry (pipeline hung)
+        4. Azure API unreachable beyond fallback timeout → retry
+        5. Entries without ci_run_id stuck in active state → retry after timeout
         """
         active_entries: list[tuple[str, dict]] = []
         for fpath, data in self._iter_status_files():
+            if data.get("status") in _ACTIVE_STATUSES:
                 active_entries.append((fpath, data))
         if not active_entries:
         updates: list[tuple[str, dict]] = []  # entries that need git write-back
         for fpath, data in active_entries:
+            run_id = data.get("ci_run_id")
             model = data.get("model", "?")
             old_status = data["status"]
+            triggered_at = self._parse_triggered_time(data)
+            # ── No ci_run_id (e.g. Triggered but API call never returned) ──
+            if not run_id:
+                if triggered_at and self._hours_since(triggered_at) > _API_UNREACHABLE_HOURS:
+                    new_status = self._apply_retry_limit(data, "Pending", model)
+                    self._reset_active_fields(data, new_status)
+                    updates.append((fpath, data))
+                    logger.warning(
+                        "[dispatcher] %s: no ci_run_id, stale %s → %s",
+                        model, old_status, new_status,
+                    )
+                continue
+            # ── Query Azure ──
             azure_state = self._query_azure_run(run_id)
             if azure_state is None:
                 # API unreachable — apply fallback timeout
+                if triggered_at and self._hours_since(triggered_at) > _API_UNREACHABLE_HOURS:
+                    new_status = self._apply_retry_limit(data, "Pending", model)
+                    self._reset_active_fields(data, new_status)
                     updates.append((fpath, data))
                     logger.warning(
+                        "[dispatcher] Fallback: %s → %s (Azure unreachable for %.1fh)",
+                        model, new_status, self._hours_since(triggered_at),
                     )
                 continue
+            state = azure_state.get("state", "")
+            result = azure_state.get("result", "")
+            # ── Azure: completed ──
+            if state == _AZP_STATE_COMPLETED:
+                if result == _AZP_RESULT_SUCCEEDED:
+                    # CI pipeline succeeded. Normally upload_results_github.py
+                    # writes back Finished. Give it a grace period.
+                    if triggered_at and self._hours_since(triggered_at) > _SUCCEEDED_GRACE_HOURS:
+                        # Grace period expired — write-back likely failed
+                        new_status = self._apply_retry_limit(data, "Pending", model)
+                        self._reset_active_fields(data, new_status)
+                        updates.append((fpath, data))
+                        logger.warning(
+                            "[dispatcher] %s: Azure succeeded but no Finished write-back "
+                            "after %.1fh — %s → %s",
+                            model, self._hours_since(triggered_at), old_status, new_status,
+                        )
+                    elif old_status != "Running":
+                        # Within grace period, ensure status is Running
+                        data["status"] = "Running"
+                        updates.append((fpath, data))
+                else:
+                    # failed or canceled → retry immediately
+                    new_status = self._apply_retry_limit(data, "Pending", model)
+                    self._reset_active_fields(data, new_status)
+                    updates.append((fpath, data))
+                    logger.info(
+                        "[dispatcher] %s: Azure %s → %s → %s",
+                        model, result, old_status, new_status,
+                    )
+                continue
+            # ── Azure: inProgress / canceling ──
+            # Check absolute max runtime
+            if triggered_at and self._hours_since(triggered_at) > _MAX_ACTIVE_HOURS:
+                new_status = self._apply_retry_limit(data, "Pending", model)
+                self._reset_active_fields(data, new_status)
+                updates.append((fpath, data))
+                logger.warning(
+                    "[dispatcher] %s: exceeded max runtime (%.1fh > %dh) — %s → %s",
+                    model, self._hours_since(triggered_at), _MAX_ACTIVE_HOURS,
+                    old_status, new_status,
+                )
+            elif old_status != "Running":
+                # Azure shows inProgress, update local to Running
+                data["status"] = "Running"
+                updates.append((fpath, data))
+                logger.info(
+                    "[dispatcher] %s: Azure inProgress — %s → Running",
+                    model, old_status,
+                )
         if not updates:
             return
                     self._repo.index.commit(
                         f"[dispatcher] Reconcile {len(staged)} entries from Azure status"
                     )
+                    # Push with retry: CI may have pushed concurrently
+                    branch = self._repo.active_branch.name
+                    for attempt in range(3):
+                        try:
+                            self._repo.remotes.origin.push(branch)
+                            logger.info(
+                                "[dispatcher] Pushed reconciled status for %d entries",
+                                len(staged),
+                            )
+                            break
+                        except Exception:
+                            if attempt < 2:
+                                logger.warning(
+                                    "[dispatcher] Push conflict (attempt %d/3), "
+                                    "pulling and retrying",
+                                    attempt + 1,
+                                )
+                                self._repo.remotes.origin.pull(branch)
+                            else:
+                                raise
             except Exception:
                 logger.error(
                     "[dispatcher] Failed to push reconciled status updates", exc_info=True
             return None
     @staticmethod
+    def _parse_triggered_time(data: dict) -> datetime | None:
+        """Parse triggered_time string to datetime, or None."""
         tstr = data.get("triggered_time", "")
         if not tstr:
             return None
         try:
+            return datetime.strptime(tstr, "%Y-%m-%dT%H:%M:%SZ")
         except ValueError:
             return None
+    @staticmethod
+    def _hours_since(dt: datetime) -> float:
+        """Return hours elapsed since the given UTC datetime."""
+        return (datetime.utcnow() - dt).total_seconds() / 3600
+    @staticmethod
+    def _reset_active_fields(data: dict, new_status: str):
+        """Set new status and clean up active-state fields."""
+        data["status"] = new_status
+        if new_status in ("Pending", "Failed"):
+            data.pop("ci_run_id", None)
+            data.pop("triggered_time", None)
+            # Record when this reset happened for cooldown enforcement
+            data["last_failed_time"] = (
+                datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
+            )
     @staticmethod
     def _apply_retry_limit(data: dict, candidate_status: str, model: str) -> str:

src/queue_eta.py CHANGED Viewed

@@ -132,8 +132,9 @@ def compute_single_eta(
                 elif status == "Pending" and script in ("auto_quant", "auto_eval"):
                     pending_count += 1
-    # The new submission will be at position = pending_count + 1
-    queue_pos = pending_count + 1
     task_hours = estimate_task_hours(model_params)
     if active_count > 0:

                 elif status == "Pending" and script in ("auto_quant", "auto_eval"):
                     pending_count += 1
+    # The model's own status file has already been written by _upload_to_hub,
+    # so it is already included in pending_count.
+    queue_pos = max(pending_count, 1)
     task_hours = estimate_task_hours(model_params)
     if active_count > 0:

tests/test_ci_dispatcher.py CHANGED Viewed

@@ -30,12 +30,15 @@ sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 from src.ci_dispatcher import (
     CIDispatcher,
     _ACTIVE_STATUSES,
     _AZP_RESULT_CANCELED,
     _AZP_RESULT_FAILED,
     _AZP_RESULT_SUCCEEDED,
     _AZP_STATE_COMPLETED,
     _MAX_RETRIES,
-    _STALE_FALLBACK_HOURS,
 )
@@ -98,52 +101,162 @@ def _mock_azure_response(status_code=200, body=None):
 # ═══════════════════════════════════════════════════════════════════════
-# 1. _map_azure_state
 # ═══════════════════════════════════════════════════════════════════════
-class TestMapAzureState:
-    """Azure (state, result) → local status string."""
-    @pytest.mark.parametrize("state, result, expected", [
-        ("inProgress", "",          "Running"),
-        ("canceling",  "",          "Running"),
-        ("completed",  "succeeded", "Running"),   # CI template writes Finished
-        ("completed",  "failed",    "Pending"),
-        ("completed",  "canceled",  "Pending"),
-        ("unknown",    "",          "Running"),   # fallback for unknown
-    ])
-    def test_mapping(self, state, result, expected):
-        assert CIDispatcher._map_azure_state({"state": state, "result": result}) == expected
 # ═══════════════════════════════════════════════════════════════════════
-# 2. _fallback_stale_check
 # ═══════════════════════════════════════════════════════════════════════
-class TestFallbackStaleCheck:
-    """Fallback when Azure API is unreachable."""
-    def test_no_triggered_time(self):
-        assert CIDispatcher._fallback_stale_check({"status": "Triggered"}) is None
-    def test_recent_entry_stays(self):
-        data = {"status": "Triggered", "triggered_time": _old_str(1)}
-        assert CIDispatcher._fallback_stale_check(data) is None
-    def test_exactly_at_threshold_stays(self):
-        data = {"status": "Triggered", "triggered_time": _old_str(_STALE_FALLBACK_HOURS)}
-        # at boundary — not strictly less-than
-        result = CIDispatcher._fallback_stale_check(data)
-        # might be None or Pending depending on seconds; either is acceptable
-        assert result in (None, "Pending")
-    def test_stale_entry_resets(self):
-        data = {"status": "Triggered", "triggered_time": _old_str(_STALE_FALLBACK_HOURS + 2)}
-        assert CIDispatcher._fallback_stale_check(data) == "Pending"
-    def test_bad_timestamp(self):
-        data = {"status": "Triggered", "triggered_time": "garbage"}
-        assert CIDispatcher._fallback_stale_check(data) is None
 # ═══════════════════════════════════════════════════════════════════════
@@ -303,7 +416,7 @@ class TestReconcileActiveRuns:
     def test_unreachable_stale_entry_resets(self):
         fp = _write_status(self._status_dir, "org", "h.json", {
             "model": "org/h", "status": "Triggered", "script": "auto_eval",
-            "ci_run_id": 90, "triggered_time": _old_str(_STALE_FALLBACK_HOURS + 2),
         })
         d = self._dispatcher()
         with patch("src.ci_dispatcher.http_requests.get", side_effect=ConnectionError):
@@ -313,12 +426,13 @@ class TestReconcileActiveRuns:
         assert "ci_run_id" not in data
         assert "triggered_time" not in data
-    # -- Entries without ci_run_id are skipped ------------------------------
-    def test_pending_without_run_id_untouched(self):
-        fp = _write_status(self._status_dir, "org", "i.json", {
-            "model": "org/i", "status": "Pending", "script": "auto_quant",
-            "submitted_time": "2026-04-01T00:00:00Z",
         })
         d = self._dispatcher()
         with patch("src.ci_dispatcher.http_requests.get") as mock_get:
@@ -488,6 +602,41 @@ class TestFindPending:
         d._repo.working_dir = self._tmpdir
         assert d._find_pending() == []
 # ═══════════════════════════════════════════════════════════════════════
 # 7. _batch_update_status
@@ -744,7 +893,7 @@ class TestRetryLimitIntegration:
     def test_fallback_stale_also_counts_retry(self):
         fp = _write_status(self._status_dir, "org", "f.json", {
             "model": "org/f", "status": "Triggered", "script": "auto_quant",
-            "ci_run_id": 50, "triggered_time": _old_str(_STALE_FALLBACK_HOURS + 2),
             "retry_count": 2,
         })
         d = self._dispatcher()

 from src.ci_dispatcher import (
     CIDispatcher,
     _ACTIVE_STATUSES,
+    _API_UNREACHABLE_HOURS,
     _AZP_RESULT_CANCELED,
     _AZP_RESULT_FAILED,
     _AZP_RESULT_SUCCEEDED,
     _AZP_STATE_COMPLETED,
+    _MAX_ACTIVE_HOURS,
     _MAX_RETRIES,
+    _RETRY_COOLDOWN_HOURS,
+    _SUCCEEDED_GRACE_HOURS,
 )
 # ═══════════════════════════════════════════════════════════════════════
+# 1. _parse_triggered_time / _hours_since / _reset_active_fields
 # ═══════════════════════════════════════════════════════════════════════
+class TestHelpers:
+    """Test the static helper methods on CIDispatcher."""
+    def test_parse_triggered_time_valid(self):
+        data = {"triggered_time": "2026-04-01T12:00:00Z"}
+        result = CIDispatcher._parse_triggered_time(data)
+        assert result == datetime(2026, 4, 1, 12, 0, 0)
+    def test_parse_triggered_time_missing(self):
+        assert CIDispatcher._parse_triggered_time({}) is None
+    def test_parse_triggered_time_garbage(self):
+        assert CIDispatcher._parse_triggered_time({"triggered_time": "garbage"}) is None
+    def test_hours_since(self):
+        two_hours_ago = datetime.utcnow() - timedelta(hours=2)
+        h = CIDispatcher._hours_since(two_hours_ago)
+        assert 1.9 < h < 2.1
+    def test_reset_active_fields_pending(self):
+        data = {"status": "Running", "ci_run_id": 42, "triggered_time": "2026-04-01T00:00:00Z", "model": "a/b"}
+        CIDispatcher._reset_active_fields(data, "Pending")
+        assert data["status"] == "Pending"
+        assert "ci_run_id" not in data
+        assert "triggered_time" not in data
+        assert "last_failed_time" in data  # records when reset happened
+        assert data["model"] == "a/b"  # preserved
+    def test_reset_active_fields_failed(self):
+        data = {"status": "Running", "ci_run_id": 42, "triggered_time": "x"}
+        CIDispatcher._reset_active_fields(data, "Failed")
+        assert data["status"] == "Failed"
+        assert "ci_run_id" not in data
+        assert "last_failed_time" in data
+    def test_reset_active_fields_running_keeps_fields(self):
+        data = {"status": "Triggered", "ci_run_id": 42, "triggered_time": "x"}
+        CIDispatcher._reset_active_fields(data, "Running")
+        assert data["status"] == "Running"
+        assert data["ci_run_id"] == 42  # not removed
 # ═══════════════════════════════════════════════════════════════════════
+# 2. Reconcile: succeeded grace period & max active timeout
 # ═══════════════════════════════════════════════════════════════════════
+class TestReconcileTimeouts:
+    """Test the time-based guards added to _reconcile_active_runs."""
+    def setup_method(self):
+        self._tmpdir = tempfile.mkdtemp(prefix="test_timeout_")
+        self._status_dir = os.path.join(self._tmpdir, "status")
+        os.makedirs(self._status_dir)
+    def _dispatcher(self, **kw):
+        d = _make_dispatcher(self._status_dir, **kw)
+        d._repo.working_dir = self._tmpdir
+        return d
+    def test_succeeded_within_grace_period_stays_running(self):
+        """Azure succeeded recently — stay Running, wait for write-back."""
+        fp = _write_status(self._status_dir, "org", "a.json", {
+            "model": "org/a", "status": "Running", "script": "auto_quant",
+            "ci_run_id": 10, "triggered_time": _now_str(),
+        })
+        d = self._dispatcher()
+        resp = _mock_azure_response(200, {"state": "completed", "result": "succeeded"})
+        with patch("src.ci_dispatcher.http_requests.get", return_value=resp):
+            d._reconcile_active_runs()
+        assert _read_status(fp)["status"] == "Running"
+        assert _read_status(fp)["ci_run_id"] == 10
+    def test_succeeded_past_grace_period_resets(self):
+        """Azure succeeded long ago but no Finished write-back — retry."""
+        fp = _write_status(self._status_dir, "org", "b.json", {
+            "model": "org/b", "status": "Running", "script": "auto_quant",
+            "ci_run_id": 20, "triggered_time": _old_str(_SUCCEEDED_GRACE_HOURS + 1),
+        })
+        d = self._dispatcher()
+        resp = _mock_azure_response(200, {"state": "completed", "result": "succeeded"})
+        with patch("src.ci_dispatcher.http_requests.get", return_value=resp):
+            d._reconcile_active_runs()
+        data = _read_status(fp)
+        assert data["status"] == "Pending"
+        assert "ci_run_id" not in data
+        assert data["retry_count"] == 1
+    def test_in_progress_within_max_active_stays(self):
+        """Azure inProgress within max active hours — keep Running."""
+        fp = _write_status(self._status_dir, "org", "c.json", {
+            "model": "org/c", "status": "Running", "script": "auto_quant",
+            "ci_run_id": 30, "triggered_time": _old_str(2),
+        })
+        d = self._dispatcher()
+        resp = _mock_azure_response(200, {"state": "inProgress", "result": ""})
+        with patch("src.ci_dispatcher.http_requests.get", return_value=resp):
+            d._reconcile_active_runs()
+        assert _read_status(fp)["status"] == "Running"
+    def test_in_progress_past_max_active_resets(self):
+        """Azure inProgress beyond max active hours — pipeline hung, retry."""
+        fp = _write_status(self._status_dir, "org", "d.json", {
+            "model": "org/d", "status": "Running", "script": "auto_quant",
+            "ci_run_id": 40, "triggered_time": _old_str(_MAX_ACTIVE_HOURS + 1),
+        })
+        d = self._dispatcher()
+        resp = _mock_azure_response(200, {"state": "inProgress", "result": ""})
+        with patch("src.ci_dispatcher.http_requests.get", return_value=resp):
+            d._reconcile_active_runs()
+        data = _read_status(fp)
+        assert data["status"] == "Pending"
+        assert data["retry_count"] == 1
+    def test_no_ci_run_id_stale_resets(self):
+        """Entry stuck in Triggered without ci_run_id — API call never returned."""
+        fp = _write_status(self._status_dir, "org", "e.json", {
+            "model": "org/e", "status": "Triggered", "script": "auto_quant",
+            "triggered_time": _old_str(_API_UNREACHABLE_HOURS + 1),
+        })
+        d = self._dispatcher()
+        with patch("src.ci_dispatcher.http_requests.get") as mock_get:
+            d._reconcile_active_runs()
+            mock_get.assert_not_called()  # no ci_run_id, no API call
+        data = _read_status(fp)
+        assert data["status"] == "Pending"
+        assert data["retry_count"] == 1
+    def test_no_ci_run_id_recent_stays(self):
+        """Entry recently Triggered without ci_run_id — still within timeout."""
+        fp = _write_status(self._status_dir, "org", "f.json", {
+            "model": "org/f", "status": "Triggered", "script": "auto_quant",
+            "triggered_time": _now_str(),
+        })
+        d = self._dispatcher()
+        with patch("src.ci_dispatcher.http_requests.get") as mock_get:
+            d._reconcile_active_runs()
+            mock_get.assert_not_called()
+        assert _read_status(fp)["status"] == "Triggered"
+    def test_succeeded_grace_period_exhausts_retries(self):
+        """Succeeded but stuck after multiple retries → Failed."""
+        fp = _write_status(self._status_dir, "org", "g.json", {
+            "model": "org/g", "status": "Running", "script": "auto_quant",
+            "ci_run_id": 70, "triggered_time": _old_str(_SUCCEEDED_GRACE_HOURS + 1),
+            "retry_count": 2,
+        })
+        d = self._dispatcher()
+        resp = _mock_azure_response(200, {"state": "completed", "result": "succeeded"})
+        with patch("src.ci_dispatcher.http_requests.get", return_value=resp):
+            d._reconcile_active_runs()
+        data = _read_status(fp)
+        assert data["status"] == "Failed"
+        assert data["retry_count"] == 3
 # ═══════════════════════════════════════════════════════════════════════
     def test_unreachable_stale_entry_resets(self):
         fp = _write_status(self._status_dir, "org", "h.json", {
             "model": "org/h", "status": "Triggered", "script": "auto_eval",
+            "ci_run_id": 90, "triggered_time": _old_str(_API_UNREACHABLE_HOURS + 2),
         })
         d = self._dispatcher()
         with patch("src.ci_dispatcher.http_requests.get", side_effect=ConnectionError):
         assert "ci_run_id" not in data
         assert "triggered_time" not in data
+    # -- Entries without ci_run_id: active ones get timeout check -----------
+    def test_active_without_run_id_gets_timeout_check(self):
+        """Active entry with no ci_run_id but stale → resets to Pending."""
+        fp = _write_status(self._status_dir, "org", "norun.json", {
+            "model": "org/norun", "status": "Triggered", "script": "auto_quant",
+            "triggered_time": _old_str(_API_UNREACHABLE_HOURS + 1),
         })
         d = self._dispatcher()
         with patch("src.ci_dispatcher.http_requests.get") as mock_get:
         d._repo.working_dir = self._tmpdir
         assert d._find_pending() == []
+    def test_cooldown_excludes_recent_retry(self):
+        """Retried entry within cooldown period is NOT picked up."""
+        _write_status(self._status_dir, "a", "retry.json", {
+            "model": "a/retry", "status": "Pending", "script": "auto_quant",
+            "submitted_time": "2026-04-01T00:00:00Z",
+            "retry_count": 1,
+            "last_failed_time": _now_str(),
+        })
+        d = _make_dispatcher(self._status_dir)
+        d._repo.working_dir = self._tmpdir
+        assert d._find_pending() == []
+    def test_cooldown_allows_old_retry(self):
+        """Retried entry past cooldown period IS picked up."""
+        _write_status(self._status_dir, "a", "retry.json", {
+            "model": "a/retry", "status": "Pending", "script": "auto_quant",
+            "submitted_time": "2026-04-01T00:00:00Z",
+            "retry_count": 1,
+            "last_failed_time": _old_str(_RETRY_COOLDOWN_HOURS + 1),
+        })
+        d = _make_dispatcher(self._status_dir)
+        d._repo.working_dir = self._tmpdir
+        pending = d._find_pending()
+        assert len(pending) == 1
+    def test_first_time_pending_has_no_cooldown(self):
+        """Fresh Pending entry (no retry_count) is always picked up."""
+        _write_status(self._status_dir, "a", "fresh.json", {
+            "model": "a/fresh", "status": "Pending", "script": "auto_quant",
+            "submitted_time": "2026-04-01T00:00:00Z",
+        })
+        d = _make_dispatcher(self._status_dir)
+        d._repo.working_dir = self._tmpdir
+        assert len(d._find_pending()) == 1
 # ═══════════════════════════════════════════════════════════════════════
 # 7. _batch_update_status
     def test_fallback_stale_also_counts_retry(self):
         fp = _write_status(self._status_dir, "org", "f.json", {
             "model": "org/f", "status": "Triggered", "script": "auto_quant",
+            "ci_run_id": 50, "triggered_time": _old_str(_API_UNREACHABLE_HOURS + 2),
             "retry_count": 2,
         })
         d = self._dispatcher()