Spaces:

Dev-the-dev91
/

instruction-fine-tuning-budget

Sleeping

App Files Files Community

Zeke Long commited on Apr 11

Commit

0d4e1b9

1 Parent(s): ecec64b

Updating to adress the bottleneck in the poll for next roun

Browse files

Files changed (6) hide show

README.md +109 -12
aggregator_client.py +72 -15
app.py +114 -82
notebook/distributed_finetuning_peft_async.ipynb +2 -1
tests/test_aggregator_client.py +14 -0
tests/test_app.py +19 -3

README.md CHANGED Viewed

@@ -55,7 +55,7 @@ AGGREGATOR_URL = "https://your-username-your-space.hf.space"
 NODE_SECRET = "my-super-secret-123"  # must match the Space secret
 ```
-Use the Space’s **direct app host** as `AGGREGATOR_URL`: **`https://<owner>-<space-name>.hf.space`** (open your Space → use the **App** tab URL, or copy from the Space card). It is **not** `huggingface.co/spaces/user/repo`, **not** `https://<owner>/<space>.hf.space` (a slash makes the client try to resolve `<owner>` as the hostname and DNS fails), and **not** a model repo id. The client accepts the host **with or without** `https://` (missing scheme defaults to `https://`) and **rewrites** the common `owner/space.hf.space` typo into `owner-space.hf.space`. API paths are `/submit`, `/status`, `/health`, `/reset`.
 Then call the aggregator client:
@@ -80,17 +80,9 @@ When the last node completes a round, the client may receive **`merge_failed`**
 Nodes do **not** need `MODEL_REPO_ID` — only the aggregator uses it to download/upload adapter weights.
-#### 3. Local testing
-Export the variables before running the server:
-```bash
-export HF_TOKEN="hf_..."
-export MODEL_REPO_ID="your-username/your-model-repo"
-export NODE_SECRET="local_test_secret"
-```
-Without these, the app defaults to empty strings (merge is skipped) and `"local_test_secret"` for the node secret.
 ## Operator notes
@@ -98,6 +90,7 @@ Without these, the app defaults to empty strings (merge is skipped) and `"local_
 - **Secrets:** Never commit `HF_TOKEN`, `NODE_SECRET`, or tokens in git remotes. Use Space **Repository secrets** and a local env or credential helper.
 - **Reset:** `POST /reset` with JSON `{"secret_key": "<ADMIN_SECRET or NODE_SECRET>"}` clears round state to 1. If `ADMIN_SECRET` is set on the Space, use that; otherwise use `NODE_SECRET`.
 - **Protected status:** When `STATUS_READ_SECRET` is set, pass the same value as header `X-Status-Secret` (see `aggregator_client.check_aggregator` / `poll_for_next_round` argument `status_secret`, and notebook `CONFIG["status_read_secret"]`).
 - **Rate limits:** `POST /submit` is limited per client IP (first `X-Forwarded-For` hop when present). Override with **`RATE_LIMIT_SUBMIT_MAX`** (default 120) and **`RATE_LIMIT_SUBMIT_WINDOW_SEC`** (default 60).
 - **Logs:** Set **`LOG_LEVEL`** (e.g. `DEBUG`, `INFO`, `WARNING`) for the `peft.aggregator` logger on stdout.
 - **Public Space:** `GET /status` is world-readable on a public Space; use a private Space if round visibility matters.
@@ -109,14 +102,118 @@ Without these, the app defaults to empty strings (merge is skipped) and `"local_
 - **Nodes:** 3 x Google Colab free T4 GPU
 - **Aggregation:** FedAvg over adapter states
-## Local Testing
 ```bash
 pip install -r requirements.txt
 pip install torch --index-url https://download.pytorch.org/whl/cpu
 uvicorn app:app --host 0.0.0.0 --port 7860 --reload
 ```
 **Note:** `requirements.txt` caps **FastAPI** and **Starlette** below versions that ship **Starlette 1.x**. Gradio **4.44.x** is incompatible with that stack (the Space would return **500** on `GET /` with Jinja `unhashable type: 'dict'`). Upgrade Gradio before raising those caps.
 Deadline: April 20, 2026 · Lead target April 13

 NODE_SECRET = "my-super-secret-123"  # must match the Space secret
 ```
+Use the Space’s **direct app host** as `AGGREGATOR_URL`: **`https://<owner>-<space-name>.hf.space`** (open your Space → use the **App** tab URL, or copy from the Space card). It is **not** `huggingface.co/spaces/user/repo`, **not** `https://<owner>/<space>.hf.space` (a slash makes the client try to resolve `<owner>` as the hostname and DNS fails), and **not** a model repo id. The client accepts the host **with or without** `https://` (missing scheme defaults to `https://`), fixes **`https:host`** typos (missing slashes after the scheme), and **rewrites** the common `owner/space.hf.space` mistake into `owner-space.hf.space`. API paths are `/submit`, `/status`, `/health`, `/reset`.
 Then call the aggregator client:
 Nodes do **not** need `MODEL_REPO_ID` — only the aggregator uses it to download/upload adapter weights.
+#### 3. Local or Space runtime env
+For **local** runs, export `HF_TOKEN`, `MODEL_REPO_ID`, and `NODE_SECRET` (or rely on defaults described in **Testing** below). On the **Hugging Face Space**, set the same keys as **Repository secrets**.
 ## Operator notes
 - **Secrets:** Never commit `HF_TOKEN`, `NODE_SECRET`, or tokens in git remotes. Use Space **Repository secrets** and a local env or credential helper.
 - **Reset:** `POST /reset` with JSON `{"secret_key": "<ADMIN_SECRET or NODE_SECRET>"}` clears round state to 1. If `ADMIN_SECRET` is set on the Space, use that; otherwise use `NODE_SECRET`.
 - **Protected status:** When `STATUS_READ_SECRET` is set, pass the same value as header `X-Status-Secret` (see `aggregator_client.check_aggregator` / `poll_for_next_round` argument `status_secret`, and notebook `CONFIG["status_read_secret"]`).
+- **401 Unauthorized:** (1) **`POST /submit`** — JSON `secret_key` must match the Space **`NODE_SECRET`** (same string in every Colab `CONFIG["node_secret"]`). (2) **`GET /status`** — if the Space defines **`STATUS_READ_SECRET`**, clients must send that value (notebook `CONFIG["status_read_secret"]`, or clear `STATUS_READ_SECRET` on the Space if you do not need it). **`GET /health`** has no secret. A **private** Hugging Face Space can also return 401 at the edge before your app runs — open the Space in the browser while logged in, or check Space visibility settings.
 - **Rate limits:** `POST /submit` is limited per client IP (first `X-Forwarded-For` hop when present). Override with **`RATE_LIMIT_SUBMIT_MAX`** (default 120) and **`RATE_LIMIT_SUBMIT_WINDOW_SEC`** (default 60).
 - **Logs:** Set **`LOG_LEVEL`** (e.g. `DEBUG`, `INFO`, `WARNING`) for the `peft.aggregator` logger on stdout.
 - **Public Space:** `GET /status` is world-readable on a public Space; use a private Space if round visibility matters.
 - **Nodes:** 3 x Google Colab free T4 GPU
 - **Aggregation:** FedAvg over adapter states
+## Testing
+### Automated tests (CI / laptop)
+From the repo root, use a virtualenv with **Python 3.10+** (3.12 is fine). Install dependencies and run the suite:
+```bash
+python -m venv .venv
+source .venv/bin/activate   # Windows: .venv\Scripts\activate
+pip install -r requirements.txt
+pip install pytest httpx "gradio==4.44.1"
+pip install torch --index-url https://download.pytorch.org/whl/cpu
+pytest tests/ -q
+```
+Tests exercise **FastAPI** routes (`/health`, `/status`, `/submit`, `/reset`) via `TestClient`. They set **`HF_TOKEN`** and **`MODEL_REPO_ID`** empty so **FedAvg is skipped** and no Hub network calls are required. **`aggregator_client`** URL normalization is covered in `tests/test_aggregator_client.py`.
+### Run locally (same stack as the Space)
 ```bash
+export NODE_SECRET="local_test_secret"
+# Optional: real FedAvg on Hub (otherwise merge is skipped when the third node submits)
+export HF_TOKEN=""
+export MODEL_REPO_ID=""
 pip install -r requirements.txt
 pip install torch --index-url https://download.pytorch.org/whl/cpu
+pip install "gradio==4.44.1"
 uvicorn app:app --host 0.0.0.0 --port 7860 --reload
 ```
+Open **`http://127.0.0.1:7860/`** for the Gradio dashboard. Smoke-test JSON endpoints:
+```bash
+BASE="http://127.0.0.1:7860"
+curl -sS "$BASE/health"
+curl -sS "$BASE/status"
+curl -sS -X POST "$BASE/submit" -H "Content-Type: application/json" \
+  -d "{\"node_id\":\"node_a\",\"secret_key\":\"local_test_secret\"}"
+```
+If the Space uses **`STATUS_READ_SECRET`**, mirror that locally:
+```bash
+curl -sS "$BASE/status" -H "X-Status-Secret: your-secret"
+```
 **Note:** `requirements.txt` caps **FastAPI** and **Starlette** below versions that ship **Starlette 1.x**. Gradio **4.44.x** is incompatible with that stack (the Space would return **500** on `GET /` with Jinja `unhashable type: 'dict'`). Upgrade Gradio before raising those caps.
+### Docker (parity with the HF Space)
+```bash
+docker build -t peft-aggregator .
+docker run --rm -p 7860:7860 \
+  -e NODE_SECRET="local_test_secret" \
+  -e HF_TOKEN="" \
+  -e MODEL_REPO_ID="" \
+  peft-aggregator
+```
+Then use the same **`curl`** examples with **`BASE=http://127.0.0.1:7860`**.
+### Test the deployed Hugging Face Space
+Use your Space **App** URL (hyphenated **`.hf.space`** host). Set **`BASE`** and **`SECRET`** to match **Repository secrets** on the Space.
+```bash
+BASE="https://YOUR_OWNER-YOUR_SPACE_NAME.hf.space"
+SECRET="your-node-secret-from-space-settings"
+curl -sS "$BASE/health"
+```
+**`GET /status`** — if **`STATUS_READ_SECRET`** is set on the Space, add the header; otherwise a public Space returns JSON without auth:
+```bash
+curl -sS "$BASE/status"
+# or: curl -sS "$BASE/status" -H "X-Status-Secret: $STATUS_READ_SECRET"
+```
+**`POST /submit`** — any node can submit independently; the first responses are **`"status":"submitted"`** with **`remaining`** until all three IDs have submitted for the current round. Omit **`round_num`** for a quick smoke test, or set **`round_num`** to the value returned by **`/status`** as **`current_round`** (mismatch → **409**).
+```bash
+for id in node_a node_b node_c; do
+  curl -sS -X POST "$BASE/submit" -H "Content-Type: application/json" \
+    -d "{\"node_id\":\"$id\",\"secret_key\":\"$SECRET\",\"avg_loss\":1.0,\"steps_completed\":10}"
+  echo
+done
+```
+After the third submit you should see **`"status":"round_complete"`** (with **`HF_TOKEN`/`MODEL_REPO_ID`** configured and valid adapter files on the Hub) or a message that merge was **skipped** / **`merge_failed`** if Hub setup is incomplete — both outcomes confirm the Space is running the aggregation logic.
+**`POST /reset`** — use **`ADMIN_SECRET`** if the Space defines it, else **`NODE_SECRET`**:
+```bash
+curl -sS -X POST "$BASE/reset" -H "Content-Type: application/json" \
+  -d "{\"secret_key\":\"$SECRET\"}"
+```
+**Python** — the same checks with **`aggregator_client`**:
+```python
+from aggregator_client import check_aggregator, health_aggregator, notify_aggregator
+BASE = "https://YOUR_OWNER-YOUR_SPACE_NAME.hf.space"
+SECRET = "your-node-secret"
+health_aggregator(BASE)
+check_aggregator(BASE, status_secret=None)  # or status_secret="..." if configured
+notify_aggregator(BASE, "node_a", SECRET, round_num=1)
+```
+**Private Space:** you may need to be logged into Hugging Face in a browser to open **`/`**; API calls from Colab or scripts still use **`NODE_SECRET`** on **`/submit`** and **`X-Status-Secret`** on **`/status`** when applicable — they do not use your HF login cookie.
 Deadline: April 20, 2026 · Lead target April 13

aggregator_client.py CHANGED Viewed

@@ -15,20 +15,63 @@ from urllib.parse import urlsplit, urlunsplit
 import requests
 def _normalize_aggregator_base_url(url: str) -> str:
     """Strip whitespace/slashes and ensure an HTTP(S) scheme for requests.
     Colab configs often omit ``https://``, which causes requests to raise
     ``MissingSchema``.
-    Also fixes a common mistake: pasting ``https://OWNER/SPACE.hf.space`` (slash)
-    instead of the real host ``https://OWNER-SPACE.hf.space``. Otherwise the HTTP
-    client treats ``OWNER`` as the hostname and fails DNS.
     """
     base = url.strip().rstrip("/")
     if not base:
         raise ValueError("aggregator_url is empty")
-    if "://" not in base:
         base = "https://" + base.lstrip("/")
     parts = urlsplit(base)
@@ -109,7 +152,7 @@ def notify_aggregator(
             "(typically https://YOUR_SPACE_NAME.hf.space), not the huggingface.co/spaces "
             "HTML page URL. Path must be exactly /submit."
         )
-    response.raise_for_status()
     data = response.json()
     if data.get("status") == "merge_failed":
         raise AggregatorMergeFailed(data)
@@ -119,7 +162,7 @@ def notify_aggregator(
 def poll_for_next_round(
     aggregator_url: str,
     current_round: int,
-    poll_interval: int = 30,
     max_wait: int = 1800,
     status_secret: str | None = None,
 ) -> dict:
@@ -129,7 +172,7 @@ def poll_for_next_round(
         aggregator_url: Base URL of the aggregator Space (with or without
             ``https://``; missing scheme defaults to ``https://``).
         current_round: The round we just finished.
-        poll_interval: Seconds between status checks.
         max_wait: Maximum seconds to wait before raising TimeoutError.
         status_secret: If the Space sets STATUS_READ_SECRET, pass it here
             (sent as X-Status-Secret on GET /status).
@@ -139,6 +182,7 @@ def poll_for_next_round(
     Raises:
         TimeoutError: If max_wait is exceeded.
     """
     base = _normalize_aggregator_base_url(aggregator_url)
     url = f"{base}/status"
@@ -148,19 +192,32 @@ def poll_for_next_round(
     while elapsed < max_wait:
         try:
             resp = requests.get(url, timeout=15, headers=headers)
-            resp.raise_for_status()
             status = resp.json()
             agg_round = status.get("current_round", 0)
             if agg_round > current_round:
                 print(f"[poll] Aggregator advanced to round {agg_round}")
                 return status
         except requests.RequestException as e:
             print(f"[poll] Request error: {e}")
-        print(
-            f"[poll] Waiting for round {current_round + 1} "
-            f"({elapsed}/{max_wait}s elapsed)..."
-        )
         time.sleep(poll_interval)
         elapsed += poll_interval
@@ -183,7 +240,7 @@ def check_aggregator(
         timeout=timeout,
         headers=_status_headers(status_secret),
     )
-    response.raise_for_status()
     return response.json()
@@ -199,7 +256,7 @@ def reset_aggregator(
         json={"secret_key": secret_key},
         timeout=timeout,
     )
-    response.raise_for_status()
     return response.json()
@@ -207,5 +264,5 @@ def health_aggregator(aggregator_url: str, timeout: int = 10) -> dict:
     """Liveness probe via GET /health; does not depend on training state."""
     url = f"{_normalize_aggregator_base_url(aggregator_url)}/health"
     response = requests.get(url, timeout=timeout)
-    response.raise_for_status()
     return response.json()

 import requests
+def _detail_from_response(response: requests.Response) -> str:
+    try:
+        data = response.json()
+        if isinstance(data, dict):
+            d = data.get("detail")
+            if d is not None:
+                return str(d) if not isinstance(d, list) else str(d)
+    except Exception:
+        pass
+    text = (response.text or "").strip()
+    if text:
+        return text[:800]
+    return response.reason or ""
+def _raise_for_aggregator_response(response: requests.Response, *, what: str) -> None:
+    """Raise HTTPError with server ``detail`` and hints for common operator mistakes."""
+    if response.status_code < 400:
+        return
+    detail = _detail_from_response(response)
+    msg = f"{what}: HTTP {response.status_code}"
+    if detail:
+        msg += f" — {detail}"
+    if response.status_code == 401:
+        msg += (
+            ". Hint: POST /submit needs JSON ``secret_key`` equal to the Space "
+            "``NODE_SECRET``. GET /status needs header ``X-Status-Secret`` when the "
+            "Space sets ``STATUS_READ_SECRET`` — pass ``status_secret=...`` from "
+            "``poll_for_next_round`` / ``check_aggregator``, or set "
+            "``CONFIG['status_read_secret']`` in the notebook to match the Space."
+        )
+    raise requests.HTTPError(msg, response=response)
 def _normalize_aggregator_base_url(url: str) -> str:
     """Strip whitespace/slashes and ensure an HTTP(S) scheme for requests.
     Colab configs often omit ``https://``, which causes requests to raise
     ``MissingSchema``.
+    Also fixes: ``https:host`` (missing ``//`` after the scheme); pasting
+    ``https://OWNER/SPACE.hf.space`` (slash) instead of ``https://OWNER-SPACE.hf.space``
+    (otherwise the client treats ``OWNER`` as the hostname and DNS fails).
     """
     base = url.strip().rstrip("/")
     if not base:
         raise ValueError("aggregator_url is empty")
+    low = base.lower()
+    # ``https:host`` / ``http:host`` (missing ``//``) does not contain ``://``;
+    # blindly prefixing ``https://`` would produce ``https://https:host`` and
+    # break urllib3 (InvalidURL).
+    if low.startswith("https:") and not low.startswith("https://"):
+        base = "https://" + base[6:].lstrip("/")
+    elif low.startswith("http:") and not low.startswith("http://"):
+        base = "http://" + base[5:].lstrip("/")
+    elif "://" not in base:
         base = "https://" + base.lstrip("/")
     parts = urlsplit(base)
             "(typically https://YOUR_SPACE_NAME.hf.space), not the huggingface.co/spaces "
             "HTML page URL. Path must be exactly /submit."
         )
+    _raise_for_aggregator_response(response, what="POST /submit")
     data = response.json()
     if data.get("status") == "merge_failed":
         raise AggregatorMergeFailed(data)
 def poll_for_next_round(
     aggregator_url: str,
     current_round: int,
+    poll_interval: int = 10,
     max_wait: int = 1800,
     status_secret: str | None = None,
 ) -> dict:
         aggregator_url: Base URL of the aggregator Space (with or without
             ``https://``; missing scheme defaults to ``https://``).
         current_round: The round we just finished.
+        poll_interval: Seconds between status checks (default 10).
         max_wait: Maximum seconds to wait before raising TimeoutError.
         status_secret: If the Space sets STATUS_READ_SECRET, pass it here
             (sent as X-Status-Secret on GET /status).
     Raises:
         TimeoutError: If max_wait is exceeded.
+        AggregatorMergeFailed: If the aggregator reports a merge error.
     """
     base = _normalize_aggregator_base_url(aggregator_url)
     url = f"{base}/status"
     while elapsed < max_wait:
         try:
             resp = requests.get(url, timeout=15, headers=headers)
+            _raise_for_aggregator_response(resp, what="GET /status")
             status = resp.json()
+            # Check for merge failure
+            merge_error = status.get("merge_error")
+            if merge_error:
+                raise AggregatorMergeFailed({"merge_result": merge_error})
             agg_round = status.get("current_round", 0)
             if agg_round > current_round:
                 print(f"[poll] Aggregator advanced to round {agg_round}")
                 return status
+            if status.get("merging"):
+                print(
+                    f"[poll] FedAvg merge in progress "
+                    f"({elapsed}/{max_wait}s elapsed)..."
+                )
+            else:
+                print(
+                    f"[poll] Waiting for round {current_round + 1} "
+                    f"({elapsed}/{max_wait}s elapsed)..."
+                )
         except requests.RequestException as e:
             print(f"[poll] Request error: {e}")
         time.sleep(poll_interval)
         elapsed += poll_interval
         timeout=timeout,
         headers=_status_headers(status_secret),
     )
+    _raise_for_aggregator_response(response, what="GET /status")
     return response.json()
         json={"secret_key": secret_key},
         timeout=timeout,
     )
+    _raise_for_aggregator_response(response, what="POST /reset")
     return response.json()
     """Liveness probe via GET /health; does not depend on training state."""
     url = f"{_normalize_aggregator_base_url(aggregator_url)}/health"
     response = requests.get(url, timeout=timeout)
+    _raise_for_aggregator_response(response, what="GET /health")
     return response.json()

app.py CHANGED Viewed

@@ -15,6 +15,7 @@ import logging
 import math
 import os
 import json
 import time
 import datetime
@@ -84,8 +85,12 @@ state = {
     "last_update": None,
     "node_metrics": {},       # {node_id: {loss, step, timestamp, ...}}
     "activity_log": [],       # recent events for the activity feed
 }
 # Per-client timestamps (monotonic) for POST /submit rate limiting
 _submit_rate_buckets: dict[str, list[float]] = {}
@@ -212,6 +217,45 @@ def fedavg_merge() -> tuple[str, bool]:
     )
 # ---------------------------------------------------------------------------
 # FastAPI endpoints
 # ---------------------------------------------------------------------------
@@ -294,6 +338,8 @@ def get_status(request: Request):
             if n not in state["submitted_nodes"]
         ],
         "last_update": state["last_update"],
     }
@@ -326,106 +372,90 @@ def submit_node(req: SubmitRequest):
             ),
         )
-    if req.node_id in state["submitted_nodes"]:
-        return {
-            "status": "already_submitted",
-            "current_round": state["current_round"],
-            "submitted_nodes": state["submitted_nodes"],
-        }
-    state["submitted_nodes"].append(req.node_id)
-    state["last_update"] = _timestamp()
-    agg_log.info(
-        "submit accepted node_id=%s round=%s progress=%s/%s",
-        req.node_id,
-        state["current_round"],
-        len(state["submitted_nodes"]),
-        len(CONFIG["expected_nodes"]),
-    )
-    # Store node metrics
-    state["node_metrics"][req.node_id] = {
-        "avg_loss": req.avg_loss,
-        "steps_completed": req.steps_completed,
-        "round": state["current_round"],
-        "submitted_at": _timestamp(),
-    }
-    # Log activity
-    loss_suffix = (
-        f" (loss: {req.avg_loss:.4f})" if req.avg_loss is not None else ""
-    )
-    _log_activity(
-        f"{req.node_id} submitted round {state['current_round']}{loss_suffix}"
-    )
-    # Check if all nodes have submitted
-    if set(state["submitted_nodes"]) == set(CONFIG["expected_nodes"]):
         _log_activity(
-            f"All nodes submitted — starting FedAvg for round {state['current_round']}"
         )
-        merge_result, merge_ok = fedavg_merge()
-        # Capture per-node losses and steps in history
-        round_metrics = {
-            nid: state["node_metrics"].get(nid, {}).get("avg_loss")
-            for nid in CONFIG["expected_nodes"]
-        }
-        round_steps = {
-            nid: state["node_metrics"].get(nid, {}).get("steps_completed")
-            for nid in CONFIG["expected_nodes"]
-        }
-        if merge_ok:
-            state["history"].append({
-                "round": state["current_round"],
-                "completed_at": _timestamp(),
-                "merge_result": merge_result,
-                "node_losses": round_metrics,
-                "node_steps": round_steps,
-            })
-            _log_activity(f"Round {state['current_round']} FedAvg complete")
-            agg_log.info(
-                "fedavg_complete round=%s new_round=%s",
-                state["current_round"],
-                state["current_round"] + 1,
             )
-            state["current_round"] += 1
-            state["submitted_nodes"] = []
-            state["node_metrics"] = {}
             return {
-                "status": "round_complete",
-                "merge_result": merge_result,
-                "new_round": state["current_round"],
             }
-        tail = merge_result if len(merge_result) <= 400 else merge_result[:400] + "..."
-        agg_log.warning(
-            "fedavg_failed round=%s detail=%s",
-            state["current_round"],
-            tail,
-        )
-        _log_activity(f"FedAvg failed (round {state['current_round']}): {merge_result}")
-        state["submitted_nodes"] = []
         return {
-            "status": "merge_failed",
-            "merge_result": merge_result,
             "current_round": state["current_round"],
         }
-    return {
-        "status": "submitted",
-        "current_round": state["current_round"],
-        "submitted_nodes": state["submitted_nodes"],
-        "remaining": [
-            n for n in CONFIG["expected_nodes"]
-            if n not in state["submitted_nodes"]
-        ],
-    }
 @app.post("/reset")
 def reset_state(req: ResetRequest):
@@ -440,6 +470,8 @@ def reset_state(req: ResetRequest):
     state["last_update"] = _timestamp()
     state["node_metrics"] = {}
     state["activity_log"] = []
     _log_activity("State reset to round 1")

 import math
 import os
 import json
+import threading
 import time
 import datetime
     "last_update": None,
     "node_metrics": {},       # {node_id: {loss, step, timestamp, ...}}
     "activity_log": [],       # recent events for the activity feed
+    "merging": False,         # True while FedAvg is running in background
+    "merge_error": None,      # set if background merge failed
 }
+_state_lock = threading.Lock()
 # Per-client timestamps (monotonic) for POST /submit rate limiting
 _submit_rate_buckets: dict[str, list[float]] = {}
     )
+def _background_merge(
+    merge_round: int,
+    round_metrics: dict,
+    round_steps: dict,
+) -> None:
+    """Run FedAvg in a background thread, then advance the round or record failure."""
+    try:
+        merge_result, merge_ok = fedavg_merge()
+    except Exception as exc:
+        merge_result, merge_ok = f"FedAvg exception: {exc}", False
+    with _state_lock:
+        if merge_ok:
+            state["history"].append({
+                "round": merge_round,
+                "completed_at": _timestamp(),
+                "merge_result": merge_result,
+                "node_losses": round_metrics,
+                "node_steps": round_steps,
+            })
+            _log_activity(f"Round {merge_round} FedAvg complete")
+            agg_log.info(
+                "fedavg_complete round=%s new_round=%s",
+                merge_round,
+                merge_round + 1,
+            )
+            state["current_round"] = merge_round + 1
+            state["submitted_nodes"] = []
+            state["node_metrics"] = {}
+        else:
+            tail = merge_result if len(merge_result) <= 400 else merge_result[:400] + "..."
+            agg_log.warning("fedavg_failed round=%s detail=%s", merge_round, tail)
+            _log_activity(f"FedAvg failed (round {merge_round}): {merge_result}")
+            state["merge_error"] = merge_result
+            state["submitted_nodes"] = []
+        state["merging"] = False
 # ---------------------------------------------------------------------------
 # FastAPI endpoints
 # ---------------------------------------------------------------------------
             if n not in state["submitted_nodes"]
         ],
         "last_update": state["last_update"],
+        "merging": state["merging"],
+        "merge_error": state["merge_error"],
     }
             ),
         )
+    with _state_lock:
+        if req.node_id in state["submitted_nodes"]:
+            return {
+                "status": "already_submitted",
+                "current_round": state["current_round"],
+                "submitted_nodes": state["submitted_nodes"],
+            }
+        if state["merging"]:
+            return {
+                "status": "merging",
+                "current_round": state["current_round"],
+                "submitted_nodes": state["submitted_nodes"],
+            }
+        state["submitted_nodes"].append(req.node_id)
+        state["last_update"] = _timestamp()
+        agg_log.info(
+            "submit accepted node_id=%s round=%s progress=%s/%s",
+            req.node_id,
+            state["current_round"],
+            len(state["submitted_nodes"]),
+            len(CONFIG["expected_nodes"]),
+        )
+        # Store node metrics
+        state["node_metrics"][req.node_id] = {
+            "avg_loss": req.avg_loss,
+            "steps_completed": req.steps_completed,
+            "round": state["current_round"],
+            "submitted_at": _timestamp(),
+        }
+        # Log activity
+        loss_suffix = (
+            f" (loss: {req.avg_loss:.4f})" if req.avg_loss is not None else ""
+        )
         _log_activity(
+            f"{req.node_id} submitted round {state['current_round']}{loss_suffix}"
         )
+        # Check if all nodes have submitted
+        all_submitted = set(state["submitted_nodes"]) == set(CONFIG["expected_nodes"])
+        if all_submitted:
+            state["merging"] = True
+            state["merge_error"] = None
+            _log_activity(
+                f"All nodes submitted — starting FedAvg for round {state['current_round']}"
             )
+            # Capture metrics under lock before spawning thread
+            round_metrics = {
+                nid: state["node_metrics"].get(nid, {}).get("avg_loss")
+                for nid in CONFIG["expected_nodes"]
+            }
+            round_steps = {
+                nid: state["node_metrics"].get(nid, {}).get("steps_completed")
+                for nid in CONFIG["expected_nodes"]
+            }
+            merge_round = state["current_round"]
+            thread = threading.Thread(
+                target=_background_merge,
+                args=(merge_round, round_metrics, round_steps),
+                daemon=True,
+            )
+            thread.start()
             return {
+                "status": "merging",
+                "current_round": state["current_round"],
+                "submitted_nodes": state["submitted_nodes"],
             }
         return {
+            "status": "submitted",
             "current_round": state["current_round"],
+            "submitted_nodes": state["submitted_nodes"],
+            "remaining": [
+                n for n in CONFIG["expected_nodes"]
+                if n not in state["submitted_nodes"]
+            ],
         }
 @app.post("/reset")
 def reset_state(req: ResetRequest):
     state["last_update"] = _timestamp()
     state["node_metrics"] = {}
     state["activity_log"] = []
+    state["merging"] = False
+    state["merge_error"] = None
     _log_activity("State reset to round 1")

notebook/distributed_finetuning_peft_async.ipynb CHANGED Viewed

@@ -102,7 +102,8 @@
     "    # Direct Space app host: https://YOURNAME-YOURSPACENAME.hf.space (from Space \"App\" URL).\n",
     "    # Not huggingface.co/spaces/... — use one hostname with hyphens, e.g.\n",
     "    # https://OWNER-SPACENAME.hf.space NOT https://OWNER/SPACENAME.hf.space (slash breaks DNS).\n",
-    "    # You may omit https://; aggregator_client fixes missing scheme and the OWNER/SPACE mistake.\n",
     "    \"aggregator_url\": \"https://your-username-peft-aggregator.hf.space\",\n",
     "    \"node_secret\": \"choose_any_shared_secret_string\",\n",
     "    # Optional: if the Space sets STATUS_READ_SECRET, put the same value here (for polling /status).\n",

     "    # Direct Space app host: https://YOURNAME-YOURSPACENAME.hf.space (from Space \"App\" URL).\n",
     "    # Not huggingface.co/spaces/... — use one hostname with hyphens, e.g.\n",
     "    # https://OWNER-SPACENAME.hf.space NOT https://OWNER/SPACENAME.hf.space (slash breaks DNS).\n",
+    "    # Use https:// (two slashes), not https: (typo). You may omit the scheme entirely;\n",
+    "    # aggregator_client fixes missing scheme, https:host, and OWNER/SPACE slash mistakes.\n",
     "    \"aggregator_url\": \"https://your-username-peft-aggregator.hf.space\",\n",
     "    \"node_secret\": \"choose_any_shared_secret_string\",\n",
     "    # Optional: if the Space sets STATUS_READ_SECRET, put the same value here (for polling /status).\n",

tests/test_aggregator_client.py CHANGED Viewed

@@ -12,6 +12,20 @@ def test_normalize_aggregator_base_url_adds_https():
     assert ac._normalize_aggregator_base_url("  example.hf.space/  ") == "https://example.hf.space"
 def test_normalize_aggregator_base_url_preserves_explicit_scheme():
     assert ac._normalize_aggregator_base_url("http://127.0.0.1:7860") == "http://127.0.0.1:7860"
     assert ac._normalize_aggregator_base_url("https://x.hf.space/") == "https://x.hf.space"

     assert ac._normalize_aggregator_base_url("  example.hf.space/  ") == "https://example.hf.space"
+def test_normalize_aggregator_base_url_fixes_https_missing_slashes():
+    """https:host (typo) must not become https://https:host."""
+    assert ac._normalize_aggregator_base_url(
+        "https:instruction-fine-tuning-budget.hf.space"
+    ) == ("https://instruction-fine-tuning-budget.hf.space")
+    assert ac._normalize_aggregator_base_url(
+        "HTTPS:instruction-fine-tuning-budget.hf.space/"
+    ) == ("https://instruction-fine-tuning-budget.hf.space")
+def test_normalize_aggregator_base_url_fixes_http_missing_slashes():
+    assert ac._normalize_aggregator_base_url("http:127.0.0.1:7860") == "http://127.0.0.1:7860"
 def test_normalize_aggregator_base_url_preserves_explicit_scheme():
     assert ac._normalize_aggregator_base_url("http://127.0.0.1:7860") == "http://127.0.0.1:7860"
     assert ac._normalize_aggregator_base_url("https://x.hf.space/") == "https://x.hf.space"

tests/test_app.py CHANGED Viewed

@@ -6,6 +6,8 @@ Uses empty HF credentials so FedAvg skips Hub I/O (no network required).
 from __future__ import annotations
 import pytest
 from fastapi.testclient import TestClient
@@ -35,6 +37,8 @@ def client(monkeypatch):
     app_module.state["last_update"] = None
     app_module.state["node_metrics"] = {}
     app_module.state["activity_log"] = []
     with TestClient(app_module.app) as c:
         yield c, app_module
@@ -148,6 +152,14 @@ def test_submit_unknown_node(client):
     assert r.status_code == 400
 def test_full_round_merge_skipped_no_hf(client):
     tc, m = client
     secret = m.CONFIG["node_secret"]
@@ -168,9 +180,11 @@ def test_full_round_merge_skipped_no_hf(client):
         if nid != "node_c":
             assert body["status"] == "submitted"
         else:
-            assert body["status"] == "round_complete"
-            assert "Skipped merge" in body["merge_result"]
-            assert body["new_round"] == 2
     st = tc.get("/status").json()
     assert st["current_round"] == 2
@@ -269,6 +283,8 @@ def test_gradio_markdown_helpers_no_crash(client):
     for nid in ("node_a", "node_b", "node_c"):
         tc.post("/submit", json={"node_id": nid, "secret_key": secret})
     m._loss_history_md()
     m._merged_adapters_md()
     m._activity_log_md()

 from __future__ import annotations
+import time
 import pytest
 from fastapi.testclient import TestClient
     app_module.state["last_update"] = None
     app_module.state["node_metrics"] = {}
     app_module.state["activity_log"] = []
+    app_module.state["merging"] = False
+    app_module.state["merge_error"] = None
     with TestClient(app_module.app) as c:
         yield c, app_module
     assert r.status_code == 400
+def _wait_for_merge(app_module, timeout=5):
+    """Wait until the background merge thread completes."""
+    deadline = time.monotonic() + timeout
+    while app_module.state["merging"] and time.monotonic() < deadline:
+        time.sleep(0.05)
+    assert not app_module.state["merging"], "merge did not complete in time"
 def test_full_round_merge_skipped_no_hf(client):
     tc, m = client
     secret = m.CONFIG["node_secret"]
         if nid != "node_c":
             assert body["status"] == "submitted"
         else:
+            # Last node triggers async merge — returns "merging"
+            assert body["status"] == "merging"
+    # Wait for the background merge thread to finish
+    _wait_for_merge(m)
     st = tc.get("/status").json()
     assert st["current_round"] == 2
     for nid in ("node_a", "node_b", "node_c"):
         tc.post("/submit", json={"node_id": nid, "secret_key": secret})
+    _wait_for_merge(m)
     m._loss_history_md()
     m._merged_adapters_md()
     m._activity_log_md()