Spaces:

rubentuesday
/

vocal-mirror

Sleeping

File size: 16,134 Bytes

# HF Space Build Error Log — rubentuesday/vocal-mirror

This file is committed alongside every fix so the repo retains full context of what broke and why.

---

## Iteration 1 — 2026-04-11
**Stage:** CONFIG_ERROR  
**Error:** `No candidate PyTorch version found for ZeroGPU`  
**Root cause:** `requirements.txt` pinned `torch==2.5.1+cu121` and `torchaudio==2.5.1+cu121` with `--extra-index-url https://download.pytorch.org/whl/cu121`. ZeroGPU manages its own CUDA PyTorch installation and rejects spaces that pin a `+cu121`-suffixed variant — it fails at config parse time before any package install.  
**Fix applied:**  
- Removed `--extra-index-url https://download.pytorch.org/whl/cu121` from `requirements.txt`  
- Removed `torch==2.5.1+cu121` and `torchaudio==2.5.1+cu121` from `requirements.txt` (ZeroGPU provides these)  
- Changed `gradio>=5.0.0,<6.0` → `gradio==4.44.1` in `requirements.txt` (project rule: pin to 4.44.1)  
- Changed `sdk_version: 5.0.0` → `sdk_version: 4.44.1` in `README.md` YAML frontmatter  
**Result:** FAIL — CONFIG_ERROR resolved, but caused new RUNTIME_ERROR (see Iteration 2). Gradio 4.44.1 was wrong choice — reverted.

---

## Iteration 2 — 2026-04-11
**Stage:** RUNTIME_ERROR  
**Error 1 (first):** `TypeError: unhashable type: 'dict'` in `jinja2/utils.py` — Gradio 4.x Jinja template cache bug  
**Error 2:** `ValueError: When localhost is not accessible, a shareable link must be created. Please set share=True` — Gradio 4.x requires share=True on remote hosts  
**Root cause:** Downgrading to `gradio==4.44.1` reintroduced two known Gradio 4.x bugs. The commit history already shows `c0a2ea8` explicitly upgraded to 5.x to fix the Jinja crash. Both errors are 4.x-only issues fixed in 5.x. The "pin to 4.44.1" instruction in the task brief was outdated.  
**Fix applied:**  
- Reverted `requirements.txt`: `gradio==4.44.1` → `gradio>=5.0.0,<6.0`  
- Reverted `README.md`: `sdk_version: 4.44.1` → `sdk_version: 5.0.0`  
**Result:** PASS — Space reached RUNNING stage. `/health_hf` returns 308 (route missing). Fixed in Iteration 3.

---

## Iteration 3 — 2026-04-11
**Stage:** RUNNING but `/health_hf` returns 308 Permanent Redirect (no such route)  
**Root cause:** `app.py` only has `demo.launch()` with no custom routes. Gradio 5.x redirects unknown paths to `/`.  
**Fix applied:** Switched from `demo.launch()` to `gr.mount_gradio_app()` pattern:
- Added `FastAPI` app with `@app.get("/health_hf")` returning `{"status": "ok"}`
- Replaced `demo.launch()` with `app = gr.mount_gradio_app(app, demo, path="/")`
- `@spaces.GPU` decorator still handles ZeroGPU GPU allocation independently  
**Result:** FAIL — RUNTIME_ERROR exit code 0. gr.mount_gradio_app() returns immediately; nothing blocks the process. Fixed in Iteration 4.

---

## Iteration 4 — 2026-04-11
**Stage:** RUNTIME_ERROR — `Exit code: 0. Reason: ` (clean exit, process didn't stay alive)
**Root cause:** `gr.mount_gradio_app()` returns an ASGI app object but doesn't start a server. Without `demo.launch()` blocking, `app.py` runs to completion and exits.
**Fix applied:** Added `uvicorn.run(app, host="0.0.0.0", port=7860)` after the mount call to start the ASGI server and block the process.
**Result:** FAIL — RUNTIME_ERROR "No @spaces.GPU function detected during startup". `uvicorn.run()` bypasses `spaces.zero.gradio` launch wrapper that scans for GPU functions. ZeroGPU requires `demo.launch()`. Fixed in Iteration 5.

---

## Iteration 5 — 2026-04-11
**Stage:** RUNTIME_ERROR — `No @spaces.GPU function detected during startup`
**Root cause:** `gr.mount_gradio_app()` + `uvicorn.run()` bypasses the `spaces.zero.gradio` interceptor of `demo.launch()`. ZeroGPU scans for `@spaces.GPU` decorated functions inside that interceptor — never gets called, so GPU functions aren't registered.  
**Fix applied:** Reverted to bare `demo.launch()`. Added `/health_hf` by monkey-patching `gradio.routes.App.create_app` to inject the route into the Gradio FastAPI app at creation time, before ZeroGPU starts the server.  
**Result:** FAIL — "Application unable to start for an unknown reason". The `create_app.__func__` access likely failed (AttributeError or TypeError) in Gradio 5.x, crashing startup silently. Fixed in Iteration 6.

---

## Iteration 6 — 2026-04-11
**Stage:** RUNTIME_ERROR — "Application unable to start for an unknown reason"  
**Root cause:** Monkey-patching `gradio.routes.App.create_app.__func__` crashed at import/startup time in Gradio 5.x. The `__func__` access pattern assumes `create_app` is a classmethod — if the signature or descriptor changed in 5.x, this raises AttributeError and kills the process before any server starts.  
**Fix applied:** Replaced monkey-patch with a daemon thread that polls `demo.server` (set by Gradio after `demo.launch()` initializes the server) and injects `/health_hf` once available. `demo.launch()` stays bare — ZeroGPU detection works normally. Thread is a no-op if injection fails.  
**Result:** FAIL — Space is RUNNING but `/health_hf` still returns 308. `demo.server` is never set in the polling thread's context (ZeroGPU runs the real server in a GPU worker, not the same process). Fixed in Iteration 7.

---

## Iteration 7 — 2026-04-11
**Stage:** RUNNING but `/health_hf` still returns 308  
**Root cause:** In ZeroGPU, the actual Gradio server runs in a separate GPU worker process. `demo.server` is never set in the main process, so the daemon thread's poll always fails and the route is never injected.  
**Fix applied:** Use `demo.launch(prevent_thread_lock=True)` — the spaces interceptor still detects `@spaces.GPU` functions, then starts the server in a background thread in the same process and returns. After `launch()` returns, `demo.server.app` is accessible and we add `/health_hf`. Main thread blocked via `threading.Event().wait()` (avoids relying on `demo.block_thread()` existing in Gradio 5.x).  
**Result:** FAIL — `AttributeError: 'Server' object has no attribute 'app'`. Gradio 5.x's `Server` wraps uvicorn — the FastAPI app lives at `server.config.app`, not `server.app`. Fixed in Iteration 8.

---

## Iteration 8 — 2026-04-11
**Stage:** RUNTIME_ERROR — `AttributeError: 'Server' object has no attribute 'app'`  
**Root cause:** `demo.server` is a Gradio `Server` (wrapping uvicorn). In uvicorn, the ASGI app is stored in `server.config.app` (the `Config` object passed at construction), not directly on `server.app`.  
**Fix applied:** Changed `demo.server.app.get(...)` → `demo.server.config.app.get(...)`.  
**Result:** FAIL — Space RUNNING but `/health_hf` still 308. `demo.server.config.app.get()` adds route AFTER Gradio's catch-all `/{path_name:path}` is already registered. FastAPI matches routes in insertion order — catch-all added first wins. Fixed in Iteration 9.

---

## Iteration 9 — 2026-04-11
**Stage:** RUNNING but `/health_hf` returns 308  
**Root cause:** Adding `@app.get("/health_hf")` after `create_app` appends the route AFTER Gradio's catch-all `/{path_name:path}`. FastAPI/Starlette matches routes in registration order — the catch-all was registered first and intercepts everything, including `/health_hf`.  
**Fix applied:** Use Starlette middleware (`BaseHTTPMiddleware`) patched into Gradio's `create_app`. Middleware runs BEFORE any route matching, so `/health_hf` is intercepted before the catch-all. Reverted to bare `demo.launch()` (ZeroGPU works). Entire patch wrapped in `try/except` so failures are silent and don't prevent startup.  
**Result:** PASS ✓ — Space RUNNING, `GET /health_hf` → `{"status":"ok"}` HTTP 200. All done after 9 iterations.

---

## Iteration 10 — 2026-04-12
**Stage:** RUNNING but "Run Benchmark" throws OSError  
**Error:** `OSError: Could not load this library: /usr/local/lib/python3.10/site-packages/torchaudio/lib/_torchaudio.abi3.so`  
**Root cause (via runtime logs):** `qwen-tts` depends on `torchaudio`. `pip install qwen-tts` upgraded `torchaudio` to the latest PyPI release which was compiled against CUDA 13 (`libcudart.so.13`). ZeroGPU A10G runs CUDA 12, so `libcudart.so.13` is not present. Full import chain: `from qwen_tts import Qwen3TTSModel` → `speech_vq.py` → `import torchaudio.compliance.kaldi` → `torchaudio/__init__.py` → `torchaudio._extension` → `torch.ops.load_library("_torchaudio.abi3.so")` → `OSError: libcudart.so.13`.  
**Fix applied:** Pinned `torchaudio==2.5.1` in `requirements.txt` BEFORE the `qwen-tts` line. torchaudio 2.5.1 (Nov 2024) was compiled against CUDA 12 and prevents pip from upgrading to a CUDA-13 version. `kaldi.fbank()` (the only torchaudio function qwen-tts calls from this path) is a CPU-only DSP operation — no GPU needed.  
**Result:** PASS ✓ — Space RUNNING with new SHA 990b408, `/health_hf` → 200. Benchmark fix deployed.

---

## Iteration 11 — 2026-04-12
**Stage:** RUNNING — benchmark redesign (not a build error)
**Change:** Replaced static `np.zeros` reference + arbitrary test text with a live microphone enrollment simulation. New UI: user records one of the 3 frontend enrollment phrases via Gradio `Audio` input, benchmark clones their voice and synthesizes an AI response ("Great job! Now let's keep the conversation going. How was your day?"), returns RTF result + playable audio output. Mirrors the actual frontend UX: enroll → clone → hear AI response.
**Files changed:** `app.py` only.
**Result:** FAIL — space RUNNING, `/health_hf` 200, but Gradio API returns 500 Internal Server Error. UI loads but "Start →" button fails. See Iteration 12.

---

## Iteration 12 — 2026-04-13
**Stage:** RUNNING but Gradio API `/gradio_api/info` returns 500 Internal Server Error  
**Error:** `File "/usr/local/lib/python3.10/site-packages/gradio_client/utils.py", line 967, in _json_schema_to_python_type` — crash during API schema generation  
**Root cause (via runtime logs):** Gradio generates a JSON schema for all function signatures when serving `/gradio_api/info`. The `gpu_chat_turn` function had type hints `ref: np.ndarray, history: list, turn_count: int, l1: str, l2: str`. `gradio_client`'s `json_schema_to_python_type` in `_json_schema_to_python_type` cannot serialize `numpy.ndarray` into a JSON schema — it crashes on the list comprehension at line 967–968 trying to build property descriptions. This crash propagates through Starlette's middleware stack, resulting in a 500 on every request (including the frontend's queue/event polling calls).  
**Fix applied:**  
- Removed all type hints from `gpu_enroll_and_greet` and `gpu_chat_turn` signatures — Gradio's schema generator only inspects annotated parameters  
- Changed `gpu_enroll_and_greet` to return `ref.tolist()` (plain Python list) instead of `np.ndarray` — keeps State JSON-serializable  
- Changed `gpu_chat_turn` to accept `ref_list` (plain list) and convert to `np.ndarray` internally via `np.array(ref_list, dtype=np.float32)` before passing to `synthesize()`  
- No changes to callbacks — `on_enroll` stores whatever the function returns; `on_send` passes it through unchanged  
**Files changed:** `app.py` only.  
**Result:** FAIL — same crash persists. Removing np.ndarray type hints did not resolve it. Root cause was actually the gr.State(dict) itself, not the function signature. See Iteration 13.

---

## Iteration 13 — 2026-04-13
**Stage:** RUNNING but `/gradio_api/info` still returns 500  
**Error:** `TypeError: argument of type 'bool' is not iterable` at `gradio_client/utils.py:882 → get_type → if "const" in schema`  
**Root cause:** Removing np.ndarray type hints in Iteration 12 did not fix the crash. The actual source is `gr.State({"l1": "en", "l2": "es", "ref": None, "history": [], "turn_count": 0})`. When Gradio generates the API schema for this State, it calls `_json_schema_to_python_type` on the dict schema. The dict's JSON Schema representation has `additionalProperties: True` (a Python bool, per JSON Schema spec). The schema generator then does `if "const" in schema` where `schema` is already a Python bool `True`, causing `TypeError: argument of type 'bool' is not iterable`. This happens in `gradio_client/utils.py` at line 882 regardless of function type hints — it's triggered by the State type itself.  
**Fix applied:** Replaced single `gr.State(dict)` with **5 flat, primitive `gr.State` objects**:
- `state_l1 = gr.State("en")` — string, safe
- `state_l2 = gr.State("es")` — string, safe  
- `state_ref = gr.State([])` — empty list (no numpy), safe
- `state_history = gr.State([])` — list of dicts (plain JSON), safe
- `state_turn_count = gr.State(0)` — int, safe
All callbacks updated to accept/return these flat states. `ref_list` (a Python list) is passed as `state_ref` and converted to `np.ndarray` inside `gpu_chat_turn` only. Full `app.py` rewrite.  
**Files changed:** `app.py` only.  
**Result:** PASS ✓ — Space RUNNING, Gradio UI fully functional (language select → enrollment → chat → wall at turn 7), `/health_hf` → 200. See session 2026-04-13 for subsequent full-backend migration.

---

## Iteration 14 — 2026-04-13 (session 2)
**Stage:** Full backend migration attempt — `gr.mount_gradio_app()` approach  
**Goal:** Serve FastAPI REST API (all `/session/*` endpoints) alongside Gradio UI so the Vercel React frontend can talk directly to the HF Space instead of Railway.  
**Approach:** Replaced `demo.launch()` with `app = gr.mount_gradio_app(api, demo, path="/ui")` where `api` is a standalone `FastAPI()` instance with all endpoints defined as routes.  
**Error:** `RUNTIME_ERROR` — Space exits with code 0 (clean exit).  
**Root cause:** HF Spaces with `sdk: gradio` require `demo.launch()` to start and block the server. `gr.mount_gradio_app()` returns an ASGI app object but does not start a server — same as Iteration 4 (the process runs to completion and exits immediately).  
**Fix applied:** See Iteration 15.

---

## Iteration 15 — 2026-04-13 (session 2)
**Stage:** RUNTIME_ERROR — Space exits code 0 after `gr.mount_gradio_app()`  
**Approach:** Switched to `include_router()` pattern: patched `gradio.routes.App.create_app` to call `gapp.include_router(_vmr)` (adding all API routes to Gradio's internal FastAPI app), then ended with `demo.launch()` to keep the process alive.  
**Error:** `GET /health` → `HTTP 308 Permanent Redirect` (location: `/`). All API routes return 308.  
**Root cause:** Gradio 5.x registers a catch-all SPA route `/{path_name:path}` during `create_app`. FastAPI matches routes in insertion order — the catch-all is registered first (inside Gradio's own `create_app` logic), so any routes added afterward via `include_router()` are never matched. Every unknown path gets 308-redirected to `/` before our routes are evaluated.  
**Key lesson:** `include_router()` appends routes AFTER the catch-all — they will never be reached in Gradio 5.x.  
**Fix applied:** See Iteration 16.

---

## Iteration 16 — 2026-04-13 (session 2)
**Stage:** RUNNING but all API routes return 308 via `include_router()`  
**Root cause:** Same Gradio 5.x SPA catch-all issue as Iteration 9 but for custom routes instead of `/health_hf`. `include_router()` is append-only and cannot insert before the catch-all.  
**Fix applied:** Implemented all REST API endpoints as a single `BaseHTTPMiddleware` subclass (`_VocalMirrorAPI`) with regex-based path dispatch. Middleware runs BEFORE any route matching (same pattern that fixed `/health_hf` in Iteration 9). `demo.launch()` stays bare. Session state in-memory dict, audio in `/tmp/`, background thread for enrollment, `asyncio.run_in_executor` for `gpu_tts()` from async context.  
**Result:** PASS ✓  
- `GET /health` → `{"status":"ok"}` HTTP 200 ✓  
- `GET /vm-config` → `{"wall_turn_count":7}` ✓  (named `/vm-config` to avoid shadowing Gradio's own `/config`)  
- `POST /session/start` → returns session_id + word_list ✓  
- `GET /session/{id}/wall_status` → `{"show_wall":false,"turn_count":0}` ✓  
- Gradio product UI at `/` still fully functional ✓  
Space is RUNNING on zero-a10g with SHA ef1ba6a.