Spaces:
Sleeping
Sleeping
| # HF Space Build Error Log β rubentuesday/vocal-mirror | |
| This file is committed alongside every fix so the repo retains full context of what broke and why. | |
| --- | |
| ## Iteration 1 β 2026-04-11 | |
| **Stage:** CONFIG_ERROR | |
| **Error:** `No candidate PyTorch version found for ZeroGPU` | |
| **Root cause:** `requirements.txt` pinned `torch==2.5.1+cu121` and `torchaudio==2.5.1+cu121` with `--extra-index-url https://download.pytorch.org/whl/cu121`. ZeroGPU manages its own CUDA PyTorch installation and rejects spaces that pin a `+cu121`-suffixed variant β it fails at config parse time before any package install. | |
| **Fix applied:** | |
| - Removed `--extra-index-url https://download.pytorch.org/whl/cu121` from `requirements.txt` | |
| - Removed `torch==2.5.1+cu121` and `torchaudio==2.5.1+cu121` from `requirements.txt` (ZeroGPU provides these) | |
| - Changed `gradio>=5.0.0,<6.0` β `gradio==4.44.1` in `requirements.txt` (project rule: pin to 4.44.1) | |
| - Changed `sdk_version: 5.0.0` β `sdk_version: 4.44.1` in `README.md` YAML frontmatter | |
| **Result:** FAIL β CONFIG_ERROR resolved, but caused new RUNTIME_ERROR (see Iteration 2). Gradio 4.44.1 was wrong choice β reverted. | |
| --- | |
| ## Iteration 2 β 2026-04-11 | |
| **Stage:** RUNTIME_ERROR | |
| **Error 1 (first):** `TypeError: unhashable type: 'dict'` in `jinja2/utils.py` β Gradio 4.x Jinja template cache bug | |
| **Error 2:** `ValueError: When localhost is not accessible, a shareable link must be created. Please set share=True` β Gradio 4.x requires share=True on remote hosts | |
| **Root cause:** Downgrading to `gradio==4.44.1` reintroduced two known Gradio 4.x bugs. The commit history already shows `c0a2ea8` explicitly upgraded to 5.x to fix the Jinja crash. Both errors are 4.x-only issues fixed in 5.x. The "pin to 4.44.1" instruction in the task brief was outdated. | |
| **Fix applied:** | |
| - Reverted `requirements.txt`: `gradio==4.44.1` β `gradio>=5.0.0,<6.0` | |
| - Reverted `README.md`: `sdk_version: 4.44.1` β `sdk_version: 5.0.0` | |
| **Result:** PASS β Space reached RUNNING stage. `/health_hf` returns 308 (route missing). Fixed in Iteration 3. | |
| --- | |
| ## Iteration 3 β 2026-04-11 | |
| **Stage:** RUNNING but `/health_hf` returns 308 Permanent Redirect (no such route) | |
| **Root cause:** `app.py` only has `demo.launch()` with no custom routes. Gradio 5.x redirects unknown paths to `/`. | |
| **Fix applied:** Switched from `demo.launch()` to `gr.mount_gradio_app()` pattern: | |
| - Added `FastAPI` app with `@app.get("/health_hf")` returning `{"status": "ok"}` | |
| - Replaced `demo.launch()` with `app = gr.mount_gradio_app(app, demo, path="/")` | |
| - `@spaces.GPU` decorator still handles ZeroGPU GPU allocation independently | |
| **Result:** FAIL β RUNTIME_ERROR exit code 0. gr.mount_gradio_app() returns immediately; nothing blocks the process. Fixed in Iteration 4. | |
| --- | |
| ## Iteration 4 β 2026-04-11 | |
| **Stage:** RUNTIME_ERROR β `Exit code: 0. Reason: ` (clean exit, process didn't stay alive) | |
| **Root cause:** `gr.mount_gradio_app()` returns an ASGI app object but doesn't start a server. Without `demo.launch()` blocking, `app.py` runs to completion and exits. | |
| **Fix applied:** Added `uvicorn.run(app, host="0.0.0.0", port=7860)` after the mount call to start the ASGI server and block the process. | |
| **Result:** FAIL β RUNTIME_ERROR "No @spaces.GPU function detected during startup". `uvicorn.run()` bypasses `spaces.zero.gradio` launch wrapper that scans for GPU functions. ZeroGPU requires `demo.launch()`. Fixed in Iteration 5. | |
| --- | |
| ## Iteration 5 β 2026-04-11 | |
| **Stage:** RUNTIME_ERROR β `No @spaces.GPU function detected during startup` | |
| **Root cause:** `gr.mount_gradio_app()` + `uvicorn.run()` bypasses the `spaces.zero.gradio` interceptor of `demo.launch()`. ZeroGPU scans for `@spaces.GPU` decorated functions inside that interceptor β never gets called, so GPU functions aren't registered. | |
| **Fix applied:** Reverted to bare `demo.launch()`. Added `/health_hf` by monkey-patching `gradio.routes.App.create_app` to inject the route into the Gradio FastAPI app at creation time, before ZeroGPU starts the server. | |
| **Result:** FAIL β "Application unable to start for an unknown reason". The `create_app.__func__` access likely failed (AttributeError or TypeError) in Gradio 5.x, crashing startup silently. Fixed in Iteration 6. | |
| --- | |
| ## Iteration 6 β 2026-04-11 | |
| **Stage:** RUNTIME_ERROR β "Application unable to start for an unknown reason" | |
| **Root cause:** Monkey-patching `gradio.routes.App.create_app.__func__` crashed at import/startup time in Gradio 5.x. The `__func__` access pattern assumes `create_app` is a classmethod β if the signature or descriptor changed in 5.x, this raises AttributeError and kills the process before any server starts. | |
| **Fix applied:** Replaced monkey-patch with a daemon thread that polls `demo.server` (set by Gradio after `demo.launch()` initializes the server) and injects `/health_hf` once available. `demo.launch()` stays bare β ZeroGPU detection works normally. Thread is a no-op if injection fails. | |
| **Result:** FAIL β Space is RUNNING but `/health_hf` still returns 308. `demo.server` is never set in the polling thread's context (ZeroGPU runs the real server in a GPU worker, not the same process). Fixed in Iteration 7. | |
| --- | |
| ## Iteration 7 β 2026-04-11 | |
| **Stage:** RUNNING but `/health_hf` still returns 308 | |
| **Root cause:** In ZeroGPU, the actual Gradio server runs in a separate GPU worker process. `demo.server` is never set in the main process, so the daemon thread's poll always fails and the route is never injected. | |
| **Fix applied:** Use `demo.launch(prevent_thread_lock=True)` β the spaces interceptor still detects `@spaces.GPU` functions, then starts the server in a background thread in the same process and returns. After `launch()` returns, `demo.server.app` is accessible and we add `/health_hf`. Main thread blocked via `threading.Event().wait()` (avoids relying on `demo.block_thread()` existing in Gradio 5.x). | |
| **Result:** FAIL β `AttributeError: 'Server' object has no attribute 'app'`. Gradio 5.x's `Server` wraps uvicorn β the FastAPI app lives at `server.config.app`, not `server.app`. Fixed in Iteration 8. | |
| --- | |
| ## Iteration 8 β 2026-04-11 | |
| **Stage:** RUNTIME_ERROR β `AttributeError: 'Server' object has no attribute 'app'` | |
| **Root cause:** `demo.server` is a Gradio `Server` (wrapping uvicorn). In uvicorn, the ASGI app is stored in `server.config.app` (the `Config` object passed at construction), not directly on `server.app`. | |
| **Fix applied:** Changed `demo.server.app.get(...)` β `demo.server.config.app.get(...)`. | |
| **Result:** FAIL β Space RUNNING but `/health_hf` still 308. `demo.server.config.app.get()` adds route AFTER Gradio's catch-all `/{path_name:path}` is already registered. FastAPI matches routes in insertion order β catch-all added first wins. Fixed in Iteration 9. | |
| --- | |
| ## Iteration 9 β 2026-04-11 | |
| **Stage:** RUNNING but `/health_hf` returns 308 | |
| **Root cause:** Adding `@app.get("/health_hf")` after `create_app` appends the route AFTER Gradio's catch-all `/{path_name:path}`. FastAPI/Starlette matches routes in registration order β the catch-all was registered first and intercepts everything, including `/health_hf`. | |
| **Fix applied:** Use Starlette middleware (`BaseHTTPMiddleware`) patched into Gradio's `create_app`. Middleware runs BEFORE any route matching, so `/health_hf` is intercepted before the catch-all. Reverted to bare `demo.launch()` (ZeroGPU works). Entire patch wrapped in `try/except` so failures are silent and don't prevent startup. | |
| **Result:** PASS β β Space RUNNING, `GET /health_hf` β `{"status":"ok"}` HTTP 200. All done after 9 iterations. | |
| --- | |
| ## Iteration 10 β 2026-04-12 | |
| **Stage:** RUNNING but "Run Benchmark" throws OSError | |
| **Error:** `OSError: Could not load this library: /usr/local/lib/python3.10/site-packages/torchaudio/lib/_torchaudio.abi3.so` | |
| **Root cause (via runtime logs):** `qwen-tts` depends on `torchaudio`. `pip install qwen-tts` upgraded `torchaudio` to the latest PyPI release which was compiled against CUDA 13 (`libcudart.so.13`). ZeroGPU A10G runs CUDA 12, so `libcudart.so.13` is not present. Full import chain: `from qwen_tts import Qwen3TTSModel` β `speech_vq.py` β `import torchaudio.compliance.kaldi` β `torchaudio/__init__.py` β `torchaudio._extension` β `torch.ops.load_library("_torchaudio.abi3.so")` β `OSError: libcudart.so.13`. | |
| **Fix applied:** Pinned `torchaudio==2.5.1` in `requirements.txt` BEFORE the `qwen-tts` line. torchaudio 2.5.1 (Nov 2024) was compiled against CUDA 12 and prevents pip from upgrading to a CUDA-13 version. `kaldi.fbank()` (the only torchaudio function qwen-tts calls from this path) is a CPU-only DSP operation β no GPU needed. | |
| **Result:** PASS β β Space RUNNING with new SHA 990b408, `/health_hf` β 200. Benchmark fix deployed. | |
| --- | |
| ## Iteration 11 β 2026-04-12 | |
| **Stage:** RUNNING β benchmark redesign (not a build error) | |
| **Change:** Replaced static `np.zeros` reference + arbitrary test text with a live microphone enrollment simulation. New UI: user records one of the 3 frontend enrollment phrases via Gradio `Audio` input, benchmark clones their voice and synthesizes an AI response ("Great job! Now let's keep the conversation going. How was your day?"), returns RTF result + playable audio output. Mirrors the actual frontend UX: enroll β clone β hear AI response. | |
| **Files changed:** `app.py` only. | |
| **Result:** FAIL β space RUNNING, `/health_hf` 200, but Gradio API returns 500 Internal Server Error. UI loads but "Start β" button fails. See Iteration 12. | |
| --- | |
| ## Iteration 12 β 2026-04-13 | |
| **Stage:** RUNNING but Gradio API `/gradio_api/info` returns 500 Internal Server Error | |
| **Error:** `File "/usr/local/lib/python3.10/site-packages/gradio_client/utils.py", line 967, in _json_schema_to_python_type` β crash during API schema generation | |
| **Root cause (via runtime logs):** Gradio generates a JSON schema for all function signatures when serving `/gradio_api/info`. The `gpu_chat_turn` function had type hints `ref: np.ndarray, history: list, turn_count: int, l1: str, l2: str`. `gradio_client`'s `json_schema_to_python_type` in `_json_schema_to_python_type` cannot serialize `numpy.ndarray` into a JSON schema β it crashes on the list comprehension at line 967β968 trying to build property descriptions. This crash propagates through Starlette's middleware stack, resulting in a 500 on every request (including the frontend's queue/event polling calls). | |
| **Fix applied:** | |
| - Removed all type hints from `gpu_enroll_and_greet` and `gpu_chat_turn` signatures β Gradio's schema generator only inspects annotated parameters | |
| - Changed `gpu_enroll_and_greet` to return `ref.tolist()` (plain Python list) instead of `np.ndarray` β keeps State JSON-serializable | |
| - Changed `gpu_chat_turn` to accept `ref_list` (plain list) and convert to `np.ndarray` internally via `np.array(ref_list, dtype=np.float32)` before passing to `synthesize()` | |
| - No changes to callbacks β `on_enroll` stores whatever the function returns; `on_send` passes it through unchanged | |
| **Files changed:** `app.py` only. | |
| **Result:** FAIL β same crash persists. Removing np.ndarray type hints did not resolve it. Root cause was actually the gr.State(dict) itself, not the function signature. See Iteration 13. | |
| --- | |
| ## Iteration 13 β 2026-04-13 | |
| **Stage:** RUNNING but `/gradio_api/info` still returns 500 | |
| **Error:** `TypeError: argument of type 'bool' is not iterable` at `gradio_client/utils.py:882 β get_type β if "const" in schema` | |
| **Root cause:** Removing np.ndarray type hints in Iteration 12 did not fix the crash. The actual source is `gr.State({"l1": "en", "l2": "es", "ref": None, "history": [], "turn_count": 0})`. When Gradio generates the API schema for this State, it calls `_json_schema_to_python_type` on the dict schema. The dict's JSON Schema representation has `additionalProperties: True` (a Python bool, per JSON Schema spec). The schema generator then does `if "const" in schema` where `schema` is already a Python bool `True`, causing `TypeError: argument of type 'bool' is not iterable`. This happens in `gradio_client/utils.py` at line 882 regardless of function type hints β it's triggered by the State type itself. | |
| **Fix applied:** Replaced single `gr.State(dict)` with **5 flat, primitive `gr.State` objects**: | |
| - `state_l1 = gr.State("en")` β string, safe | |
| - `state_l2 = gr.State("es")` β string, safe | |
| - `state_ref = gr.State([])` β empty list (no numpy), safe | |
| - `state_history = gr.State([])` β list of dicts (plain JSON), safe | |
| - `state_turn_count = gr.State(0)` β int, safe | |
| All callbacks updated to accept/return these flat states. `ref_list` (a Python list) is passed as `state_ref` and converted to `np.ndarray` inside `gpu_chat_turn` only. Full `app.py` rewrite. | |
| **Files changed:** `app.py` only. | |
| **Result:** PASS β β Space RUNNING, Gradio UI fully functional (language select β enrollment β chat β wall at turn 7), `/health_hf` β 200. See session 2026-04-13 for subsequent full-backend migration. | |
| --- | |
| ## Iteration 14 β 2026-04-13 (session 2) | |
| **Stage:** Full backend migration attempt β `gr.mount_gradio_app()` approach | |
| **Goal:** Serve FastAPI REST API (all `/session/*` endpoints) alongside Gradio UI so the Vercel React frontend can talk directly to the HF Space instead of Railway. | |
| **Approach:** Replaced `demo.launch()` with `app = gr.mount_gradio_app(api, demo, path="/ui")` where `api` is a standalone `FastAPI()` instance with all endpoints defined as routes. | |
| **Error:** `RUNTIME_ERROR` β Space exits with code 0 (clean exit). | |
| **Root cause:** HF Spaces with `sdk: gradio` require `demo.launch()` to start and block the server. `gr.mount_gradio_app()` returns an ASGI app object but does not start a server β same as Iteration 4 (the process runs to completion and exits immediately). | |
| **Fix applied:** See Iteration 15. | |
| --- | |
| ## Iteration 15 β 2026-04-13 (session 2) | |
| **Stage:** RUNTIME_ERROR β Space exits code 0 after `gr.mount_gradio_app()` | |
| **Approach:** Switched to `include_router()` pattern: patched `gradio.routes.App.create_app` to call `gapp.include_router(_vmr)` (adding all API routes to Gradio's internal FastAPI app), then ended with `demo.launch()` to keep the process alive. | |
| **Error:** `GET /health` β `HTTP 308 Permanent Redirect` (location: `/`). All API routes return 308. | |
| **Root cause:** Gradio 5.x registers a catch-all SPA route `/{path_name:path}` during `create_app`. FastAPI matches routes in insertion order β the catch-all is registered first (inside Gradio's own `create_app` logic), so any routes added afterward via `include_router()` are never matched. Every unknown path gets 308-redirected to `/` before our routes are evaluated. | |
| **Key lesson:** `include_router()` appends routes AFTER the catch-all β they will never be reached in Gradio 5.x. | |
| **Fix applied:** See Iteration 16. | |
| --- | |
| ## Iteration 16 β 2026-04-13 (session 2) | |
| **Stage:** RUNNING but all API routes return 308 via `include_router()` | |
| **Root cause:** Same Gradio 5.x SPA catch-all issue as Iteration 9 but for custom routes instead of `/health_hf`. `include_router()` is append-only and cannot insert before the catch-all. | |
| **Fix applied:** Implemented all REST API endpoints as a single `BaseHTTPMiddleware` subclass (`_VocalMirrorAPI`) with regex-based path dispatch. Middleware runs BEFORE any route matching (same pattern that fixed `/health_hf` in Iteration 9). `demo.launch()` stays bare. Session state in-memory dict, audio in `/tmp/`, background thread for enrollment, `asyncio.run_in_executor` for `gpu_tts()` from async context. | |
| **Result:** PASS β | |
| - `GET /health` β `{"status":"ok"}` HTTP 200 β | |
| - `GET /vm-config` β `{"wall_turn_count":7}` β (named `/vm-config` to avoid shadowing Gradio's own `/config`) | |
| - `POST /session/start` β returns session_id + word_list β | |
| - `GET /session/{id}/wall_status` β `{"show_wall":false,"turn_count":0}` β | |
| - Gradio product UI at `/` still fully functional β | |
| Space is RUNNING on zero-a10g with SHA ef1ba6a. | |