Spaces:

rubentuesday
/

vocal-mirror

Sleeping

App Files Files Community

vocal-mirror / build-errors /build_errors.md

rubentuesday

docs: log iterations 14-16 (full backend migration, middleware fix)

71b7030 about 2 months ago

preview code

raw

history blame contribute delete

16.1 kB

A newer version of the Gradio SDK is available: 6.15.2

Upgrade

HF Space Build Error Log — rubentuesday/vocal-mirror

This file is committed alongside every fix so the repo retains full context of what broke and why.

Iteration 1 — 2026-04-11

Stage: CONFIG_ERROR
Error: No candidate PyTorch version found for ZeroGPU
Root cause: requirements.txt pinned torch==2.5.1+cu121 and torchaudio==2.5.1+cu121 with --extra-index-url https://download.pytorch.org/whl/cu121. ZeroGPU manages its own CUDA PyTorch installation and rejects spaces that pin a +cu121-suffixed variant — it fails at config parse time before any package install.
Fix applied:

Removed --extra-index-url https://download.pytorch.org/whl/cu121 from requirements.txt
Removed torch==2.5.1+cu121 and torchaudio==2.5.1+cu121 from requirements.txt (ZeroGPU provides these)
Changed gradio>=5.0.0,<6.0 → gradio==4.44.1 in requirements.txt (project rule: pin to 4.44.1)
Changed sdk_version: 5.0.0 → sdk_version: 4.44.1 in README.md YAML frontmatter
Result: FAIL — CONFIG_ERROR resolved, but caused new RUNTIME_ERROR (see Iteration 2). Gradio 4.44.1 was wrong choice — reverted.

Iteration 2 — 2026-04-11

Stage: RUNTIME_ERROR
Error 1 (first): TypeError: unhashable type: 'dict' in jinja2/utils.py — Gradio 4.x Jinja template cache bug
Error 2: ValueError: When localhost is not accessible, a shareable link must be created. Please set share=True — Gradio 4.x requires share=True on remote hosts
Root cause: Downgrading to gradio==4.44.1 reintroduced two known Gradio 4.x bugs. The commit history already shows c0a2ea8 explicitly upgraded to 5.x to fix the Jinja crash. Both errors are 4.x-only issues fixed in 5.x. The "pin to 4.44.1" instruction in the task brief was outdated.
Fix applied:

Reverted requirements.txt: gradio==4.44.1 → gradio>=5.0.0,<6.0
Reverted README.md: sdk_version: 4.44.1 → sdk_version: 5.0.0
Result: PASS — Space reached RUNNING stage. /health_hf returns 308 (route missing). Fixed in Iteration 3.

Iteration 3 — 2026-04-11

Stage: RUNNING but /health_hf returns 308 Permanent Redirect (no such route)
Root cause: app.py only has demo.launch() with no custom routes. Gradio 5.x redirects unknown paths to /.
Fix applied: Switched from demo.launch() to gr.mount_gradio_app() pattern:

Added FastAPI app with @app.get("/health_hf") returning {"status": "ok"}
Replaced demo.launch() with app = gr.mount_gradio_app(app, demo, path="/")
@spaces.GPU decorator still handles ZeroGPU GPU allocation independently
Result: FAIL — RUNTIME_ERROR exit code 0. gr.mount_gradio_app() returns immediately; nothing blocks the process. Fixed in Iteration 4.

Iteration 4 — 2026-04-11

Stage: RUNTIME_ERROR — Exit code: 0. Reason: (clean exit, process didn't stay alive) Root cause: gr.mount_gradio_app() returns an ASGI app object but doesn't start a server. Without demo.launch() blocking, app.py runs to completion and exits. Fix applied: Added uvicorn.run(app, host="0.0.0.0", port=7860) after the mount call to start the ASGI server and block the process. Result: FAIL — RUNTIME_ERROR "No @spaces.GPU function detected during startup". uvicorn.run() bypasses spaces.zero.gradio launch wrapper that scans for GPU functions. ZeroGPU requires demo.launch(). Fixed in Iteration 5.

Iteration 5 — 2026-04-11

Stage: RUNTIME_ERROR — No @spaces.GPU function detected during startup Root cause: gr.mount_gradio_app() + uvicorn.run() bypasses the spaces.zero.gradio interceptor of demo.launch(). ZeroGPU scans for @spaces.GPU decorated functions inside that interceptor — never gets called, so GPU functions aren't registered.
Fix applied: Reverted to bare demo.launch(). Added /health_hf by monkey-patching gradio.routes.App.create_app to inject the route into the Gradio FastAPI app at creation time, before ZeroGPU starts the server.
Result: FAIL — "Application unable to start for an unknown reason". The create_app.__func__ access likely failed (AttributeError or TypeError) in Gradio 5.x, crashing startup silently. Fixed in Iteration 6.

Iteration 6 — 2026-04-11

Stage: RUNTIME_ERROR — "Application unable to start for an unknown reason"
Root cause: Monkey-patching gradio.routes.App.create_app.__func__ crashed at import/startup time in Gradio 5.x. The __func__ access pattern assumes create_app is a classmethod — if the signature or descriptor changed in 5.x, this raises AttributeError and kills the process before any server starts.
Fix applied: Replaced monkey-patch with a daemon thread that polls demo.server (set by Gradio after demo.launch() initializes the server) and injects /health_hf once available. demo.launch() stays bare — ZeroGPU detection works normally. Thread is a no-op if injection fails.
Result: FAIL — Space is RUNNING but /health_hf still returns 308. demo.server is never set in the polling thread's context (ZeroGPU runs the real server in a GPU worker, not the same process). Fixed in Iteration 7.

Iteration 7 — 2026-04-11

Stage: RUNNING but /health_hf still returns 308
Root cause: In ZeroGPU, the actual Gradio server runs in a separate GPU worker process. demo.server is never set in the main process, so the daemon thread's poll always fails and the route is never injected.
Fix applied: Use demo.launch(prevent_thread_lock=True) — the spaces interceptor still detects @spaces.GPU functions, then starts the server in a background thread in the same process and returns. After launch() returns, demo.server.app is accessible and we add /health_hf. Main thread blocked via threading.Event().wait() (avoids relying on demo.block_thread() existing in Gradio 5.x).
Result: FAIL — AttributeError: 'Server' object has no attribute 'app'. Gradio 5.x's Server wraps uvicorn — the FastAPI app lives at server.config.app, not server.app. Fixed in Iteration 8.

Iteration 8 — 2026-04-11

Stage: RUNTIME_ERROR — AttributeError: 'Server' object has no attribute 'app'
Root cause: demo.server is a Gradio Server (wrapping uvicorn). In uvicorn, the ASGI app is stored in server.config.app (the Config object passed at construction), not directly on server.app.
Fix applied: Changed demo.server.app.get(...) → demo.server.config.app.get(...).
Result: FAIL — Space RUNNING but /health_hf still 308. demo.server.config.app.get() adds route AFTER Gradio's catch-all /{path_name:path} is already registered. FastAPI matches routes in insertion order — catch-all added first wins. Fixed in Iteration 9.

Iteration 9 — 2026-04-11

Stage: RUNNING but /health_hf returns 308
Root cause: Adding @app.get("/health_hf") after create_app appends the route AFTER Gradio's catch-all /{path_name:path}. FastAPI/Starlette matches routes in registration order — the catch-all was registered first and intercepts everything, including /health_hf.
Fix applied: Use Starlette middleware (BaseHTTPMiddleware) patched into Gradio's create_app. Middleware runs BEFORE any route matching, so /health_hf is intercepted before the catch-all. Reverted to bare demo.launch() (ZeroGPU works). Entire patch wrapped in try/except so failures are silent and don't prevent startup.
Result: PASS ✓ — Space RUNNING, GET /health_hf → {"status":"ok"} HTTP 200. All done after 9 iterations.

Iteration 10 — 2026-04-12

Stage: RUNNING but "Run Benchmark" throws OSError
Error: OSError: Could not load this library: /usr/local/lib/python3.10/site-packages/torchaudio/lib/_torchaudio.abi3.so
Root cause (via runtime logs): qwen-tts depends on torchaudio. pip install qwen-tts upgraded torchaudio to the latest PyPI release which was compiled against CUDA 13 (libcudart.so.13). ZeroGPU A10G runs CUDA 12, so libcudart.so.13 is not present. Full import chain: from qwen_tts import Qwen3TTSModel → speech_vq.py → import torchaudio.compliance.kaldi → torchaudio/__init__.py → torchaudio._extension → torch.ops.load_library("_torchaudio.abi3.so") → OSError: libcudart.so.13.
Fix applied: Pinned torchaudio==2.5.1 in requirements.txt BEFORE the qwen-tts line. torchaudio 2.5.1 (Nov 2024) was compiled against CUDA 12 and prevents pip from upgrading to a CUDA-13 version. kaldi.fbank() (the only torchaudio function qwen-tts calls from this path) is a CPU-only DSP operation — no GPU needed.
Result: PASS ✓ — Space RUNNING with new SHA 990b408, /health_hf → 200. Benchmark fix deployed.

Iteration 11 — 2026-04-12

Stage: RUNNING — benchmark redesign (not a build error) Change: Replaced static np.zeros reference + arbitrary test text with a live microphone enrollment simulation. New UI: user records one of the 3 frontend enrollment phrases via Gradio Audio input, benchmark clones their voice and synthesizes an AI response ("Great job! Now let's keep the conversation going. How was your day?"), returns RTF result + playable audio output. Mirrors the actual frontend UX: enroll → clone → hear AI response. Files changed: app.py only. Result: FAIL — space RUNNING, /health_hf 200, but Gradio API returns 500 Internal Server Error. UI loads but "Start →" button fails. See Iteration 12.

Iteration 12 — 2026-04-13

Stage: RUNNING but Gradio API /gradio_api/info returns 500 Internal Server Error
Error: File "/usr/local/lib/python3.10/site-packages/gradio_client/utils.py", line 967, in _json_schema_to_python_type — crash during API schema generation
Root cause (via runtime logs): Gradio generates a JSON schema for all function signatures when serving /gradio_api/info. The gpu_chat_turn function had type hints ref: np.ndarray, history: list, turn_count: int, l1: str, l2: str. gradio_client's json_schema_to_python_type in _json_schema_to_python_type cannot serialize numpy.ndarray into a JSON schema — it crashes on the list comprehension at line 967–968 trying to build property descriptions. This crash propagates through Starlette's middleware stack, resulting in a 500 on every request (including the frontend's queue/event polling calls).
Fix applied:

Removed all type hints from gpu_enroll_and_greet and gpu_chat_turn signatures — Gradio's schema generator only inspects annotated parameters
Changed gpu_enroll_and_greet to return ref.tolist() (plain Python list) instead of np.ndarray — keeps State JSON-serializable
Changed gpu_chat_turn to accept ref_list (plain list) and convert to np.ndarray internally via np.array(ref_list, dtype=np.float32) before passing to synthesize()
No changes to callbacks — on_enroll stores whatever the function returns; on_send passes it through unchanged
Files changed: app.py only.
Result: FAIL — same crash persists. Removing np.ndarray type hints did not resolve it. Root cause was actually the gr.State(dict) itself, not the function signature. See Iteration 13.

Iteration 13 — 2026-04-13

Stage: RUNNING but /gradio_api/info still returns 500
Error: TypeError: argument of type 'bool' is not iterable at gradio_client/utils.py:882 → get_type → if "const" in schema
Root cause: Removing np.ndarray type hints in Iteration 12 did not fix the crash. The actual source is gr.State({"l1": "en", "l2": "es", "ref": None, "history": [], "turn_count": 0}). When Gradio generates the API schema for this State, it calls _json_schema_to_python_type on the dict schema. The dict's JSON Schema representation has additionalProperties: True (a Python bool, per JSON Schema spec). The schema generator then does if "const" in schema where schema is already a Python bool True, causing TypeError: argument of type 'bool' is not iterable. This happens in gradio_client/utils.py at line 882 regardless of function type hints — it's triggered by the State type itself.
Fix applied: Replaced single gr.State(dict) with 5 flat, primitive gr.State objects:

state_l1 = gr.State("en") — string, safe
state_l2 = gr.State("es") — string, safe
state_ref = gr.State([]) — empty list (no numpy), safe
state_history = gr.State([]) — list of dicts (plain JSON), safe
state_turn_count = gr.State(0) — int, safe All callbacks updated to accept/return these flat states. ref_list (a Python list) is passed as state_ref and converted to np.ndarray inside gpu_chat_turn only. Full app.py rewrite.
Files changed: app.py only.
Result: PASS ✓ — Space RUNNING, Gradio UI fully functional (language select → enrollment → chat → wall at turn 7), /health_hf → 200. See session 2026-04-13 for subsequent full-backend migration.

Iteration 14 — 2026-04-13 (session 2)

Stage: Full backend migration attempt — gr.mount_gradio_app() approach
Goal: Serve FastAPI REST API (all /session/* endpoints) alongside Gradio UI so the Vercel React frontend can talk directly to the HF Space instead of Railway.
Approach: Replaced demo.launch() with app = gr.mount_gradio_app(api, demo, path="/ui") where api is a standalone FastAPI() instance with all endpoints defined as routes.
Error: RUNTIME_ERROR — Space exits with code 0 (clean exit).
Root cause: HF Spaces with sdk: gradio require demo.launch() to start and block the server. gr.mount_gradio_app() returns an ASGI app object but does not start a server — same as Iteration 4 (the process runs to completion and exits immediately).
Fix applied: See Iteration 15.

Iteration 15 — 2026-04-13 (session 2)

Stage: RUNTIME_ERROR — Space exits code 0 after gr.mount_gradio_app()
Approach: Switched to include_router() pattern: patched gradio.routes.App.create_app to call gapp.include_router(_vmr) (adding all API routes to Gradio's internal FastAPI app), then ended with demo.launch() to keep the process alive.
Error: GET /health → HTTP 308 Permanent Redirect (location: /). All API routes return 308.
Root cause: Gradio 5.x registers a catch-all SPA route /{path_name:path} during create_app. FastAPI matches routes in insertion order — the catch-all is registered first (inside Gradio's own create_app logic), so any routes added afterward via include_router() are never matched. Every unknown path gets 308-redirected to / before our routes are evaluated.
Key lesson: include_router() appends routes AFTER the catch-all — they will never be reached in Gradio 5.x.
Fix applied: See Iteration 16.

Iteration 16 — 2026-04-13 (session 2)

Stage: RUNNING but all API routes return 308 via include_router()
Root cause: Same Gradio 5.x SPA catch-all issue as Iteration 9 but for custom routes instead of /health_hf. include_router() is append-only and cannot insert before the catch-all.
Fix applied: Implemented all REST API endpoints as a single BaseHTTPMiddleware subclass (_VocalMirrorAPI) with regex-based path dispatch. Middleware runs BEFORE any route matching (same pattern that fixed /health_hf in Iteration 9). demo.launch() stays bare. Session state in-memory dict, audio in /tmp/, background thread for enrollment, asyncio.run_in_executor for gpu_tts() from async context.
Result: PASS ✓

GET /health → {"status":"ok"} HTTP 200 ✓
GET /vm-config → {"wall_turn_count":7} ✓ (named /vm-config to avoid shadowing Gradio's own /config)
POST /session/start → returns session_id + word_list ✓
GET /session/{id}/wall_status → {"show_wall":false,"turn_count":0} ✓
Gradio product UI at / still fully functional ✓
Space is RUNNING on zero-a10g with SHA ef1ba6a.