Spaces:

hetchyy
/

quranic-universal-aligner

Running on Zero

hetchyy Claude Opus 4.6 commited on Feb 21

Commit

6984b50

1 Parent(s): a6f747e

Fix SDK worker_init CUDA poisoning: reset immediately instead of 300s cooldown

When ZeroGPU SDK's worker_init fails (quota exhaustion causes torch._C._cuda_init()
to fail), it poisons torch.cuda._initialized at the process level. The SDK wraps
this as gradio.Error(title="ZeroGPU worker error") with the original CUDA message
stripped, bypassing our pattern matching. Now detect these SDK worker errors via
e.title and reset CUDA state immediately — no lock is held and no CUDA context is
active at worker_init time, so other users can retry GPU fresh without waiting 300s.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (1) hide show

src/core/zero_gpu.py +22 -0

src/core/zero_gpu.py CHANGED Viewed

@@ -366,6 +366,13 @@ def gpu_with_fallback(duration=60):
                 err_lower = str(e).lower()
                 is_cuda_error = any(p in err_lower for p in _CUDA_ERROR_PATTERNS)
                 if is_cuda_error:
                     print(f"[GPU] CUDA error, falling back to CPU: {e}")
                     _mark_cuda_unhealthy()
@@ -377,6 +384,21 @@ def gpu_with_fallback(duration=60):
                         pass
                     return func(*args, **kwargs)
                 is_timeout = (
                     'timeout' in err_lower
                     or 'duration' in err_lower

                 err_lower = str(e).lower()
                 is_cuda_error = any(p in err_lower for p in _CUDA_ERROR_PATTERNS)
+                # SDK wraps worker_init failures as gradio.Error(title="ZeroGPU worker error")
+                # with message = just the exception class name. Original CUDA message is lost.
+                is_sdk_worker_error = False
+                if not is_cuda_error:
+                    err_title = getattr(e, 'title', '') or ''
+                    is_sdk_worker_error = 'worker' in err_title.lower() and 'error' in err_title.lower()
                 if is_cuda_error:
                     print(f"[GPU] CUDA error, falling back to CPU: {e}")
                     _mark_cuda_unhealthy()
                         pass
                     return func(*args, **kwargs)
+                if is_sdk_worker_error:
+                    # worker_init failed (torch._C._cuda_init() poisoned the process).
+                    # Reset immediately — no lock is held, no CUDA context is active.
+                    # Other users can retry GPU fresh; their worker_init gets a new GPU.
+                    # Do NOT call _mark_cuda_unhealthy() — that blocks ALL users for 300s.
+                    print(f"[GPU] SDK worker error, resetting CUDA state: {e}")
+                    _try_reset_cuda_state()
+                    _request_state.gpu_quota_exhausted = True
+                    try:
+                        import gradio as gr
+                        gr.Warning("GPU temporarily unavailable — using CPU (slower).")
+                    except Exception:
+                        pass
+                    return func(*args, **kwargs)
                 is_timeout = (
                     'timeout' in err_lower
                     or 'duration' in err_lower