Spaces:
Running on Zero
Suggestions for stable HuggingFace ZeroGPU deployment (Fixing CUDA flash-attn error and batch size timeout)
Hi,
Thanks for this amazing project! I deployed the Space on HuggingFace ZeroGPU and encountered a couple of issues that might be helpful to address for stable deployment:
Flash Attention CUDA Error: On ZeroGPU, since the allocated GPU nodes change dynamically, the pre-compiled flash-attn wheel often causes "no kernel image is available for execution" crashes due to architecture mismatch. Forcing a fallback to PyTorch's native Scaled Dot-Product Attention (SDPA) resolves this and ensures 100% stability across all allocated GPU models.
Batch Size Timeout: The default batch size of 2 often exceeds the 120-second ZeroGPU execution limit, leading to aborted tasks. Making
batch_size_inputinteractive (and defaulting to 1 on Space deployments) allows the pipeline to complete successfully in a single session and enables the Save & Resume mechanism to work properly when needed.
Perhaps adding a graceful fallback in the handler or environment checks would make HF Space deployments much more robust.
Best regards!