Suggestions for stable HuggingFace ZeroGPU deployment (Fixing CUDA flash-attn error and batch size timeout)

#18
by ezmarynoori - opened

Hi,

Thanks for this amazing project! I deployed the Space on HuggingFace ZeroGPU and encountered a couple of issues that might be helpful to address for stable deployment:

  1. Flash Attention CUDA Error: On ZeroGPU, since the allocated GPU nodes change dynamically, the pre-compiled flash-attn wheel often causes "no kernel image is available for execution" crashes due to architecture mismatch. Forcing a fallback to PyTorch's native Scaled Dot-Product Attention (SDPA) resolves this and ensures 100% stability across all allocated GPU models.

  2. Batch Size Timeout: The default batch size of 2 often exceeds the 120-second ZeroGPU execution limit, leading to aborted tasks. Making batch_size_input interactive (and defaulting to 1 on Space deployments) allows the pipeline to complete successfully in a single session and enables the Save & Resume mechanism to work properly when needed.

Perhaps adding a graceful fallback in the handler or environment checks would make HF Space deployments much more robust.

Best regards!

Sign up or log in to comment