Spaces:
Running
Running
Apply for a GPU community grant: Personal project
#1
by
turtle170
- opened
Hi @hysts ,
I am requesting a ZeroGPU grant for this space: turtle170/ZeroEngine.
Project Focus
ZeroEngine demonstrates high-efficiency LLM orchestration. I have already successfully optimized the kernel to run GGUF models on the base 2 vCPU tier, but hardware is now the primary bottleneck for user experience.
Hardware Justification
While the 2 vCPU build works, the inference speed and queue times limit its utility as a community tool.
- Current Limitation: Capped at small 1B-3B models with significant latency (currently seeing 1-4 TPS on Unsloth BF16 Llama 3.2 1B).
- ZeroGPU Goal: Upgrading will allow ZeroEngine to support 7B+ models with lightning-fast inference (predicted 50-200 TPS on a 7B Q4_K_M model), transforming it into a flagship GGUF runner for the Hub.
Technical Responsibility
I have built this engine to be a "polite neighbor" on shared hardware:
- Aggressive Cleanup: ZeroEngine utilizes a "20s Inactivity Session Killer," which combines Python Garbage Collection with a specialized model unloading routine to ensure VRAM is released immediately after a session. The inactive user will also be kicked into queue to free up space for other users.
- Optimization: Background tokenization and KV-cache priming are already implemented in a separate background handler (turtle170/ZeroEngine-Backend) to minimize active GPU residency time, ensuring we only occupy a GPU slice during active generation.
The space is fully refactored for Gradio 6.5.0 and is ready for @spaces.GPU deployment immediately. This grant will allow us to provide the community with a high-performance, zero-config way to explore GGUF models at scale.
Thank you for your consideration!