Spaces:
Running
Running
A newer version of the Gradio SDK is available:
6.5.1
metadata
title: ZeroEngine V0.2
emoji: π
colorFrom: gray
colorTo: gray
sdk: gradio
sdk_version: 6.5.0
app_file: app.py
pinned: false
license: apache-2.0
python_version: 3.11
hf_oauth: true
hf_oauth_scopes:
- read-repos
- email
π°οΈ ZeroEngine V0.1
ZeroEngine is a high-efficiency inference platform designed to push the limits of low-tier hardware. It demonstrates that with aggressive optimization, even a standard 2 vCPU instance can provide a responsive LLM experience.
π Key Features
- Zero-Config GGUF Loading: Scan and boot any compatible repository directly from the Hub.
- Ghost Cache System: Background tokenization and KV-cache priming for near-instant execution.
- Resource Stewardship: Integrated "Inactivity Session Killer" and 3-pass GC to ensure high availability on shared hardware.
π οΈ Usage
- Target Repo: Enter a Hugging Face model repository (e.g.,
unsloth/Llama-3.2-1B-GGUF).- Note: On current 2 vCPU hardware, models >4B are not recommended.
- Scan: Click SCAN to fetch available
.ggufquants. - Select Quant: Choose your preferred file. (Recommendation:
Q4_K_Mfor the optimal balance of speed and logic). - Initialize: Click BOOT to load the model into the kernel.
- Execute: Start chatting. The engine pre-processes your input into tensors while you type.
βοΈ Current Limitations
- Concurrency: To maintain performance, vCPU slots are strictly managed. If the system is full, you will be placed in a queue.
- Inactivity Timeout: Users are automatically rotated out of the active slot after 20 seconds of inactivity to free resources for the community.
- Hardware Bottleneck: On the base 2 vCPU tier, expect 1-5 TPS for BF16 models and 6-12 TPS for optimized quants.
ποΈ Technical Stack
- Inference:
llama-cpp-python - Frontend:
Gradio 6.5.0 - Telemetry: Custom JSON-based resource monitoring
- License: Apache 2.0
ZeroEngine is a personal open-source project dedicated to making LLM inference accessible on minimal hardware.