Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -11,17 +11,32 @@ license: apache-2.0
|
|
| 11 |
python_version: 3.11
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
| 15 |
-
ZeroEngine is designed to
|
| 16 |
|
| 17 |
-
##
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
4. Select 'BOOT' to load your model.
|
| 22 |
-
5. Start chatting! The engine automatically pre-processes your enquiry into a tensor, speeding up everything.
|
| 23 |
|
| 24 |
-
##
|
| 25 |
-
1.
|
| 26 |
-
|
| 27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
python_version: 3.11
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# 🛰️ ZeroEngine V0.1
|
| 15 |
+
**ZeroEngine** is a high-efficiency inference platform designed to push the limits of low-tier hardware. It demonstrates that with aggressive optimization, even a standard 2 vCPU instance can provide a responsive LLM experience.
|
| 16 |
|
| 17 |
+
## 🚀 Key Features
|
| 18 |
+
- **Zero-Config GGUF Loading:** Scan and boot any compatible repository directly from the Hub.
|
| 19 |
+
- **Ghost Cache System:** Background tokenization and KV-cache priming for near-instant execution.
|
| 20 |
+
- **Resource Stewardship:** Integrated "Inactivity Session Killer" and 3-pass GC to ensure high availability on shared hardware.
|
|
|
|
|
|
|
| 21 |
|
| 22 |
+
## 🛠️ Usage
|
| 23 |
+
1. **Target Repo:** Enter a Hugging Face model repository (e.g., `unsloth/Llama-3.2-1B-GGUF`).
|
| 24 |
+
- *Note: On current 2 vCPU hardware, models >4B are not recommended.*
|
| 25 |
+
2. **Scan:** Click **SCAN** to fetch available `.gguf` quants.
|
| 26 |
+
3. **Select Quant:** Choose your preferred file. (Recommendation: `Q4_K_M` for the optimal balance of speed and logic).
|
| 27 |
+
4. **Initialize:** Click **BOOT** to load the model into the kernel.
|
| 28 |
+
5. **Execute:** Start chatting. The engine pre-processes your input into tensors while you type.
|
| 29 |
+
|
| 30 |
+
## ⚖️ Current Limitations
|
| 31 |
+
- **Concurrency:** To maintain performance, vCPU slots are strictly managed. If the system is full, you will be placed in a queue.
|
| 32 |
+
- **Inactivity Timeout:** Users are automatically rotated out of the active slot after **20 seconds of inactivity** to free resources for the community.
|
| 33 |
+
- **Hardware Bottleneck:** On the base 2 vCPU tier, expect 1-5 TPS for BF16 models and 6-12 TPS for optimized quants.
|
| 34 |
+
|
| 35 |
+
## 🏗️ Technical Stack
|
| 36 |
+
- **Inference:** `llama-cpp-python`
|
| 37 |
+
- **Frontend:** `Gradio 6.5.0`
|
| 38 |
+
- **Telemetry:** Custom JSON-based resource monitoring
|
| 39 |
+
- **License:** Apache 2.0
|
| 40 |
+
|
| 41 |
+
---
|
| 42 |
+
*ZeroEngine is a personal open-source project dedicated to making LLM inference accessible on minimal hardware.*
|