Spaces:

turtle170
/

ZeroEngine

Running

App Files Files Community

turtle170 commited on Jan 31

Commit

5ed5fed

verified ·

1 Parent(s): 1a132e5

Update README.md

Browse files

Files changed (1) hide show

README.md +27 -12

README.md CHANGED Viewed

@@ -11,17 +11,32 @@ license: apache-2.0
 python_version: 3.11
 ---
-## Overview:
-ZeroEngine is designed to demonstrate how low-tier hardware like the 2 vCPU instance provided by HF can run various models with ease.
-## Usage
-1. Enter your model repo (e.g. unsloth/gemma-3-1b-it-GGUF) [CAUTION: since ZeroEngine is running on low-tier hardware, It cannot run big models >4B.]
-2. Click 'SCAN' to get all the .gguf files of that repo.
-3. Click your preferred file (Q4_K_M has the best performance, at about 6-12 Tokens Per Second.)
-4. Select 'BOOT' to load your model.
-5. Start chatting! The engine automatically pre-processes your enquiry into a tensor, speeding up everything.
-## Limitations
-1. You might have to queue, as only 2 vCPUs are available. As we prioritise performance, a vCPU is assigned to a active user. There may be 2 active users at the same time. After one of the idles for >=20 seconds, they will be automatically kicked into the queue, freeing a slot.
-2. As the engine runs on low-tier hardware, expect 1-5 TPS on BF16 models.
-3. As the engine uses a shared template, some models like Gemma 3 would not work.

 python_version: 3.11
 ---
+# 🛰️ ZeroEngine V0.1
+**ZeroEngine** is a high-efficiency inference platform designed to push the limits of low-tier hardware. It demonstrates that with aggressive optimization, even a standard 2 vCPU instance can provide a responsive LLM experience.
+## 🚀 Key Features
+- **Zero-Config GGUF Loading:** Scan and boot any compatible repository directly from the Hub.
+- **Ghost Cache System:** Background tokenization and KV-cache priming for near-instant execution.
+- **Resource Stewardship:** Integrated "Inactivity Session Killer" and 3-pass GC to ensure high availability on shared hardware.
+## 🛠️ Usage
+1. **Target Repo:** Enter a Hugging Face model repository (e.g., `unsloth/Llama-3.2-1B-GGUF`).
+   - *Note: On current 2 vCPU hardware, models >4B are not recommended.*
+2. **Scan:** Click **SCAN** to fetch available `.gguf` quants.
+3. **Select Quant:** Choose your preferred file. (Recommendation: `Q4_K_M` for the optimal balance of speed and logic).
+4. **Initialize:** Click **BOOT** to load the model into the kernel.
+5. **Execute:** Start chatting. The engine pre-processes your input into tensors while you type.
+## ⚖️ Current Limitations
+- **Concurrency:** To maintain performance, vCPU slots are strictly managed. If the system is full, you will be placed in a queue.
+- **Inactivity Timeout:** Users are automatically rotated out of the active slot after **20 seconds of inactivity** to free resources for the community.
+- **Hardware Bottleneck:** On the base 2 vCPU tier, expect 1-5 TPS for BF16 models and 6-12 TPS for optimized quants.
+## 🏗️ Technical Stack
+- **Inference:** `llama-cpp-python`
+- **Frontend:** `Gradio 6.5.0`
+- **Telemetry:** Custom JSON-based resource monitoring
+- **License:** Apache 2.0
+---
+*ZeroEngine is a personal open-source project dedicated to making LLM inference accessible on minimal hardware.*