turtle170 commited on
Commit
5ed5fed
·
verified ·
1 Parent(s): 1a132e5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -12
README.md CHANGED
@@ -11,17 +11,32 @@ license: apache-2.0
11
  python_version: 3.11
12
  ---
13
 
14
- ## Overview:
15
- ZeroEngine is designed to demonstrate how low-tier hardware like the 2 vCPU instance provided by HF can run various models with ease.
16
 
17
- ## Usage
18
- 1. Enter your model repo (e.g. unsloth/gemma-3-1b-it-GGUF) [CAUTION: since ZeroEngine is running on low-tier hardware, It cannot run big models >4B.]
19
- 2. Click 'SCAN' to get all the .gguf files of that repo.
20
- 3. Click your preferred file (Q4_K_M has the best performance, at about 6-12 Tokens Per Second.)
21
- 4. Select 'BOOT' to load your model.
22
- 5. Start chatting! The engine automatically pre-processes your enquiry into a tensor, speeding up everything.
23
 
24
- ## Limitations
25
- 1. You might have to queue, as only 2 vCPUs are available. As we prioritise performance, a vCPU is assigned to a active user. There may be 2 active users at the same time. After one of the idles for >=20 seconds, they will be automatically kicked into the queue, freeing a slot.
26
- 2. As the engine runs on low-tier hardware, expect 1-5 TPS on BF16 models.
27
- 3. As the engine uses a shared template, some models like Gemma 3 would not work.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  python_version: 3.11
12
  ---
13
 
14
+ # 🛰️ ZeroEngine V0.1
15
+ **ZeroEngine** is a high-efficiency inference platform designed to push the limits of low-tier hardware. It demonstrates that with aggressive optimization, even a standard 2 vCPU instance can provide a responsive LLM experience.
16
 
17
+ ## 🚀 Key Features
18
+ - **Zero-Config GGUF Loading:** Scan and boot any compatible repository directly from the Hub.
19
+ - **Ghost Cache System:** Background tokenization and KV-cache priming for near-instant execution.
20
+ - **Resource Stewardship:** Integrated "Inactivity Session Killer" and 3-pass GC to ensure high availability on shared hardware.
 
 
21
 
22
+ ## 🛠️ Usage
23
+ 1. **Target Repo:** Enter a Hugging Face model repository (e.g., `unsloth/Llama-3.2-1B-GGUF`).
24
+ - *Note: On current 2 vCPU hardware, models >4B are not recommended.*
25
+ 2. **Scan:** Click **SCAN** to fetch available `.gguf` quants.
26
+ 3. **Select Quant:** Choose your preferred file. (Recommendation: `Q4_K_M` for the optimal balance of speed and logic).
27
+ 4. **Initialize:** Click **BOOT** to load the model into the kernel.
28
+ 5. **Execute:** Start chatting. The engine pre-processes your input into tensors while you type.
29
+
30
+ ## ⚖️ Current Limitations
31
+ - **Concurrency:** To maintain performance, vCPU slots are strictly managed. If the system is full, you will be placed in a queue.
32
+ - **Inactivity Timeout:** Users are automatically rotated out of the active slot after **20 seconds of inactivity** to free resources for the community.
33
+ - **Hardware Bottleneck:** On the base 2 vCPU tier, expect 1-5 TPS for BF16 models and 6-12 TPS for optimized quants.
34
+
35
+ ## 🏗️ Technical Stack
36
+ - **Inference:** `llama-cpp-python`
37
+ - **Frontend:** `Gradio 6.5.0`
38
+ - **Telemetry:** Custom JSON-based resource monitoring
39
+ - **License:** Apache 2.0
40
+
41
+ ---
42
+ *ZeroEngine is a personal open-source project dedicated to making LLM inference accessible on minimal hardware.*