File size: 2,092 Bytes
2838f15
7046421
ddd856f
c9c4656
ddd856f
2838f15
 
 
 
 
111b6d9
7046421
 
 
 
2838f15
 
5ed5fed
 
ddd856f
5ed5fed
 
 
 
1a132e5
5ed5fed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
title: ZeroEngine V0.2
emoji: πŸš€
colorFrom: gray
colorTo: gray
sdk: gradio
sdk_version: 6.5.0
app_file: app.py
pinned: false
license: apache-2.0
python_version: 3.11
hf_oauth: true
hf_oauth_scopes:
- read-repos
- email
---

# πŸ›°οΈ ZeroEngine V0.1
**ZeroEngine** is a high-efficiency inference platform designed to push the limits of low-tier hardware. It demonstrates that with aggressive optimization, even a standard 2 vCPU instance can provide a responsive LLM experience.

## πŸš€ Key Features
- **Zero-Config GGUF Loading:** Scan and boot any compatible repository directly from the Hub.
- **Ghost Cache System:** Background tokenization and KV-cache priming for near-instant execution.
- **Resource Stewardship:** Integrated "Inactivity Session Killer" and 3-pass GC to ensure high availability on shared hardware.

## πŸ› οΈ Usage
1. **Target Repo:** Enter a Hugging Face model repository (e.g., `unsloth/Llama-3.2-1B-GGUF`). 
   - *Note: On current 2 vCPU hardware, models >4B are not recommended.*
2. **Scan:** Click **SCAN** to fetch available `.gguf` quants.
3. **Select Quant:** Choose your preferred file. (Recommendation: `Q4_K_M` for the optimal balance of speed and logic).
4. **Initialize:** Click **BOOT** to load the model into the kernel.
5. **Execute:** Start chatting. The engine pre-processes your input into tensors while you type.

## βš–οΈ Current Limitations
- **Concurrency:** To maintain performance, vCPU slots are strictly managed. If the system is full, you will be placed in a queue.
- **Inactivity Timeout:** Users are automatically rotated out of the active slot after **20 seconds of inactivity** to free resources for the community.
- **Hardware Bottleneck:** On the base 2 vCPU tier, expect 1-5 TPS for BF16 models and 6-12 TPS for optimized quants.

## πŸ—οΈ Technical Stack
- **Inference:** `llama-cpp-python`
- **Frontend:** `Gradio 6.5.0`
- **Telemetry:** Custom JSON-based resource monitoring
- **License:** Apache 2.0

---
*ZeroEngine is a personal open-source project dedicated to making LLM inference accessible on minimal hardware.*