turtle170 commited on
Commit
1a132e5
·
verified ·
1 Parent(s): 7ca413a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -6
README.md CHANGED
@@ -11,10 +11,17 @@ license: apache-2.0
11
  python_version: 3.11
12
  ---
13
 
14
- # ZeroEngine V0.1 (Kernel)
15
- High-performance inference engine for 2-vCPU / 16GB RAM constraints.
16
 
17
- ## Optimizations
18
- - **KV-Cache Stitching**: Asynchronous pre-evaluation of queue inputs.
19
- - **Hard Partitioning**: Dedicated core assignment per concurrent user.
20
- - **Memory Mapping**: weights mapped via `mmap` to preserve RAM for context.
 
 
 
 
 
 
 
 
11
  python_version: 3.11
12
  ---
13
 
14
+ ## Overview:
15
+ ZeroEngine is designed to demonstrate how low-tier hardware like the 2 vCPU instance provided by HF can run various models with ease.
16
 
17
+ ## Usage
18
+ 1. Enter your model repo (e.g. unsloth/gemma-3-1b-it-GGUF) [CAUTION: since ZeroEngine is running on low-tier hardware, It cannot run big models >4B.]
19
+ 2. Click 'SCAN' to get all the .gguf files of that repo.
20
+ 3. Click your preferred file (Q4_K_M has the best performance, at about 6-12 Tokens Per Second.)
21
+ 4. Select 'BOOT' to load your model.
22
+ 5. Start chatting! The engine automatically pre-processes your enquiry into a tensor, speeding up everything.
23
+
24
+ ## Limitations
25
+ 1. You might have to queue, as only 2 vCPUs are available. As we prioritise performance, a vCPU is assigned to a active user. There may be 2 active users at the same time. After one of the idles for >=20 seconds, they will be automatically kicked into the queue, freeing a slot.
26
+ 2. As the engine runs on low-tier hardware, expect 1-5 TPS on BF16 models.
27
+ 3. As the engine uses a shared template, some models like Gemma 3 would not work.