Echo-Mini / README.md
Jershone's picture
Update README.md
7443056 verified
---
license: apache-2.0
tags:
- gguf
- text-generation
- edge-ai
- local-first
- micro-llm
- rag
model_creator: MLM8372984732947
model_name: Echo-Mini-22M-F16
pipeline_tag: text-generation
language:
- en
---
# ๐Ÿš€ Echo-Mini (22M Parameters - F16 GGUF)
**Echo-Mini** is an ultra-lightweight, highly optimized micro-transformer model designed explicitly for low-power edge computing, local-first environments, and embedded system integration.
Unlike massive cloud-hosted LLMs, Echo-Mini packs its entire vocabulary, tokenizer, and attention mechanisms into a portable **~44MB footprint**, making it a perfect foundation for private, zero-latency on-device text tasks.
## โœจ Key Features
* **Zero Cloud Dependency:** Runs 100% locally on standard consumer devices, mobile processors, and edge systems.
* **Extreme Performance:** Achieves ultra-high inference speeds (300+ tokens/second) entirely on consumer CPUs without needing an active GPU.
* **Pristine Precision:** Compiled in unquantized **Float16 (F16)** to prevent the common formatting collapse, word-smashing, and attention loops frequently found in microscopic quantized variants.
* **Self-Contained Architecture:** The GGUF container packages all architectural metadata and tokenizer configurations into a single, unified binary.
---
## ๐Ÿง  System Prompt Modes (The "Three Brains")
Echo-Mini switches its internal logic based on the `System` tag provided in the prompt structure. To achieve the best inference quality, define the processing mode explicitly prior to user inputs.
### 1. `[CHAT]` or `[STORY]` Mode
Optimized for general conversation, textual interactions, or narrative generation.
```text
System: [CHAT]
User: Write a story about a girl cleaning up her toys.
Assistant:
```
### 2. `[CODE]` Mode
Triggers syntax-focused generation logic. Highly effective for simple programmatic formatting, loops, and script execution structures.
```text
System: [CODE]
User: Write a python while loop to count to 10.
Assistant:
```
### 3. `[FACT]` or `[RAG]` Mode
Designed for context-grounded text extraction (Retrieval-Augmented Generation). Use this mode when piping external files, telemetry logs, or hardware documentation directly into the context window.
```text
System: [FACT]
Context: The vehicle requires 205/55 R19 tires for optimal performance.
User: What size tires do I need?
Assistant:
```
> โš ๏ธ **CRITICAL TOKENIZER WARNING:** Ensure your prompt structure ends exactly on the colon (`Assistant:`) with **no trailing space**. If a physical space is left after the colon, the sub-word tokenizer will misalign, leading to omitted word spaces or combined words.
---
## ๐Ÿ’ป Quickstart Implementation (Node.js / TypeScript)
You can run this model locally using `node-llama-cpp`. For optimal streaming results, utilize a **sliding-window text decoder** to cleanly reconstruct trailing word spaces during active inference.
```typescript
import {LlamaModel, LlamaContext, LlamaSequence} from "node-llama-cpp";
import path from "path";
const model = new LlamaModel({
modelPath: path.join(__dirname, "model-f16.gguf")
});
const context = new LlamaContext({model});
const sequence = new LlamaSequence({context});
// Step 1: Format prompt strictly without a trailing space. Choose your Mode!
const prompt = `System: [CODE]\nUser: Write a python print statement.\nAssistant:`;
const tokens = model.tokenize(prompt);
// Step 2: Inject BOS token if missing from sequence start
const finalTokens = tokens[0] === model.tokens.bos ? tokens : [model.tokens.bos, ...tokens];
let responseTokens: number[] = [];
let printedLength = 0;
console.log("Assistant stream started:\n");
for await (const token of sequence.evaluate(finalTokens, {
temperature: 0.7,
topP: 0.95,
topK: 50,
repeatPenalty: false // Retain natural structural text pacing
})) {
if (token === model.tokens.eos) break;
responseTokens.push(token);
// Dynamic window decoding prevents token boundary space stripping
const fullText = model.detokenize(responseTokens);
const textChunk = fullText.slice(printedLength);
printedLength = fullText.length;
process.stdout.write(textChunk);
}
```
---
## ๐ŸŽฏ Intended Use Cases
* **Embedded Software & Robotics:** Native voice/text command parsing on low-spec hardware setups (e.g., Raspberry Pi controllers, microcontrollers, offline robotics).
* **On-Device Private Assistants:** Powering custom local input tools (such as privacy-focused Android IME keyboards) requiring absolute data isolation.
* **Micro-RAG Architectures:** Querying offline system manual files or parsing real-time configuration contexts directly at the edge.
## ๐Ÿ“„ License
This model and its compiled weights are open-sourced under the **Apache 2.0 License**. You are free to modify, distribute, and embed this architecture within proprietary and commercial products.