---
license: apache-2.0
tags:
- gguf
- text-generation
- edge-ai
- local-first
- micro-llm
- rag
model_creator: MLM8372984732947
model_name: Echo-Mini-22M-F16
pipeline_tag: text-generation
language:
- en
---

# 🚀 Echo-Mini (22M Parameters - F16 GGUF)

**Echo-Mini** is an ultra-lightweight, highly optimized micro-transformer model designed explicitly for low-power edge computing, local-first environments, and embedded system integration.

Unlike massive cloud-hosted LLMs, Echo-Mini packs its entire vocabulary, tokenizer, and attention mechanisms into a portable **~44MB footprint**, making it a perfect foundation for private, zero-latency on-device text tasks.

## ✨ Key Features

* **Zero Cloud Dependency:** Runs 100% locally on standard consumer devices, mobile processors, and edge systems.
* **Extreme Performance:** Achieves ultra-high inference speeds (300+ tokens/second) entirely on consumer CPUs without needing an active GPU.
* **Pristine Precision:** Compiled in unquantized **Float16 (F16)** to prevent the common formatting collapse, word-smashing, and attention loops frequently found in microscopic quantized variants.
* **Self-Contained Architecture:** The GGUF container packages all architectural metadata and tokenizer configurations into a single, unified binary.

---

## 🧠 System Prompt Modes (The "Three Brains")

Echo-Mini switches its internal logic based on the `System` tag provided in the prompt structure. To achieve the best inference quality, define the processing mode explicitly prior to user inputs.

### 1. `[CHAT]` or `[STORY]` Mode

Optimized for general conversation, textual interactions, or narrative generation.

```text
System: [CHAT]
User: Write a story about a girl cleaning up her toys.
Assistant:


```

### 2. `[CODE]` Mode

Triggers syntax-focused generation logic. Highly effective for simple programmatic formatting, loops, and script execution structures.

```text
System: [CODE]
User: Write a python while loop to count to 10.
Assistant:


```

### 3. `[FACT]` or `[RAG]` Mode

Designed for context-grounded text extraction (Retrieval-Augmented Generation). Use this mode when piping external files, telemetry logs, or hardware documentation directly into the context window.

```text
System: [FACT]
Context: The vehicle requires 205/55 R19 tires for optimal performance.
User: What size tires do I need?
Assistant:


```

> ⚠️ **CRITICAL TOKENIZER WARNING:** Ensure your prompt structure ends exactly on the colon (`Assistant:`) with **no trailing space**. If a physical space is left after the colon, the sub-word tokenizer will misalign, leading to omitted word spaces or combined words.

---

## 💻 Quickstart Implementation (Node.js / TypeScript)

You can run this model locally using `node-llama-cpp`. For optimal streaming results, utilize a **sliding-window text decoder** to cleanly reconstruct trailing word spaces during active inference.

```typescript
import {LlamaModel, LlamaContext, LlamaSequence} from "node-llama-cpp";
import path from "path";

const model = new LlamaModel({
    modelPath: path.join(__dirname, "model-f16.gguf")
});

const context = new LlamaContext({model});
const sequence = new LlamaSequence({context});

// Step 1: Format prompt strictly without a trailing space. Choose your Mode!
const prompt = `System: [CODE]\nUser: Write a python print statement.\nAssistant:`;
const tokens = model.tokenize(prompt);

// Step 2: Inject BOS token if missing from sequence start
const finalTokens = tokens[0] === model.tokens.bos ? tokens : [model.tokens.bos, ...tokens];

let responseTokens: number[] = [];
let printedLength = 0;

console.log("Assistant stream started:\n");

for await (const token of sequence.evaluate(finalTokens, { 
    temperature: 0.7,  
    topP: 0.95,        
    topK: 50,          
    repeatPenalty: false // Retain natural structural text pacing
})) {
    if (token === model.tokens.eos) break;
    responseTokens.push(token);

    // Dynamic window decoding prevents token boundary space stripping
    const fullText = model.detokenize(responseTokens);
    const textChunk = fullText.slice(printedLength);
    printedLength = fullText.length;
    
    process.stdout.write(textChunk); 
}


```

---

## 🎯 Intended Use Cases

* **Embedded Software & Robotics:** Native voice/text command parsing on low-spec hardware setups (e.g., Raspberry Pi controllers, microcontrollers, offline robotics).
* **On-Device Private Assistants:** Powering custom local input tools (such as privacy-focused Android IME keyboards) requiring absolute data isolation.
* **Micro-RAG Architectures:** Querying offline system manual files or parsing real-time configuration contexts directly at the edge.

## 📄 License

This model and its compiled weights are open-sourced under the **Apache 2.0 License**. You are free to modify, distribute, and embed this architecture within proprietary and commercial products.