--- license: apache-2.0 tags: - gguf - text-generation - edge-ai - local-first - micro-llm - rag model_creator: MLM8372984732947 model_name: Echo-Mini-22M-F16 pipeline_tag: text-generation language: - en --- # 🚀 Echo-Mini (22M Parameters - F16 GGUF) **Echo-Mini** is an ultra-lightweight, highly optimized micro-transformer model designed explicitly for low-power edge computing, local-first environments, and embedded system integration. Unlike massive cloud-hosted LLMs, Echo-Mini packs its entire vocabulary, tokenizer, and attention mechanisms into a portable **~44MB footprint**, making it a perfect foundation for private, zero-latency on-device text tasks. ## ✨ Key Features * **Zero Cloud Dependency:** Runs 100% locally on standard consumer devices, mobile processors, and edge systems. * **Extreme Performance:** Achieves ultra-high inference speeds (300+ tokens/second) entirely on consumer CPUs without needing an active GPU. * **Pristine Precision:** Compiled in unquantized **Float16 (F16)** to prevent the common formatting collapse, word-smashing, and attention loops frequently found in microscopic quantized variants. * **Self-Contained Architecture:** The GGUF container packages all architectural metadata and tokenizer configurations into a single, unified binary. --- ## 🧠 System Prompt Modes (The "Three Brains") Echo-Mini switches its internal logic based on the `System` tag provided in the prompt structure. To achieve the best inference quality, define the processing mode explicitly prior to user inputs. ### 1. `[CHAT]` or `[STORY]` Mode Optimized for general conversation, textual interactions, or narrative generation. ```text System: [CHAT] User: Write a story about a girl cleaning up her toys. Assistant: ``` ### 2. `[CODE]` Mode Triggers syntax-focused generation logic. Highly effective for simple programmatic formatting, loops, and script execution structures. ```text System: [CODE] User: Write a python while loop to count to 10. Assistant: ``` ### 3. `[FACT]` or `[RAG]` Mode Designed for context-grounded text extraction (Retrieval-Augmented Generation). Use this mode when piping external files, telemetry logs, or hardware documentation directly into the context window. ```text System: [FACT] Context: The vehicle requires 205/55 R19 tires for optimal performance. User: What size tires do I need? Assistant: ``` > ⚠️ **CRITICAL TOKENIZER WARNING:** Ensure your prompt structure ends exactly on the colon (`Assistant:`) with **no trailing space**. If a physical space is left after the colon, the sub-word tokenizer will misalign, leading to omitted word spaces or combined words. --- ## 💻 Quickstart Implementation (Node.js / TypeScript) You can run this model locally using `node-llama-cpp`. For optimal streaming results, utilize a **sliding-window text decoder** to cleanly reconstruct trailing word spaces during active inference. ```typescript import {LlamaModel, LlamaContext, LlamaSequence} from "node-llama-cpp"; import path from "path"; const model = new LlamaModel({ modelPath: path.join(__dirname, "model-f16.gguf") }); const context = new LlamaContext({model}); const sequence = new LlamaSequence({context}); // Step 1: Format prompt strictly without a trailing space. Choose your Mode! const prompt = `System: [CODE]\nUser: Write a python print statement.\nAssistant:`; const tokens = model.tokenize(prompt); // Step 2: Inject BOS token if missing from sequence start const finalTokens = tokens[0] === model.tokens.bos ? tokens : [model.tokens.bos, ...tokens]; let responseTokens: number[] = []; let printedLength = 0; console.log("Assistant stream started:\n"); for await (const token of sequence.evaluate(finalTokens, { temperature: 0.7, topP: 0.95, topK: 50, repeatPenalty: false // Retain natural structural text pacing })) { if (token === model.tokens.eos) break; responseTokens.push(token); // Dynamic window decoding prevents token boundary space stripping const fullText = model.detokenize(responseTokens); const textChunk = fullText.slice(printedLength); printedLength = fullText.length; process.stdout.write(textChunk); } ``` --- ## 🎯 Intended Use Cases * **Embedded Software & Robotics:** Native voice/text command parsing on low-spec hardware setups (e.g., Raspberry Pi controllers, microcontrollers, offline robotics). * **On-Device Private Assistants:** Powering custom local input tools (such as privacy-focused Android IME keyboards) requiring absolute data isolation. * **Micro-RAG Architectures:** Querying offline system manual files or parsing real-time configuration contexts directly at the edge. ## 📄 License This model and its compiled weights are open-sourced under the **Apache 2.0 License**. You are free to modify, distribute, and embed this architecture within proprietary and commercial products.