Instructions to use Jershone/Echo-Mini with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Jershone/Echo-Mini with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Jershone/Echo-Mini", filename="Echo-Mini.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Jershone/Echo-Mini with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Jershone/Echo-Mini # Run inference directly in the terminal: llama-cli -hf Jershone/Echo-Mini
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Jershone/Echo-Mini # Run inference directly in the terminal: llama-cli -hf Jershone/Echo-Mini
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Jershone/Echo-Mini # Run inference directly in the terminal: ./llama-cli -hf Jershone/Echo-Mini
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Jershone/Echo-Mini # Run inference directly in the terminal: ./build/bin/llama-cli -hf Jershone/Echo-Mini
Use Docker
docker model run hf.co/Jershone/Echo-Mini
- LM Studio
- Jan
- vLLM
How to use Jershone/Echo-Mini with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Jershone/Echo-Mini" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Jershone/Echo-Mini", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Jershone/Echo-Mini
- Ollama
How to use Jershone/Echo-Mini with Ollama:
ollama run hf.co/Jershone/Echo-Mini
- Unsloth Studio new
How to use Jershone/Echo-Mini with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Jershone/Echo-Mini to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Jershone/Echo-Mini to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Jershone/Echo-Mini to start chatting
- Docker Model Runner
How to use Jershone/Echo-Mini with Docker Model Runner:
docker model run hf.co/Jershone/Echo-Mini
- Lemonade
How to use Jershone/Echo-Mini with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Jershone/Echo-Mini
Run and chat with the model
lemonade run user.Echo-Mini-{{QUANT_TAG}}List all available models
lemonade list
| license: apache-2.0 | |
| tags: | |
| - gguf | |
| - text-generation | |
| - edge-ai | |
| - local-first | |
| - micro-llm | |
| - rag | |
| model_creator: MLM8372984732947 | |
| model_name: Echo-Mini-22M-F16 | |
| pipeline_tag: text-generation | |
| language: | |
| - en | |
| # ๐ Echo-Mini (22M Parameters - F16 GGUF) | |
| **Echo-Mini** is an ultra-lightweight, highly optimized micro-transformer model designed explicitly for low-power edge computing, local-first environments, and embedded system integration. | |
| Unlike massive cloud-hosted LLMs, Echo-Mini packs its entire vocabulary, tokenizer, and attention mechanisms into a portable **~44MB footprint**, making it a perfect foundation for private, zero-latency on-device text tasks. | |
| ## โจ Key Features | |
| * **Zero Cloud Dependency:** Runs 100% locally on standard consumer devices, mobile processors, and edge systems. | |
| * **Extreme Performance:** Achieves ultra-high inference speeds (300+ tokens/second) entirely on consumer CPUs without needing an active GPU. | |
| * **Pristine Precision:** Compiled in unquantized **Float16 (F16)** to prevent the common formatting collapse, word-smashing, and attention loops frequently found in microscopic quantized variants. | |
| * **Self-Contained Architecture:** The GGUF container packages all architectural metadata and tokenizer configurations into a single, unified binary. | |
| --- | |
| ## ๐ง System Prompt Modes (The "Three Brains") | |
| Echo-Mini switches its internal logic based on the `System` tag provided in the prompt structure. To achieve the best inference quality, define the processing mode explicitly prior to user inputs. | |
| ### 1. `[CHAT]` or `[STORY]` Mode | |
| Optimized for general conversation, textual interactions, or narrative generation. | |
| ```text | |
| System: [CHAT] | |
| User: Write a story about a girl cleaning up her toys. | |
| Assistant: | |
| ``` | |
| ### 2. `[CODE]` Mode | |
| Triggers syntax-focused generation logic. Highly effective for simple programmatic formatting, loops, and script execution structures. | |
| ```text | |
| System: [CODE] | |
| User: Write a python while loop to count to 10. | |
| Assistant: | |
| ``` | |
| ### 3. `[FACT]` or `[RAG]` Mode | |
| Designed for context-grounded text extraction (Retrieval-Augmented Generation). Use this mode when piping external files, telemetry logs, or hardware documentation directly into the context window. | |
| ```text | |
| System: [FACT] | |
| Context: The vehicle requires 205/55 R19 tires for optimal performance. | |
| User: What size tires do I need? | |
| Assistant: | |
| ``` | |
| > โ ๏ธ **CRITICAL TOKENIZER WARNING:** Ensure your prompt structure ends exactly on the colon (`Assistant:`) with **no trailing space**. If a physical space is left after the colon, the sub-word tokenizer will misalign, leading to omitted word spaces or combined words. | |
| --- | |
| ## ๐ป Quickstart Implementation (Node.js / TypeScript) | |
| You can run this model locally using `node-llama-cpp`. For optimal streaming results, utilize a **sliding-window text decoder** to cleanly reconstruct trailing word spaces during active inference. | |
| ```typescript | |
| import {LlamaModel, LlamaContext, LlamaSequence} from "node-llama-cpp"; | |
| import path from "path"; | |
| const model = new LlamaModel({ | |
| modelPath: path.join(__dirname, "model-f16.gguf") | |
| }); | |
| const context = new LlamaContext({model}); | |
| const sequence = new LlamaSequence({context}); | |
| // Step 1: Format prompt strictly without a trailing space. Choose your Mode! | |
| const prompt = `System: [CODE]\nUser: Write a python print statement.\nAssistant:`; | |
| const tokens = model.tokenize(prompt); | |
| // Step 2: Inject BOS token if missing from sequence start | |
| const finalTokens = tokens[0] === model.tokens.bos ? tokens : [model.tokens.bos, ...tokens]; | |
| let responseTokens: number[] = []; | |
| let printedLength = 0; | |
| console.log("Assistant stream started:\n"); | |
| for await (const token of sequence.evaluate(finalTokens, { | |
| temperature: 0.7, | |
| topP: 0.95, | |
| topK: 50, | |
| repeatPenalty: false // Retain natural structural text pacing | |
| })) { | |
| if (token === model.tokens.eos) break; | |
| responseTokens.push(token); | |
| // Dynamic window decoding prevents token boundary space stripping | |
| const fullText = model.detokenize(responseTokens); | |
| const textChunk = fullText.slice(printedLength); | |
| printedLength = fullText.length; | |
| process.stdout.write(textChunk); | |
| } | |
| ``` | |
| --- | |
| ## ๐ฏ Intended Use Cases | |
| * **Embedded Software & Robotics:** Native voice/text command parsing on low-spec hardware setups (e.g., Raspberry Pi controllers, microcontrollers, offline robotics). | |
| * **On-Device Private Assistants:** Powering custom local input tools (such as privacy-focused Android IME keyboards) requiring absolute data isolation. | |
| * **Micro-RAG Architectures:** Querying offline system manual files or parsing real-time configuration contexts directly at the edge. | |
| ## ๐ License | |
| This model and its compiled weights are open-sourced under the **Apache 2.0 License**. You are free to modify, distribute, and embed this architecture within proprietary and commercial products. |