Text Generation
LiteRT-LM
English
custom
hermes-edge
mobile-ai
on-device
ios
iphone-16
apple-neural-engine
deepseek
dspark
speculative-decoding
hermes-agent
tool-calling
raven-ecosystem
Instructions to use bclermo/hermes-edge with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT-LM
How to use bclermo/hermes-edge with LiteRT-LM:
# LiteRT-LM runs on various platforms (Android, iOS, Windows, Linux, macOS, IoT, Web/WASM) # and supports many APIs (C++, Python, Kotlin, Swift, JavaScript, Flutter). # For platform-specific integration guides, please refer to the official developer website: # https://ai.google.dev/edge/litert-lm # To try LiteRT-LM, the easiest way is to use our CLI tool. # 1. Install the LiteRT-LM CLI tool: pip install litert-lm # 2. Download and run this model locally: # See: https://ai.google.dev/edge/litert-lm/cli litert-lm run \ --from-huggingface-repo=bclermo/hermes-edge \ model.litertlm \ --prompt="Write me a poem"
- Notebooks
- Google Colab
- Kaggle
π¦ Hermes Edge
On-device AI agent for iPhone 16 + Android β fully offline via LiteRT-LM.
π± Install on iPhone 16 (1 Tap)
https://huggingface.co/bclermo/hermes-edge/resolve/main/dist/hermes-mobile-270m-int4.litertlm
- Open Google AI Edge Gallery app on your iPhone 16
- Tap Import Model
- Paste the URL above
- The model auto-downloads and runs on A18 Pro Neural Engine
Requirements: iOS 18.2+, iPhone 16/16 Pro, LiteRT-LM runtime (bundled with Gallery).
π§ Architecture
Hermes Edge combines three advanced AI techniques:
1. DeepSeek-Style Reasoning
Chain-of-thought reasoning inspired by DeepSeek-R1 and DeepSeek-V4:
- Internal reasoning in
<think>...</think>tags - Step-by-step problem decomposition
- Self-verification of intermediate results
- Compatible with tool calling within reasoning traces
2. Hermes Tool Calling
NousResearch-compatible function calling format:
<tool_call>{"name": "calculator", "arguments": {"expr": "2+2"}}</tool_call>
<tool_response>{"name": "calculator", "content": "4"}</tool_response>
3. DSpark Speculative Decoding
Inspired by DeepSeek's DSpark framework β a lightweight draft model predicts K=4 tokens ahead, verified in a single pass by the main model. Up to 2.5Γ speedup with identical output quality (lossless).
π Performance (iPhone 16 Pro β A18 Pro)
| Model Variant | Speed | RAM | Size | DSpark Speedup |
|---|---|---|---|---|
| 270M INT4 | ~55 tok/s | ~180 MB | 180 MB | 2.1Γ |
| 500M INT4 | ~40 tok/s | ~320 MB | 320 MB | 2.3Γ |
| 1B INT4 | ~25 tok/s | ~650 MB | 650 MB | 2.5Γ |
π§ Build Your Own Model
# Install
pip install litert-torch torch transformers sentencepiece
# Convert any HuggingFace model to .litertlm
litert-torch export_hf \
--model=Qwen/Qwen2.5-0.5B-Instruct \
--output_dir=./dist \
--quantization=dynamic_wi4_afp32 \
--cache_length=2048 \
--prefill_lengths=32
Or use the Makefile:
make convert-270m # Qwen2.5-0.5B β 270M INT4
make convert-500m # Qwen2.5-1.5B β 500M INT4
make convert-1b # Qwen3-0.6B β 1B INT4
π Quick Start
from hermes.litert_model import LiteRTModel
from hermes.agent import HermesAgent, AgentConfig
from hermes.chat_template import build_prompt, Message
model = LiteRTModel("dist/hermes-mobile-270m-int4.litertlm")
model.load()
agent = HermesAgent(model, config=AgentConfig(use_reasoning=True, use_speculative_decoding=True))
response = agent.run("What is 15% of 80?")
print(response)
# <think>Let me calculate 15% of 80...
# 10% of 80 = 8, 5% of 80 = 4, so 15% = 8 + 4 = 12</think>
# 15% of 80 is 12.
π§© Components
| Module | Description |
|---|---|
hermes/litert_model.py |
LiteRT-LM runtime wrapper (Python) |
hermes/agent.py |
Agent loop: reasoning β tools β response |
hermes/config.py |
Model architecture configuration |
hermes/chat_template.py |
ChatML + tool calling format |
scripts/convert_hf_to_litertlm.py |
HF β .litertlm converter |
scripts/deepseek_reasoning_template.py |
DeepSeek-style reasoning templates |
scripts/hermes_tool_format.py |
Hermes tool calling format |
scripts/dspark_draft.py |
DSpark-inspired speculative decoding |
hf-space/app.py |
Gradio demo Space |
π Requirements
- Python 3.11+
- LiteRT-LM runtime (for inference)
- litert-torch (for conversion)
- torch + transformers + sentencepiece
π License
Apache 2.0 β see LICENSE.
Hermes Edge Β· Built on Raven AI Ecosystem Β· Barry Clerjuste