Instructions to use Rapid42/gemma-4-E2B-it-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use Rapid42/gemma-4-E2B-it-MLX with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("Rapid42/gemma-4-E2B-it-MLX") config = load_config("Rapid42/gemma-4-E2B-it-MLX") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use Rapid42/gemma-4-E2B-it-MLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "Rapid42/gemma-4-E2B-it-MLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Rapid42/gemma-4-E2B-it-MLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Rapid42/gemma-4-E2B-it-MLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "Rapid42/gemma-4-E2B-it-MLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Rapid42/gemma-4-E2B-it-MLX
Run Hermes
hermes
Rapid42/gemma-4-E2B-it-MLX
Gemma 4 (~1B, E2B ultra-efficient variant) — MLX format for Apple Silicon, instruction-tuned
Converted and optimized by Rapid42 — engineering tools for fast pipelines.
What This Is
This is Gemma 4 E2B — Google DeepMind's ultra-compact multimodal Gemma 4 variant (~1B parameters) in MLX format for native Apple Silicon inference. Instruction-tuned (-it) for chat and task-following.
The E2B is the smallest model in the Gemma 4 family — prioritising speed and minimal memory over raw capability. It still supports image input, making it the most capable sub-2B multimodal model available in MLX format.
- Parameters: ~1B (E2B = Efficient 2B-class, actual ~1B)
- Modality: Text + Image input → Text output
- Format: MLX (Apple Silicon native)
- Base model: google/gemma-4-it
- License: Apache 2.0
Hardware Requirements
| Device | RAM | Experience |
|---|---|---|
| Any M-series Mac (8GB+) | ~1.5GB | ✅ Runs on everything |
| M1 MacBook Air (8GB) | ~1.5GB | ✅ Extremely fast |
| iPhone / iPad (via MLX) | ~1.5GB | ✅ On-device capable |
| M3 Max | ~1.5GB | ✅ Near-instant — alongside any other app |
The lightest multimodal model you can run locally. Load time under 2 seconds.
Quick Start
pip install mlx-lm
Text chat:
from mlx_lm import load, generate
model, tokenizer = load("Rapid42/gemma-4-E2B-it-MLX")
messages = [{"role": "user", "content": "Summarize this in one paragraph: [paste text]"}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, return_dict=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=256, verbose=True)
print(response)
CLI (fastest way to chat):
mlx_lm.chat --model Rapid42/gemma-4-E2B-it-MLX
With image input (via MLX-VLM):
pip install mlx-vlm
python -m mlx_vlm.generate \
--model Rapid42/gemma-4-E2B-it-MLX \
--prompt "What's in this image?" \
--image /path/to/image.jpg
E2B vs E4B — Which to Use?
| Use Case | E2B (~1B) | E4B (~8B) |
|---|---|---|
| Quick summaries, short answers | ✅ Fast | ✅ More accurate |
| Complex reasoning | ❌ Limited | ✅ Much better |
| Always-on background assistant | ✅ Ideal | ⚠️ Uses more RAM |
| Image understanding | ✅ Basic | ✅ Strong |
| On-device mobile | ✅ Yes | ⚠️ Tight |
| Code generation | ⚠️ Simple only | ✅ Good |
Rule of thumb: Use E2B when you need speed and low overhead. Use E4B when you need quality.
Gemma 4 License
Apache 2.0. Full details: ai.google.dev/gemma/docs/gemma_4_license
Authors: Google DeepMind
About Rapid42
Rapid42 builds fast, precise engineering tools — from VFX pipeline utilities to optimized ML model distributions.
→ rapid42.com · ExrToPsd · Level Careers
- Downloads last month
- 27
4-bit