Instructions to use lthn/lemer-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use lthn/lemer-mlx with MLX:

# Make sure mlx-vlm is installed
# pip install --upgrade mlx-vlm

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load the model
model, processor = load("lthn/lemer-mlx")
config = load_config("lthn/lemer-mlx")

# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."

# Apply chat template
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=1
)

# Generate output
output = generate(model, processor, formatted_prompt, image)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use lthn/lemer-mlx with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "lthn/lemer-mlx"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "lthn/lemer-mlx"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use lthn/lemer-mlx with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "lthn/lemer-mlx"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default lthn/lemer-mlx

Run Hermes

hermes

OpenClaw new

How to use lthn/lemer-mlx with OpenClaw:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "lthn/lemer-mlx"

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "lthn/lemer-mlx" \
  --custom-provider-id mlx-lm \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Lemer (MLX Q4) — Gemma 4 E2B + LEK

On-device default MLX 4-bit quantised build of lemer — Gemma 4 E2B with the Lethean Ethical Kernel (LEK) merged into the text attention weights, quantised to 4 bits per weight via mlx-vlm's native quantisation (affine mode, group size 64). Full multimodal support preserved (text, image, audio). Effective rate: 6.851 bits per weight average (embeddings and sensitive layers kept at higher precision). This is the default on-device variant — smallest footprint, fastest inference, best for consumer Apple Silicon.

Other formats in the Lemma family:

Repo	Format	Size	Use case
lthn/lemer	HF + GGUF + MLX Q4 bundled	3–9 GB per variant	Main consumer repo — everything in one place
lthn/lemer-mlx-bf16	MLX BF16	10.2 GB	Full-precision reference
lthn/lemer-mlx-q8	MLX Q8	5.9 GB	Near-lossless quantised
lthn/lemer-mlx	MLX Q4	4.1 GB	You are here — on-device default
LetheanNetwork/lemer	HF BF16 (unmodified base)	10.2 GB	Raw Google Gemma 4 E2B fork, no LEK

What This Is

The Lethean Ethical Kernel (LEK) has been merged directly into the text attention projections (100 q/k/v/o_proj layers) of Gemma 4 E2B via LoRA finetune, then folded into the base weights. The vision tower and audio tower are preserved unmodified from Google's upstream — LEK only shifts text reasoning.

This variant is MLX Q4 quantised from the merged model — the smallest, fastest multimodal Lemma variant suitable for on-device inference on consumer Apple Silicon. Single safetensor file, ~4.1 GB. Quantisation is 4 bits for attention/MLP weights, with embeddings and selected layers kept at higher precision (hence the 6.851 bits/weight average). Verified on M3 Ultra at 145+ tokens/sec generation via mlx-lm; vision inference tested against COCO sample images via mlx-vlm with accurate descriptions.

Use this variant when:

You want the default on-device Lemma experience
You're running on consumer Apple Silicon (M1/M2/M3 base, Air, Pro, Studio)
You need the fastest inference with acceptable quality
Memory budget is limited (~5 GB runtime peak)

For higher fidelity, use lemer-mlx-q8 at 5.9 GB or lemer-mlx-bf16 at 10.2 GB.

Quick Start

mlx-lm (text)

uv tool install mlx-lm
mlx_lm.chat --model lthn/lemer-mlx
mlx_lm.generate --model lthn/lemer-mlx --prompt "Hello, how are you?"

mlx-vlm (vision + audio multimodal)

uv tool install mlx-vlm

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model, processor = load("lthn/lemer-mlx")
config = load_config("lthn/lemer-mlx")

image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image in one sentence."

formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=1
)

output = generate(model, processor, formatted_prompt, image)
print(output.text)

mlx-vlm server (OpenAI-compatible API)

mlx_vlm.server --model lthn/lemer-mlx --port 8080

Then any OpenAI-compatible client can hit http://localhost:8080/v1/chat/completions. Works with LM Studio, pi-coding-agent, OpenWebUI, and any other OpenAI-API-compatible client.

Note: use mlx_vlm.server (not mlx_lm.server) because lemer is multimodal. The text-only mlx_lm.server does not correctly route the vision/audio tensors for Gemma 4.

Recommended Sampling

Per Google's Gemma 4 model card, use these across all use cases. Gemma 4 is calibrated for temperature=1.0 — greedy / temperature=0 is NOT recommended and will measurably underperform.

Parameter	Value
`temperature`	1.0
`top_p`	0.95
`top_k`	64

Already set in generation_config.json.

Model Details

Property	Value
Architecture	Gemma 4 E2B
Format	MLX Q4 (affine quantisation)
Quantisation bits	4 (6.851 bits/weight average including full-precision layers)
Quantisation group size	64
Parameters	5.1B total, 2.3B effective (Per-Layer Embeddings)
Layers	35 text decoder layers
Context Length	128K tokens
Vocabulary	262K tokens
Modalities	Text, Image, Audio
Vision Encoder	~150M params (preserved unmodified from Google)
Audio Encoder	~300M params (preserved unmodified from Google)
Weight file	Single `model.safetensors` (~4.1 GB)
LEK delta	LoRA rank 8 merged into 100 text attention projections, then quantised
Quantisation source	lthn/lemer-mlx-bf16 via `mlx_vlm.convert(quantize=True, q_bits=4, q_group_size=64)`
Base fork	LetheanNetwork/lemer (unmodified Google fork)
Licence	EUPL-1.2

Performance Notes

Verified on M3 Ultra (96 GB):

mlx-lm generation: ~145 tokens/sec on text-only inference
Peak runtime memory: ~3.4 GB (ample headroom for context growth)
Vision inference: correct multi-object scene description on COCO test images

Should run comfortably on M1/M2/M3/M4 Air (8 GB RAM) for text inference, and on Pro/Max/Ultra variants for full multimodal workloads.

Full Model Card

Detailed documentation — Lemma family overview, GGUF variants, capability map, benchmarks, the "why EUPL-1.2" framing, and the Roadmap — lives on the main repo:

→ lthn/lemer

About Lethean

Lethean is a social enterprise building ethical AI infrastructure. The Lemma model family is part of the LEM (Lethean Ethical Model) project — training protocol and tooling for intrinsic ethical alignment of language models via consent-based LoRA finetunes, shipped EUPL-1.2 so the ethical layer stays in the open.