Lemer (MLX Q4) — Gemma 4 E2B + LEK

On-device default MLX 4-bit quantised build of lemer — Gemma 4 E2B with the Lethean Ethical Kernel (LEK) merged into the text attention weights, quantised to 4 bits per weight via mlx-vlm's native quantisation (affine mode, group size 64). Full multimodal support preserved (text, image, audio). Effective rate: 6.851 bits per weight average (embeddings and sensitive layers kept at higher precision). This is the default on-device variant — smallest footprint, fastest inference, best for consumer Apple Silicon.

Other formats in the Lemma family:

Repo Format Size Use case
lthn/lemer HF + GGUF + MLX Q4 bundled 3–9 GB per variant Main consumer repo — everything in one place
lthn/lemer-mlx-bf16 MLX BF16 10.2 GB Full-precision reference
lthn/lemer-mlx-q8 MLX Q8 5.9 GB Near-lossless quantised
lthn/lemer-mlx MLX Q4 4.1 GB You are here — on-device default
LetheanNetwork/lemer HF BF16 (unmodified base) 10.2 GB Raw Google Gemma 4 E2B fork, no LEK

What This Is

The Lethean Ethical Kernel (LEK) has been merged directly into the text attention projections (100 q/k/v/o_proj layers) of Gemma 4 E2B via LoRA finetune, then folded into the base weights. The vision tower and audio tower are preserved unmodified from Google's upstream — LEK only shifts text reasoning.

This variant is MLX Q4 quantised from the merged model — the smallest, fastest multimodal Lemma variant suitable for on-device inference on consumer Apple Silicon. Single safetensor file, ~4.1 GB. Quantisation is 4 bits for attention/MLP weights, with embeddings and selected layers kept at higher precision (hence the 6.851 bits/weight average). Verified on M3 Ultra at 145+ tokens/sec generation via mlx-lm; vision inference tested against COCO sample images via mlx-vlm with accurate descriptions.

Use this variant when:

  • You want the default on-device Lemma experience
  • You're running on consumer Apple Silicon (M1/M2/M3 base, Air, Pro, Studio)
  • You need the fastest inference with acceptable quality
  • Memory budget is limited (~5 GB runtime peak)

For higher fidelity, use lemer-mlx-q8 at 5.9 GB or lemer-mlx-bf16 at 10.2 GB.

Quick Start

mlx-lm (text)

uv tool install mlx-lm
mlx_lm.chat --model lthn/lemer-mlx
mlx_lm.generate --model lthn/lemer-mlx --prompt "Hello, how are you?"

mlx-vlm (vision + audio multimodal)

uv tool install mlx-vlm
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model, processor = load("lthn/lemer-mlx")
config = load_config("lthn/lemer-mlx")

image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image in one sentence."

formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=1
)

output = generate(model, processor, formatted_prompt, image)
print(output.text)

mlx-vlm server (OpenAI-compatible API)

mlx_vlm.server --model lthn/lemer-mlx --port 8080

Then any OpenAI-compatible client can hit http://localhost:8080/v1/chat/completions. Works with LM Studio, pi-coding-agent, OpenWebUI, and any other OpenAI-API-compatible client.

Note: use mlx_vlm.server (not mlx_lm.server) because lemer is multimodal. The text-only mlx_lm.server does not correctly route the vision/audio tensors for Gemma 4.

Recommended Sampling

Per Google's Gemma 4 model card, use these across all use cases. Gemma 4 is calibrated for temperature=1.0 — greedy / temperature=0 is NOT recommended and will measurably underperform.

Parameter Value
temperature 1.0
top_p 0.95
top_k 64

Already set in generation_config.json.

Model Details

Property Value
Architecture Gemma 4 E2B
Format MLX Q4 (affine quantisation)
Quantisation bits 4 (6.851 bits/weight average including full-precision layers)
Quantisation group size 64
Parameters 5.1B total, 2.3B effective (Per-Layer Embeddings)
Layers 35 text decoder layers
Context Length 128K tokens
Vocabulary 262K tokens
Modalities Text, Image, Audio
Vision Encoder ~150M params (preserved unmodified from Google)
Audio Encoder ~300M params (preserved unmodified from Google)
Weight file Single model.safetensors (~4.1 GB)
LEK delta LoRA rank 8 merged into 100 text attention projections, then quantised
Quantisation source lthn/lemer-mlx-bf16 via mlx_vlm.convert(quantize=True, q_bits=4, q_group_size=64)
Base fork LetheanNetwork/lemer (unmodified Google fork)
Licence EUPL-1.2

Performance Notes

Verified on M3 Ultra (96 GB):

  • mlx-lm generation: ~145 tokens/sec on text-only inference
  • Peak runtime memory: ~3.4 GB (ample headroom for context growth)
  • Vision inference: correct multi-object scene description on COCO test images

Should run comfortably on M1/M2/M3/M4 Air (8 GB RAM) for text inference, and on Pro/Max/Ultra variants for full multimodal workloads.

Full Model Card

Detailed documentation — Lemma family overview, GGUF variants, capability map, benchmarks, the "why EUPL-1.2" framing, and the Roadmap — lives on the main repo:

→ lthn/lemer

About Lethean

Lethean is a social enterprise building ethical AI infrastructure. The Lemma model family is part of the LEM (Lethean Ethical Model) project — training protocol and tooling for intrinsic ethical alignment of language models via consent-based LoRA finetunes, shipped EUPL-1.2 so the ethical layer stays in the open.

Downloads last month
266
Safetensors
Model size
1B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lthn/lemer-mlx

Finetuned
lthn/lemer
Quantized
(3)
this model

Collection including lthn/lemer-mlx