Lemer (MLX Q8) — Gemma 4 E2B + LEK

Near-lossless MLX 8-bit quantised build of lemer — Gemma 4 E2B with the Lethean Ethical Kernel (LEK) merged into the text attention weights, quantised to 8 bits per weight via mlx-vlm's native quantisation (affine mode, group size 64). Full multimodal support preserved (text, image, audio). Effective rate: 9.257 bits per weight average (embeddings and sensitive layers kept at higher precision).

Other formats in the Lemma family:

Repo	Format	Size	Use case
lthn/lemer	HF + GGUF + MLX Q4 bundled	3–9 GB per variant	Main consumer repo — everything in one place
lthn/lemer-mlx-bf16	MLX BF16	10.2 GB	Full-precision reference
lthn/lemer-mlx-q8	MLX Q8	5.9 GB	You are here — near-lossless quantised
lthn/lemer-mlx	MLX Q4	4.1 GB	On-device default
LetheanNetwork/lemer	HF BF16 (unmodified base)	10.2 GB	Raw Google Gemma 4 E2B fork, no LEK

What This Is

The Lethean Ethical Kernel (LEK) has been merged directly into the text attention projections (100 q/k/v/o_proj layers) of Gemma 4 E2B via LoRA finetune, then folded into the base weights. The vision tower and audio tower are preserved unmodified from Google's upstream — LEK only shifts text reasoning.

This variant is MLX Q8 quantised from the merged model. 2 safetensor shards totalling ~5.9 GB. Quantisation is 8 bits for attention/MLP weights, with embeddings and selected layers kept at higher precision (hence the 9.257 bits/weight average). Near-lossless quality versus the BF16 reference — generation is indistinguishable for most tasks.

Use this variant when:

You want higher fidelity than Q4 but don't need full BF16
You have ~7 GB of runtime memory available
You're running comparative evals against other Q8 models
You want the best speed/quality tradeoff for multimodal inference

For lighter on-device usage, use lemer-mlx (Q4) at 4.1 GB. For reference-grade fidelity, use lemer-mlx-bf16.

Quick Start

mlx-lm (text)

uv tool install mlx-lm
mlx_lm.chat --model lthn/lemer-mlx-q8
mlx_lm.generate --model lthn/lemer-mlx-q8 --prompt "Hello, how are you?"

mlx-vlm (vision + audio multimodal)

uv tool install mlx-vlm

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model, processor = load("lthn/lemer-mlx-q8")
config = load_config("lthn/lemer-mlx-q8")

image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image in one sentence."

formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=1
)

output = generate(model, processor, formatted_prompt, image)
print(output.text)

mlx-vlm server (OpenAI-compatible API)

mlx_vlm.server --model lthn/lemer-mlx-q8 --port 8080

Note: use mlx_vlm.server (not mlx_lm.server) because lemer is multimodal. The text-only mlx_lm.server does not correctly route the vision/audio tensors for Gemma 4.

Recommended Sampling

Per Google's Gemma 4 model card. Gemma 4 is calibrated for temperature=1.0 — greedy / temperature=0 is NOT recommended and will measurably underperform.

Parameter	Value
`temperature`	1.0
`top_p`	0.95
`top_k`	64

Already set in generation_config.json.

Model Details

Property	Value
Architecture	Gemma 4 E2B
Format	MLX Q8 (affine quantisation)
Quantisation bits	8 (9.257 bits/weight average including full-precision layers)
Quantisation group size	64
Parameters	5.1B total, 2.3B effective (Per-Layer Embeddings)
Layers	35 text decoder layers
Context Length	128K tokens
Vocabulary	262K tokens
Modalities	Text, Image, Audio
Vision Encoder	~150M params (preserved unmodified from Google)
Audio Encoder	~300M params (preserved unmodified from Google)
Weight files	2 shards (`model-00001-of-00002.safetensors`, …, 5.9 GB total)
LEK delta	LoRA rank 8 merged into 100 text attention projections, then quantised
Quantisation source	lthn/lemer-mlx-bf16 via `mlx_vlm.convert(quantize=True, q_bits=8, q_group_size=64)`
Base fork	LetheanNetwork/lemer (unmodified Google fork)
Licence	EUPL-1.2

Full Model Card

Detailed documentation — Lemma family overview, GGUF variants, capability map, benchmarks, the "why EUPL-1.2" framing, and the Roadmap — lives on the main repo:

→ lthn/lemer

About Lethean

Lethean is a social enterprise building ethical AI infrastructure. The Lemma model family is part of the LEM (Lethean Ethical Model) project — training protocol and tooling for intrinsic ethical alignment of language models via consent-based LoRA finetunes, shipped EUPL-1.2 so the ethical layer stays in the open.