--- license: eupl-1.2 pipeline_tag: image-text-to-text library_name: mlx base_model: - lthn/lemer base_model_relation: quantized tags: - gemma4 - lemma - mlx - 4bit - apple-silicon - multimodal - on-device - conversational license_link: https://ai.google.dev/gemma/docs/gemma_4_license --- # Lemer (MLX Q4) — Gemma 4 E2B + LEK **On-device default MLX 4-bit quantised build of [lemer](https://huggingface.co/lthn/lemer)** — Gemma 4 E2B with the Lethean Ethical Kernel (LEK) merged into the text attention weights, quantised to 4 bits per weight via `mlx-vlm`'s native quantisation (affine mode, group size 64). Full multimodal support preserved (text, image, audio). Effective rate: **6.851 bits per weight average** (embeddings and sensitive layers kept at higher precision). This is the **default on-device variant** — smallest footprint, fastest inference, best for consumer Apple Silicon. **Other formats in the Lemma family:** | Repo | Format | Size | Use case | |---|---|---|---| | [lthn/lemer](https://huggingface.co/lthn/lemer) | HF + GGUF + MLX Q4 bundled | 3–9 GB per variant | Main consumer repo — everything in one place | | [lthn/lemer-mlx-bf16](https://huggingface.co/lthn/lemer-mlx-bf16) | MLX BF16 | 10.2 GB | Full-precision reference | | [lthn/lemer-mlx-q8](https://huggingface.co/lthn/lemer-mlx-q8) | MLX Q8 | 5.9 GB | Near-lossless quantised | | [lthn/lemer-mlx](https://huggingface.co/lthn/lemer-mlx) | MLX Q4 | **4.1 GB** | **You are here** — on-device default | | [LetheanNetwork/lemer](https://huggingface.co/LetheanNetwork/lemer) | HF BF16 (unmodified base) | 10.2 GB | Raw Google Gemma 4 E2B fork, no LEK | ## What This Is The **Lethean Ethical Kernel (LEK)** has been merged directly into the text attention projections (100 `q/k/v/o_proj` layers) of Gemma 4 E2B via LoRA finetune, then folded into the base weights. The vision tower and audio tower are preserved unmodified from Google's upstream — LEK only shifts text reasoning. This variant is **MLX Q4 quantised from the merged model** — the smallest, fastest multimodal Lemma variant suitable for on-device inference on consumer Apple Silicon. Single safetensor file, ~4.1 GB. Quantisation is 4 bits for attention/MLP weights, with embeddings and selected layers kept at higher precision (hence the 6.851 bits/weight average). Verified on M3 Ultra at **145+ tokens/sec generation** via `mlx-lm`; vision inference tested against COCO sample images via `mlx-vlm` with accurate descriptions. Use this variant when: - You want the default on-device Lemma experience - You're running on consumer Apple Silicon (M1/M2/M3 base, Air, Pro, Studio) - You need the fastest inference with acceptable quality - Memory budget is limited (~5 GB runtime peak) For higher fidelity, use [lemer-mlx-q8](https://huggingface.co/lthn/lemer-mlx-q8) at 5.9 GB or [lemer-mlx-bf16](https://huggingface.co/lthn/lemer-mlx-bf16) at 10.2 GB. ## Quick Start ### mlx-lm (text) ```bash uv tool install mlx-lm mlx_lm.chat --model lthn/lemer-mlx mlx_lm.generate --model lthn/lemer-mlx --prompt "Hello, how are you?" ``` ### mlx-vlm (vision + audio multimodal) ```bash uv tool install mlx-vlm ``` ```python from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config model, processor = load("lthn/lemer-mlx") config = load_config("lthn/lemer-mlx") image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image in one sentence." formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) output = generate(model, processor, formatted_prompt, image) print(output.text) ``` ### mlx-vlm server (OpenAI-compatible API) ```bash mlx_vlm.server --model lthn/lemer-mlx --port 8080 ``` Then any OpenAI-compatible client can hit `http://localhost:8080/v1/chat/completions`. Works with LM Studio, pi-coding-agent, OpenWebUI, and any other OpenAI-API-compatible client. > **Note**: use `mlx_vlm.server` (not `mlx_lm.server`) because lemer is multimodal. The text-only `mlx_lm.server` does not correctly route the vision/audio tensors for Gemma 4. ## Recommended Sampling Per Google's [Gemma 4 model card](https://huggingface.co/google/gemma-4-E2B-it), use these across all use cases. **Gemma 4 is calibrated for `temperature=1.0` — greedy / temperature=0 is NOT recommended and will measurably underperform.** | Parameter | Value | |-----------|-------| | `temperature` | 1.0 | | `top_p` | 0.95 | | `top_k` | 64 | Already set in `generation_config.json`. ## Model Details | Property | Value | |----------|-------| | **Architecture** | Gemma 4 E2B | | **Format** | MLX Q4 (affine quantisation) | | **Quantisation bits** | 4 (6.851 bits/weight average including full-precision layers) | | **Quantisation group size** | 64 | | **Parameters** | 5.1B total, 2.3B effective (Per-Layer Embeddings) | | **Layers** | 35 text decoder layers | | **Context Length** | 128K tokens | | **Vocabulary** | 262K tokens | | **Modalities** | Text, Image, Audio | | **Vision Encoder** | ~150M params (preserved unmodified from Google) | | **Audio Encoder** | ~300M params (preserved unmodified from Google) | | **Weight file** | Single `model.safetensors` (~4.1 GB) | | **LEK delta** | LoRA rank 8 merged into 100 text attention projections, then quantised | | **Quantisation source** | [lthn/lemer-mlx-bf16](https://huggingface.co/lthn/lemer-mlx-bf16) via `mlx_vlm.convert(quantize=True, q_bits=4, q_group_size=64)` | | **Base fork** | [LetheanNetwork/lemer](https://huggingface.co/LetheanNetwork/lemer) (unmodified Google fork) | | **Licence** | EUPL-1.2 | ## Performance Notes Verified on M3 Ultra (96 GB): - **mlx-lm generation**: ~145 tokens/sec on text-only inference - **Peak runtime memory**: ~3.4 GB (ample headroom for context growth) - **Vision inference**: correct multi-object scene description on COCO test images Should run comfortably on M1/M2/M3/M4 Air (8 GB RAM) for text inference, and on Pro/Max/Ultra variants for full multimodal workloads. ## Full Model Card Detailed documentation — Lemma family overview, GGUF variants, capability map, benchmarks, the "why EUPL-1.2" framing, and the Roadmap — lives on the main repo: **→ [lthn/lemer](https://huggingface.co/lthn/lemer)** ## About Lethean [Lethean](https://lthn.ai) is a social enterprise building ethical AI infrastructure. The Lemma model family is part of the [LEM (Lethean Ethical Model)](https://github.com/LetheanNetwork) project — training protocol and tooling for intrinsic ethical alignment of language models via consent-based LoRA finetunes, shipped EUPL-1.2 so the ethical layer stays in the open. - Website: [lthn.ai](https://lthn.ai) - GitHub: [LetheanNetwork](https://github.com/LetheanNetwork) - Axioms (public domain): [Snider/ai-ethics](https://github.com/Snider/ai-ethics) - Licence: [EUPL-1.2](https://joinup.ec.europa.eu/collection/eupl/eupl-text-eupl-12)