| --- |
| license: eupl-1.2 |
| pipeline_tag: image-text-to-text |
| library_name: mlx |
| base_model: |
| - lthn/lemer |
| base_model_relation: quantized |
| tags: |
| - gemma4 |
| - lemma |
| - mlx |
| - 4bit |
| - apple-silicon |
| - multimodal |
| - on-device |
| - conversational |
| license_link: https://ai.google.dev/gemma/docs/gemma_4_license |
| --- |
| <!-- |
| This content is subject to the European Union Public Licence (EUPL-1.2). |
| For full licence details, please refer to: https://huggingface.co/lthn/lemer-mlx/tree/main/LICENSE |
| Origin URL: https://huggingface.co/lthn/lemer-mlx/tree/main |
| --> |
| # Lemer (MLX Q4) β Gemma 4 E2B + LEK |
|
|
| **On-device default MLX 4-bit quantised build of [lemer](https://huggingface.co/lthn/lemer)** β Gemma 4 E2B with the Lethean Ethical Kernel (LEK) merged into the text attention weights, quantised to 4 bits per weight via `mlx-vlm`'s native quantisation (affine mode, group size 64). Full multimodal support preserved (text, image, audio). Effective rate: **6.851 bits per weight average** (embeddings and sensitive layers kept at higher precision). This is the **default on-device variant** β smallest footprint, fastest inference, best for consumer Apple Silicon. |
|
|
| **Other formats in the Lemma family:** |
|
|
| | Repo | Format | Size | Use case | |
| |---|---|---|---| |
| | [lthn/lemer](https://huggingface.co/lthn/lemer) | HF + GGUF + MLX Q4 bundled | 3β9 GB per variant | Main consumer repo β everything in one place | |
| | [lthn/lemer-mlx-bf16](https://huggingface.co/lthn/lemer-mlx-bf16) | MLX BF16 | 10.2 GB | Full-precision reference | |
| | [lthn/lemer-mlx-q8](https://huggingface.co/lthn/lemer-mlx-q8) | MLX Q8 | 5.9 GB | Near-lossless quantised | |
| | [lthn/lemer-mlx](https://huggingface.co/lthn/lemer-mlx) | MLX Q4 | **4.1 GB** | **You are here** β on-device default | |
| | [LetheanNetwork/lemer](https://huggingface.co/LetheanNetwork/lemer) | HF BF16 (unmodified base) | 10.2 GB | Raw Google Gemma 4 E2B fork, no LEK | |
|
|
| ## What This Is |
|
|
| The **Lethean Ethical Kernel (LEK)** has been merged directly into the text attention projections (100 `q/k/v/o_proj` layers) of Gemma 4 E2B via LoRA finetune, then folded into the base weights. The vision tower and audio tower are preserved unmodified from Google's upstream β LEK only shifts text reasoning. |
|
|
| This variant is **MLX Q4 quantised from the merged model** β the smallest, fastest multimodal Lemma variant suitable for on-device inference on consumer Apple Silicon. Single safetensor file, ~4.1 GB. Quantisation is 4 bits for attention/MLP weights, with embeddings and selected layers kept at higher precision (hence the 6.851 bits/weight average). Verified on M3 Ultra at **145+ tokens/sec generation** via `mlx-lm`; vision inference tested against COCO sample images via `mlx-vlm` with accurate descriptions. |
|
|
| Use this variant when: |
| - You want the default on-device Lemma experience |
| - You're running on consumer Apple Silicon (M1/M2/M3 base, Air, Pro, Studio) |
| - You need the fastest inference with acceptable quality |
| - Memory budget is limited (~5 GB runtime peak) |
|
|
| For higher fidelity, use [lemer-mlx-q8](https://huggingface.co/lthn/lemer-mlx-q8) at 5.9 GB or [lemer-mlx-bf16](https://huggingface.co/lthn/lemer-mlx-bf16) at 10.2 GB. |
|
|
| ## Quick Start |
|
|
| ### mlx-lm (text) |
|
|
| ```bash |
| uv tool install mlx-lm |
| mlx_lm.chat --model lthn/lemer-mlx |
| mlx_lm.generate --model lthn/lemer-mlx --prompt "Hello, how are you?" |
| ``` |
|
|
| ### mlx-vlm (vision + audio multimodal) |
|
|
| ```bash |
| uv tool install mlx-vlm |
| ``` |
|
|
| ```python |
| from mlx_vlm import load, generate |
| from mlx_vlm.prompt_utils import apply_chat_template |
| from mlx_vlm.utils import load_config |
| |
| model, processor = load("lthn/lemer-mlx") |
| config = load_config("lthn/lemer-mlx") |
| |
| image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] |
| prompt = "Describe this image in one sentence." |
| |
| formatted_prompt = apply_chat_template( |
| processor, config, prompt, num_images=1 |
| ) |
| |
| output = generate(model, processor, formatted_prompt, image) |
| print(output.text) |
| ``` |
|
|
| ### mlx-vlm server (OpenAI-compatible API) |
|
|
| ```bash |
| mlx_vlm.server --model lthn/lemer-mlx --port 8080 |
| ``` |
|
|
| Then any OpenAI-compatible client can hit `http://localhost:8080/v1/chat/completions`. Works with LM Studio, pi-coding-agent, OpenWebUI, and any other OpenAI-API-compatible client. |
|
|
| > **Note**: use `mlx_vlm.server` (not `mlx_lm.server`) because lemer is multimodal. The text-only `mlx_lm.server` does not correctly route the vision/audio tensors for Gemma 4. |
| |
| ## Recommended Sampling |
| |
| Per Google's [Gemma 4 model card](https://huggingface.co/google/gemma-4-E2B-it), use these across all use cases. **Gemma 4 is calibrated for `temperature=1.0` β greedy / temperature=0 is NOT recommended and will measurably underperform.** |
| |
| | Parameter | Value | |
| |-----------|-------| |
| | `temperature` | 1.0 | |
| | `top_p` | 0.95 | |
| | `top_k` | 64 | |
|
|
| Already set in `generation_config.json`. |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |----------|-------| |
| | **Architecture** | Gemma 4 E2B | |
| | **Format** | MLX Q4 (affine quantisation) | |
| | **Quantisation bits** | 4 (6.851 bits/weight average including full-precision layers) | |
| | **Quantisation group size** | 64 | |
| | **Parameters** | 5.1B total, 2.3B effective (Per-Layer Embeddings) | |
| | **Layers** | 35 text decoder layers | |
| | **Context Length** | 128K tokens | |
| | **Vocabulary** | 262K tokens | |
| | **Modalities** | Text, Image, Audio | |
| | **Vision Encoder** | ~150M params (preserved unmodified from Google) | |
| | **Audio Encoder** | ~300M params (preserved unmodified from Google) | |
| | **Weight file** | Single `model.safetensors` (~4.1 GB) | |
| | **LEK delta** | LoRA rank 8 merged into 100 text attention projections, then quantised | |
| | **Quantisation source** | [lthn/lemer-mlx-bf16](https://huggingface.co/lthn/lemer-mlx-bf16) via `mlx_vlm.convert(quantize=True, q_bits=4, q_group_size=64)` | |
| | **Base fork** | [LetheanNetwork/lemer](https://huggingface.co/LetheanNetwork/lemer) (unmodified Google fork) | |
| | **Licence** | EUPL-1.2 | |
|
|
| ## Performance Notes |
|
|
| Verified on M3 Ultra (96 GB): |
| - **mlx-lm generation**: ~145 tokens/sec on text-only inference |
| - **Peak runtime memory**: ~3.4 GB (ample headroom for context growth) |
| - **Vision inference**: correct multi-object scene description on COCO test images |
|
|
| Should run comfortably on M1/M2/M3/M4 Air (8 GB RAM) for text inference, and on Pro/Max/Ultra variants for full multimodal workloads. |
|
|
| ## Full Model Card |
|
|
| Detailed documentation β Lemma family overview, GGUF variants, capability map, benchmarks, the "why EUPL-1.2" framing, and the Roadmap β lives on the main repo: |
|
|
| **β [lthn/lemer](https://huggingface.co/lthn/lemer)** |
|
|
| ## About Lethean |
|
|
| [Lethean](https://lthn.ai) is a social enterprise building ethical AI infrastructure. The Lemma model family is part of the [LEM (Lethean Ethical Model)](https://github.com/LetheanNetwork) project β training protocol and tooling for intrinsic ethical alignment of language models via consent-based LoRA finetunes, shipped EUPL-1.2 so the ethical layer stays in the open. |
|
|
| - Website: [lthn.ai](https://lthn.ai) |
| - GitHub: [LetheanNetwork](https://github.com/LetheanNetwork) |
| - Axioms (public domain): [Snider/ai-ethics](https://github.com/Snider/ai-ethics) |
| - Licence: [EUPL-1.2](https://joinup.ec.europa.eu/collection/eupl/eupl-text-eupl-12) |
|
|