File size: 7,194 Bytes
662a530 0e862eb 662a530 0e862eb 662a530 0e862eb 662a530 0e862eb 662a530 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | ---
license: eupl-1.2
pipeline_tag: image-text-to-text
library_name: mlx
base_model:
- lthn/lemer
base_model_relation: quantized
tags:
- gemma4
- lemma
- mlx
- 4bit
- apple-silicon
- multimodal
- on-device
- conversational
license_link: https://ai.google.dev/gemma/docs/gemma_4_license
---
<!--
This content is subject to the European Union Public Licence (EUPL-1.2).
For full licence details, please refer to: https://huggingface.co/lthn/lemer-mlx/tree/main/LICENSE
Origin URL: https://huggingface.co/lthn/lemer-mlx/tree/main
-->
# Lemer (MLX Q4) β Gemma 4 E2B + LEK
**On-device default MLX 4-bit quantised build of [lemer](https://huggingface.co/lthn/lemer)** β Gemma 4 E2B with the Lethean Ethical Kernel (LEK) merged into the text attention weights, quantised to 4 bits per weight via `mlx-vlm`'s native quantisation (affine mode, group size 64). Full multimodal support preserved (text, image, audio). Effective rate: **6.851 bits per weight average** (embeddings and sensitive layers kept at higher precision). This is the **default on-device variant** β smallest footprint, fastest inference, best for consumer Apple Silicon.
**Other formats in the Lemma family:**
| Repo | Format | Size | Use case |
|---|---|---|---|
| [lthn/lemer](https://huggingface.co/lthn/lemer) | HF + GGUF + MLX Q4 bundled | 3β9 GB per variant | Main consumer repo β everything in one place |
| [lthn/lemer-mlx-bf16](https://huggingface.co/lthn/lemer-mlx-bf16) | MLX BF16 | 10.2 GB | Full-precision reference |
| [lthn/lemer-mlx-q8](https://huggingface.co/lthn/lemer-mlx-q8) | MLX Q8 | 5.9 GB | Near-lossless quantised |
| [lthn/lemer-mlx](https://huggingface.co/lthn/lemer-mlx) | MLX Q4 | **4.1 GB** | **You are here** β on-device default |
| [LetheanNetwork/lemer](https://huggingface.co/LetheanNetwork/lemer) | HF BF16 (unmodified base) | 10.2 GB | Raw Google Gemma 4 E2B fork, no LEK |
## What This Is
The **Lethean Ethical Kernel (LEK)** has been merged directly into the text attention projections (100 `q/k/v/o_proj` layers) of Gemma 4 E2B via LoRA finetune, then folded into the base weights. The vision tower and audio tower are preserved unmodified from Google's upstream β LEK only shifts text reasoning.
This variant is **MLX Q4 quantised from the merged model** β the smallest, fastest multimodal Lemma variant suitable for on-device inference on consumer Apple Silicon. Single safetensor file, ~4.1 GB. Quantisation is 4 bits for attention/MLP weights, with embeddings and selected layers kept at higher precision (hence the 6.851 bits/weight average). Verified on M3 Ultra at **145+ tokens/sec generation** via `mlx-lm`; vision inference tested against COCO sample images via `mlx-vlm` with accurate descriptions.
Use this variant when:
- You want the default on-device Lemma experience
- You're running on consumer Apple Silicon (M1/M2/M3 base, Air, Pro, Studio)
- You need the fastest inference with acceptable quality
- Memory budget is limited (~5 GB runtime peak)
For higher fidelity, use [lemer-mlx-q8](https://huggingface.co/lthn/lemer-mlx-q8) at 5.9 GB or [lemer-mlx-bf16](https://huggingface.co/lthn/lemer-mlx-bf16) at 10.2 GB.
## Quick Start
### mlx-lm (text)
```bash
uv tool install mlx-lm
mlx_lm.chat --model lthn/lemer-mlx
mlx_lm.generate --model lthn/lemer-mlx --prompt "Hello, how are you?"
```
### mlx-vlm (vision + audio multimodal)
```bash
uv tool install mlx-vlm
```
```python
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model, processor = load("lthn/lemer-mlx")
config = load_config("lthn/lemer-mlx")
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image in one sentence."
formatted_prompt = apply_chat_template(
processor, config, prompt, num_images=1
)
output = generate(model, processor, formatted_prompt, image)
print(output.text)
```
### mlx-vlm server (OpenAI-compatible API)
```bash
mlx_vlm.server --model lthn/lemer-mlx --port 8080
```
Then any OpenAI-compatible client can hit `http://localhost:8080/v1/chat/completions`. Works with LM Studio, pi-coding-agent, OpenWebUI, and any other OpenAI-API-compatible client.
> **Note**: use `mlx_vlm.server` (not `mlx_lm.server`) because lemer is multimodal. The text-only `mlx_lm.server` does not correctly route the vision/audio tensors for Gemma 4.
## Recommended Sampling
Per Google's [Gemma 4 model card](https://huggingface.co/google/gemma-4-E2B-it), use these across all use cases. **Gemma 4 is calibrated for `temperature=1.0` β greedy / temperature=0 is NOT recommended and will measurably underperform.**
| Parameter | Value |
|-----------|-------|
| `temperature` | 1.0 |
| `top_p` | 0.95 |
| `top_k` | 64 |
Already set in `generation_config.json`.
## Model Details
| Property | Value |
|----------|-------|
| **Architecture** | Gemma 4 E2B |
| **Format** | MLX Q4 (affine quantisation) |
| **Quantisation bits** | 4 (6.851 bits/weight average including full-precision layers) |
| **Quantisation group size** | 64 |
| **Parameters** | 5.1B total, 2.3B effective (Per-Layer Embeddings) |
| **Layers** | 35 text decoder layers |
| **Context Length** | 128K tokens |
| **Vocabulary** | 262K tokens |
| **Modalities** | Text, Image, Audio |
| **Vision Encoder** | ~150M params (preserved unmodified from Google) |
| **Audio Encoder** | ~300M params (preserved unmodified from Google) |
| **Weight file** | Single `model.safetensors` (~4.1 GB) |
| **LEK delta** | LoRA rank 8 merged into 100 text attention projections, then quantised |
| **Quantisation source** | [lthn/lemer-mlx-bf16](https://huggingface.co/lthn/lemer-mlx-bf16) via `mlx_vlm.convert(quantize=True, q_bits=4, q_group_size=64)` |
| **Base fork** | [LetheanNetwork/lemer](https://huggingface.co/LetheanNetwork/lemer) (unmodified Google fork) |
| **Licence** | EUPL-1.2 |
## Performance Notes
Verified on M3 Ultra (96 GB):
- **mlx-lm generation**: ~145 tokens/sec on text-only inference
- **Peak runtime memory**: ~3.4 GB (ample headroom for context growth)
- **Vision inference**: correct multi-object scene description on COCO test images
Should run comfortably on M1/M2/M3/M4 Air (8 GB RAM) for text inference, and on Pro/Max/Ultra variants for full multimodal workloads.
## Full Model Card
Detailed documentation β Lemma family overview, GGUF variants, capability map, benchmarks, the "why EUPL-1.2" framing, and the Roadmap β lives on the main repo:
**β [lthn/lemer](https://huggingface.co/lthn/lemer)**
## About Lethean
[Lethean](https://lthn.ai) is a social enterprise building ethical AI infrastructure. The Lemma model family is part of the [LEM (Lethean Ethical Model)](https://github.com/LetheanNetwork) project β training protocol and tooling for intrinsic ethical alignment of language models via consent-based LoRA finetunes, shipped EUPL-1.2 so the ethical layer stays in the open.
- Website: [lthn.ai](https://lthn.ai)
- GitHub: [LetheanNetwork](https://github.com/LetheanNetwork)
- Axioms (public domain): [Snider/ai-ethics](https://github.com/Snider/ai-ethics)
- Licence: [EUPL-1.2](https://joinup.ec.europa.eu/collection/eupl/eupl-text-eupl-12)
|