lemer-mlx / README.md
Snider
docs: correct base_model lineage for HF model tree
0e862eb
---
license: eupl-1.2
pipeline_tag: image-text-to-text
library_name: mlx
base_model:
- lthn/lemer
base_model_relation: quantized
tags:
- gemma4
- lemma
- mlx
- 4bit
- apple-silicon
- multimodal
- on-device
- conversational
license_link: https://ai.google.dev/gemma/docs/gemma_4_license
---
<!--
This content is subject to the European Union Public Licence (EUPL-1.2).
For full licence details, please refer to: https://huggingface.co/lthn/lemer-mlx/tree/main/LICENSE
Origin URL: https://huggingface.co/lthn/lemer-mlx/tree/main
-->
# Lemer (MLX Q4) β€” Gemma 4 E2B + LEK
**On-device default MLX 4-bit quantised build of [lemer](https://huggingface.co/lthn/lemer)** β€” Gemma 4 E2B with the Lethean Ethical Kernel (LEK) merged into the text attention weights, quantised to 4 bits per weight via `mlx-vlm`'s native quantisation (affine mode, group size 64). Full multimodal support preserved (text, image, audio). Effective rate: **6.851 bits per weight average** (embeddings and sensitive layers kept at higher precision). This is the **default on-device variant** β€” smallest footprint, fastest inference, best for consumer Apple Silicon.
**Other formats in the Lemma family:**
| Repo | Format | Size | Use case |
|---|---|---|---|
| [lthn/lemer](https://huggingface.co/lthn/lemer) | HF + GGUF + MLX Q4 bundled | 3–9 GB per variant | Main consumer repo β€” everything in one place |
| [lthn/lemer-mlx-bf16](https://huggingface.co/lthn/lemer-mlx-bf16) | MLX BF16 | 10.2 GB | Full-precision reference |
| [lthn/lemer-mlx-q8](https://huggingface.co/lthn/lemer-mlx-q8) | MLX Q8 | 5.9 GB | Near-lossless quantised |
| [lthn/lemer-mlx](https://huggingface.co/lthn/lemer-mlx) | MLX Q4 | **4.1 GB** | **You are here** β€” on-device default |
| [LetheanNetwork/lemer](https://huggingface.co/LetheanNetwork/lemer) | HF BF16 (unmodified base) | 10.2 GB | Raw Google Gemma 4 E2B fork, no LEK |
## What This Is
The **Lethean Ethical Kernel (LEK)** has been merged directly into the text attention projections (100 `q/k/v/o_proj` layers) of Gemma 4 E2B via LoRA finetune, then folded into the base weights. The vision tower and audio tower are preserved unmodified from Google's upstream β€” LEK only shifts text reasoning.
This variant is **MLX Q4 quantised from the merged model** β€” the smallest, fastest multimodal Lemma variant suitable for on-device inference on consumer Apple Silicon. Single safetensor file, ~4.1 GB. Quantisation is 4 bits for attention/MLP weights, with embeddings and selected layers kept at higher precision (hence the 6.851 bits/weight average). Verified on M3 Ultra at **145+ tokens/sec generation** via `mlx-lm`; vision inference tested against COCO sample images via `mlx-vlm` with accurate descriptions.
Use this variant when:
- You want the default on-device Lemma experience
- You're running on consumer Apple Silicon (M1/M2/M3 base, Air, Pro, Studio)
- You need the fastest inference with acceptable quality
- Memory budget is limited (~5 GB runtime peak)
For higher fidelity, use [lemer-mlx-q8](https://huggingface.co/lthn/lemer-mlx-q8) at 5.9 GB or [lemer-mlx-bf16](https://huggingface.co/lthn/lemer-mlx-bf16) at 10.2 GB.
## Quick Start
### mlx-lm (text)
```bash
uv tool install mlx-lm
mlx_lm.chat --model lthn/lemer-mlx
mlx_lm.generate --model lthn/lemer-mlx --prompt "Hello, how are you?"
```
### mlx-vlm (vision + audio multimodal)
```bash
uv tool install mlx-vlm
```
```python
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model, processor = load("lthn/lemer-mlx")
config = load_config("lthn/lemer-mlx")
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image in one sentence."
formatted_prompt = apply_chat_template(
processor, config, prompt, num_images=1
)
output = generate(model, processor, formatted_prompt, image)
print(output.text)
```
### mlx-vlm server (OpenAI-compatible API)
```bash
mlx_vlm.server --model lthn/lemer-mlx --port 8080
```
Then any OpenAI-compatible client can hit `http://localhost:8080/v1/chat/completions`. Works with LM Studio, pi-coding-agent, OpenWebUI, and any other OpenAI-API-compatible client.
> **Note**: use `mlx_vlm.server` (not `mlx_lm.server`) because lemer is multimodal. The text-only `mlx_lm.server` does not correctly route the vision/audio tensors for Gemma 4.
## Recommended Sampling
Per Google's [Gemma 4 model card](https://huggingface.co/google/gemma-4-E2B-it), use these across all use cases. **Gemma 4 is calibrated for `temperature=1.0` β€” greedy / temperature=0 is NOT recommended and will measurably underperform.**
| Parameter | Value |
|-----------|-------|
| `temperature` | 1.0 |
| `top_p` | 0.95 |
| `top_k` | 64 |
Already set in `generation_config.json`.
## Model Details
| Property | Value |
|----------|-------|
| **Architecture** | Gemma 4 E2B |
| **Format** | MLX Q4 (affine quantisation) |
| **Quantisation bits** | 4 (6.851 bits/weight average including full-precision layers) |
| **Quantisation group size** | 64 |
| **Parameters** | 5.1B total, 2.3B effective (Per-Layer Embeddings) |
| **Layers** | 35 text decoder layers |
| **Context Length** | 128K tokens |
| **Vocabulary** | 262K tokens |
| **Modalities** | Text, Image, Audio |
| **Vision Encoder** | ~150M params (preserved unmodified from Google) |
| **Audio Encoder** | ~300M params (preserved unmodified from Google) |
| **Weight file** | Single `model.safetensors` (~4.1 GB) |
| **LEK delta** | LoRA rank 8 merged into 100 text attention projections, then quantised |
| **Quantisation source** | [lthn/lemer-mlx-bf16](https://huggingface.co/lthn/lemer-mlx-bf16) via `mlx_vlm.convert(quantize=True, q_bits=4, q_group_size=64)` |
| **Base fork** | [LetheanNetwork/lemer](https://huggingface.co/LetheanNetwork/lemer) (unmodified Google fork) |
| **Licence** | EUPL-1.2 |
## Performance Notes
Verified on M3 Ultra (96 GB):
- **mlx-lm generation**: ~145 tokens/sec on text-only inference
- **Peak runtime memory**: ~3.4 GB (ample headroom for context growth)
- **Vision inference**: correct multi-object scene description on COCO test images
Should run comfortably on M1/M2/M3/M4 Air (8 GB RAM) for text inference, and on Pro/Max/Ultra variants for full multimodal workloads.
## Full Model Card
Detailed documentation β€” Lemma family overview, GGUF variants, capability map, benchmarks, the "why EUPL-1.2" framing, and the Roadmap β€” lives on the main repo:
**β†’ [lthn/lemer](https://huggingface.co/lthn/lemer)**
## About Lethean
[Lethean](https://lthn.ai) is a social enterprise building ethical AI infrastructure. The Lemma model family is part of the [LEM (Lethean Ethical Model)](https://github.com/LetheanNetwork) project β€” training protocol and tooling for intrinsic ethical alignment of language models via consent-based LoRA finetunes, shipped EUPL-1.2 so the ethical layer stays in the open.
- Website: [lthn.ai](https://lthn.ai)
- GitHub: [LetheanNetwork](https://github.com/LetheanNetwork)
- Axioms (public domain): [Snider/ai-ethics](https://github.com/Snider/ai-ethics)
- Licence: [EUPL-1.2](https://joinup.ec.europa.eu/collection/eupl/eupl-text-eupl-12)