File size: 7,194 Bytes
662a530
 
 
 
 
0e862eb
662a530
 
 
 
0e862eb
 
662a530
0e862eb
662a530
 
0e862eb
662a530
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
license: eupl-1.2
pipeline_tag: image-text-to-text
library_name: mlx
base_model:
- lthn/lemer
base_model_relation: quantized
tags:
- gemma4
- lemma
- mlx
- 4bit
- apple-silicon
- multimodal
- on-device
- conversational
license_link: https://ai.google.dev/gemma/docs/gemma_4_license
---
<!--
This content is subject to the European Union Public Licence (EUPL-1.2).
For full licence details, please refer to: https://huggingface.co/lthn/lemer-mlx/tree/main/LICENSE
Origin URL: https://huggingface.co/lthn/lemer-mlx/tree/main
-->
# Lemer (MLX Q4) β€” Gemma 4 E2B + LEK

**On-device default MLX 4-bit quantised build of [lemer](https://huggingface.co/lthn/lemer)** β€” Gemma 4 E2B with the Lethean Ethical Kernel (LEK) merged into the text attention weights, quantised to 4 bits per weight via `mlx-vlm`'s native quantisation (affine mode, group size 64). Full multimodal support preserved (text, image, audio). Effective rate: **6.851 bits per weight average** (embeddings and sensitive layers kept at higher precision). This is the **default on-device variant** β€” smallest footprint, fastest inference, best for consumer Apple Silicon.

**Other formats in the Lemma family:**

| Repo | Format | Size | Use case |
|---|---|---|---|
| [lthn/lemer](https://huggingface.co/lthn/lemer) | HF + GGUF + MLX Q4 bundled | 3–9 GB per variant | Main consumer repo β€” everything in one place |
| [lthn/lemer-mlx-bf16](https://huggingface.co/lthn/lemer-mlx-bf16) | MLX BF16 | 10.2 GB | Full-precision reference |
| [lthn/lemer-mlx-q8](https://huggingface.co/lthn/lemer-mlx-q8) | MLX Q8 | 5.9 GB | Near-lossless quantised |
| [lthn/lemer-mlx](https://huggingface.co/lthn/lemer-mlx) | MLX Q4 | **4.1 GB** | **You are here** β€” on-device default |
| [LetheanNetwork/lemer](https://huggingface.co/LetheanNetwork/lemer) | HF BF16 (unmodified base) | 10.2 GB | Raw Google Gemma 4 E2B fork, no LEK |

## What This Is

The **Lethean Ethical Kernel (LEK)** has been merged directly into the text attention projections (100 `q/k/v/o_proj` layers) of Gemma 4 E2B via LoRA finetune, then folded into the base weights. The vision tower and audio tower are preserved unmodified from Google's upstream β€” LEK only shifts text reasoning.

This variant is **MLX Q4 quantised from the merged model** β€” the smallest, fastest multimodal Lemma variant suitable for on-device inference on consumer Apple Silicon. Single safetensor file, ~4.1 GB. Quantisation is 4 bits for attention/MLP weights, with embeddings and selected layers kept at higher precision (hence the 6.851 bits/weight average). Verified on M3 Ultra at **145+ tokens/sec generation** via `mlx-lm`; vision inference tested against COCO sample images via `mlx-vlm` with accurate descriptions.

Use this variant when:
- You want the default on-device Lemma experience
- You're running on consumer Apple Silicon (M1/M2/M3 base, Air, Pro, Studio)
- You need the fastest inference with acceptable quality
- Memory budget is limited (~5 GB runtime peak)

For higher fidelity, use [lemer-mlx-q8](https://huggingface.co/lthn/lemer-mlx-q8) at 5.9 GB or [lemer-mlx-bf16](https://huggingface.co/lthn/lemer-mlx-bf16) at 10.2 GB.

## Quick Start

### mlx-lm (text)

```bash
uv tool install mlx-lm
mlx_lm.chat --model lthn/lemer-mlx
mlx_lm.generate --model lthn/lemer-mlx --prompt "Hello, how are you?"
```

### mlx-vlm (vision + audio multimodal)

```bash
uv tool install mlx-vlm
```

```python
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model, processor = load("lthn/lemer-mlx")
config = load_config("lthn/lemer-mlx")

image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image in one sentence."

formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=1
)

output = generate(model, processor, formatted_prompt, image)
print(output.text)
```

### mlx-vlm server (OpenAI-compatible API)

```bash
mlx_vlm.server --model lthn/lemer-mlx --port 8080
```

Then any OpenAI-compatible client can hit `http://localhost:8080/v1/chat/completions`. Works with LM Studio, pi-coding-agent, OpenWebUI, and any other OpenAI-API-compatible client.

> **Note**: use `mlx_vlm.server` (not `mlx_lm.server`) because lemer is multimodal. The text-only `mlx_lm.server` does not correctly route the vision/audio tensors for Gemma 4.

## Recommended Sampling

Per Google's [Gemma 4 model card](https://huggingface.co/google/gemma-4-E2B-it), use these across all use cases. **Gemma 4 is calibrated for `temperature=1.0` β€” greedy / temperature=0 is NOT recommended and will measurably underperform.**

| Parameter | Value |
|-----------|-------|
| `temperature` | 1.0 |
| `top_p` | 0.95 |
| `top_k` | 64 |

Already set in `generation_config.json`.

## Model Details

| Property | Value |
|----------|-------|
| **Architecture** | Gemma 4 E2B |
| **Format** | MLX Q4 (affine quantisation) |
| **Quantisation bits** | 4 (6.851 bits/weight average including full-precision layers) |
| **Quantisation group size** | 64 |
| **Parameters** | 5.1B total, 2.3B effective (Per-Layer Embeddings) |
| **Layers** | 35 text decoder layers |
| **Context Length** | 128K tokens |
| **Vocabulary** | 262K tokens |
| **Modalities** | Text, Image, Audio |
| **Vision Encoder** | ~150M params (preserved unmodified from Google) |
| **Audio Encoder** | ~300M params (preserved unmodified from Google) |
| **Weight file** | Single `model.safetensors` (~4.1 GB) |
| **LEK delta** | LoRA rank 8 merged into 100 text attention projections, then quantised |
| **Quantisation source** | [lthn/lemer-mlx-bf16](https://huggingface.co/lthn/lemer-mlx-bf16) via `mlx_vlm.convert(quantize=True, q_bits=4, q_group_size=64)` |
| **Base fork** | [LetheanNetwork/lemer](https://huggingface.co/LetheanNetwork/lemer) (unmodified Google fork) |
| **Licence** | EUPL-1.2 |

## Performance Notes

Verified on M3 Ultra (96 GB):
- **mlx-lm generation**: ~145 tokens/sec on text-only inference
- **Peak runtime memory**: ~3.4 GB (ample headroom for context growth)
- **Vision inference**: correct multi-object scene description on COCO test images

Should run comfortably on M1/M2/M3/M4 Air (8 GB RAM) for text inference, and on Pro/Max/Ultra variants for full multimodal workloads.

## Full Model Card

Detailed documentation β€” Lemma family overview, GGUF variants, capability map, benchmarks, the "why EUPL-1.2" framing, and the Roadmap β€” lives on the main repo:

**β†’ [lthn/lemer](https://huggingface.co/lthn/lemer)**

## About Lethean

[Lethean](https://lthn.ai) is a social enterprise building ethical AI infrastructure. The Lemma model family is part of the [LEM (Lethean Ethical Model)](https://github.com/LetheanNetwork) project β€” training protocol and tooling for intrinsic ethical alignment of language models via consent-based LoRA finetunes, shipped EUPL-1.2 so the ethical layer stays in the open.

- Website: [lthn.ai](https://lthn.ai)
- GitHub: [LetheanNetwork](https://github.com/LetheanNetwork)
- Axioms (public domain): [Snider/ai-ethics](https://github.com/Snider/ai-ethics)
- Licence: [EUPL-1.2](https://joinup.ec.europa.eu/collection/eupl/eupl-text-eupl-12)