Snider

docs: correct base_model lineage for HF model tree

0e862eb 4 days ago

7.19 kB

	---
	license: eupl-1.2
	pipeline_tag: image-text-to-text
	library_name: mlx
	base_model:
	- lthn/lemer
	base_model_relation: quantized
	tags:
	- gemma4
	- lemma
	- mlx
	- 4bit
	- apple-silicon
	- multimodal
	- on-device
	- conversational
	license_link: https://ai.google.dev/gemma/docs/gemma_4_license
	---
	<!--
	This content is subject to the European Union Public Licence (EUPL-1.2).
	For full licence details, please refer to: https://huggingface.co/lthn/lemer-mlx/tree/main/LICENSE
	Origin URL: https://huggingface.co/lthn/lemer-mlx/tree/main
	-->
	# Lemer (MLX Q4) — Gemma 4 E2B + LEK

	On-device default MLX 4-bit quantised build of [lemer](https://huggingface.co/lthn/lemer) — Gemma 4 E2B with the Lethean Ethical Kernel (LEK) merged into the text attention weights, quantised to 4 bits per weight via `mlx-vlm`'s native quantisation (affine mode, group size 64). Full multimodal support preserved (text, image, audio). Effective rate: 6.851 bits per weight average (embeddings and sensitive layers kept at higher precision). This is the default on-device variant — smallest footprint, fastest inference, best for consumer Apple Silicon.

	Other formats in the Lemma family:

	\| Repo \| Format \| Size \| Use case \|
	\|---\|---\|---\|---\|
	\| [lthn/lemer](https://huggingface.co/lthn/lemer) \| HF + GGUF + MLX Q4 bundled \| 3–9 GB per variant \| Main consumer repo — everything in one place \|
	\| [lthn/lemer-mlx-bf16](https://huggingface.co/lthn/lemer-mlx-bf16) \| MLX BF16 \| 10.2 GB \| Full-precision reference \|
	\| [lthn/lemer-mlx-q8](https://huggingface.co/lthn/lemer-mlx-q8) \| MLX Q8 \| 5.9 GB \| Near-lossless quantised \|
	\| [lthn/lemer-mlx](https://huggingface.co/lthn/lemer-mlx) \| MLX Q4 \| 4.1 GB \| You are here — on-device default \|
	\| [LetheanNetwork/lemer](https://huggingface.co/LetheanNetwork/lemer) \| HF BF16 (unmodified base) \| 10.2 GB \| Raw Google Gemma 4 E2B fork, no LEK \|

	## What This Is

	The Lethean Ethical Kernel (LEK) has been merged directly into the text attention projections (100 `q/k/v/o_proj` layers) of Gemma 4 E2B via LoRA finetune, then folded into the base weights. The vision tower and audio tower are preserved unmodified from Google's upstream — LEK only shifts text reasoning.

	This variant is MLX Q4 quantised from the merged model — the smallest, fastest multimodal Lemma variant suitable for on-device inference on consumer Apple Silicon. Single safetensor file, ~4.1 GB. Quantisation is 4 bits for attention/MLP weights, with embeddings and selected layers kept at higher precision (hence the 6.851 bits/weight average). Verified on M3 Ultra at 145+ tokens/sec generation via `mlx-lm`; vision inference tested against COCO sample images via `mlx-vlm` with accurate descriptions.

	Use this variant when:
	- You want the default on-device Lemma experience
	- You're running on consumer Apple Silicon (M1/M2/M3 base, Air, Pro, Studio)
	- You need the fastest inference with acceptable quality
	- Memory budget is limited (~5 GB runtime peak)

	For higher fidelity, use [lemer-mlx-q8](https://huggingface.co/lthn/lemer-mlx-q8) at 5.9 GB or [lemer-mlx-bf16](https://huggingface.co/lthn/lemer-mlx-bf16) at 10.2 GB.

	## Quick Start

	### mlx-lm (text)

	```bash
	uv tool install mlx-lm
	mlx_lm.chat --model lthn/lemer-mlx
	mlx_lm.generate --model lthn/lemer-mlx --prompt "Hello, how are you?"
	```

	### mlx-vlm (vision + audio multimodal)

	```bash
	uv tool install mlx-vlm
	```

	```python
	from mlx_vlm import load, generate
	from mlx_vlm.prompt_utils import apply_chat_template
	from mlx_vlm.utils import load_config

	model, processor = load("lthn/lemer-mlx")
	config = load_config("lthn/lemer-mlx")

	image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
	prompt = "Describe this image in one sentence."

	formatted_prompt = apply_chat_template(
	processor, config, prompt, num_images=1
	)

	output = generate(model, processor, formatted_prompt, image)
	print(output.text)
	```

	### mlx-vlm server (OpenAI-compatible API)

	```bash
	mlx_vlm.server --model lthn/lemer-mlx --port 8080
	```

	Then any OpenAI-compatible client can hit `http://localhost:8080/v1/chat/completions`. Works with LM Studio, pi-coding-agent, OpenWebUI, and any other OpenAI-API-compatible client.

	> Note: use `mlx_vlm.server` (not `mlx_lm.server`) because lemer is multimodal. The text-only `mlx_lm.server` does not correctly route the vision/audio tensors for Gemma 4.

	## Recommended Sampling

	Per Google's [Gemma 4 model card](https://huggingface.co/google/gemma-4-E2B-it), use these across all use cases. Gemma 4 is calibrated for `temperature=1.0` — greedy / temperature=0 is NOT recommended and will measurably underperform.

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| `temperature` \| 1.0 \|
	\| `top_p` \| 0.95 \|
	\| `top_k` \| 64 \|

	Already set in `generation_config.json`.

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Architecture \| Gemma 4 E2B \|
	\| Format \| MLX Q4 (affine quantisation) \|
	\| Quantisation bits \| 4 (6.851 bits/weight average including full-precision layers) \|
	\| Quantisation group size \| 64 \|
	\| Parameters \| 5.1B total, 2.3B effective (Per-Layer Embeddings) \|
	\| Layers \| 35 text decoder layers \|
	\| Context Length \| 128K tokens \|
	\| Vocabulary \| 262K tokens \|
	\| Modalities \| Text, Image, Audio \|
	\| Vision Encoder \| ~150M params (preserved unmodified from Google) \|
	\| Audio Encoder \| ~300M params (preserved unmodified from Google) \|
	\| Weight file \| Single `model.safetensors` (~4.1 GB) \|
	\| LEK delta \| LoRA rank 8 merged into 100 text attention projections, then quantised \|
	\| Quantisation source \| [lthn/lemer-mlx-bf16](https://huggingface.co/lthn/lemer-mlx-bf16) via `mlx_vlm.convert(quantize=True, q_bits=4, q_group_size=64)` \|
	\| Base fork \| [LetheanNetwork/lemer](https://huggingface.co/LetheanNetwork/lemer) (unmodified Google fork) \|
	\| Licence \| EUPL-1.2 \|

	## Performance Notes

	Verified on M3 Ultra (96 GB):
	- mlx-lm generation: ~145 tokens/sec on text-only inference
	- Peak runtime memory: ~3.4 GB (ample headroom for context growth)
	- Vision inference: correct multi-object scene description on COCO test images

	Should run comfortably on M1/M2/M3/M4 Air (8 GB RAM) for text inference, and on Pro/Max/Ultra variants for full multimodal workloads.

	## Full Model Card

	Detailed documentation — Lemma family overview, GGUF variants, capability map, benchmarks, the "why EUPL-1.2" framing, and the Roadmap — lives on the main repo:

	→ [lthn/lemer](https://huggingface.co/lthn/lemer)

	## About Lethean

	[Lethean](https://lthn.ai) is a social enterprise building ethical AI infrastructure. The Lemma model family is part of the [LEM (Lethean Ethical Model)](https://github.com/LetheanNetwork) project — training protocol and tooling for intrinsic ethical alignment of language models via consent-based LoRA finetunes, shipped EUPL-1.2 so the ethical layer stays in the open.

	- Website: [lthn.ai](https://lthn.ai)
	- GitHub: [LetheanNetwork](https://github.com/LetheanNetwork)
	- Axioms (public domain): [Snider/ai-ethics](https://github.com/Snider/ai-ethics)
	- Licence: [EUPL-1.2](https://joinup.ec.europa.eu/collection/eupl/eupl-text-eupl-12)