File size: 4,774 Bytes
b199369 3f14c1e 8b86970 3f14c1e b199369 3f14c1e e60d202 3f14c1e e60d202 3f14c1e fe135b5 3f14c1e 64232fe 8b86970 3f14c1e e60d202 3f14c1e e60d202 3f14c1e e60d202 3f14c1e e60d202 3f14c1e e60d202 3f14c1e e60d202 3f14c1e e60d202 3f14c1e e60d202 3f14c1e b199369 3f14c1e e60d202 3f14c1e 8b86970 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
---
model-index:
- name: >-
LFM2-8B-A1B — MLX (Apple Silicon), **8-bit** (with guidance on MoE + RAM
planning)
results: []
language:
- en
tags:
- mlx
- apple-silicon
- liquidai
- lfm2
- moe
- transformer
- long-context
- instruct
- quantized
- 8bit
- Mixture of Experts
- coding
pipeline_tag: text-generation
library_name: mlx
license: other
license_name: lfm1.0
license_link: LICENSE
base_model:
- LiquidAI/LFM2-8B-A1B
---
# LFM2-8B-A1B — **MLX 8-bit** (Apple Silicon)
**Maintainer / Publisher:** [**Susant Achary**](https://huggingface.co/Susant-Achary)
**Upstream model:** [LiquidAI/LFM2-8B-A1B](https://huggingface.co/LiquidAI/LFM2-8B-A1B)
**This repo (MLX 8-bit):** `mlx-community/LFM2-8B-A1B-8bit-MLX`
This repository provides an **Apple-Silicon-optimized MLX build** of **LFM2-8B-A1B** at **8-bit** quantization for fast, on-device inference.
---
## 🔎 What is LFM2-8B-A1B?
- **Architecture:** Mixture-of-Experts (**MoE**) Transformer.
- **Size:** ~**8B total parameters** with **~1B active** per token (the “A1B” suffix commonly denotes *~1B active params*).
- **Why MoE?** During generation, only a subset of experts is **activated per token**, reducing **compute per token** while keeping a larger total parameter pool for expressivity.
> **Important memory note (single-device inference):**
> Although *compute per token* benefits from MoE (fewer **active** parameters), **the full set of experts still resides in memory** for typical single-GPU/CPU deployments. In practice this means **RAM usage scales with total parameters**, not with the smaller *active* count.
---
## 📦 What’s in this MLX build
- `config.json` (MLX), `mlx_model*.safetensors` (**8-bit** shards)
- Tokenizer files: `tokenizer.json`, `tokenizer_config.json`
- Model metadata (e.g., `model_index.json`)
Target platform: **macOS** on **Apple Silicon (M-series)** using **Metal/MPS**.
---
## ✅ Intended use
- General **instruction-following**, chat, and summarization
- **RAG** back-ends and long-context workflows on device
- **Function-calling / structured outputs** with schema-style prompts
## ⚠️ Limitations
- Even at 8-bit, **long contexts** (KV-cache) can dominate memory at high `max_tokens` or large batch sizes.
- As with any quantization, small regressions vs FP16 can appear on intricate math/code or edge-formatting.
---
## 🔢 RAM planning (8-bit, MoE, MLX)
You asked to **assume and decide** RAM usage in absence of your measurements. Below are **practical planning numbers** derived from first-principles + experience with MLX and similar MoE models. Treat them as **starting points** and validate on your hardware.
### Rule-of-thumb components
- **Weights:** `~ total_params × 1 byte` (8-bit). For 8B params → **~8.0 GB** baseline.
- **Runtime overhead:** MLX graph + tensors + metadata → **~0.5–1.0 GB** typical.
- **KV cache:** grows with **context_length × layers × heads × dtype**; often **1–3+ GB** for long contexts.
### Indicative peak RAM (single image/text, batch=1)
| Context window | Estimated peak RAM |
|---|---:|
| **4k tokens** | **~9.5–10.5 GB** |
| **8k tokens** | **~10.5–11.8 GB** |
| **16k tokens** | **~12.0–14.0 GB** |
> These ranges assume **8-bit** weights, **A1B MoE** (all experts resident), batch size = 1, and standard generation settings.
> On lower windows (≤2k), you may see **~9–10 GB**. Larger windows or batches will increase KV-cache and peak RAM.
---
## 🧭 Choosing precision for LFM2-8B-A1B
While this card is **8-bit**, teams often want a consistent lineup. If you later produce 6/5/4/3/2-bit MLX builds, here’s a practical guide (RAM figures are **indicative** for an 8B MoE LM; your results depend on context/batch):
| Variant | Typical Peak RAM | Relative Speed | Typical Behavior | When to choose |
|---|---:|:---:|---|---|
| **4-bit** | ~7–8 GB | 🔥🔥🔥 | Better detail retention | If 3-bit drops too much fidelity |
| **6-bit** | ~9–10.5 GB | 🔥🔥 | Near-max MLX quality | If you want accuracy under quant |
| **8-bit** *(this repo)* | **~9.5–12+ GB** | 🔥🔥 | **Highest** quality among quant tiers | When RAM allows and you want the most faithful outputs |
> **MoE caveat:** MoE **reduces compute per token**, but unless experts are **paged/partitioned** across devices and loaded on demand, **memory** still follows **total parameters**. On a single Mac, plan RAM as if the *whole 8B* parameter set is resident.
---
## 🚀 Quickstart (CLI — MLX)
**Deterministic generation**
```bash
python -m mlx_lm.generate \
--model mlx-community/LFM2-8B-A1B-8bit-MLX \
--prompt "Summarize the following in 5 bullet points:\n<your text>" \
--max-tokens 256 \
--temperature 0.0 \
--device mps \
--seed 0 |