File size: 4,774 Bytes
b199369
3f14c1e
8b86970
 
 
3f14c1e
b199369
 
 
 
3f14c1e
e60d202
 
 
 
 
 
3f14c1e
e60d202
3f14c1e
fe135b5
3f14c1e
 
64232fe
 
 
8b86970
 
3f14c1e
 
 
 
e60d202
 
 
 
 
 
 
 
 
3f14c1e
e60d202
 
 
 
 
 
3f14c1e
 
 
e60d202
3f14c1e
e60d202
 
 
3f14c1e
e60d202
3f14c1e
 
 
e60d202
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3f14c1e
e60d202
 
 
 
 
3f14c1e
e60d202
3f14c1e
b199369
3f14c1e
 
 
 
 
 
 
e60d202
3f14c1e
 
 
8b86970
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
model-index:
- name: >-
    LFM2-8B-A1B — MLX (Apple Silicon), **8-bit** (with guidance on MoE + RAM
    planning)
  results: []
language:
- en
tags:
- mlx
- apple-silicon
- liquidai
- lfm2
- moe
- transformer
- long-context
- instruct
- quantized
- 8bit
- Mixture of Experts
- coding
pipeline_tag: text-generation
library_name: mlx
license: other
license_name: lfm1.0
license_link: LICENSE
base_model:
- LiquidAI/LFM2-8B-A1B
---

# LFM2-8B-A1B — **MLX 8-bit** (Apple Silicon)

**Maintainer / Publisher:** [**Susant Achary**](https://huggingface.co/Susant-Achary)  
**Upstream model:** [LiquidAI/LFM2-8B-A1B](https://huggingface.co/LiquidAI/LFM2-8B-A1B)  
**This repo (MLX 8-bit):** `mlx-community/LFM2-8B-A1B-8bit-MLX`

This repository provides an **Apple-Silicon-optimized MLX build** of **LFM2-8B-A1B** at **8-bit** quantization for fast, on-device inference.

---

## 🔎 What is LFM2-8B-A1B?

- **Architecture:** Mixture-of-Experts (**MoE**) Transformer.  
- **Size:** ~**8B total parameters** with **~1B active** per token (the “A1B” suffix commonly denotes *~1B active params*).  
- **Why MoE?** During generation, only a subset of experts is **activated per token**, reducing **compute per token** while keeping a larger total parameter pool for expressivity.

> **Important memory note (single-device inference):**  
> Although *compute per token* benefits from MoE (fewer **active** parameters), **the full set of experts still resides in memory** for typical single-GPU/CPU deployments. In practice this means **RAM usage scales with total parameters**, not with the smaller *active* count.

---

## 📦 What’s in this MLX build

- `config.json` (MLX), `mlx_model*.safetensors` (**8-bit** shards)
- Tokenizer files: `tokenizer.json`, `tokenizer_config.json`
- Model metadata (e.g., `model_index.json`)

Target platform: **macOS** on **Apple Silicon (M-series)** using **Metal/MPS**.

---

## ✅ Intended use

- General **instruction-following**, chat, and summarization
- **RAG** back-ends and long-context workflows on device
- **Function-calling / structured outputs** with schema-style prompts

## ⚠️ Limitations

- Even at 8-bit, **long contexts** (KV-cache) can dominate memory at high `max_tokens` or large batch sizes.  
- As with any quantization, small regressions vs FP16 can appear on intricate math/code or edge-formatting.

---

## 🔢 RAM planning (8-bit, MoE, MLX)

You asked to **assume and decide** RAM usage in absence of your measurements. Below are **practical planning numbers** derived from first-principles + experience with MLX and similar MoE models. Treat them as **starting points** and validate on your hardware.

### Rule-of-thumb components

- **Weights:** `~ total_params × 1 byte` (8-bit). For 8B params → **~8.0 GB** baseline.  
- **Runtime overhead:** MLX graph + tensors + metadata → **~0.5–1.0 GB** typical.  
- **KV cache:** grows with **context_length × layers × heads × dtype**; often **1–3+ GB** for long contexts.

### Indicative peak RAM (single image/text, batch=1)

| Context window | Estimated peak RAM |
|---|---:|
| **4k tokens** | **~9.5–10.5 GB** |
| **8k tokens** | **~10.5–11.8 GB** |
| **16k tokens** | **~12.0–14.0 GB** |

> These ranges assume **8-bit** weights, **A1B MoE** (all experts resident), batch size = 1, and standard generation settings.  
> On lower windows (≤2k), you may see **~9–10 GB**. Larger windows or batches will increase KV-cache and peak RAM.

---

## 🧭 Choosing precision for LFM2-8B-A1B 

While this card is **8-bit**, teams often want a consistent lineup. If you later produce 6/5/4/3/2-bit MLX builds, here’s a practical guide (RAM figures are **indicative** for an 8B MoE LM; your results depend on context/batch):

| Variant | Typical Peak RAM | Relative Speed | Typical Behavior | When to choose |
|---|---:|:---:|---|---|
| **4-bit** | ~7–8 GB | 🔥🔥🔥 | Better detail retention | If 3-bit drops too much fidelity |
| **6-bit** | ~9–10.5 GB | 🔥🔥 | Near-max MLX quality | If you want accuracy under quant |
| **8-bit** *(this repo)* | **~9.5–12+ GB** | 🔥🔥 | **Highest** quality among quant tiers | When RAM allows and you want the most faithful outputs |

> **MoE caveat:** MoE **reduces compute per token**, but unless experts are **paged/partitioned** across devices and loaded on demand, **memory** still follows **total parameters**. On a single Mac, plan RAM as if the *whole 8B* parameter set is resident.

---

## 🚀 Quickstart (CLI — MLX)

**Deterministic generation**
```bash
python -m mlx_lm.generate \
  --model mlx-community/LFM2-8B-A1B-8bit-MLX \
  --prompt "Summarize the following in 5 bullet points:\n<your text>" \
  --max-tokens 256 \
  --temperature 0.0 \
  --device mps \
  --seed 0