File size: 18,810 Bytes
7f64a5a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
---
license: fair-noncommercial-research-license
extra_gated_fields:
  First Name: text
  Last Name: text
  Date of birth: date_picker
  Country: country
  Affiliation: text
  Job title:
    type: select
    options:
    - Student
    - Research Graduate
    - AI researcher
    - AI developer/engineer
    - Reporter
    - Other
  geo: ip_location
  By clicking Submit below I accept the terms of the license and acknowledge that the information I provide will be collected stored processed and shared in accordance with the Meta Privacy Policy: checkbox
extra_gated_description: >-
  The information you provide will be collected, stored, processed and shared in
  accordance with the [Meta Privacy
  Policy](https://www.facebook.com/privacy/policy/).
extra_gated_button_content: Submit
language:
- en
library_name: transformers
tags:
- facebook
- meta
- pytorch
---

# MobileLLM-P1 Model Card
We are introducing MobileLLM-P1 or Pro, a 1B foundational language model in the MobileLLM series, designed to deliver high-quality, efficient on-device inference across a wide range of general language modeling tasks. <br>
We open-source two variants of the model: A **pre-trained base model** along with **quantized checkpoints** for CPU and accelerator inference, as well as an **instruction tuned version**, showing competitive performance against models in the this size range on tasks like tool calling, question answering, rewriting and summarization. 

<p align="center">🤗 &nbsp;&nbsp;<a href="https://huggingface.co/spaces/akhaliq/MobileLLM-Pro">Chat with MobileLLM-Pro</a></p>

## Key Features
- **Strong Pre-training Performance:** MobileLLM-Pro base achieves impressive pre-training results, outperforming Gemma 3 1B and Llama 3.2 1B by on average 5.7% and 7.9% respectively on reasoning, knowledge, and long-context retrieval benchmarks. This performance is achieved by pre-training on less than 2T fully open-source tokens.
- **128k Context Window:** The model supports up to 128k tokens, enabling long-context understanding for applications such as document summarization and information retrieval, implicitly learned from a large teacher model.
- **Efficient Long-Context Inference:** Interleaving local and global attention layers at a 3:1 ratio with 512 local attention, MobileLLM-Pro reduces prefill latency by 1.8x* and lowers KV cache size from 117MB to 40MB* compared to fully global attention, enabling faster and more memory-efficient inference. (*Assuming 8k context length)
- **Near Lossless int4 Quantization:** We provide int4 quantization-ready checkpoints for our pre-trained model with less than 1.3% quality degradation compared to floating point baselines:
    - CPU: int4 weights (group size 32), int8 dynamic activations, int8 KV cache, with only 0.4% regression.
    - Accelerators: int4 per-channel weights, with only 1.3% quality regression.
- **Instruction Fine-Tuned Model:** We provide a competitive instruction fine-tuned (IFT) model specializing in use-cases such as tool calling, question answering, rewriting and summarization.

MobileLLM-Pro sets a new standard for efficient, high-quality on-device language modeling. We invite the community to explore, evaluate, and build upon this model.

## Model Information
**Layers:** 30<br>
**Attention Heads:** 20<br>
**KV Heads:** 4<br>
**Dimension:** 1280<br>
**Hidden Dimension:** 6144<br>
**Vocabulary Size:** 202,048<br>
**Total Parameters:** 1,084M (1.08B)

**Input Modality:** Text<br>
**Output Modality:** Text<br>
**Languages:** English<br>

**Training Method:** Knowledge Distillation<br>
**Context Length:** 128k tokens<br>
**Teacher Model:** [Llama 4-Scout](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E)<br>
**Loss Function:** KL Divergence<br>
**Quantization:** 16-bit, 4-bit<br>
**Other Features:** Shared Embeddings, Local-Global Attention

**Model Developer:** Meta Reality Labs <br>
**Model Release Date:**  October 2025 <br>
**License:** MobileLLM-Pro is FAIR NC licensed

## Results
### Base Pretrained Model
| Benchmark       | **P1 (FP)**   | **P1&#32; (Q-CPU)** | **P1 (Q-Acc)** | **Gemma 3 1B** | **Llama 3.2 1B** |
|-----------------|---------------|---------------------|----------------|----------------|------------------|
| HellaSwag       | **67.11%**    | 64.89%              | 65.10%         | 62.30%         | 65.69%           |
| BoolQ           | **76.24%**    | **77.49%**          | **76.36%**     | 63.20%         | 62.51%           |
| PIQA            | **76.55%**    | **76.66%**          | **75.52%**     | 73.80%         | 75.14%           |
| SocialIQA       | **50.87%**    | **51.18%**          | **50.05%**     | 48.90%         | 45.60%           |
| TriviaQA        | **39.85%**    | 37.26%              | 36.42%         | 39.80%         | 23.81%           |
| NatQ            | **15.76%**    | **15.43%**          | **13.19%**     | 9.48%          | 5.48%            |
| ARC-c           | **52.62%**    | **52.45%**          | **51.24%**     | 38.40%         | 38.28%           |
| ARC-e           | **76.28%**    | **76.58%**          | **75.73%**     | 73.00%         | 63.47%           |
| WinoGrande      | **62.83%**    | **62.43%**          | **61.96%**     | 58.20%         | 61.09%           |
| OBQA            | **43.60%**    | **44.20%**          | **40.40%**     |                | 37.20%           |
| NIH             | **100.00%**   | 96.44%              | **98.67%**     | 

FP = Full precision, bf16<br>
Q-CPU = int4, group-wise quantized (for CPU)<br>
Q-Acc = int4, channel-wise quantized (for Accelerators (ANE&HTP))

### Instruction Tuned Model
| Benchmark     | **P1 (IFT)** | **Gemma 3 1B (IFT)** | **Llama 3.2 1B (IFT)** |
|---------------|--------------|----------------------|------------------------|
| MMLU          | 44.8%        | 29.9%                | **49.3%**              |
| IFEval        | 62.0%        | **80.2%**            | 59.5%                  |
| MBPP          | **46.8%**    | 35.2%                | 39.6%                  |
| HumanEval     | **59.8%**    | 41.5%                | 37.8%                  |
| ARC-C         | **62.7%**    |                      | 59.4%                  |
| HellaSwag     | **58.4%**    |                      | 41.2%                  |
| BFCL v2       | **29.4%**    |                      | 25.7%                  |
| Open Rewrite  | **51.0%**    |                      | 41.6%                  |
| TLDR9+        | **16.8%**    |                      | **16.8%**              |

## Training Data

We constructed our datamix by selecting publicly available datasets that cover a range of domains. Using data-specific simulation runs, each dataset's contribution to the training process was carefully balanced by assigning it a specific sampling weight. These weights remained consistent throughout the base model pretraining and were informed by the extended work of [Automixer](https://scholar.google.com/citations?view_op=view_citation&hl=en&user=FbR5cAMAAAAJ&sortby=pubdate&citation_for_view=FbR5cAMAAAAJ:cFHS6HbyZ2cC) and additional ablation studies. <br>
The pre-training datamix primarily consists of a large educational web dataset, which makes up the vast majority of the training data. Smaller but significant portions come from coding data, mathematics, Wikipedia, scientific papers, Q&A forums, and algebraic content. In total, the datamix includes approximately 1,500 million rows and 1,640 billion tokens. <br>
For our instruction fine-tuned data-mix, we focus on data diversity from existing open-source fine-tuning corpora. Specifically, we combine datasets for general instruction tuning with chat, science, safety, coding and math domains. For our final DPO phase, we rely on completely synthetic datasets.

## Training Process
### Pretraining
Our general pre-training process contains three distinct phases using logit-based knowledge distillation from the [Llama 4-Scout](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E) model and a novel model merging paradigm: 

**Phase 1 (KD)**: Language Learning – Learn general language skills from high-quality, well balanced pre-training data <br>
**Phase 2 (KD)**: Long-context awareness – Extend the model context-length to 128k tokens using implicit positional distillation from the teacher model <br>
**Phase 3 (KD)**: Domain abilities – Acquire domain understanding through annealing of multiple models in parallel and merging the specialist models, resulting in improvements across a diverse range of domains

![image](https://cdn-uploads.huggingface.co/production/uploads/68c1aa07c02e455d06f93a42/DpI3Yk1fxWA789N76fvjr.png)

On top of the three pre-training phases, we add a fourth phase of Quantization-Aware Training (QAT) for our 4-bit quantized model checkpoint.

### Instruction Fine-Tuning
We split the instruction fine-tuning stage into three distinct phases combining SFT and DPO methods:

 **Phase 1 (SFT)**: Learn general instruction-following with a focus on data diversity <br>
**Phase 2 (SFT)**: Domain-weight the Phase 1 data given its shortcomings (e.g. upsample code data to improve logical reasoning) <br>
**Phase 3 (SFT + DPO)**: Train and align the model for safety and self-identification 

![image](https://cdn-uploads.huggingface.co/production/uploads/68c1aa07c02e455d06f93a42/wBAO_0Bu3dnCn8R2K9HXD.png)

## Quantization

![image/png](https://cdn-uploads.huggingface.co/production/uploads/68c1aa07c02e455d06f93a42/NJ_d8jyeVwkLIp9kwZRtR.png)

We apply Quantization Aware Training (QAT) to our baseline and instruction fine-tuned models, yielding quantization-ready checkpoints that can either be directly converted to integer datatype (with minimal quality loss) or used for QAT on additional data. We release two quantization-ready checkpoints:

- **4-bit groupwise weight quantization** with block size 32, 8-bit dynamic activations, and 8-bit kv-cache quantizations — optimized for CPU/GPU backends ([xnnpack](https://docs.pytorch.org/executorch/0.5/native-delegates-executorch-xnnpack-delegate.html)).
- **4-bit channelwise quantization** without activation quantization and 8-bit kv-cache quantizations — designed for edge hardware accelerators such as Apple Neural Engine ([ANE](https://apple.github.io/coremltools/docs-guides/source/opt-quantization-overview.html)) and Qualcomm’s Hexagon Tensor Processor ([HTP](https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_guidelines_int4_weights.html)).

Our QAT approach incorporates long-context awareness (up to 128k tokens) and self-knowledge distillation using the full-precision teacher model. We compared the QAT-trained model to a standard round-to-nearest Post-Training Quantization (PTQ) baseline. In the groupwise pre-training setting, we observe a 34% (absolute) regression in average benchmark score when using PTQ and only a 1.5% (absolute) regression for QAT. For instruction fine-tuning, we observe less than 1% average regression using QAT.

## How to use
### Full precision:

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login

login(token="<HF_TOKEN>")
MODEL_ID = "facebook/MobileLLM-Pro"

def generate(user_input: str, model, tokenizer, chat: bool) -> str:
    if chat:
        user_input = [{"role": "user", "content": user_input}]
        inputs = tokenizer.apply_chat_template(
            user_input, return_tensors="pt", add_generation_prompt=True
        ).to(model.device)
    else:
        inputs = tokenizer(user_input, return_tensors="pt")["input_ids"].to(model.device)
    outputs = model.generate(inputs, max_new_tokens=128)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def main():
    version = "instruct"  # "base" | "instruct"
    tokenizer = AutoTokenizer.from_pretrained(
        MODEL_ID, trust_remote_code=True, subfolder=version
    )
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID, trust_remote_code=True, subfolder=version
    )
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    model.eval()

    prompt = "Why are open-source on-device language models great?"
    result = generate(prompt, model, tokenizer, chat=(version == "instruct"))
    print(result)

if __name__ == "__main__":
    main()

```

### Quantize Checkpoints

#### 4-bit Groupwise Quantization

```python
from torchao.quantization import quantize_
from torchao.quantization.qat import (
    QATConfig, 
    IntxFakeQuantizeConfig
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True
)

# Prepare for QAT.
# 8-bit dynamic per-token quantization for activations
activation_config = IntxFakeQuantizeConfig(
    torch.int8, "per_token", is_symmetric=False,
)
# 4-bit group-size=32 with range_learning=True for weights
weight_config = IntxFakeQuantizeConfig(
    torch.int4,
    group_size=32,
    is_symmetric=True,
    is_dynamic=True,
)
qat_config = QATConfig(
    activation_config=activation_config,
    weight_config=weight_config,
    step="prepare",
)
quantize_(model, qat_config)

embedding_filter_fn = lambda m, fqn: isinstance(m, torch.nn.Embedding)
embedding_qat_config = IntxFakeQuantizeConfig(
    torch.int4,
    group_size=32,
    is_symmetric=True,
    is_dynamic=True,
)
quantize_(
    model,
    QATConfig(
        weight_config=embedding_qat_config, 
   step="prepare"
    ),
    embedding_filter_fn
)

# The model is now ready for Quantization aware Training (QAT)
# trainer.train()
model.save_pretrained(
    save_directory=<QAT_save_directory>,
    safe_serialization=False
)

# Convert model after training
from torchao.quantization import (
    IntxWeightOnlyConfig,
    Int8DynamicActivationIntxWeightConfig
)
from torchao.quantization.granularity import PerGroup

qat_convert_config = QATConfig(
    Int8DynamicActivationIntxWeightConfig(
        weight_dtype=torch.int4
        weight_granularity=PerGroup(32),
    ),
    step="convert",
)
quantize_(model, qat_convert_config)
embedding_convert_config = IntxWeightOnlyConfig(
    weight_dtype=torch.int4,
    granularity=PerGroup(32)
)
quantize_(
    model,
    QATConfig(
        embedding_convert_config,
        step="convert"
    ),
    embedding_filter_fn
)

# Save model after convert
model.save_pretrained(
    save_directory=<quantized_model_directory>,
    safe_serialization=False
)
```

#### 4-bit Channelwise Quantization

```python
from torchao.quantization import quantize_
from torchao.quantization.granularity import PerAxis
from torchao.quantization.qat import (
    initialize_fake_quantizers,
    IntxFakeQuantizeConfig,
    QATConfig
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True
)

# 4-bit per-channel with range_learning=True for weights
weight_config = IntxFakeQuantizeConfig(
    torch.int4,
    granularity=PerAxis(0),
    is_symmetric=True,
    is_dynamic=False,
    range_learning=True,
)
qat_config = QATConfig(
    weight_config=weight_config,
    step="prepare",
)
quantize_(model, qat_config)

embedding_filter_fn = lambda m, fqn: isinstance(m, torch.nn.Embedding)
quantize_(model, qat_config, embedding_filter_fn)

# Initialize the fake quantizers for range-learning
example_inputs = (torch.tensor([[1]], dtype=torch.long),)
initialize_fake_quantizers(model, example_inputs)


# The model is now ready for Quantization aware Training (QAT)
# trainer.train()
model.save_pretrained(
    save_directory=<QAT_save_directory>,
    safe_serialization=False
)

# Convert model after training
from torchao.quantization import IntxWeightOnlyConfig

wt_convert_config = IntxWeightOnlyConfig(
    weight_dtype=torch.int4,
    granularity=PerAxis(0)
)
qat_convert_config = QATConfig(
    wt_convert_config,
    step="convert",
)
quantize_(model, qat_convert_config)
quantize_(model, qat_convert_config, embedding_filter_fn)

# Save model after convert
model.save_pretrained(
    save_directory=<quantized_model_directory>,
    safe_serialization=False
)
```

## Latency benchmarking

Latency benchmarking was done on a Samsung Galaxy S25 CPU and Samsung Galaxy S24 Hexagon Tensor Processor (HTP). Models were exported to ExecuTorch with XNNPACK backend (for CPU) and HTP backend (for accelerator). The model size of the CPU model with 4-bit groupwise quantization is 590MB. The CPU and HTP prefill latency for different input prompt lengths of 2k, 4k and 8k along with decode speed for generating 1k tokens is shown in the following table.

| Model / Prompt length     | 2k     | 4k     | 8k    |
|---------------------------|--------|--------|-------|
| CPU Prefill Latency (s)   | 8.9    | 24.8   | 63.5  |
| CPU Decode Speed (tok/s)  | 33.6   | 24.8   | 19.7  |
| HTP Prefill Latency (s)   | 1.96   | 3.38   | 9.82  |
| HTP Decode Speed (tok/s)  | 31.60  | 28.95  | 22.77 |
| KV Cache Size (MB)        | 14     | 23     | 40    |


To validate the benefit of interleaved local-global attention (LGA), we benchmark models across different prompt lengths and measure the speed-up in prefill & decode relative to using global attention at every layer:

![image](https://cdn-uploads.huggingface.co/production/uploads/68c1aa07c02e455d06f93a42/_p8JT_Wtljwyp23TmKsTc.png)


## Citation

@misc{mobilellm_pro,<br>
title={MobileLLM-Pro Model Card},<br>
author={Patrick Huber*, Ernie Chang*, Wei Wen*, Igor Fedorov*, Tarek Elgamal, Hanxian Huang, Naveen Suda, Chinnadhurai Sankar, Vish Vogeti, Yanghan Wang, Alex Gladkov, Kai Sheng Tai, Abdelrahman Elogeel, Tarek Hefny, Vikas Chandra, Ahmed Aly, Anuj Kumar, Raghuraman Krishnamoorthi**, Adithya Sagar**}, <br>
year={2025},<br>
month={October},<br>
url = {[https://huggingface.co/facebook/MobileLLM-Pro](https://huggingface.co/facebook/MobileLLM-Pro)}}

## Contact

Patrick Huber, Meta Inc, Reality Labs ([patrickhuber@meta.com](mailto:patrickhuber@meta.com))<br>
Ernie Chang, Meta Inc, Reality Labs ([erniecyc@meta.com](mailto:erniecyc@meta.com))<br>
Wei Wen,  Meta Inc, Reality Labs ([wewen@meta.com](mailto:wewen@meta.com))<br>
Igor Fedorov, Meta Inc, Reality Labs ([ifedorov@meta.com](mailto:ifedorov@meta.com))<br>
Raghuraman Krishnamoorthi,  Meta Inc Reality Labs ([raghuraman@meta.com](mailto:raghuraman@meta.com))<br>
Adithya Sagar, Meta Inc, Reality Labs (adithyasagar@meta.com)

## Acknowledgements

We want to thank the team involved in this project, especially: Kimish Patel, Andrew Or, Min Guo, Shen Xu, Brian Moran, Maho Takahashi, Claire Lesage, Rylan Conway, Karan Chadha, Matthew Grange, Tomasz Wołcyrz, Shiv Desai, Amarlin Anand, Joele Sires, Robert Carrillo, Francisc Bungiu, Jayden Yu, AJ Brush, Yang Li, Samuel Selvan, Anand Sharma, Peng Shan, Anand Dass, Abhishek Sharma

## License

MobileLLM-Pro is distributed under the [FAIR NC license](https://huggingface.co/facebook/MobileLLM-Pro/blob/main/LICENSE)