File size: 18,810 Bytes
7f64a5a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 | ---
license: fair-noncommercial-research-license
extra_gated_fields:
First Name: text
Last Name: text
Date of birth: date_picker
Country: country
Affiliation: text
Job title:
type: select
options:
- Student
- Research Graduate
- AI researcher
- AI developer/engineer
- Reporter
- Other
geo: ip_location
By clicking Submit below I accept the terms of the license and acknowledge that the information I provide will be collected stored processed and shared in accordance with the Meta Privacy Policy: checkbox
extra_gated_description: >-
The information you provide will be collected, stored, processed and shared in
accordance with the [Meta Privacy
Policy](https://www.facebook.com/privacy/policy/).
extra_gated_button_content: Submit
language:
- en
library_name: transformers
tags:
- facebook
- meta
- pytorch
---
# MobileLLM-P1 Model Card
We are introducing MobileLLM-P1 or Pro, a 1B foundational language model in the MobileLLM series, designed to deliver high-quality, efficient on-device inference across a wide range of general language modeling tasks. <br>
We open-source two variants of the model: A **pre-trained base model** along with **quantized checkpoints** for CPU and accelerator inference, as well as an **instruction tuned version**, showing competitive performance against models in the this size range on tasks like tool calling, question answering, rewriting and summarization.
<p align="center">🤗 <a href="https://huggingface.co/spaces/akhaliq/MobileLLM-Pro">Chat with MobileLLM-Pro</a></p>
## Key Features
- **Strong Pre-training Performance:** MobileLLM-Pro base achieves impressive pre-training results, outperforming Gemma 3 1B and Llama 3.2 1B by on average 5.7% and 7.9% respectively on reasoning, knowledge, and long-context retrieval benchmarks. This performance is achieved by pre-training on less than 2T fully open-source tokens.
- **128k Context Window:** The model supports up to 128k tokens, enabling long-context understanding for applications such as document summarization and information retrieval, implicitly learned from a large teacher model.
- **Efficient Long-Context Inference:** Interleaving local and global attention layers at a 3:1 ratio with 512 local attention, MobileLLM-Pro reduces prefill latency by 1.8x* and lowers KV cache size from 117MB to 40MB* compared to fully global attention, enabling faster and more memory-efficient inference. (*Assuming 8k context length)
- **Near Lossless int4 Quantization:** We provide int4 quantization-ready checkpoints for our pre-trained model with less than 1.3% quality degradation compared to floating point baselines:
- CPU: int4 weights (group size 32), int8 dynamic activations, int8 KV cache, with only 0.4% regression.
- Accelerators: int4 per-channel weights, with only 1.3% quality regression.
- **Instruction Fine-Tuned Model:** We provide a competitive instruction fine-tuned (IFT) model specializing in use-cases such as tool calling, question answering, rewriting and summarization.
MobileLLM-Pro sets a new standard for efficient, high-quality on-device language modeling. We invite the community to explore, evaluate, and build upon this model.
## Model Information
**Layers:** 30<br>
**Attention Heads:** 20<br>
**KV Heads:** 4<br>
**Dimension:** 1280<br>
**Hidden Dimension:** 6144<br>
**Vocabulary Size:** 202,048<br>
**Total Parameters:** 1,084M (1.08B)
**Input Modality:** Text<br>
**Output Modality:** Text<br>
**Languages:** English<br>
**Training Method:** Knowledge Distillation<br>
**Context Length:** 128k tokens<br>
**Teacher Model:** [Llama 4-Scout](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E)<br>
**Loss Function:** KL Divergence<br>
**Quantization:** 16-bit, 4-bit<br>
**Other Features:** Shared Embeddings, Local-Global Attention
**Model Developer:** Meta Reality Labs <br>
**Model Release Date:** October 2025 <br>
**License:** MobileLLM-Pro is FAIR NC licensed
## Results
### Base Pretrained Model
| Benchmark | **P1 (FP)** | **P1  (Q-CPU)** | **P1 (Q-Acc)** | **Gemma 3 1B** | **Llama 3.2 1B** |
|-----------------|---------------|---------------------|----------------|----------------|------------------|
| HellaSwag | **67.11%** | 64.89% | 65.10% | 62.30% | 65.69% |
| BoolQ | **76.24%** | **77.49%** | **76.36%** | 63.20% | 62.51% |
| PIQA | **76.55%** | **76.66%** | **75.52%** | 73.80% | 75.14% |
| SocialIQA | **50.87%** | **51.18%** | **50.05%** | 48.90% | 45.60% |
| TriviaQA | **39.85%** | 37.26% | 36.42% | 39.80% | 23.81% |
| NatQ | **15.76%** | **15.43%** | **13.19%** | 9.48% | 5.48% |
| ARC-c | **52.62%** | **52.45%** | **51.24%** | 38.40% | 38.28% |
| ARC-e | **76.28%** | **76.58%** | **75.73%** | 73.00% | 63.47% |
| WinoGrande | **62.83%** | **62.43%** | **61.96%** | 58.20% | 61.09% |
| OBQA | **43.60%** | **44.20%** | **40.40%** | | 37.20% |
| NIH | **100.00%** | 96.44% | **98.67%** |
FP = Full precision, bf16<br>
Q-CPU = int4, group-wise quantized (for CPU)<br>
Q-Acc = int4, channel-wise quantized (for Accelerators (ANE&HTP))
### Instruction Tuned Model
| Benchmark | **P1 (IFT)** | **Gemma 3 1B (IFT)** | **Llama 3.2 1B (IFT)** |
|---------------|--------------|----------------------|------------------------|
| MMLU | 44.8% | 29.9% | **49.3%** |
| IFEval | 62.0% | **80.2%** | 59.5% |
| MBPP | **46.8%** | 35.2% | 39.6% |
| HumanEval | **59.8%** | 41.5% | 37.8% |
| ARC-C | **62.7%** | | 59.4% |
| HellaSwag | **58.4%** | | 41.2% |
| BFCL v2 | **29.4%** | | 25.7% |
| Open Rewrite | **51.0%** | | 41.6% |
| TLDR9+ | **16.8%** | | **16.8%** |
## Training Data
We constructed our datamix by selecting publicly available datasets that cover a range of domains. Using data-specific simulation runs, each dataset's contribution to the training process was carefully balanced by assigning it a specific sampling weight. These weights remained consistent throughout the base model pretraining and were informed by the extended work of [Automixer](https://scholar.google.com/citations?view_op=view_citation&hl=en&user=FbR5cAMAAAAJ&sortby=pubdate&citation_for_view=FbR5cAMAAAAJ:cFHS6HbyZ2cC) and additional ablation studies. <br>
The pre-training datamix primarily consists of a large educational web dataset, which makes up the vast majority of the training data. Smaller but significant portions come from coding data, mathematics, Wikipedia, scientific papers, Q&A forums, and algebraic content. In total, the datamix includes approximately 1,500 million rows and 1,640 billion tokens. <br>
For our instruction fine-tuned data-mix, we focus on data diversity from existing open-source fine-tuning corpora. Specifically, we combine datasets for general instruction tuning with chat, science, safety, coding and math domains. For our final DPO phase, we rely on completely synthetic datasets.
## Training Process
### Pretraining
Our general pre-training process contains three distinct phases using logit-based knowledge distillation from the [Llama 4-Scout](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E) model and a novel model merging paradigm:
**Phase 1 (KD)**: Language Learning – Learn general language skills from high-quality, well balanced pre-training data <br>
**Phase 2 (KD)**: Long-context awareness – Extend the model context-length to 128k tokens using implicit positional distillation from the teacher model <br>
**Phase 3 (KD)**: Domain abilities – Acquire domain understanding through annealing of multiple models in parallel and merging the specialist models, resulting in improvements across a diverse range of domains

On top of the three pre-training phases, we add a fourth phase of Quantization-Aware Training (QAT) for our 4-bit quantized model checkpoint.
### Instruction Fine-Tuning
We split the instruction fine-tuning stage into three distinct phases combining SFT and DPO methods:
**Phase 1 (SFT)**: Learn general instruction-following with a focus on data diversity <br>
**Phase 2 (SFT)**: Domain-weight the Phase 1 data given its shortcomings (e.g. upsample code data to improve logical reasoning) <br>
**Phase 3 (SFT + DPO)**: Train and align the model for safety and self-identification

## Quantization

We apply Quantization Aware Training (QAT) to our baseline and instruction fine-tuned models, yielding quantization-ready checkpoints that can either be directly converted to integer datatype (with minimal quality loss) or used for QAT on additional data. We release two quantization-ready checkpoints:
- **4-bit groupwise weight quantization** with block size 32, 8-bit dynamic activations, and 8-bit kv-cache quantizations — optimized for CPU/GPU backends ([xnnpack](https://docs.pytorch.org/executorch/0.5/native-delegates-executorch-xnnpack-delegate.html)).
- **4-bit channelwise quantization** without activation quantization and 8-bit kv-cache quantizations — designed for edge hardware accelerators such as Apple Neural Engine ([ANE](https://apple.github.io/coremltools/docs-guides/source/opt-quantization-overview.html)) and Qualcomm’s Hexagon Tensor Processor ([HTP](https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_guidelines_int4_weights.html)).
Our QAT approach incorporates long-context awareness (up to 128k tokens) and self-knowledge distillation using the full-precision teacher model. We compared the QAT-trained model to a standard round-to-nearest Post-Training Quantization (PTQ) baseline. In the groupwise pre-training setting, we observe a 34% (absolute) regression in average benchmark score when using PTQ and only a 1.5% (absolute) regression for QAT. For instruction fine-tuning, we observe less than 1% average regression using QAT.
## How to use
### Full precision:
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login
login(token="<HF_TOKEN>")
MODEL_ID = "facebook/MobileLLM-Pro"
def generate(user_input: str, model, tokenizer, chat: bool) -> str:
if chat:
user_input = [{"role": "user", "content": user_input}]
inputs = tokenizer.apply_chat_template(
user_input, return_tensors="pt", add_generation_prompt=True
).to(model.device)
else:
inputs = tokenizer(user_input, return_tensors="pt")["input_ids"].to(model.device)
outputs = model.generate(inputs, max_new_tokens=128)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
def main():
version = "instruct" # "base" | "instruct"
tokenizer = AutoTokenizer.from_pretrained(
MODEL_ID, trust_remote_code=True, subfolder=version
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, trust_remote_code=True, subfolder=version
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
prompt = "Why are open-source on-device language models great?"
result = generate(prompt, model, tokenizer, chat=(version == "instruct"))
print(result)
if __name__ == "__main__":
main()
```
### Quantize Checkpoints
#### 4-bit Groupwise Quantization
```python
from torchao.quantization import quantize_
from torchao.quantization.qat import (
QATConfig,
IntxFakeQuantizeConfig
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True
)
# Prepare for QAT.
# 8-bit dynamic per-token quantization for activations
activation_config = IntxFakeQuantizeConfig(
torch.int8, "per_token", is_symmetric=False,
)
# 4-bit group-size=32 with range_learning=True for weights
weight_config = IntxFakeQuantizeConfig(
torch.int4,
group_size=32,
is_symmetric=True,
is_dynamic=True,
)
qat_config = QATConfig(
activation_config=activation_config,
weight_config=weight_config,
step="prepare",
)
quantize_(model, qat_config)
embedding_filter_fn = lambda m, fqn: isinstance(m, torch.nn.Embedding)
embedding_qat_config = IntxFakeQuantizeConfig(
torch.int4,
group_size=32,
is_symmetric=True,
is_dynamic=True,
)
quantize_(
model,
QATConfig(
weight_config=embedding_qat_config,
step="prepare"
),
embedding_filter_fn
)
# The model is now ready for Quantization aware Training (QAT)
# trainer.train()
model.save_pretrained(
save_directory=<QAT_save_directory>,
safe_serialization=False
)
# Convert model after training
from torchao.quantization import (
IntxWeightOnlyConfig,
Int8DynamicActivationIntxWeightConfig
)
from torchao.quantization.granularity import PerGroup
qat_convert_config = QATConfig(
Int8DynamicActivationIntxWeightConfig(
weight_dtype=torch.int4
weight_granularity=PerGroup(32),
),
step="convert",
)
quantize_(model, qat_convert_config)
embedding_convert_config = IntxWeightOnlyConfig(
weight_dtype=torch.int4,
granularity=PerGroup(32)
)
quantize_(
model,
QATConfig(
embedding_convert_config,
step="convert"
),
embedding_filter_fn
)
# Save model after convert
model.save_pretrained(
save_directory=<quantized_model_directory>,
safe_serialization=False
)
```
#### 4-bit Channelwise Quantization
```python
from torchao.quantization import quantize_
from torchao.quantization.granularity import PerAxis
from torchao.quantization.qat import (
initialize_fake_quantizers,
IntxFakeQuantizeConfig,
QATConfig
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True
)
# 4-bit per-channel with range_learning=True for weights
weight_config = IntxFakeQuantizeConfig(
torch.int4,
granularity=PerAxis(0),
is_symmetric=True,
is_dynamic=False,
range_learning=True,
)
qat_config = QATConfig(
weight_config=weight_config,
step="prepare",
)
quantize_(model, qat_config)
embedding_filter_fn = lambda m, fqn: isinstance(m, torch.nn.Embedding)
quantize_(model, qat_config, embedding_filter_fn)
# Initialize the fake quantizers for range-learning
example_inputs = (torch.tensor([[1]], dtype=torch.long),)
initialize_fake_quantizers(model, example_inputs)
# The model is now ready for Quantization aware Training (QAT)
# trainer.train()
model.save_pretrained(
save_directory=<QAT_save_directory>,
safe_serialization=False
)
# Convert model after training
from torchao.quantization import IntxWeightOnlyConfig
wt_convert_config = IntxWeightOnlyConfig(
weight_dtype=torch.int4,
granularity=PerAxis(0)
)
qat_convert_config = QATConfig(
wt_convert_config,
step="convert",
)
quantize_(model, qat_convert_config)
quantize_(model, qat_convert_config, embedding_filter_fn)
# Save model after convert
model.save_pretrained(
save_directory=<quantized_model_directory>,
safe_serialization=False
)
```
## Latency benchmarking
Latency benchmarking was done on a Samsung Galaxy S25 CPU and Samsung Galaxy S24 Hexagon Tensor Processor (HTP). Models were exported to ExecuTorch with XNNPACK backend (for CPU) and HTP backend (for accelerator). The model size of the CPU model with 4-bit groupwise quantization is 590MB. The CPU and HTP prefill latency for different input prompt lengths of 2k, 4k and 8k along with decode speed for generating 1k tokens is shown in the following table.
| Model / Prompt length | 2k | 4k | 8k |
|---------------------------|--------|--------|-------|
| CPU Prefill Latency (s) | 8.9 | 24.8 | 63.5 |
| CPU Decode Speed (tok/s) | 33.6 | 24.8 | 19.7 |
| HTP Prefill Latency (s) | 1.96 | 3.38 | 9.82 |
| HTP Decode Speed (tok/s) | 31.60 | 28.95 | 22.77 |
| KV Cache Size (MB) | 14 | 23 | 40 |
To validate the benefit of interleaved local-global attention (LGA), we benchmark models across different prompt lengths and measure the speed-up in prefill & decode relative to using global attention at every layer:

## Citation
@misc{mobilellm_pro,<br>
title={MobileLLM-Pro Model Card},<br>
author={Patrick Huber*, Ernie Chang*, Wei Wen*, Igor Fedorov*, Tarek Elgamal, Hanxian Huang, Naveen Suda, Chinnadhurai Sankar, Vish Vogeti, Yanghan Wang, Alex Gladkov, Kai Sheng Tai, Abdelrahman Elogeel, Tarek Hefny, Vikas Chandra, Ahmed Aly, Anuj Kumar, Raghuraman Krishnamoorthi**, Adithya Sagar**}, <br>
year={2025},<br>
month={October},<br>
url = {[https://huggingface.co/facebook/MobileLLM-Pro](https://huggingface.co/facebook/MobileLLM-Pro)}}
## Contact
Patrick Huber, Meta Inc, Reality Labs ([patrickhuber@meta.com](mailto:patrickhuber@meta.com))<br>
Ernie Chang, Meta Inc, Reality Labs ([erniecyc@meta.com](mailto:erniecyc@meta.com))<br>
Wei Wen, Meta Inc, Reality Labs ([wewen@meta.com](mailto:wewen@meta.com))<br>
Igor Fedorov, Meta Inc, Reality Labs ([ifedorov@meta.com](mailto:ifedorov@meta.com))<br>
Raghuraman Krishnamoorthi, Meta Inc Reality Labs ([raghuraman@meta.com](mailto:raghuraman@meta.com))<br>
Adithya Sagar, Meta Inc, Reality Labs (adithyasagar@meta.com)
## Acknowledgements
We want to thank the team involved in this project, especially: Kimish Patel, Andrew Or, Min Guo, Shen Xu, Brian Moran, Maho Takahashi, Claire Lesage, Rylan Conway, Karan Chadha, Matthew Grange, Tomasz Wołcyrz, Shiv Desai, Amarlin Anand, Joele Sires, Robert Carrillo, Francisc Bungiu, Jayden Yu, AJ Brush, Yang Li, Samuel Selvan, Anand Sharma, Peng Shan, Anand Dass, Abhishek Sharma
## License
MobileLLM-Pro is distributed under the [FAIR NC license](https://huggingface.co/facebook/MobileLLM-Pro/blob/main/LICENSE) |