File size: 18,810 Bytes

7f64a5a

---
license: fair-noncommercial-research-license
extra_gated_fields:
  First Name: text
  Last Name: text
  Date of birth: date_picker
  Country: country
  Affiliation: text
  Job title:
    type: select
    options:
    - Student
    - Research Graduate
    - AI researcher
    - AI developer/engineer
    - Reporter
    - Other
  geo: ip_location
  By clicking Submit below I accept the terms of the license and acknowledge that the information I provide will be collected stored processed and shared in accordance with the Meta Privacy Policy: checkbox
extra_gated_description: >-
  The information you provide will be collected, stored, processed and shared in
  accordance with the [Meta Privacy
  Policy](https://www.facebook.com/privacy/policy/).
extra_gated_button_content: Submit
language:
- en
library_name: transformers
tags:
- facebook
- meta
- pytorch
---

# MobileLLM-P1 Model Card
We are introducing MobileLLM-P1 or Pro, a 1B foundational language model in the MobileLLM series, designed to deliver high-quality, efficient on-device inference across a wide range of general language modeling tasks. <br>
We open-source two variants of the model: A **pre-trained base model** along with **quantized checkpoints** for CPU and accelerator inference, as well as an **instruction tuned version**, showing competitive performance against models in the this size range on tasks like tool calling, question answering, rewriting and summarization. 

<p align="center">🤗 &nbsp;&nbsp;<a href="https://huggingface.co/spaces/akhaliq/MobileLLM-Pro">Chat with MobileLLM-Pro</a></p>

## Key Features
- **Strong Pre-training Performance:** MobileLLM-Pro base achieves impressive pre-training results, outperforming Gemma 3 1B and Llama 3.2 1B by on average 5.7% and 7.9% respectively on reasoning, knowledge, and long-context retrieval benchmarks. This performance is achieved by pre-training on less than 2T fully open-source tokens.
- **128k Context Window:** The model supports up to 128k tokens, enabling long-context understanding for applications such as document summarization and information retrieval, implicitly learned from a large teacher model.
- **Efficient Long-Context Inference:** Interleaving local and global attention layers at a 3:1 ratio with 512 local attention, MobileLLM-Pro reduces prefill latency by 1.8x* and lowers KV cache size from 117MB to 40MB* compared to fully global attention, enabling faster and more memory-efficient inference. (*Assuming 8k context length)
- **Near Lossless int4 Quantization:** We provide int4 quantization-ready checkpoints for our pre-trained model with less than 1.3% quality degradation compared to floating point baselines:
    - CPU: int4 weights (group size 32), int8 dynamic activations, int8 KV cache, with only 0.4% regression.
    - Accelerators: int4 per-channel weights, with only 1.3% quality regression.
- **Instruction Fine-Tuned Model:** We provide a competitive instruction fine-tuned (IFT) model specializing in use-cases such as tool calling, question answering, rewriting and summarization.

MobileLLM-Pro sets a new standard for efficient, high-quality on-device language modeling. We invite the community to explore, evaluate, and build upon this model.

## Model Information
**Layers:** 30<br>
**Attention Heads:** 20<br>
**KV Heads:** 4<br>
**Dimension:** 1280<br>
**Hidden Dimension:** 6144<br>
**Vocabulary Size:** 202,048<br>
**Total Parameters:** 1,084M (1.08B)

**Input Modality:** Text<br>
**Output Modality:** Text<br>
**Languages:** English<br>

**Training Method:** Knowledge Distillation<br>
**Context Length:** 128k tokens<br>
**Teacher Model:** [Llama 4-Scout](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E)<br>
**Loss Function:** KL Divergence<br>
**Quantization:** 16-bit, 4-bit<br>
**Other Features:** Shared Embeddings, Local-Global Attention

**Model Developer:** Meta Reality Labs <br>
**Model Release Date:**  October 2025 <br>
**License:** MobileLLM-Pro is FAIR NC licensed

## Results
### Base Pretrained Model
| Benchmark       | **P1 (FP)**   | **P1&#32; (Q-CPU)** | **P1 (Q-Acc)** | **Gemma 3 1B** | **Llama 3.2 1B** |
|-----------------|---------------|---------------------|----------------|----------------|------------------|
| HellaSwag       | **67.11%**    | 64.89%              | 65.10%         | 62.30%         | 65.69%           |
| BoolQ           | **76.24%**    | **77.49%**          | **76.36%**     | 63.20%         | 62.51%           |
| PIQA            | **76.55%**    | **76.66%**          | **75.52%**     | 73.80%         | 75.14%           |
| SocialIQA       | **50.87%**    | **51.18%**          | **50.05%**     | 48.90%         | 45.60%           |
| TriviaQA        | **39.85%**    | 37.26%              | 36.42%         | 39.80%         | 23.81%           |
| NatQ            | **15.76%**    | **15.43%**          | **13.19%**     | 9.48%          | 5.48%            |
| ARC-c           | **52.62%**    | **52.45%**          | **51.24%**     | 38.40%         | 38.28%           |
| ARC-e           | **76.28%**    | **76.58%**          | **75.73%**     | 73.00%         | 63.47%           |
| WinoGrande      | **62.83%**    | **62.43%**          | **61.96%**     | 58.20%         | 61.09%           |
| OBQA            | **43.60%**    | **44.20%**          | **40.40%**     |                | 37.20%           |
| NIH             | **100.00%**   | 96.44%              | **98.67%**     | 

FP = Full precision, bf16<br>
Q-CPU = int4, group-wise quantized (for CPU)<br>
Q-Acc = int4, channel-wise quantized (for Accelerators (ANE&HTP))

### Instruction Tuned Model
| Benchmark     | **P1 (IFT)** | **Gemma 3 1B (IFT)** | **Llama 3.2 1B (IFT)** |
|---------------|--------------|----------------------|------------------------|
| MMLU          | 44.8%        | 29.9%                | **49.3%**              |
| IFEval        | 62.0%        | **80.2%**            | 59.5%                  |
| MBPP          | **46.8%**    | 35.2%                | 39.6%                  |
| HumanEval     | **59.8%**    | 41.5%                | 37.8%                  |
| ARC-C         | **62.7%**    |                      | 59.4%                  |
| HellaSwag     | **58.4%**    |                      | 41.2%                  |
| BFCL v2       | **29.4%**    |                      | 25.7%                  |
| Open Rewrite  | **51.0%**    |                      | 41.6%                  |
| TLDR9+        | **16.8%**    |                      | **16.8%**              |

## Training Data

We constructed our datamix by selecting publicly available datasets that cover a range of domains. Using data-specific simulation runs, each dataset's contribution to the training process was carefully balanced by assigning it a specific sampling weight. These weights remained consistent throughout the base model pretraining and were informed by the extended work of [Automixer](https://scholar.google.com/citations?view_op=view_citation&hl=en&user=FbR5cAMAAAAJ&sortby=pubdate&citation_for_view=FbR5cAMAAAAJ:cFHS6HbyZ2cC) and additional ablation studies. <br>
The pre-training datamix primarily consists of a large educational web dataset, which makes up the vast majority of the training data. Smaller but significant portions come from coding data, mathematics, Wikipedia, scientific papers, Q&A forums, and algebraic content. In total, the datamix includes approximately 1,500 million rows and 1,640 billion tokens. <br>
For our instruction fine-tuned data-mix, we focus on data diversity from existing open-source fine-tuning corpora. Specifically, we combine datasets for general instruction tuning with chat, science, safety, coding and math domains. For our final DPO phase, we rely on completely synthetic datasets.

## Training Process
### Pretraining
Our general pre-training process contains three distinct phases using logit-based knowledge distillation from the [Llama 4-Scout](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E) model and a novel model merging paradigm: 

**Phase 1 (KD)**: Language Learning – Learn general language skills from high-quality, well balanced pre-training data <br>
**Phase 2 (KD)**: Long-context awareness – Extend the model context-length to 128k tokens using implicit positional distillation from the teacher model <br>
**Phase 3 (KD)**: Domain abilities – Acquire domain understanding through annealing of multiple models in parallel and merging the specialist models, resulting in improvements across a diverse range of domains

![image](https://cdn-uploads.huggingface.co/production/uploads/68c1aa07c02e455d06f93a42/DpI3Yk1fxWA789N76fvjr.png)

On top of the three pre-training phases, we add a fourth phase of Quantization-Aware Training (QAT) for our 4-bit quantized model checkpoint.

### Instruction Fine-Tuning
We split the instruction fine-tuning stage into three distinct phases combining SFT and DPO methods:

 **Phase 1 (SFT)**: Learn general instruction-following with a focus on data diversity <br>
**Phase 2 (SFT)**: Domain-weight the Phase 1 data given its shortcomings (e.g. upsample code data to improve logical reasoning) <br>
**Phase 3 (SFT + DPO)**: Train and align the model for safety and self-identification 

![image](https://cdn-uploads.huggingface.co/production/uploads/68c1aa07c02e455d06f93a42/wBAO_0Bu3dnCn8R2K9HXD.png)

## Quantization

![image/png](https://cdn-uploads.huggingface.co/production/uploads/68c1aa07c02e455d06f93a42/NJ_d8jyeVwkLIp9kwZRtR.png)

We apply Quantization Aware Training (QAT) to our baseline and instruction fine-tuned models, yielding quantization-ready checkpoints that can either be directly converted to integer datatype (with minimal quality loss) or used for QAT on additional data. We release two quantization-ready checkpoints:

- **4-bit groupwise weight quantization** with block size 32, 8-bit dynamic activations, and 8-bit kv-cache quantizations — optimized for CPU/GPU backends ([xnnpack](https://docs.pytorch.org/executorch/0.5/native-delegates-executorch-xnnpack-delegate.html)).
- **4-bit channelwise quantization** without activation quantization and 8-bit kv-cache quantizations — designed for edge hardware accelerators such as Apple Neural Engine ([ANE](https://apple.github.io/coremltools/docs-guides/source/opt-quantization-overview.html)) and Qualcomm’s Hexagon Tensor Processor ([HTP](https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/htp_guidelines_int4_weights.html)).

Our QAT approach incorporates long-context awareness (up to 128k tokens) and self-knowledge distillation using the full-precision teacher model. We compared the QAT-trained model to a standard round-to-nearest Post-Training Quantization (PTQ) baseline. In the groupwise pre-training setting, we observe a 34% (absolute) regression in average benchmark score when using PTQ and only a 1.5% (absolute) regression for QAT. For instruction fine-tuning, we observe less than 1% average regression using QAT.

## How to use
### Full precision:

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login

login(token="<HF_TOKEN>")
MODEL_ID = "facebook/MobileLLM-Pro"

def generate(user_input: str, model, tokenizer, chat: bool) -> str:
    if chat:
        user_input = [{"role": "user", "content": user_input}]
        inputs = tokenizer.apply_chat_template(
            user_input, return_tensors="pt", add_generation_prompt=True
        ).to(model.device)
    else:
        inputs = tokenizer(user_input, return_tensors="pt")["input_ids"].to(model.device)
    outputs = model.generate(inputs, max_new_tokens=128)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def main():
    version = "instruct"  # "base" | "instruct"
    tokenizer = AutoTokenizer.from_pretrained(
        MODEL_ID, trust_remote_code=True, subfolder=version
    )
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID, trust_remote_code=True, subfolder=version
    )
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    model.eval()

    prompt = "Why are open-source on-device language models great?"
    result = generate(prompt, model, tokenizer, chat=(version == "instruct"))
    print(result)

if __name__ == "__main__":
    main()

```

### Quantize Checkpoints

#### 4-bit Groupwise Quantization

```python
from torchao.quantization import quantize_
from torchao.quantization.qat import (
    QATConfig, 
    IntxFakeQuantizeConfig
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True
)

# Prepare for QAT.
# 8-bit dynamic per-token quantization for activations
activation_config = IntxFakeQuantizeConfig(
    torch.int8, "per_token", is_symmetric=False,
)
# 4-bit group-size=32 with range_learning=True for weights
weight_config = IntxFakeQuantizeConfig(
    torch.int4,
    group_size=32,
    is_symmetric=True,
    is_dynamic=True,
)
qat_config = QATConfig(
    activation_config=activation_config,
    weight_config=weight_config,
    step="prepare",
)
quantize_(model, qat_config)

embedding_filter_fn = lambda m, fqn: isinstance(m, torch.nn.Embedding)
embedding_qat_config = IntxFakeQuantizeConfig(
    torch.int4,
    group_size=32,
    is_symmetric=True,
    is_dynamic=True,
)
quantize_(
    model,
    QATConfig(
        weight_config=embedding_qat_config, 
   step="prepare"
    ),
    embedding_filter_fn
)

# The model is now ready for Quantization aware Training (QAT)
# trainer.train()
model.save_pretrained(
    save_directory=<QAT_save_directory>,
    safe_serialization=False
)

# Convert model after training
from torchao.quantization import (
    IntxWeightOnlyConfig,
    Int8DynamicActivationIntxWeightConfig
)
from torchao.quantization.granularity import PerGroup

qat_convert_config = QATConfig(
    Int8DynamicActivationIntxWeightConfig(
        weight_dtype=torch.int4
        weight_granularity=PerGroup(32),
    ),
    step="convert",
)
quantize_(model, qat_convert_config)
embedding_convert_config = IntxWeightOnlyConfig(
    weight_dtype=torch.int4,
    granularity=PerGroup(32)
)
quantize_(
    model,
    QATConfig(
        embedding_convert_config,
        step="convert"
    ),
    embedding_filter_fn
)

# Save model after convert
model.save_pretrained(
    save_directory=<quantized_model_directory>,
    safe_serialization=False
)
```

#### 4-bit Channelwise Quantization

```python
from torchao.quantization import quantize_
from torchao.quantization.granularity import PerAxis
from torchao.quantization.qat import (
    initialize_fake_quantizers,
    IntxFakeQuantizeConfig,
    QATConfig
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True
)

# 4-bit per-channel with range_learning=True for weights
weight_config = IntxFakeQuantizeConfig(
    torch.int4,
    granularity=PerAxis(0),
    is_symmetric=True,
    is_dynamic=False,
    range_learning=True,
)
qat_config = QATConfig(
    weight_config=weight_config,
    step="prepare",
)
quantize_(model, qat_config)

embedding_filter_fn = lambda m, fqn: isinstance(m, torch.nn.Embedding)
quantize_(model, qat_config, embedding_filter_fn)

# Initialize the fake quantizers for range-learning
example_inputs = (torch.tensor([[1]], dtype=torch.long),)
initialize_fake_quantizers(model, example_inputs)


# The model is now ready for Quantization aware Training (QAT)
# trainer.train()
model.save_pretrained(
    save_directory=<QAT_save_directory>,
    safe_serialization=False
)

# Convert model after training
from torchao.quantization import IntxWeightOnlyConfig

wt_convert_config = IntxWeightOnlyConfig(
    weight_dtype=torch.int4,
    granularity=PerAxis(0)
)
qat_convert_config = QATConfig(
    wt_convert_config,
    step="convert",
)
quantize_(model, qat_convert_config)
quantize_(model, qat_convert_config, embedding_filter_fn)

# Save model after convert
model.save_pretrained(
    save_directory=<quantized_model_directory>,
    safe_serialization=False
)
```

## Latency benchmarking

Latency benchmarking was done on a Samsung Galaxy S25 CPU and Samsung Galaxy S24 Hexagon Tensor Processor (HTP). Models were exported to ExecuTorch with XNNPACK backend (for CPU) and HTP backend (for accelerator). The model size of the CPU model with 4-bit groupwise quantization is 590MB. The CPU and HTP prefill latency for different input prompt lengths of 2k, 4k and 8k along with decode speed for generating 1k tokens is shown in the following table.

| Model / Prompt length     | 2k     | 4k     | 8k    |
|---------------------------|--------|--------|-------|
| CPU Prefill Latency (s)   | 8.9    | 24.8   | 63.5  |
| CPU Decode Speed (tok/s)  | 33.6   | 24.8   | 19.7  |
| HTP Prefill Latency (s)   | 1.96   | 3.38   | 9.82  |
| HTP Decode Speed (tok/s)  | 31.60  | 28.95  | 22.77 |
| KV Cache Size (MB)        | 14     | 23     | 40    |


To validate the benefit of interleaved local-global attention (LGA), we benchmark models across different prompt lengths and measure the speed-up in prefill & decode relative to using global attention at every layer:

![image](https://cdn-uploads.huggingface.co/production/uploads/68c1aa07c02e455d06f93a42/_p8JT_Wtljwyp23TmKsTc.png)


## Citation

@misc{mobilellm_pro,<br>
title={MobileLLM-Pro Model Card},<br>
author={Patrick Huber*, Ernie Chang*, Wei Wen*, Igor Fedorov*, Tarek Elgamal, Hanxian Huang, Naveen Suda, Chinnadhurai Sankar, Vish Vogeti, Yanghan Wang, Alex Gladkov, Kai Sheng Tai, Abdelrahman Elogeel, Tarek Hefny, Vikas Chandra, Ahmed Aly, Anuj Kumar, Raghuraman Krishnamoorthi**, Adithya Sagar**}, <br>
year={2025},<br>
month={October},<br>
url = {[https://huggingface.co/facebook/MobileLLM-Pro](https://huggingface.co/facebook/MobileLLM-Pro)}}

## Contact

Patrick Huber, Meta Inc, Reality Labs ([patrickhuber@meta.com](mailto:patrickhuber@meta.com))<br>
Ernie Chang, Meta Inc, Reality Labs ([erniecyc@meta.com](mailto:erniecyc@meta.com))<br>
Wei Wen,  Meta Inc, Reality Labs ([wewen@meta.com](mailto:wewen@meta.com))<br>
Igor Fedorov, Meta Inc, Reality Labs ([ifedorov@meta.com](mailto:ifedorov@meta.com))<br>
Raghuraman Krishnamoorthi,  Meta Inc Reality Labs ([raghuraman@meta.com](mailto:raghuraman@meta.com))<br>
Adithya Sagar, Meta Inc, Reality Labs (adithyasagar@meta.com)

## Acknowledgements

We want to thank the team involved in this project, especially: Kimish Patel, Andrew Or, Min Guo, Shen Xu, Brian Moran, Maho Takahashi, Claire Lesage, Rylan Conway, Karan Chadha, Matthew Grange, Tomasz Wołcyrz, Shiv Desai, Amarlin Anand, Joele Sires, Robert Carrillo, Francisc Bungiu, Jayden Yu, AJ Brush, Yang Li, Samuel Selvan, Anand Sharma, Peng Shan, Anand Dass, Abhishek Sharma

## License

MobileLLM-Pro is distributed under the [FAIR NC license](https://huggingface.co/facebook/MobileLLM-Pro/blob/main/LICENSE)