---
library_name: transformers
tags:
- torchao
- phi
- phi4
- nlp
- code
- math
- chat
- conversational
license: mit
language:
- multilingual
base_model:
- microsoft/Phi-4-mini-instruct
pipeline_tag: text-generation
---

Phi4-mini is quantized by the PyTorch team using an algorithm in torchao called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq) ([paper](https://openreview.net/pdf?id=8PCxOlwbIn)). The model has 2-bit weight linears, 4-bit embeddings, and 8-bit dynamic activations. It is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).

We provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared/blob/main/phi4_model_2bit.pte) for direct use in ExecuTorch. (The provided pte file is exported with a `max_context_length` of 1024. If you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)

# Running in a Mobile App

The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared/blob/main/phi4_model_2bit.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) for doing this in iOS. On iPhone 15 Pro, the model runs at 27 tokens/second and uses 1453 Mb of memory.

# Quantization Recipe

Install `uv` by following https://docs.astral.sh/uv/getting-started/installation

```bash
uv venv ~/.uv-hf --python 3.13
source ~/.uv-hf/bin/activate
uv pip install transformers==4.56.2 'trl[vllm]==0.23.1' tensorboard
uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126 torchao
```

## QAT Finetuning with PARQ

We apply QAT with an optimizer-only package called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq). The below script finetunes Phi-4-mini-instruct with 2-bit weight quantization and 4-bit embedding quantization using QAT, both at per-row granularity. Do the following before running it:

1. `curl -O https://huggingface.co/datasets/pytorch/parq-sft/resolve/main/qat_sft.py`
2. Set `dataset_name` to your desired dataset from the [HuggingFace datasets hub](https://huggingface.co/datasets) in addition to `max_steps`.

```bash
source ~/.uv-hf/bin/activate

SEED=$RANDOM
SAVE_DIR=checkpoints/phi-4-mini-2wei-4emb-${SEED}

dataset_name=<TODO>
max_steps=<TODO>
ngpu=8
device_batch_size=4
grad_accum_steps=2
lr=3e-5
TOKENIZERS_PARALLELISM=$(( ngpu == 1 ))  \
  PYTORCH_ALLOC_CONF=expandable_segments:True \
  torchrun \
  --nproc-per-node $ngpu \
  --rdzv-id $SEED \
  --rdzv-backend c10d \
  --rdzv-endpoint localhost:$(shuf -i 29000-29500 -n 1) \
  -m qat_sft \
  --model_name_or_path microsoft/Phi-4-mini-instruct \
  --bf16 true \
  --num_train_epochs 1 \
  --per_device_train_batch_size $device_batch_size \
  --gradient_accumulation_steps $grad_accum_steps \
  --dataset_name $dataset_name \
  --dataloader_num_workers 4 \
  --max_length 4096 \
  --max_steps $max_steps \
  --report_to tensorboard \
  --learning_rate $lr \
  --lr_scheduler_type linear \
  --warmup_ratio 0.0 \
  --seed $SEED \
  --output_dir $SAVE_DIR \
  --weight_bits 2 \
  --linear_pat 'proj\.weight$' \
  --embed_bits 4 \
  --embed_pat '(lm_head|embed_tokens)'
```

To export the finetuned model, rerun the above script on a single GPU with `--resume_from_checkpoint ${SAVE_DIR}/checkpoint-{SAVE_STEP}`. The exported model will be saved to `${SAVE_DIR}/quant_converted`.

## Generation from Quantized Model

```py
import os

from huggingface_hub import whoami, get_token
from transformers import AutoModelForCausalLM, AutoTokenizer

set_seed(0)
model_path = f"{SAVE_DIR}"
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map="auto", dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Manual testing
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [{"role": "user", "content": prompt}]
templated_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(templated_prompt, return_tensors="pt").to(model.device)
inputs.pop("token_type_ids", None)

start_idx = len(inputs.input_ids[0])
response_ids = model.generate(**inputs, max_new_tokens=256, **kwargs)[0]
response_ids = response_ids[start_idx:].tolist()
output_text = tokenizer.decode(response_ids, skip_special_tokens=True)
print(output_text)
```

# Model Quality

We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.

Evaluation command for below table:
```bash
lm_eval \
  --model hf \
  --model_args pretrained=$SAVE_DIR,dtype=auto \
  --tasks arc_easy,arc_challenge,boolq,hellaswag,mathqa,openbookqa,piqa,social_iqa,winogrande \
  --output_path ${SAVE_DIR}/eval_results.json \
  --batch_size auto \
  --trust_remote_code
```
Note: exact numbers may vary slightly based on your machine's chosen batch size.

| | [Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) | [4-bit PTQ](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4) | 2-bit QAT |
| --- | :---: | :---: | :---: |
| arc_easy | 80.30 | 74.28 | 68.98 |
| arc_challenge | 58.45 | 52.65 | 43.17 |
| boolq | 83.46 | 69.11 | 71.50 |
| hellaswag | 72.76 | 68.97 | 62.10 |
| mathqa | 41.27 | 38.12 | 32.76 |
| openbookqa | 41.80 | 39.80 | 38.40 |
| piqa | 78.29 | 76.22 | 73.83 |
| social_iqa | 49.64 | 45.55 | 46.93 |
| winogrande | 71.51 | 68.67 | 64.48 |

# Exporting to ExecuTorch

⚠️ Note: These instructions only work on Arm-based machines. Running them on x86_64 will fail.

We can run the 2-bit quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch), the PyTorch solution for mobile deployment.

To set up ExecuTorch with TorchAO lowbit kernels, run the following commands:
```bash
git clone https://github.com/pytorch/executorch.git
pushd executorch
git submodule update --init --recursive
python install_executorch.py
USE_CPP=1 TORCHAO_BUILD_KLEIDIAI=1 pip install third-party/ao
popd
```

(The above command works on Arm-based Mac; to use Arm-based Linux define the following environment variables before pip installing third-party/ao: BUILD_TORCHAO_EXPERIMENTAL=1 TORCHAO_BUILD_CPU_AARCH64=1 TORCHAO_BUILD_KLEIDIAI=1 TORCHAO_ENABLE_ARM_NEON_DOT=1 TORCHAO_PARALLEL_BACKEND=OPENMP).

Now we export the model to ExecuTorch, using the TorchAO lowbit kernel backend.
(Do not run these commands from a directory containing the ExecuTorch repo you cloned during setup, or python will use the local paths in the repo instead of the installed paths.)

```bash
# 1. Download QAT'd weights from HF
HF_DIR=pytorch/Phi-4-mini-instruct-parq-2w-4e-shared
WEIGHT_DIR=$(hf download ${HF_DIR})

# 2. Rename the weight keys to ones that ExecuTorch expects
python -m executorch.examples.models.phi_4_mini.convert_weights $WEIGHT_DIR pytorch_model_converted.bin

# 3. Download model config from the ExecuTorch repo
curl -L -o phi_4_mini_config.json https://raw.githubusercontent.com/pytorch/executorch/main/examples/models/phi_4_mini/config/config.json

# 4. Export the model to ExecuTorch pte file
python -m executorch.examples.models.llama.export_llama \
  --model "phi_4_mini" \
  --checkpoint pytorch_model_converted.bin \
  --params phi_4_mini_config.json \
  --output_name phi4_model_2bit.pte \
  -kv \
  --use_sdpa_with_kv_cache \
  --use-torchao-kernels \
  --max_context_length 1024 \
  --max_seq_length 256 \
  --dtype fp32 \
  --metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}'

# # 5. (optional) Upload pte file to HuggingFace
# hf upload ${HF_DIR} phi4_model_2bit.pte
```

Once you have the *.pte file, you can run it inside of our [iOS demo app](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) in a [few easy steps](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple#build-and-run).