--- library_name: transformers tags: - torchao - phi - phi4 - nlp - code - math - chat - conversational license: mit language: - multilingual base_model: - microsoft/Phi-4-mini-instruct pipeline_tag: text-generation --- Phi4-mini is quantized by the PyTorch team using an algorithm in torchao called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq) ([paper](https://openreview.net/pdf?id=8PCxOlwbIn)). The model has 2-bit weight linears, 4-bit embeddings, and 8-bit dynamic activations. It is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch). We provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared/blob/main/phi4_model_2bit.pte) for direct use in ExecuTorch. (The provided pte file is exported with a `max_context_length` of 1024. If you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).) # Running in a Mobile App The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared/blob/main/phi4_model_2bit.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) for doing this in iOS. On iPhone 15 Pro, the model runs at 27 tokens/second and uses 1453 Mb of memory. # Quantization Recipe Install `uv` by following https://docs.astral.sh/uv/getting-started/installation ```bash uv venv ~/.uv-hf --python 3.13 source ~/.uv-hf/bin/activate uv pip install transformers==4.56.2 'trl[vllm]==0.23.1' tensorboard uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126 torchao ``` ## QAT Finetuning with PARQ We apply QAT with an optimizer-only package called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq). The below script finetunes Phi-4-mini-instruct with 2-bit weight quantization and 4-bit embedding quantization using QAT, both at per-row granularity. Do the following before running it: 1. `curl -O https://huggingface.co/datasets/pytorch/parq-sft/resolve/main/qat_sft.py` 2. Set `dataset_name` to your desired dataset from the [HuggingFace datasets hub](https://huggingface.co/datasets) in addition to `max_steps`. ```bash source ~/.uv-hf/bin/activate SEED=$RANDOM SAVE_DIR=checkpoints/phi-4-mini-2wei-4emb-${SEED} dataset_name= max_steps= ngpu=8 device_batch_size=4 grad_accum_steps=2 lr=3e-5 TOKENIZERS_PARALLELISM=$(( ngpu == 1 )) \ PYTORCH_ALLOC_CONF=expandable_segments:True \ torchrun \ --nproc-per-node $ngpu \ --rdzv-id $SEED \ --rdzv-backend c10d \ --rdzv-endpoint localhost:$(shuf -i 29000-29500 -n 1) \ -m qat_sft \ --model_name_or_path microsoft/Phi-4-mini-instruct \ --bf16 true \ --num_train_epochs 1 \ --per_device_train_batch_size $device_batch_size \ --gradient_accumulation_steps $grad_accum_steps \ --dataset_name $dataset_name \ --dataloader_num_workers 4 \ --max_length 4096 \ --max_steps $max_steps \ --report_to tensorboard \ --learning_rate $lr \ --lr_scheduler_type linear \ --warmup_ratio 0.0 \ --seed $SEED \ --output_dir $SAVE_DIR \ --weight_bits 2 \ --linear_pat 'proj\.weight$' \ --embed_bits 4 \ --embed_pat '(lm_head|embed_tokens)' ``` To export the finetuned model, rerun the above script on a single GPU with `--resume_from_checkpoint ${SAVE_DIR}/checkpoint-{SAVE_STEP}`. The exported model will be saved to `${SAVE_DIR}/quant_converted`. ## Generation from Quantized Model ```py import os from huggingface_hub import whoami, get_token from transformers import AutoModelForCausalLM, AutoTokenizer set_seed(0) model_path = f"{SAVE_DIR}" model = AutoModelForCausalLM.from_pretrained( model_path, device_map="auto", dtype="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_path) # Manual testing prompt = "Hey, are you conscious? Can you talk to me?" messages = [{"role": "user", "content": prompt}] templated_prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) inputs = tokenizer(templated_prompt, return_tensors="pt").to(model.device) inputs.pop("token_type_ids", None) start_idx = len(inputs.input_ids[0]) response_ids = model.generate(**inputs, max_new_tokens=256, **kwargs)[0] response_ids = response_ids[start_idx:].tolist() output_text = tokenizer.decode(response_ids, skip_special_tokens=True) print(output_text) ``` # Model Quality We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model. Evaluation command for below table: ```bash lm_eval \ --model hf \ --model_args pretrained=$SAVE_DIR,dtype=auto \ --tasks arc_easy,arc_challenge,boolq,hellaswag,mathqa,openbookqa,piqa,social_iqa,winogrande \ --output_path ${SAVE_DIR}/eval_results.json \ --batch_size auto \ --trust_remote_code ``` Note: exact numbers may vary slightly based on your machine's chosen batch size. | | [Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) | [4-bit PTQ](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4) | 2-bit QAT | | --- | :---: | :---: | :---: | | arc_easy | 80.30 | 74.28 | 68.98 | | arc_challenge | 58.45 | 52.65 | 43.17 | | boolq | 83.46 | 69.11 | 71.50 | | hellaswag | 72.76 | 68.97 | 62.10 | | mathqa | 41.27 | 38.12 | 32.76 | | openbookqa | 41.80 | 39.80 | 38.40 | | piqa | 78.29 | 76.22 | 73.83 | | social_iqa | 49.64 | 45.55 | 46.93 | | winogrande | 71.51 | 68.67 | 64.48 | # Exporting to ExecuTorch ⚠️ Note: These instructions only work on Arm-based machines. Running them on x86_64 will fail. We can run the 2-bit quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch), the PyTorch solution for mobile deployment. To set up ExecuTorch with TorchAO lowbit kernels, run the following commands: ```bash git clone https://github.com/pytorch/executorch.git pushd executorch git submodule update --init --recursive python install_executorch.py USE_CPP=1 TORCHAO_BUILD_KLEIDIAI=1 pip install third-party/ao popd ``` (The above command works on Arm-based Mac; to use Arm-based Linux define the following environment variables before pip installing third-party/ao: BUILD_TORCHAO_EXPERIMENTAL=1 TORCHAO_BUILD_CPU_AARCH64=1 TORCHAO_BUILD_KLEIDIAI=1 TORCHAO_ENABLE_ARM_NEON_DOT=1 TORCHAO_PARALLEL_BACKEND=OPENMP). Now we export the model to ExecuTorch, using the TorchAO lowbit kernel backend. (Do not run these commands from a directory containing the ExecuTorch repo you cloned during setup, or python will use the local paths in the repo instead of the installed paths.) ```bash # 1. Download QAT'd weights from HF HF_DIR=pytorch/Phi-4-mini-instruct-parq-2w-4e-shared WEIGHT_DIR=$(hf download ${HF_DIR}) # 2. Rename the weight keys to ones that ExecuTorch expects python -m executorch.examples.models.phi_4_mini.convert_weights $WEIGHT_DIR pytorch_model_converted.bin # 3. Download model config from the ExecuTorch repo curl -L -o phi_4_mini_config.json https://raw.githubusercontent.com/pytorch/executorch/main/examples/models/phi_4_mini/config/config.json # 4. Export the model to ExecuTorch pte file python -m executorch.examples.models.llama.export_llama \ --model "phi_4_mini" \ --checkpoint pytorch_model_converted.bin \ --params phi_4_mini_config.json \ --output_name phi4_model_2bit.pte \ -kv \ --use_sdpa_with_kv_cache \ --use-torchao-kernels \ --max_context_length 1024 \ --max_seq_length 256 \ --dtype fp32 \ --metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}' # # 5. (optional) Upload pte file to HuggingFace # hf upload ${HF_DIR} phi4_model_2bit.pte ``` Once you have the *.pte file, you can run it inside of our [iOS demo app](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) in a [few easy steps](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple#build-and-run).