|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- torchao |
|
|
- phi |
|
|
- phi4 |
|
|
- nlp |
|
|
- code |
|
|
- math |
|
|
- chat |
|
|
- conversational |
|
|
license: mit |
|
|
language: |
|
|
- multilingual |
|
|
base_model: |
|
|
- microsoft/Phi-4-mini-instruct |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
Phi4-mini is quantized by the PyTorch team using an algorithm in torchao called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq) ([paper](https://openreview.net/pdf?id=8PCxOlwbIn)). The model has 2-bit weight linears, 4-bit embeddings, and 8-bit dynamic activations. It is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch). |
|
|
|
|
|
We provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared/blob/main/phi4_model_2bit.pte) for direct use in ExecuTorch. (The provided pte file is exported with a `max_context_length` of 1024. If you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).) |
|
|
|
|
|
# Running in a Mobile App |
|
|
|
|
|
The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared/blob/main/phi4_model_2bit.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) for doing this in iOS. On iPhone 15 Pro, the model runs at 27 tokens/second and uses 1453 Mb of memory. |
|
|
|
|
|
# Quantization Recipe |
|
|
|
|
|
Install `uv` by following https://docs.astral.sh/uv/getting-started/installation |
|
|
|
|
|
```bash |
|
|
uv venv ~/.uv-hf --python 3.13 |
|
|
source ~/.uv-hf/bin/activate |
|
|
uv pip install transformers==4.56.2 'trl[vllm]==0.23.1' tensorboard |
|
|
uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126 torchao |
|
|
``` |
|
|
|
|
|
## QAT Finetuning with PARQ |
|
|
|
|
|
We apply QAT with an optimizer-only package called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq). The below script finetunes Phi-4-mini-instruct with 2-bit weight quantization and 4-bit embedding quantization using QAT, both at per-row granularity. Do the following before running it: |
|
|
|
|
|
1. `curl -O https://huggingface.co/datasets/pytorch/parq-sft/resolve/main/qat_sft.py` |
|
|
2. Set `dataset_name` to your desired dataset from the [HuggingFace datasets hub](https://huggingface.co/datasets) in addition to `max_steps`. |
|
|
|
|
|
```bash |
|
|
source ~/.uv-hf/bin/activate |
|
|
|
|
|
SEED=$RANDOM |
|
|
SAVE_DIR=checkpoints/phi-4-mini-2wei-4emb-${SEED} |
|
|
|
|
|
dataset_name=<TODO> |
|
|
max_steps=<TODO> |
|
|
ngpu=8 |
|
|
device_batch_size=4 |
|
|
grad_accum_steps=2 |
|
|
lr=3e-5 |
|
|
TOKENIZERS_PARALLELISM=$(( ngpu == 1 )) \ |
|
|
PYTORCH_ALLOC_CONF=expandable_segments:True \ |
|
|
torchrun \ |
|
|
--nproc-per-node $ngpu \ |
|
|
--rdzv-id $SEED \ |
|
|
--rdzv-backend c10d \ |
|
|
--rdzv-endpoint localhost:$(shuf -i 29000-29500 -n 1) \ |
|
|
-m qat_sft \ |
|
|
--model_name_or_path microsoft/Phi-4-mini-instruct \ |
|
|
--bf16 true \ |
|
|
--num_train_epochs 1 \ |
|
|
--per_device_train_batch_size $device_batch_size \ |
|
|
--gradient_accumulation_steps $grad_accum_steps \ |
|
|
--dataset_name $dataset_name \ |
|
|
--dataloader_num_workers 4 \ |
|
|
--max_length 4096 \ |
|
|
--max_steps $max_steps \ |
|
|
--report_to tensorboard \ |
|
|
--learning_rate $lr \ |
|
|
--lr_scheduler_type linear \ |
|
|
--warmup_ratio 0.0 \ |
|
|
--seed $SEED \ |
|
|
--output_dir $SAVE_DIR \ |
|
|
--weight_bits 2 \ |
|
|
--linear_pat 'proj\.weight$' \ |
|
|
--embed_bits 4 \ |
|
|
--embed_pat '(lm_head|embed_tokens)' |
|
|
``` |
|
|
|
|
|
To export the finetuned model, rerun the above script on a single GPU with `--resume_from_checkpoint ${SAVE_DIR}/checkpoint-{SAVE_STEP}`. The exported model will be saved to `${SAVE_DIR}/quant_converted`. |
|
|
|
|
|
## Generation from Quantized Model |
|
|
|
|
|
```py |
|
|
import os |
|
|
|
|
|
from huggingface_hub import whoami, get_token |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
set_seed(0) |
|
|
model_path = f"{SAVE_DIR}" |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_path, device_map="auto", dtype="auto" |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
|
|
|
|
# Manual testing |
|
|
prompt = "Hey, are you conscious? Can you talk to me?" |
|
|
messages = [{"role": "user", "content": prompt}] |
|
|
templated_prompt = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True, |
|
|
) |
|
|
inputs = tokenizer(templated_prompt, return_tensors="pt").to(model.device) |
|
|
inputs.pop("token_type_ids", None) |
|
|
|
|
|
start_idx = len(inputs.input_ids[0]) |
|
|
response_ids = model.generate(**inputs, max_new_tokens=256, **kwargs)[0] |
|
|
response_ids = response_ids[start_idx:].tolist() |
|
|
output_text = tokenizer.decode(response_ids, skip_special_tokens=True) |
|
|
print(output_text) |
|
|
``` |
|
|
|
|
|
# Model Quality |
|
|
|
|
|
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model. |
|
|
|
|
|
Evaluation command for below table: |
|
|
```bash |
|
|
lm_eval \ |
|
|
--model hf \ |
|
|
--model_args pretrained=$SAVE_DIR,dtype=auto \ |
|
|
--tasks arc_easy,arc_challenge,boolq,hellaswag,mathqa,openbookqa,piqa,social_iqa,winogrande \ |
|
|
--output_path ${SAVE_DIR}/eval_results.json \ |
|
|
--batch_size auto \ |
|
|
--trust_remote_code |
|
|
``` |
|
|
Note: exact numbers may vary slightly based on your machine's chosen batch size. |
|
|
|
|
|
| | [Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) | [4-bit PTQ](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4) | 2-bit QAT | |
|
|
| --- | :---: | :---: | :---: | |
|
|
| arc_easy | 80.30 | 74.28 | 68.98 | |
|
|
| arc_challenge | 58.45 | 52.65 | 43.17 | |
|
|
| boolq | 83.46 | 69.11 | 71.50 | |
|
|
| hellaswag | 72.76 | 68.97 | 62.10 | |
|
|
| mathqa | 41.27 | 38.12 | 32.76 | |
|
|
| openbookqa | 41.80 | 39.80 | 38.40 | |
|
|
| piqa | 78.29 | 76.22 | 73.83 | |
|
|
| social_iqa | 49.64 | 45.55 | 46.93 | |
|
|
| winogrande | 71.51 | 68.67 | 64.48 | |
|
|
|
|
|
# Exporting to ExecuTorch |
|
|
|
|
|
⚠️ Note: These instructions only work on Arm-based machines. Running them on x86_64 will fail. |
|
|
|
|
|
We can run the 2-bit quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch), the PyTorch solution for mobile deployment. |
|
|
|
|
|
To set up ExecuTorch with TorchAO lowbit kernels, run the following commands: |
|
|
```bash |
|
|
git clone https://github.com/pytorch/executorch.git |
|
|
pushd executorch |
|
|
git submodule update --init --recursive |
|
|
python install_executorch.py |
|
|
USE_CPP=1 TORCHAO_BUILD_KLEIDIAI=1 pip install third-party/ao |
|
|
popd |
|
|
``` |
|
|
|
|
|
(The above command works on Arm-based Mac; to use Arm-based Linux define the following environment variables before pip installing third-party/ao: BUILD_TORCHAO_EXPERIMENTAL=1 TORCHAO_BUILD_CPU_AARCH64=1 TORCHAO_BUILD_KLEIDIAI=1 TORCHAO_ENABLE_ARM_NEON_DOT=1 TORCHAO_PARALLEL_BACKEND=OPENMP). |
|
|
|
|
|
Now we export the model to ExecuTorch, using the TorchAO lowbit kernel backend. |
|
|
(Do not run these commands from a directory containing the ExecuTorch repo you cloned during setup, or python will use the local paths in the repo instead of the installed paths.) |
|
|
|
|
|
```bash |
|
|
# 1. Download QAT'd weights from HF |
|
|
HF_DIR=pytorch/Phi-4-mini-instruct-parq-2w-4e-shared |
|
|
WEIGHT_DIR=$(hf download ${HF_DIR}) |
|
|
|
|
|
# 2. Rename the weight keys to ones that ExecuTorch expects |
|
|
python -m executorch.examples.models.phi_4_mini.convert_weights $WEIGHT_DIR pytorch_model_converted.bin |
|
|
|
|
|
# 3. Download model config from the ExecuTorch repo |
|
|
curl -L -o phi_4_mini_config.json https://raw.githubusercontent.com/pytorch/executorch/main/examples/models/phi_4_mini/config/config.json |
|
|
|
|
|
# 4. Export the model to ExecuTorch pte file |
|
|
python -m executorch.examples.models.llama.export_llama \ |
|
|
--model "phi_4_mini" \ |
|
|
--checkpoint pytorch_model_converted.bin \ |
|
|
--params phi_4_mini_config.json \ |
|
|
--output_name phi4_model_2bit.pte \ |
|
|
-kv \ |
|
|
--use_sdpa_with_kv_cache \ |
|
|
--use-torchao-kernels \ |
|
|
--max_context_length 1024 \ |
|
|
--max_seq_length 256 \ |
|
|
--dtype fp32 \ |
|
|
--metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}' |
|
|
|
|
|
# # 5. (optional) Upload pte file to HuggingFace |
|
|
# hf upload ${HF_DIR} phi4_model_2bit.pte |
|
|
``` |
|
|
|
|
|
Once you have the *.pte file, you can run it inside of our [iOS demo app](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) in a [few easy steps](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple#build-and-run). |
|
|
|
|
|
|