README.md · pytorch/Phi-4-mini-instruct-parq-2w-4e-shared at main

Phi-4-mini-instruct-parq-2w-4e-shared / README.md

metascroy

Update README.md

22c5567 verified about 1 month ago

preview code

raw

history blame contribute delete

7.97 kB

	---
	library_name: transformers
	tags:
	- torchao
	- phi
	- phi4
	- nlp
	- code
	- math
	- chat
	- conversational
	license: mit
	language:
	- multilingual
	base_model:
	- microsoft/Phi-4-mini-instruct
	pipeline_tag: text-generation
	---

	Phi4-mini is quantized by the PyTorch team using an algorithm in torchao called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq) ([paper](https://openreview.net/pdf?id=8PCxOlwbIn)). The model has 2-bit weight linears, 4-bit embeddings, and 8-bit dynamic activations. It is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).

	We provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared/blob/main/phi4_model_2bit.pte) for direct use in ExecuTorch. (The provided pte file is exported with a `max_context_length` of 1024. If you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)

	# Running in a Mobile App

	The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared/blob/main/phi4_model_2bit.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) for doing this in iOS. On iPhone 15 Pro, the model runs at 27 tokens/second and uses 1453 Mb of memory.

	# Quantization Recipe

	Install `uv` by following https://docs.astral.sh/uv/getting-started/installation

	```bash
	uv venv ~/.uv-hf --python 3.13
	source ~/.uv-hf/bin/activate
	uv pip install transformers==4.56.2 'trl[vllm]==0.23.1' tensorboard
	uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126 torchao
	```

	## QAT Finetuning with PARQ

	We apply QAT with an optimizer-only package called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq). The below script finetunes Phi-4-mini-instruct with 2-bit weight quantization and 4-bit embedding quantization using QAT, both at per-row granularity. Do the following before running it:

	1. `curl -O https://huggingface.co/datasets/pytorch/parq-sft/resolve/main/qat_sft.py`
	2. Set `dataset_name` to your desired dataset from the [HuggingFace datasets hub](https://huggingface.co/datasets) in addition to `max_steps`.

	```bash
	source ~/.uv-hf/bin/activate

	SEED=$RANDOM
	SAVE_DIR=checkpoints/phi-4-mini-2wei-4emb-${SEED}

	dataset_name=<TODO>
	max_steps=<TODO>
	ngpu=8
	device_batch_size=4
	grad_accum_steps=2
	lr=3e-5
	TOKENIZERS_PARALLELISM=$(( ngpu == 1 )) \
	PYTORCH_ALLOC_CONF=expandable_segments:True \
	torchrun \
	--nproc-per-node $ngpu \
	--rdzv-id $SEED \
	--rdzv-backend c10d \
	--rdzv-endpoint localhost:$(shuf -i 29000-29500 -n 1) \
	-m qat_sft \
	--model_name_or_path microsoft/Phi-4-mini-instruct \
	--bf16 true \
	--num_train_epochs 1 \
	--per_device_train_batch_size $device_batch_size \
	--gradient_accumulation_steps $grad_accum_steps \
	--dataset_name $dataset_name \
	--dataloader_num_workers 4 \
	--max_length 4096 \
	--max_steps $max_steps \
	--report_to tensorboard \
	--learning_rate $lr \
	--lr_scheduler_type linear \
	--warmup_ratio 0.0 \
	--seed $SEED \
	--output_dir $SAVE_DIR \
	--weight_bits 2 \
	--linear_pat 'proj\.weight$' \
	--embed_bits 4 \
	--embed_pat '(lm_head\|embed_tokens)'
	```

	To export the finetuned model, rerun the above script on a single GPU with `--resume_from_checkpoint ${SAVE_DIR}/checkpoint-{SAVE_STEP}`. The exported model will be saved to `${SAVE_DIR}/quant_converted`.

	## Generation from Quantized Model

	```py
	import os

	from huggingface_hub import whoami, get_token
	from transformers import AutoModelForCausalLM, AutoTokenizer

	set_seed(0)
	model_path = f"{SAVE_DIR}"
	model = AutoModelForCausalLM.from_pretrained(
	model_path, device_map="auto", dtype="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained(model_path)

	# Manual testing
	prompt = "Hey, are you conscious? Can you talk to me?"
	messages = [{"role": "user", "content": prompt}]
	templated_prompt = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True,
	)
	inputs = tokenizer(templated_prompt, return_tensors="pt").to(model.device)
	inputs.pop("token_type_ids", None)

	start_idx = len(inputs.input_ids[0])
	response_ids = model.generate(inputs, max_new_tokens=256, kwargs)[0]
	response_ids = response_ids[start_idx:].tolist()
	output_text = tokenizer.decode(response_ids, skip_special_tokens=True)
	print(output_text)
	```

	# Model Quality

	We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.

	Evaluation command for below table:
	```bash
	lm_eval \
	--model hf \
	--model_args pretrained=$SAVE_DIR,dtype=auto \
	--tasks arc_easy,arc_challenge,boolq,hellaswag,mathqa,openbookqa,piqa,social_iqa,winogrande \
	--output_path ${SAVE_DIR}/eval_results.json \
	--batch_size auto \
	--trust_remote_code
	```
	Note: exact numbers may vary slightly based on your machine's chosen batch size.

	\| \| [Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) \| [4-bit PTQ](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4) \| 2-bit QAT \|
	\| --- \| :---: \| :---: \| :---: \|
	\| arc_easy \| 80.30 \| 74.28 \| 68.98 \|
	\| arc_challenge \| 58.45 \| 52.65 \| 43.17 \|
	\| boolq \| 83.46 \| 69.11 \| 71.50 \|
	\| hellaswag \| 72.76 \| 68.97 \| 62.10 \|
	\| mathqa \| 41.27 \| 38.12 \| 32.76 \|
	\| openbookqa \| 41.80 \| 39.80 \| 38.40 \|
	\| piqa \| 78.29 \| 76.22 \| 73.83 \|
	\| social_iqa \| 49.64 \| 45.55 \| 46.93 \|
	\| winogrande \| 71.51 \| 68.67 \| 64.48 \|

	# Exporting to ExecuTorch

	⚠️ Note: These instructions only work on Arm-based machines. Running them on x86_64 will fail.

	We can run the 2-bit quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch), the PyTorch solution for mobile deployment.

	To set up ExecuTorch with TorchAO lowbit kernels, run the following commands:
	```bash
	git clone https://github.com/pytorch/executorch.git
	pushd executorch
	git submodule update --init --recursive
	python install_executorch.py
	USE_CPP=1 TORCHAO_BUILD_KLEIDIAI=1 pip install third-party/ao
	popd
	```

	(The above command works on Arm-based Mac; to use Arm-based Linux define the following environment variables before pip installing third-party/ao: BUILD_TORCHAO_EXPERIMENTAL=1 TORCHAO_BUILD_CPU_AARCH64=1 TORCHAO_BUILD_KLEIDIAI=1 TORCHAO_ENABLE_ARM_NEON_DOT=1 TORCHAO_PARALLEL_BACKEND=OPENMP).

	Now we export the model to ExecuTorch, using the TorchAO lowbit kernel backend.
	(Do not run these commands from a directory containing the ExecuTorch repo you cloned during setup, or python will use the local paths in the repo instead of the installed paths.)

	```bash
	# 1. Download QAT'd weights from HF
	HF_DIR=pytorch/Phi-4-mini-instruct-parq-2w-4e-shared
	WEIGHT_DIR=$(hf download ${HF_DIR})

	# 2. Rename the weight keys to ones that ExecuTorch expects
	python -m executorch.examples.models.phi_4_mini.convert_weights $WEIGHT_DIR pytorch_model_converted.bin

	# 3. Download model config from the ExecuTorch repo
	curl -L -o phi_4_mini_config.json https://raw.githubusercontent.com/pytorch/executorch/main/examples/models/phi_4_mini/config/config.json

	# 4. Export the model to ExecuTorch pte file
	python -m executorch.examples.models.llama.export_llama \
	--model "phi_4_mini" \
	--checkpoint pytorch_model_converted.bin \
	--params phi_4_mini_config.json \
	--output_name phi4_model_2bit.pte \
	-kv \
	--use_sdpa_with_kv_cache \
	--use-torchao-kernels \
	--max_context_length 1024 \
	--max_seq_length 256 \
	--dtype fp32 \
	--metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}'

	# # 5. (optional) Upload pte file to HuggingFace
	# hf upload ${HF_DIR} phi4_model_2bit.pte
	```

	Once you have the *.pte file, you can run it inside of our [iOS demo app](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) in a [few easy steps](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple#build-and-run).