Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -17,6 +17,14 @@ base_model:
|
|
| 17 |
pipeline_tag: text-generation
|
| 18 |
---
|
| 19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
# Quantization Recipe
|
| 21 |
|
| 22 |
Install `uv` by following https://docs.astral.sh/uv/getting-started/installation
|
|
@@ -30,9 +38,7 @@ uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126
|
|
| 30 |
|
| 31 |
## QAT Finetuning with PARQ
|
| 32 |
|
| 33 |
-
We apply QAT with
|
| 34 |
-
|
| 35 |
-
The following script finetunes Phi-4-mini-instruct with 2-bit weight quantization and 4-bit embedding quantization using QAT. Set `dataset_name` to your desired dataset from the [HuggingFace datasets hub](https://huggingface.co/datasets).
|
| 36 |
|
| 37 |
```bash
|
| 38 |
source ~/.uv-hf/bin/activate
|
|
@@ -40,6 +46,8 @@ source ~/.uv-hf/bin/activate
|
|
| 40 |
SEED=$RANDOM
|
| 41 |
SAVE_DIR=checkpoints/phi-4-mini-2wei-4emb-${SEED}
|
| 42 |
|
|
|
|
|
|
|
| 43 |
ngpu=8
|
| 44 |
device_batch_size=4
|
| 45 |
grad_accum_steps=2
|
|
@@ -60,9 +68,8 @@ TOKENIZERS_PARALLELISM=$(( ngpu == 1 )) \
|
|
| 60 |
--dataset_name $dataset_name \
|
| 61 |
--dataloader_num_workers 4 \
|
| 62 |
--max_length 4096 \
|
| 63 |
-
--
|
| 64 |
--report_to tensorboard \
|
| 65 |
-
--logging_steps 2 \
|
| 66 |
--learning_rate $lr \
|
| 67 |
--lr_scheduler_type linear \
|
| 68 |
--warmup_ratio 0.0 \
|
|
@@ -74,17 +81,15 @@ TOKENIZERS_PARALLELISM=$(( ngpu == 1 )) \
|
|
| 74 |
--embed_pat '(lm_head|embed_tokens)'
|
| 75 |
```
|
| 76 |
|
|
|
|
|
|
|
| 77 |
## Generation from Quantized Model
|
| 78 |
|
| 79 |
```py
|
| 80 |
import os
|
| 81 |
|
| 82 |
from huggingface_hub import whoami, get_token
|
| 83 |
-
from transformers import
|
| 84 |
-
AutoModelForCausalLM,
|
| 85 |
-
AutoTokenizer,
|
| 86 |
-
set_seed,
|
| 87 |
-
)
|
| 88 |
|
| 89 |
set_seed(0)
|
| 90 |
model_path = f"{SAVE_DIR}"
|
|
@@ -113,6 +118,8 @@ print(output_text)
|
|
| 113 |
|
| 114 |
# Model Quality
|
| 115 |
|
|
|
|
|
|
|
| 116 |
Evaluation command for below table:
|
| 117 |
```bash
|
| 118 |
lm_eval \
|
|
@@ -123,7 +130,7 @@ lm_eval \
|
|
| 123 |
--batch_size auto \
|
| 124 |
--trust_remote_code
|
| 125 |
```
|
| 126 |
-
Note
|
| 127 |
|
| 128 |
| | [bf16](https://huggingface.co/microsoft/Phi-4-mini-instruct) | [4-bit PTQ](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4) | 2-bit QAT |
|
| 129 |
| --- | :---: | :---: | :---: |
|
|
|
|
| 17 |
pipeline_tag: text-generation
|
| 18 |
---
|
| 19 |
|
| 20 |
+
Phi4-mini is quantized by the PyTorch team using an algorithm in torchao called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq). The model has 2-bit weight linears, 4-bit embeddings, and 8-bit dynamic activations. It is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
|
| 21 |
+
|
| 22 |
+
We provide the quantized pte for direct use in ExecuTorch. (The provided pte file is exported with a max_context_length of 1024. If you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
|
| 23 |
+
|
| 24 |
+
# Running in a Mobile App
|
| 25 |
+
|
| 26 |
+
The pte file can be run with ExecuTorch on a mobile phone. See the [instructions](https://docs.pytorch.org/executorch/0.7/llm/llama-demo-ios.html) for doing this in iOS. On iPhone 15 Pro, the model runs at 27 tokens/sec and uses 1453 Mb of memory.
|
| 27 |
+
|
| 28 |
# Quantization Recipe
|
| 29 |
|
| 30 |
Install `uv` by following https://docs.astral.sh/uv/getting-started/installation
|
|
|
|
| 38 |
|
| 39 |
## QAT Finetuning with PARQ
|
| 40 |
|
| 41 |
+
We apply QAT with an optimizer-only package called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq). The following script finetunes Phi-4-mini-instruct with 2-bit weight quantization and 4-bit embedding quantization using QAT, both at per-row granularity. Set `dataset_name` to your desired dataset from the [HuggingFace datasets hub](https://huggingface.co/datasets) in addition to `max_steps`.
|
|
|
|
|
|
|
| 42 |
|
| 43 |
```bash
|
| 44 |
source ~/.uv-hf/bin/activate
|
|
|
|
| 46 |
SEED=$RANDOM
|
| 47 |
SAVE_DIR=checkpoints/phi-4-mini-2wei-4emb-${SEED}
|
| 48 |
|
| 49 |
+
dataset_name=<TODO>
|
| 50 |
+
max_steps=<TODO>
|
| 51 |
ngpu=8
|
| 52 |
device_batch_size=4
|
| 53 |
grad_accum_steps=2
|
|
|
|
| 68 |
--dataset_name $dataset_name \
|
| 69 |
--dataloader_num_workers 4 \
|
| 70 |
--max_length 4096 \
|
| 71 |
+
--max_steps $max_steps \
|
| 72 |
--report_to tensorboard \
|
|
|
|
| 73 |
--learning_rate $lr \
|
| 74 |
--lr_scheduler_type linear \
|
| 75 |
--warmup_ratio 0.0 \
|
|
|
|
| 81 |
--embed_pat '(lm_head|embed_tokens)'
|
| 82 |
```
|
| 83 |
|
| 84 |
+
To export the finetuned model, rerun the above script on a single GPU with `--resume_from_checkpoint ${SAVE_DIR}/checkpoint-{SAVE_STEP}`. The exported model will be saved to `${SAVE_DIR}/quant_converted`.
|
| 85 |
+
|
| 86 |
## Generation from Quantized Model
|
| 87 |
|
| 88 |
```py
|
| 89 |
import os
|
| 90 |
|
| 91 |
from huggingface_hub import whoami, get_token
|
| 92 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
|
| 94 |
set_seed(0)
|
| 95 |
model_path = f"{SAVE_DIR}"
|
|
|
|
| 118 |
|
| 119 |
# Model Quality
|
| 120 |
|
| 121 |
+
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
|
| 122 |
+
|
| 123 |
Evaluation command for below table:
|
| 124 |
```bash
|
| 125 |
lm_eval \
|
|
|
|
| 130 |
--batch_size auto \
|
| 131 |
--trust_remote_code
|
| 132 |
```
|
| 133 |
+
Note: exact numbers may vary slightly based on your machine's chosen batch size.
|
| 134 |
|
| 135 |
| | [bf16](https://huggingface.co/microsoft/Phi-4-mini-instruct) | [4-bit PTQ](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4) | 2-bit QAT |
|
| 136 |
| --- | :---: | :---: | :---: |
|