Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +18 -11

README.md CHANGED Viewed

@@ -17,6 +17,14 @@ base_model:
 pipeline_tag: text-generation
 ---
 # Quantization Recipe
 Install `uv` by following https://docs.astral.sh/uv/getting-started/installation
@@ -30,9 +38,7 @@ uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126
 ## QAT Finetuning with PARQ
-We apply QAT with a torchao optimizer-only package called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq).
-The following script finetunes Phi-4-mini-instruct with 2-bit weight quantization and 4-bit embedding quantization using QAT. Set `dataset_name` to your desired dataset from the [HuggingFace datasets hub](https://huggingface.co/datasets).
 ```bash
 source ~/.uv-hf/bin/activate
@@ -40,6 +46,8 @@ source ~/.uv-hf/bin/activate
 SEED=$RANDOM
 SAVE_DIR=checkpoints/phi-4-mini-2wei-4emb-${SEED}
 ngpu=8
 device_batch_size=4
 grad_accum_steps=2
@@ -60,9 +68,8 @@ TOKENIZERS_PARALLELISM=$(( ngpu == 1 ))  \
   --dataset_name $dataset_name \
   --dataloader_num_workers 4 \
   --max_length 4096 \
-  --save_total_limit 1 \
   --report_to tensorboard \
-  --logging_steps 2 \
   --learning_rate $lr \
   --lr_scheduler_type linear \
   --warmup_ratio 0.0 \
@@ -74,17 +81,15 @@ TOKENIZERS_PARALLELISM=$(( ngpu == 1 ))  \
   --embed_pat '(lm_head|embed_tokens)'
 ```
 ## Generation from Quantized Model
 ```py
 import os
 from huggingface_hub import whoami, get_token
-from transformers import (
-  AutoModelForCausalLM,
-  AutoTokenizer,
-  set_seed,
-)
 set_seed(0)
 model_path = f"{SAVE_DIR}"
@@ -113,6 +118,8 @@ print(output_text)
 # Model Quality
 Evaluation command for below table:
 ```bash
 lm_eval \
@@ -123,7 +130,7 @@ lm_eval \
   --batch_size auto \
   --trust_remote_code
 ```
-Note that exact numbers may vary slightly based on your machine's chosen batch size.
 | | [bf16](https://huggingface.co/microsoft/Phi-4-mini-instruct) | [4-bit PTQ](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4) | 2-bit QAT |
 | --- | :---: | :---: | :---: |

 pipeline_tag: text-generation
 ---
+Phi4-mini is quantized by the PyTorch team using an algorithm in torchao called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq). The model has 2-bit weight linears, 4-bit embeddings, and 8-bit dynamic activations. It is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
+We provide the quantized pte for direct use in ExecuTorch. (The provided pte file is exported with a max_context_length of 1024. If you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
+# Running in a Mobile App
+The pte file can be run with ExecuTorch on a mobile phone. See the [instructions](https://docs.pytorch.org/executorch/0.7/llm/llama-demo-ios.html) for doing this in iOS. On iPhone 15 Pro, the model runs at 27 tokens/sec and uses 1453 Mb of memory.
 # Quantization Recipe
 Install `uv` by following https://docs.astral.sh/uv/getting-started/installation
 ## QAT Finetuning with PARQ
+We apply QAT with an optimizer-only package called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq). The following script finetunes Phi-4-mini-instruct with 2-bit weight quantization and 4-bit embedding quantization using QAT, both at per-row granularity. Set `dataset_name` to your desired dataset from the [HuggingFace datasets hub](https://huggingface.co/datasets) in addition to `max_steps`.
 ```bash
 source ~/.uv-hf/bin/activate
 SEED=$RANDOM
 SAVE_DIR=checkpoints/phi-4-mini-2wei-4emb-${SEED}
+dataset_name=<TODO>
+max_steps=<TODO>
 ngpu=8
 device_batch_size=4
 grad_accum_steps=2
   --dataset_name $dataset_name \
   --dataloader_num_workers 4 \
   --max_length 4096 \
+  --max_steps $max_steps \
   --report_to tensorboard \
   --learning_rate $lr \
   --lr_scheduler_type linear \
   --warmup_ratio 0.0 \
   --embed_pat '(lm_head|embed_tokens)'
 ```
+To export the finetuned model, rerun the above script on a single GPU with `--resume_from_checkpoint ${SAVE_DIR}/checkpoint-{SAVE_STEP}`. The exported model will be saved to `${SAVE_DIR}/quant_converted`.
 ## Generation from Quantized Model
 ```py
 import os
 from huggingface_hub import whoami, get_token
+from transformers import AutoModelForCausalLM, AutoTokenizer
 set_seed(0)
 model_path = f"{SAVE_DIR}"
 # Model Quality
+We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
 Evaluation command for below table:
 ```bash
 lm_eval \
   --batch_size auto \
   --trust_remote_code
 ```
+Note: exact numbers may vary slightly based on your machine's chosen batch size.
 | | [bf16](https://huggingface.co/microsoft/Phi-4-mini-instruct) | [4-bit PTQ](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4) | 2-bit QAT |
 | --- | :---: | :---: | :---: |