lvj commited on
Commit
106ffe2
·
verified ·
1 Parent(s): dab5c66

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +18 -11
README.md CHANGED
@@ -17,6 +17,14 @@ base_model:
17
  pipeline_tag: text-generation
18
  ---
19
 
 
 
 
 
 
 
 
 
20
  # Quantization Recipe
21
 
22
  Install `uv` by following https://docs.astral.sh/uv/getting-started/installation
@@ -30,9 +38,7 @@ uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126
30
 
31
  ## QAT Finetuning with PARQ
32
 
33
- We apply QAT with a torchao optimizer-only package called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq).
34
-
35
- The following script finetunes Phi-4-mini-instruct with 2-bit weight quantization and 4-bit embedding quantization using QAT. Set `dataset_name` to your desired dataset from the [HuggingFace datasets hub](https://huggingface.co/datasets).
36
 
37
  ```bash
38
  source ~/.uv-hf/bin/activate
@@ -40,6 +46,8 @@ source ~/.uv-hf/bin/activate
40
  SEED=$RANDOM
41
  SAVE_DIR=checkpoints/phi-4-mini-2wei-4emb-${SEED}
42
 
 
 
43
  ngpu=8
44
  device_batch_size=4
45
  grad_accum_steps=2
@@ -60,9 +68,8 @@ TOKENIZERS_PARALLELISM=$(( ngpu == 1 )) \
60
  --dataset_name $dataset_name \
61
  --dataloader_num_workers 4 \
62
  --max_length 4096 \
63
- --save_total_limit 1 \
64
  --report_to tensorboard \
65
- --logging_steps 2 \
66
  --learning_rate $lr \
67
  --lr_scheduler_type linear \
68
  --warmup_ratio 0.0 \
@@ -74,17 +81,15 @@ TOKENIZERS_PARALLELISM=$(( ngpu == 1 )) \
74
  --embed_pat '(lm_head|embed_tokens)'
75
  ```
76
 
 
 
77
  ## Generation from Quantized Model
78
 
79
  ```py
80
  import os
81
 
82
  from huggingface_hub import whoami, get_token
83
- from transformers import (
84
- AutoModelForCausalLM,
85
- AutoTokenizer,
86
- set_seed,
87
- )
88
 
89
  set_seed(0)
90
  model_path = f"{SAVE_DIR}"
@@ -113,6 +118,8 @@ print(output_text)
113
 
114
  # Model Quality
115
 
 
 
116
  Evaluation command for below table:
117
  ```bash
118
  lm_eval \
@@ -123,7 +130,7 @@ lm_eval \
123
  --batch_size auto \
124
  --trust_remote_code
125
  ```
126
- Note that exact numbers may vary slightly based on your machine's chosen batch size.
127
 
128
  | | [bf16](https://huggingface.co/microsoft/Phi-4-mini-instruct) | [4-bit PTQ](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4) | 2-bit QAT |
129
  | --- | :---: | :---: | :---: |
 
17
  pipeline_tag: text-generation
18
  ---
19
 
20
+ Phi4-mini is quantized by the PyTorch team using an algorithm in torchao called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq). The model has 2-bit weight linears, 4-bit embeddings, and 8-bit dynamic activations. It is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
21
+
22
+ We provide the quantized pte for direct use in ExecuTorch. (The provided pte file is exported with a max_context_length of 1024. If you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
23
+
24
+ # Running in a Mobile App
25
+
26
+ The pte file can be run with ExecuTorch on a mobile phone. See the [instructions](https://docs.pytorch.org/executorch/0.7/llm/llama-demo-ios.html) for doing this in iOS. On iPhone 15 Pro, the model runs at 27 tokens/sec and uses 1453 Mb of memory.
27
+
28
  # Quantization Recipe
29
 
30
  Install `uv` by following https://docs.astral.sh/uv/getting-started/installation
 
38
 
39
  ## QAT Finetuning with PARQ
40
 
41
+ We apply QAT with an optimizer-only package called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq). The following script finetunes Phi-4-mini-instruct with 2-bit weight quantization and 4-bit embedding quantization using QAT, both at per-row granularity. Set `dataset_name` to your desired dataset from the [HuggingFace datasets hub](https://huggingface.co/datasets) in addition to `max_steps`.
 
 
42
 
43
  ```bash
44
  source ~/.uv-hf/bin/activate
 
46
  SEED=$RANDOM
47
  SAVE_DIR=checkpoints/phi-4-mini-2wei-4emb-${SEED}
48
 
49
+ dataset_name=<TODO>
50
+ max_steps=<TODO>
51
  ngpu=8
52
  device_batch_size=4
53
  grad_accum_steps=2
 
68
  --dataset_name $dataset_name \
69
  --dataloader_num_workers 4 \
70
  --max_length 4096 \
71
+ --max_steps $max_steps \
72
  --report_to tensorboard \
 
73
  --learning_rate $lr \
74
  --lr_scheduler_type linear \
75
  --warmup_ratio 0.0 \
 
81
  --embed_pat '(lm_head|embed_tokens)'
82
  ```
83
 
84
+ To export the finetuned model, rerun the above script on a single GPU with `--resume_from_checkpoint ${SAVE_DIR}/checkpoint-{SAVE_STEP}`. The exported model will be saved to `${SAVE_DIR}/quant_converted`.
85
+
86
  ## Generation from Quantized Model
87
 
88
  ```py
89
  import os
90
 
91
  from huggingface_hub import whoami, get_token
92
+ from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
 
 
93
 
94
  set_seed(0)
95
  model_path = f"{SAVE_DIR}"
 
118
 
119
  # Model Quality
120
 
121
+ We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
122
+
123
  Evaluation command for below table:
124
  ```bash
125
  lm_eval \
 
130
  --batch_size auto \
131
  --trust_remote_code
132
  ```
133
+ Note: exact numbers may vary slightly based on your machine's chosen batch size.
134
 
135
  | | [bf16](https://huggingface.co/microsoft/Phi-4-mini-instruct) | [4-bit PTQ](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4) | 2-bit QAT |
136
  | --- | :---: | :---: | :---: |