| # LoftQ: LoRA-fine-tuning-aware Quantization | |
| ## Introduction | |
| LoftQ finds quantized LoRA initialization: quantized backbone Q and LoRA adapters A and B, given a pre-trained weight W. | |
| ## Quick Start | |
| Steps: | |
| 1. Apply LoftQ to a full-precision pre-trained weight and save. | |
| 2. Load LoftQ initialization and train. | |
| For step 1, we have provided off-the-shelf LoftQ initializations (see [supported model list](#appendix-off-the-shelf-model-list)) | |
| in [Huggingface Hub LoftQ](https://huggingface.co/LoftQ). | |
| If you want to do it yourself, jump to [LoftQ DIY](#loftq-diy). | |
| For step 2, below is an example of loading 4bit Mistral-7B with 64rank LoRA adapters from Huggingface Hub. | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, BitsAndBytesConfig | |
| from peft import PeftModel | |
| MODEL_ID = "LoftQ/Mistral-7B-v0.1-4bit-64rank" | |
| base_model = AutoModelForCausalLM.from_pretrained( | |
| MODEL_ID, | |
| torch_dtype=torch.bfloat16, # you may change it with different models | |
| quantization_config=BitsAndBytesConfig( | |
| load_in_4bit=True, | |
| bnb_4bit_compute_dtype=torch.bfloat16, # bfloat16 is recommended | |
| bnb_4bit_use_double_quant=False, | |
| bnb_4bit_quant_type='nf4', | |
| ), | |
| ) | |
| peft_model = PeftModel.from_pretrained( | |
| base_model, | |
| MODEL_ID, | |
| subfolder="loftq_init", | |
| is_trainable=True, | |
| ) | |
| # Do training with peft_model ... | |
| ``` | |
| ## LoftQ DIY | |
| ### Apply LoftQ and save | |
| We provide [quantize_save_load.py](quantize_save_load.py) as an example to apply LoftQ with | |
| different bits(`--bits`), ranks(`--rank`), and alternating steps (`--iter`, a hyper-parameter in LoftQ, see Algorithm 1 in [LoftQ paper](https://huggingface.co/papers/2310.08659)). Currently, this example supports | |
| `llama-2`, `falcon`, `mistral`, `bart`, `t5`, `deberta`, `bert`, `roberta`. | |
| Below is an example of obtaining 4bit LLAMA-2-7b with 16-rank LoRA adapters by 5 alternating steps. | |
| ```sh | |
| SAVE_DIR="model_zoo/loftq/" | |
| python quantize_save_load.py \ | |
| --model_name_or_path meta-llama/Llama-2-7b-hf \ # high-precision model id in HF | |
| --token HF_TOKEN \ # your HF token if the model is private, e.g., llama-2 | |
| --bits 4 \ | |
| --iter 5 \ | |
| --rank 16 \ | |
| --save_dir $SAVE_DIR | |
| ``` | |
| The above commands end up with creating the model directory under `$SAVE_DIR`. | |
| Specifically, the model directory is named as | |
| `MODEL_DIR = SAVE_DIR + f"{args.model_name_or_path.split('/')[-1]}-{args.bits}bits-{args.rank}rank"` | |
| In this example, `MODEL_DIR="model_zoo/loftq/Llama-2-7b-hf-4bit-16rank"`, where the backbone is stored in `$MODEL_DIR` | |
| and the LoRA adapters are at the sub-folder `$MODEL_DIR/loftq_init`. | |
| ### Load and train | |
| Similar to loading from Huggingface Hub, we only need to change the `MODEL_ID` to the `MODEL_DIR`. | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, BitsAndBytesConfig | |
| from peft import PeftModel | |
| MODEL_DIR = "model_zoo/loftq/Llama-2-7b-hf-4bit-16rank" | |
| base_model = AutoModelForCausalLM.from_pretrained( | |
| MODEL_DIR, | |
| torch_dtype=torch.bfloat16, | |
| quantization_config=BitsAndBytesConfig( | |
| load_in_4bit=True, | |
| bnb_4bit_compute_dtype=torch.bfloat16, | |
| bnb_4bit_use_double_quant=False, | |
| bnb_4bit_quant_type='nf4', | |
| ), | |
| ) | |
| peft_model = PeftModel.from_pretrained( | |
| base_model, | |
| MODEL_DIR, | |
| subfolder="loftq_init", | |
| is_trainable=True, | |
| ) | |
| # Do training with peft_model ... | |
| ``` | |
| ## LoftQ Fine-tuning | |
| We also provide an example to fine-tune LoftQ on GSM8K. | |
| We load the quantized backbone and LoRA adapters from the [LoftQ Huggingface hub](https://huggingface.co/LoftQ). | |
| ```sh | |
| python train_gsm8k_llama.py \ | |
| --model_name_or_path LoftQ/Llama-2-13b-hf-4bit-64rank \ | |
| --output_dir exp_results/gsm8k/llama-2-13b/bit4-rank64/lr1e-4 \ | |
| --learning_rate 1e-4 \ | |
| --weight_decay 0.1 \ | |
| --lr_scheduler_type cosine \ | |
| --num_warmup_steps 100 \ | |
| --seed 202 \ | |
| --dataset_name gsm8k \ | |
| --dataset_config main \ | |
| --pad_to_max_length \ | |
| --max_source_length 128 \ | |
| --max_target_length 256 \ | |
| --num_train_epochs 5 \ | |
| --per_device_train_batch_size 4 \ | |
| --per_device_eval_batch_size 4 \ | |
| --gradient_accumulation_steps 4 \ | |
| --with_tracking \ | |
| --report_to tensorboard | |
| ``` | |
| ## Appendix: Off-the-shelf Model List | |
| | Model Name | Bits | Ranks | | |
| | ----------- | ---- | ----- | | |
| | LLAMA-2-7b | 4 | 64 | | |
| | LLAMA-2-13b | 4 | 64 | | |
| | LLAMA-2-70b | 4 | 64 | | |
| | Mistral | 4 | 64 | | |
| | Mistral | 4 | 32 | | |
| | BART-large | 4 | 8 | | |
| | BART-large | 4 | 16 | | |
| | BART-large | 4 | 32 | | |
| | BART-large | 2 | 8 | | |
| ## In-place application of LoftQ initialization | |
| PEFT provides a convenience function `replace_lora_weights_loftq` to apply LoftQ initialization in-place to the quantized model. Check out [this notebook](https://github.com/huggingface/peft/blob/main/examples/loftq_finetuning/LoftQ_weight_replacement.ipynb) for an example. | |