| | --- |
| | license: mit |
| | datasets: |
| | - wikipedia |
| | --- |
| | # BitLinear-phi-1.5 |
| |
|
| | BitLinear-phi-1.5 is a model trained partially using the method described in [The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits](https://arxiv.org/abs/2402.17764). |
| |
|
| | ### Notice: Our BitLinear layer will only apply 1-bit quantization to the weight |
| | ### Other components (RMSnorm, activation quant) in the paper is discarded. |
| | Idea behind: The major contribution in their paper is introduced a valid binary weight quantization, we don't want to mix it with other components to make it difficult to evaluate the major part. |
| |
|
| | The model structure is from [phi-1.5](https://huggingface.co/microsoft/phi-1_5), with all linear layers except lm_head replaced with our custom BitLinear layer. |
| | |
| | It was trained on a small subset of the [wikipedia dataset](https://huggingface.co/datasets/wikipedia) dataset, for research validation purpose only. |
| | |
| | |
| | ```python |
| | dataset = load_dataset("wikipedia", "20220301.en") |
| | dataset = dataset['train'].select(range(int(1e5))) |
| | ``` |
| | Please notice the kernel is not optimzed for 1-bit matrix yet. |
| | |
| | The model is trained on a 3090(24GB) for 16 hours. |
| | |
| | ## For faster(3x) inference, check https://github.com/Mrw33554432/Bitlinear4HF and install custom kernel |
| | ## For training code, check https://github.com/Mrw33554432/Bitlinear4HF. |
| | |
| | The training code should be compatible with most of the LLMs in huggingface. |
| | |
| | Using pretrained model weight (normal models) for training will not work due to gradient explosion. |
| | |
| | ## Sample inference code (slow) |
| | |
| | |
| | ```python |
| | import torch |
| | from replace_hf import replace_linear_in_hf |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
|
| |
|
| | def quick_test(model, tokenizer, prompt: str): |
| | # Encode the inputs |
| | inputs = tokenizer.encode(prompt, return_tensors="pt") |
| |
|
| | # Generate outputs |
| | outputs = model.generate(inputs, max_length=64) |
| | |
| | # Decode and print the outputs |
| | print(tokenizer.decode(outputs[0])) |
| | |
| |
|
| | torch.set_default_device("cuda") |
| |
|
| | tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True) |
| | model = AutoModelForCausalLM.from_pretrained("Mrw33554432/bitLinear-phi-1.5", trust_remote_code=True, torch_dtype=torch.float16) |
| |
|
| | print(model) |
| | # Replace Linear layers with BitLinear |
| | replace_linear_in_hf(model, keep_param=True) |
| | print(model) |
| |
|
| | quick_test(model, tokenizer, prompt="Tom is the") |
| | ``` |