|
|
--- |
|
|
license: unknown |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- mistralai/Mistral-7B-Instruct-v0.2 |
|
|
--- |
|
|
# Mistral LoRA - BitNet 1.58 Q&A Expert |
|
|
|
|
|
This is a LoRA fine-tuned adapter for [`mistralai/Mistral-7B-Instruct-v0.2`] on a custom Q&A dataset derived from the paper **"The Era of 1-bit LLMs" (BitNet b1.58)**. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- Base model: `mistralai/Mistral-7B-Instruct-v0.2` |
|
|
- LoRA fine-tuning: 4-bit quantization (bitsandbytes) + PEFT |
|
|
- Target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj` |
|
|
- Rank: 8, Alpha: 16, Dropout: 0.05 |
|
|
|
|
|
## Dataset |
|
|
|
|
|
Q&A pairs were auto-generated from the BitNet b1.58 paper. Each instruction asked about architectural and performance details of 1-bit LLMs. |
|
|
|
|
|
## Before vs. After Comparison |
|
|
|
|
|
| Question | Base Model Output | Fine-tuned Model Output | |
|
|
|---------|------------------|--------------------------| |
|
|
| What is a 1-bit LLM? | β Talks about hardware cache lines | β
Correctly defines quantized LLM | |
|
|
| How does BitNet b1.58 differ from standard 1-bit models? | β Talks about legacy networking | β
Talks about ternary weights (-1, 0, 1) | |
|
|
| At what size does it outperform FP16? | β Refers to wrong paper | β
Refers to performance table | |
|
|
| Why is it more memory/latency efficient? | β Talks about DHT routing | β
Highlights no FP multiplication | |
|
|
| Edge deployment and hardware design? | β Talks about old protocols | β
References new 1-bit hardware potential | |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from peft import PeftModel |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
base = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2") |
|
|
model = PeftModel.from_pretrained(base, "ogflash/mistral-lora-qa-1bit") |
|
|
tokenizer = AutoTokenizer.from_pretrained("ogflash/mistral-lora-qa-1bit") |
|
|
|
|
|
prompt = "### Instruction:\nwhat is 1 bit llm\n\n### Response:\n" |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to("cuda") |
|
|
outputs = model.generate(**inputs, max_new_tokens=100) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |