ogflash
/

mistral-lora-qa-1bit

Model card Files Files and versions

mistral-lora-qa-1bit / README.md

ogflash's picture

Update README.md

7421411 verified 7 months ago

|

history blame contribute delete

1.99 kB

	---
	license: unknown
	language:
	- en
	base_model:
	- mistralai/Mistral-7B-Instruct-v0.2
	---
	# Mistral LoRA - BitNet 1.58 Q&A Expert

	This is a LoRA fine-tuned adapter for [`mistralai/Mistral-7B-Instruct-v0.2`] on a custom Q&A dataset derived from the paper "The Era of 1-bit LLMs" (BitNet b1.58).

	## Model Details

	- Base model: `mistralai/Mistral-7B-Instruct-v0.2`
	- LoRA fine-tuning: 4-bit quantization (bitsandbytes) + PEFT
	- Target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`
	- Rank: 8, Alpha: 16, Dropout: 0.05

	## Dataset

	Q&A pairs were auto-generated from the BitNet b1.58 paper. Each instruction asked about architectural and performance details of 1-bit LLMs.

	## Before vs. After Comparison

	\| Question \| Base Model Output \| Fine-tuned Model Output \|
	\|---------\|------------------\|--------------------------\|
	\| What is a 1-bit LLM? \| ❌ Talks about hardware cache lines \| ✅ Correctly defines quantized LLM \|
	\| How does BitNet b1.58 differ from standard 1-bit models? \| ❌ Talks about legacy networking \| ✅ Talks about ternary weights (-1, 0, 1) \|
	\| At what size does it outperform FP16? \| ❌ Refers to wrong paper \| ✅ Refers to performance table \|
	\| Why is it more memory/latency efficient? \| ❌ Talks about DHT routing \| ✅ Highlights no FP multiplication \|
	\| Edge deployment and hardware design? \| ❌ Talks about old protocols \| ✅ References new 1-bit hardware potential \|

	## Usage

	```python
	from peft import PeftModel
	from transformers import AutoTokenizer, AutoModelForCausalLM

	base = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
	model = PeftModel.from_pretrained(base, "ogflash/mistral-lora-qa-1bit")
	tokenizer = AutoTokenizer.from_pretrained("ogflash/mistral-lora-qa-1bit")

	prompt = "### Instruction:\nwhat is 1 bit llm\n\n### Response:\n"
	inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
	outputs = model.generate(**inputs, max_new_tokens=100)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))