Update README.md
Browse files# Quantized Qwen Models (Educational Project)
## π Introduction
This repository contains **quantized versions** of the [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) model.
The purpose of this project is **educational** β to demonstrate how to:
* Load a large model with Hugging Face `transformers`
* Apply **8-bit** and **4-bit** quantization using `bitsandbytes`
* Compare model memory footprints, generated outputs, and perplexity
* Save & upload quantized models to Hugging Face Hub for sharing
β οΈ **Disclaimer**:
These models are shared **only for learning purposes**. They are not optimized for production or guaranteed to be safe for downstream applications.
---
## π Repository Structure
The repo includes three subfolders, each with its own model weights and tokenizer:
```
my-quantized-qwen/
βββ base/ β Original FP16 Qwen3-1.7B
βββ 8bit/ β Quantized to 8-bit using bitsandbytes
βββ 4bit/ β Quantized to 4-bit (NF4, double quantization)
```
---
## βοΈ How It Was Done
The workflow was executed in **Kaggle Notebooks** with GPU acceleration.
1. **Load Base Model**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Qwen/Qwen3-1.7B"
base_model = AutoModelForCausalLM.from_pretrained(model_id).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_id)
```
2. **Quantize to 8-bit & 4-bit**
```python
from transformers import BitsAndBytesConfig
# 8-bit
config_8bit = BitsAndBytesConfig(load_in_8bit=True)
model_8bit = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=config_8bit,
low_cpu_mem_usage=True
)
# 4-bit
config_4bit = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16",
bnb_4bit_use_double_quant=True,
)
model_4bit = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=config_4bit
).to("cuda")
```
3. **Save Models**
```python
model_4bit.save_pretrained("models/qwen-4bit")
model_8bit.save_pretrained("models/qwen-8bit")
base_model.save_pretrained("models/qwen-base")
tokenizer.save_pretrained("models/qwen-base")
tokenizer.save_pretrained("models/qwen-8bit")
tokenizer.save_pretrained("models/qwen-4bit")
```
4. **Upload to Hugging Face Hub**
```python
from huggingface_hub import HfApi
api = HfApi(token="YOUR_HF_TOKEN")
api.upload_folder(
folder_path="models",
repo_id="username/my-quantized-qwen",
repo_type="model"
)
```
---
## π Observations
* **Base FP16 Model**: Largest memory footprint, best fidelity
* **8-bit Quantized**: Huge memory savings, minimal quality loss
* **4-bit Quantized**: Maximum compression, still functional for chat/demo use
---
## π Usage
Example of loading the 4-bit quantized model:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "username/my-quantized-qwen/qwen-4bit"
model = AutoModelForCausalLM.from_pretrained(repo_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(repo_id)
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---
## π License & Disclaimer
* Original model: [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B)
* This quantized version is redistributed under the **same license** as the original.
* β οΈ **For educational and research purposes only. Not intended for production use.**