Update README.md

# Quantized Qwen Models (Educational Project)

## 📌 Introduction

This repository contains **quantized versions** of the [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) model.
The purpose of this project is **educational** — to demonstrate how to:

* Load a large model with Hugging Face `transformers`
* Apply **8-bit** and **4-bit** quantization using `bitsandbytes`
* Compare model memory footprints, generated outputs, and perplexity
* Save & upload quantized models to Hugging Face Hub for sharing

⚠️ **Disclaimer**:
These models are shared **only for learning purposes**. They are not optimized for production or guaranteed to be safe for downstream applications.

---

## 📂 Repository Structure

The repo includes three subfolders, each with its own model weights and tokenizer:

```
my-quantized-qwen/
├── base/ → Original FP16 Qwen3-1.7B
├── 8bit/ → Quantized to 8-bit using bitsandbytes
└── 4bit/ → Quantized to 4-bit (NF4, double quantization)
```

---

## ⚙️ How It Was Done

The workflow was executed in **Kaggle Notebooks** with GPU acceleration.

1. **Load Base Model**

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Qwen/Qwen3-1.7B"
base_model = AutoModelForCausalLM.from_pretrained(model_id).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_id)
```

2. **Quantize to 8-bit & 4-bit**

```python
from transformers import BitsAndBytesConfig

# 8-bit
config_8bit = BitsAndBytesConfig(load_in_8bit=True)
model_8bit = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=config_8bit,
low_cpu_mem_usage=True
)

# 4-bit
config_4bit = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16",
bnb_4bit_use_double_quant=True,
)
model_4bit = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=config_4bit
).to("cuda")
```

3. **Save Models**

```python
model_4bit.save_pretrained("models/qwen-4bit")
model_8bit.save_pretrained("models/qwen-8bit")
base_model.save_pretrained("models/qwen-base")
tokenizer.save_pretrained("models/qwen-base")
tokenizer.save_pretrained("models/qwen-8bit")
tokenizer.save_pretrained("models/qwen-4bit")
```

4. **Upload to Hugging Face Hub**

```python
from huggingface_hub import HfApi
api = HfApi(token="YOUR_HF_TOKEN")
api.upload_folder(
folder_path="models",
repo_id="username/my-quantized-qwen",
repo_type="model"
)
```

---

## 📊 Observations

* **Base FP16 Model**: Largest memory footprint, best fidelity
* **8-bit Quantized**: Huge memory savings, minimal quality loss
* **4-bit Quantized**: Maximum compression, still functional for chat/demo use

---

## 🚀 Usage

Example of loading the 4-bit quantized model:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "username/my-quantized-qwen/qwen-4bit"

model = AutoModelForCausalLM.from_pretrained(repo_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(repo_id)

inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## 📜 License & Disclaimer

* Original model: [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B)
* This quantized version is redistributed under the **same license** as the original.
* ⚠️ **For educational and research purposes only. Not intended for production use.**

Files changed (1) hide show

README.md +5 -3

README.md CHANGED Viewed

@@ -1,3 +1,5 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+base_model:
+- Qwen/Qwen3-1.7B
+---