joekraper commited on
Commit
8e824ef
Β·
verified Β·
1 Parent(s): 77807ef

update in readme

Browse files

# Quantized Qwen Models (Educational Project)

## πŸ“Œ Introduction

This repository contains **quantized versions** of the [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) model.
The purpose of this project is **educational** β€” to demonstrate how to:

* Load a large model with Hugging Face `transformers`
* Apply **8-bit** and **4-bit** quantization using `bitsandbytes`
* Compare model memory footprints, generated outputs, and perplexity
* Save & upload quantized models to Hugging Face Hub for sharing

⚠️ **Disclaimer**:
These models are shared **only for learning purposes**. They are not optimized for production or guaranteed to be safe for downstream applications.

---

## πŸ“‚ Repository Structure

The repo includes three subfolders, each with its own model weights and tokenizer:

```
my-quantized-qwen/
β”œβ”€β”€ base/ β†’ Original FP16 Qwen3-1.7B
β”œβ”€β”€ 8bit/ β†’ Quantized to 8-bit using bitsandbytes
└── 4bit/ β†’ Quantized to 4-bit (NF4, double quantization)
```

---

## βš™οΈ How It Was Done

The workflow was executed in **Kaggle Notebooks** with GPU acceleration.

1. **Load Base Model**

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Qwen/Qwen3-1.7B"
base_model = AutoModelForCausalLM.from_pretrained(model_id).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_id)
```

2. **Quantize to 8-bit & 4-bit**

```python
from transformers import BitsAndBytesConfig

# 8-bit
config_8bit = BitsAndBytesConfig(load_in_8bit=True)
model_8bit = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=config_8bit,
low_cpu_mem_usage=True
)

# 4-bit
config_4bit = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16",
bnb_4bit_use_double_quant=True,
)
model_4bit = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=config_4bit
).to("cuda")
```

3. **Save Models**

```python
model_4bit.save_pretrained("models/qwen-4bit")
model_8bit.save_pretrained("models/qwen-8bit")
base_model.save_pretrained("models/qwen-base")
tokenizer.save_pretrained("models/qwen-base")
tokenizer.save_pretrained("models/qwen-8bit")
tokenizer.save_pretrained("models/qwen-4bit")
```

4. **Upload to Hugging Face Hub**

```python
from huggingface_hub import HfApi
api = HfApi(token="YOUR_HF_TOKEN")
api.upload_folder(
folder_path="models",
repo_id="username/my-quantized-qwen",
repo_type="model"
)
```

---

## πŸ“Š Observations

* **Base FP16 Model**: Largest memory footprint, best fidelity
* **8-bit Quantized**: Huge memory savings, minimal quality loss
* **4-bit Quantized**: Maximum compression, still functional for chat/demo use

---

## πŸš€ Usage

Example of loading the 4-bit quantized model:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "username/my-quantized-qwen/qwen-4bit"

model = AutoModelForCausalLM.from_pretrained(repo_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(repo_id)

inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## πŸ“œ License & Disclaimer

* Original model: [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B)
* This quantized version is redistributed under the **same license** as the original.
* ⚠️ **For educational and research purposes only. Not intended for production use.**

Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -2,4 +2,6 @@
2
  license: apache-2.0
3
  base_model:
4
  - Qwen/Qwen3-1.7B
 
 
5
  ---
 
2
  license: apache-2.0
3
  base_model:
4
  - Qwen/Qwen3-1.7B
5
+ new_version: Qwen/Qwen3-1.7B
6
+ pipeline_tag: text-generation
7
  ---