Qwen3-Embedding-0.6B-Q8
This repository contains an 8-bit quantized version of Qwen3-Embedding-0.6B using bitsandbytes.
The goal of this quantization is to reduce disk footprint and speed up inference while maintaining high model quality.
Quantization was performed with BitsAndBytesConfig(load_in_8bit=True) following the current recommended Hugging Face API (quantization_config argument).
Quantization Details
- Method: BitsAndBytes 8-bit (LLM.int8)
- Library:
bitsandbytes(bnb) - Format:
safetensors - Precision:
- Weights: 8-bit int8
- Activations / compute: FP16 or BF16 depending on hardware
- Intended use: Low-memory inference on GPU or CPU
The quantized tensors are stored using bitsandbytes’s internal linear layers, referenced inside the safetensors file.
Transformers Usage
import torch
from torch import Tensor
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
def last_token_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
else:
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # this enables 8-bit quantization
llm_int8_threshold=6.0, # defaults; safe values
llm_int8_has_fp16_weight=False
)
model_id = "ManiKumarAdapala/Qwen3-Embedding-0.6B-Q8_0-Safetensors"
tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side='left')
model = AutoModel.from_pretrained(model_id, quantization_config = bnb_config)
max_length = 1024 #fixed
input_texts = "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
# Tokenize the input texts
batch_dict = tokenizer(
input_texts,
padding=True,
truncation=True,
max_length=max_length,
return_tensors="pt",
)
batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings_ = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# normalize embeddings
embeddings = F.normalize(embeddings_, p=2, dim=1)
print(embeddings, len(embeddings[0]))
Memory & Performance
Approximate size comparison:
| Model Type | Disk Size | Runtime RAM |
|---|---|---|
| FP16 | ~1.2 GB | ~2.3 GB |
| 8-bit (this model) | ~600 MB | ~0.9–1.1 GB |
Citation
@article{qwen3embedding,
title={Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models},
author={Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and Zhang, Xin and Lin, Huan and Yang, Baosong and Xie, Pengjun and Yang, An and Liu, Dayiheng and Lin, Junyang and Huang, Fei and Zhou, Jingren},
journal={arXiv preprint arXiv:2506.05176},
year={2025}
}
Disclaimer:
I am not the creator or original owner of the Qwen/Qwen3 models. This repository provides a quantized version strictly for compatibility and deployment. All rights to the underlying models remain with the original authors. This repository adheres to the same license and usage terms as the upstream (base) model. Please review the original license for details on permissions and limitations.
- Downloads last month
- 37