Qwen3-Embedding-0.6B-Q8

This repository contains an 8-bit quantized version of Qwen3-Embedding-0.6B using bitsandbytes.
The goal of this quantization is to reduce disk footprint and speed up inference while maintaining high model quality.

Quantization was performed with BitsAndBytesConfig(load_in_8bit=True) following the current recommended Hugging Face API (quantization_config argument).

Quantization Details

Method: BitsAndBytes 8-bit (LLM.int8)
Library: bitsandbytes (bnb)
Format: safetensors
Precision:
- Weights: 8-bit int8
- Activations / compute: FP16 or BF16 depending on hardware
Intended use: Low-memory inference on GPU or CPU

The quantized tensors are stored using bitsandbytes’s internal linear layers, referenced inside the safetensors file.

Transformers Usage

import torch
from torch import Tensor
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,             # this enables 8-bit quantization
    llm_int8_threshold=6.0,        # defaults; safe values
    llm_int8_has_fp16_weight=False
)

model_id = "ManiKumarAdapala/Qwen3-Embedding-0.6B-Q8_0-Safetensors"

tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side='left')
model = AutoModel.from_pretrained(model_id, quantization_config = bnb_config)

max_length = 1024 #fixed

input_texts = "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."

# Tokenize the input texts
batch_dict = tokenizer(
    input_texts,
    padding=True,
    truncation=True,
    max_length=max_length,
    return_tensors="pt",
)
batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings_ = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings_, p=2, dim=1)

print(embeddings, len(embeddings[0]))

Memory & Performance

Approximate size comparison:

Model Type	Disk Size	Runtime RAM
FP16	~1.2 GB	~2.3 GB
8-bit (this model)	~600 MB	~0.9–1.1 GB

Citation

@article{qwen3embedding,
  title={Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models},
  author={Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and Zhang, Xin and Lin, Huan and Yang, Baosong and Xie, Pengjun and Yang, An and Liu, Dayiheng and Lin, Junyang and Huang, Fei and Zhou, Jingren},
  journal={arXiv preprint arXiv:2506.05176},
  year={2025}
}

Disclaimer:
I am not the creator or original owner of the Qwen/Qwen3 models. This repository provides a quantized version strictly for compatibility and deployment. All rights to the underlying models remain with the original authors. This repository adheres to the same license and usage terms as the upstream (base) model. Please review the original license for details on permissions and limitations.

Downloads last month: 37

Safetensors

Model size

0.6B params

Tensor type

F32

F16

Model tree for ManiKumarAdapala/Qwen3-Embedding-0.6B-Q8_0-Safetensors

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-Embedding-0.6B

Quantized

(60)

this model

Paper for ManiKumarAdapala/Qwen3-Embedding-0.6B-Q8_0-Safetensors

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Paper • 2506.05176 • Published Jun 5, 2025 • 82