Qwen3-Embedding-0.6B-Q8

This repository contains an 8-bit quantized version of Qwen3-Embedding-0.6B using bitsandbytes.
The goal of this quantization is to reduce disk footprint and speed up inference while maintaining high model quality.

Quantization was performed with BitsAndBytesConfig(load_in_8bit=True) following the current recommended Hugging Face API (quantization_config argument).

Quantization Details

  • Method: BitsAndBytes 8-bit (LLM.int8)
  • Library: bitsandbytes (bnb)
  • Format: safetensors
  • Precision:
    • Weights: 8-bit int8
    • Activations / compute: FP16 or BF16 depending on hardware
  • Intended use: Low-memory inference on GPU or CPU

The quantized tensors are stored using bitsandbytes’s internal linear layers, referenced inside the safetensors file.

Transformers Usage

import torch
from torch import Tensor
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,             # this enables 8-bit quantization
    llm_int8_threshold=6.0,        # defaults; safe values
    llm_int8_has_fp16_weight=False
)

model_id = "ManiKumarAdapala/Qwen3-Embedding-0.6B-Q8_0-Safetensors"

tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side='left')
model = AutoModel.from_pretrained(model_id, quantization_config = bnb_config)

max_length = 1024 #fixed

input_texts = "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."

# Tokenize the input texts
batch_dict = tokenizer(
    input_texts,
    padding=True,
    truncation=True,
    max_length=max_length,
    return_tensors="pt",
)
batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings_ = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings_, p=2, dim=1)

print(embeddings, len(embeddings[0]))

Memory & Performance

Approximate size comparison:

Model Type Disk Size Runtime RAM
FP16 ~1.2 GB ~2.3 GB
8-bit (this model) ~600 MB ~0.9–1.1 GB

Citation

@article{qwen3embedding,
  title={Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models},
  author={Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and Zhang, Xin and Lin, Huan and Yang, Baosong and Xie, Pengjun and Yang, An and Liu, Dayiheng and Lin, Junyang and Huang, Fei and Zhou, Jingren},
  journal={arXiv preprint arXiv:2506.05176},
  year={2025}
}

Disclaimer:
I am not the creator or original owner of the Qwen/Qwen3 models. This repository provides a quantized version strictly for compatibility and deployment. All rights to the underlying models remain with the original authors. This repository adheres to the same license and usage terms as the upstream (base) model. Please review the original license for details on permissions and limitations.

Downloads last month
37
Safetensors
Model size
0.6B params
Tensor type
F32
·
F16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManiKumarAdapala/Qwen3-Embedding-0.6B-Q8_0-Safetensors

Quantized
(60)
this model

Paper for ManiKumarAdapala/Qwen3-Embedding-0.6B-Q8_0-Safetensors