Qwen3-Embedding-8B-GGUF (Q8_0 8-bit Fixed Metadata)

This repository contains a fixed version of the Qwen3-Embedding-8B (Q8_0) model. This model is a state-of-the-art, 8-billion parameter decoder-only embedding model that currently ranks #1 on the MTEB Multilingual Leaderboard.

πŸ›  Why this version exists (The Fix)

The original GGUF conversions for this model often missed the metadata flags required for Last Token Pooling. This caused "SEP token" warnings and queue freezes in tools like AnythingLLM and LM Studio.

Fixed by Gemini CLI: The identification of the missing header keys and the technical implementation of the metadata patch were performed using Gemini CLI. Because the standard GGUF conversion logic fails to set the required flags for Qwen3's specific architecture, Gemini CLI was used to script a direct patch to the GGUF KV pairs.

Changes made to the GGUF header via Gemini CLI:

  1. Fixed tokenizer.ggml.add_eos_token: Set to true.
  2. Added tokenizer.ggml.seperator_token_id: Set to 151643.
  3. Added tokenizer.ggml.add_sep_token: Set to true.

πŸ“Š Technical Specifications

  • Parameters: 8.2 Billion
  • Max Context: 32,768 tokens
  • Native Dimension: 4096 (Supports Matryoshka/MRL truncation from 32 to 4096)
  • Instruction Aware: Yes (Optimized for Instruct-based retrieval)
  • Languages: 100+ (Multilingual, Cross-lingual, and Code retrieval)

πŸ’‘ Usage Tip: Instruction-Based Retrieval

This model is Instruction-Aware. For the best results in RAG or search scenarios, use the following format for your queries:

  • For Queries: Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: [Your Question Here]
  • For Documents: Just provide the raw text (No instruction needed).

βš–οΈ Performance & Integrity

  • MTEB Multilingual Score: 70.58 (#1 Ranking)
  • Integrity: 398/398 tensors match the original Qwen Team weights. No re-quantization was performed; only the metadata header was patched to ensure compatibility with modern BERT-based runners.

πŸš€ Usage in LM Studio / AnythingLLM

  1. Download the .gguf file.
  2. LM Studio: Select in the Embeddings tab. It will now load without the "SEP" warning.
  3. AnythingLLM: Select LM Studio as your provider. The ingestion queue will now process large folders of documents without stopping.

πŸ“œ Citation

If you use this model, please cite the original work:

@article{qwen3embedding,
  title={Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models},
  author={Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and Zhang, Xin and Lin, Huan and Yang, Baosong and Xie, Pengjun and Yang, An and Liu, Dayiheng and Lin, Junyang and Huang, Fei and Zhou, Jingren},
  journal={arXiv preprint arXiv:2506.05176},
  year={2025}
}
Downloads last month
-
GGUF
Model size
8B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for tfjack/Qwen3-Embedding-8B-GGUF-Fixed

Base model

Qwen/Qwen3-8B-Base
Quantized
(20)
this model

Paper for tfjack/Qwen3-Embedding-8B-GGUF-Fixed