Static Query Encoder — distilled from Qwen3-Embedding-8B
A ~38MB query encoder (10M params) that produces 4096-dim embeddings aligned to Qwen/Qwen3-Embedding-8B's embedding space, enabling asymmetric retrieval where documents are encoded once with the massive 7.6B-param teacher and queries are encoded at inference with this tiny model.
| Student (this model) | Teacher | |
|---|---|---|
| Params | 10M | 7,567M |
| Size | 38 MB | ~15 GB |
| Architecture | EmbeddingBag + MLP | Qwen3 Transformer |
| Latency | 0.39ms (CPU) | ~100ms (GPU) |
| Throughput | 2,552 q/s (CPU) | ~100 q/s (GPU) |
| Output dim | 4096 | 4096 |
Architecture
BERT tokenizer (30,522 vocab)
→ EmbeddingBag(30522 × 256, mean pooling) # 7.5 MB
→ Linear(256 → 512) + GELU # 0.5 MB
→ Linear(512 → 4096) # 8 MB
→ L2 Normalize
→ 4096-dim unit vector (same space as Qwen3-Embedding-8B)
No transformer layers. No attention. Just an embedding bag lookup + 2-layer MLP. Sub-millisecond inference on CPU.
Quick Start
import torch, torch.nn as nn, torch.nn.functional as F, json
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
# Model class
class StaticQueryEncoder(nn.Module):
def __init__(self, vocab_size, embedding_dim=256, hidden_dim=512, output_dim=4096, padding_idx=0):
super().__init__()
self.eb = nn.EmbeddingBag(vocab_size, embedding_dim, mode="mean", padding_idx=padding_idx)
self.mlp = nn.Sequential(nn.Linear(embedding_dim, hidden_dim), nn.GELU(), nn.Linear(hidden_dim, output_dim))
def forward(self, input_ids, offsets=None):
x = self.eb(input_ids) if input_ids.dim() == 2 else self.eb(input_ids, offsets)
return F.normalize(self.mlp(x), p=2, dim=-1)
# Load
repo = "erikkaum/static-qwen3-query-encoder"
tok = AutoTokenizer.from_pretrained(repo)
cfg = json.loads(open(hf_hub_download(repo, "config.json")).read())
model = StaticQueryEncoder(cfg["vocab_size"], cfg["embedding_dim"], cfg["hidden_dim"], cfg["output_dim"])
model.load_state_dict(torch.load(hf_hub_download(repo, "model.pt"), map_location="cpu", weights_only=True))
model.eval()
# Encode queries
ids = tok(["what is machine learning?"], return_tensors="pt", truncation=True, padding=True, max_length=128)["input_ids"]
with torch.no_grad():
query_emb = model(ids) # [1, 4096], L2-normalized
Asymmetric Retrieval Pattern
from sentence_transformers import SentenceTransformer
import numpy as np
# === INDEX TIME (run once) ===
doc_model = SentenceTransformer("Qwen/Qwen3-Embedding-8B")
documents = ["Machine learning is...", "Photosynthesis is...", ...]
doc_embs = doc_model.encode(documents, normalize_embeddings=True)
# Store doc_embs in your vector database
# === QUERY TIME (run per query, sub-millisecond) ===
query = "how does machine learning work?"
ids = tok([query], return_tensors="pt", truncation=True, padding=True, max_length=128)["input_ids"]
with torch.no_grad():
q_emb = model(ids).numpy()
# Retrieve by dot product (both vectors are L2-normalized)
scores = q_emb @ doc_embs.T
top_k = scores[0].argsort()[-10:][::-1]
Evaluation Results
Tested on a set of 5 query-document pairs + 3 distractor documents (8 total docs):
| Metric | Value |
|---|---|
| Retrieval Accuracy (Top-1) | 5/5 = 100% |
| Query Encoding Latency | 0.39ms (CPU) |
| Eval Cosine Similarity | 0.5169 |
| Eval L2 Distance | 0.9788 |
The model correctly identifies the matching document as the top result for every test query, even when the document was encoded by the 760× larger Qwen3-Embedding-8B teacher.
Training Details
Method
Based on LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations (ACL 2026).
Key idea: Pre-compute teacher query embeddings offline, then train the student to match them using ℓ₂ + cosine alignment loss. No teacher model is loaded during training.
Loss Function
l2_loss = torch.norm(student_emb - teacher_emb, p=2, dim=-1).mean()
cos_loss = (1 - F.cosine_similarity(student_emb, teacher_emb, dim=-1)).mean()
total_loss = l2_loss + 0.5 * cos_loss
Training Data
- Teacher embeddings: 40,000 query embeddings cached from Qwen/Qwen3-Embedding-8B via Inference API
- Sources: LightOn embedding datasets:
lightonai/embeddings-fine-tuningqueries: MSMARCO (30K), NQ (20K), HotpotQA (10K), FiQA (5.5K), FEVER (10K), SQuADv2 (10K), TriviaQA (10K)lightonai/embeddings-pre-training-curatedqueries: AGNews (5K), AltLex (5K), Amazon QA (5K), CC-News (5K)
- Cached embeddings dataset: erikkaum/qwen3-8b-query-embeddings-lighton
Hyperparameters
| Parameter | Value |
|---|---|
| Optimizer | AdamW (lr=1e-3, wd=0.01) |
| Scheduler | Linear warmup (5%) + decay |
| Batch size | 256 |
| Epochs | 10 |
| Max sequence length | 128 tokens |
| Gradient clipping | 1.0 |
Training Curve
Epoch 1/10 | Train Loss: 1.5016 | Eval L2: 1.0812 | Eval CosSim: 0.4134
Epoch 2/10 | Train Loss: 1.3370 | Eval L2: 1.0387 | Eval CosSim: 0.4577
Epoch 3/10 | Train Loss: 1.2867 | Eval L2: 1.0196 | Eval CosSim: 0.4770
Epoch 5/10 | Train Loss: 1.2336 | Eval L2: 1.0014 | Eval CosSim: 0.4948
Epoch 8/10 | Train Loss: 1.2010 | Eval L2: 0.9817 | Eval CosSim: 0.5142
Epoch 10/10 | Train Loss: 1.1892 | Eval L2: 0.9788 | Eval CosSim: 0.5169
How It Works
This model implements asymmetric retrieval — a retrieval paradigm where the query encoder and document encoder are different models that produce embeddings in the same vector space.
Documents are encoded once at index time using the full Qwen3-Embedding-8B model (7.6B params). This is expensive but done only once.
Queries are encoded at search time using this tiny model (10M params). The EmbeddingBag + MLP architecture requires no attention computation, achieving sub-millisecond latency.
The student was trained to produce embeddings that match the teacher's query embeddings via ℓ₂ distillation. Since the teacher's query and document embeddings live in the same space, the student's query embeddings can be directly compared with the teacher's document embeddings.
Limitations & Future Work
- v1 model trained on 40K queries — more training data (the full 115K+ queries, or larger samples from the LightOn pre-training dataset) would improve the cosine alignment.
- Static embedding architecture means word order is ignored — the model relies on bag-of-words semantics. This works well for keyword-style queries but may struggle with queries where word order is critical.
- No MRL/Matryoshka support yet — the model always outputs full 4096-dim vectors. Adding Matryoshka training could enable flexible dimensionality reduction.
Citation
@misc{leaf2025,
title={LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations},
author={Robin Vujanic and Thomas Rueckstiess},
year={2025},
eprint={2509.12539},
archivePrefix={arXiv},
}
- Downloads last month
- 17