Static Query Encoder — distilled from Qwen3-Embedding-8B

A ~38MB query encoder (10M params) that produces 4096-dim embeddings aligned to Qwen/Qwen3-Embedding-8B's embedding space, enabling asymmetric retrieval where documents are encoded once with the massive 7.6B-param teacher and queries are encoded at inference with this tiny model.

	Student (this model)	Teacher
Params	10M	7,567M
Size	38 MB	~15 GB
Architecture	EmbeddingBag + MLP	Qwen3 Transformer
Latency	0.39ms (CPU)	~100ms (GPU)
Throughput	2,552 q/s (CPU)	~100 q/s (GPU)
Output dim	4096	4096

Architecture

BERT tokenizer (30,522 vocab)
  → EmbeddingBag(30522 × 256, mean pooling)    # 7.5 MB
  → Linear(256 → 512) + GELU                   # 0.5 MB  
  → Linear(512 → 4096)                         # 8 MB
  → L2 Normalize
  → 4096-dim unit vector (same space as Qwen3-Embedding-8B)

No transformer layers. No attention. Just an embedding bag lookup + 2-layer MLP. Sub-millisecond inference on CPU.

Quick Start

import torch, torch.nn as nn, torch.nn.functional as F, json
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download

# Model class
class StaticQueryEncoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim=256, hidden_dim=512, output_dim=4096, padding_idx=0):
        super().__init__()
        self.eb = nn.EmbeddingBag(vocab_size, embedding_dim, mode="mean", padding_idx=padding_idx)
        self.mlp = nn.Sequential(nn.Linear(embedding_dim, hidden_dim), nn.GELU(), nn.Linear(hidden_dim, output_dim))
    def forward(self, input_ids, offsets=None):
        x = self.eb(input_ids) if input_ids.dim() == 2 else self.eb(input_ids, offsets)
        return F.normalize(self.mlp(x), p=2, dim=-1)

# Load
repo = "erikkaum/static-qwen3-query-encoder"
tok = AutoTokenizer.from_pretrained(repo)
cfg = json.loads(open(hf_hub_download(repo, "config.json")).read())
model = StaticQueryEncoder(cfg["vocab_size"], cfg["embedding_dim"], cfg["hidden_dim"], cfg["output_dim"])
model.load_state_dict(torch.load(hf_hub_download(repo, "model.pt"), map_location="cpu", weights_only=True))
model.eval()

# Encode queries
ids = tok(["what is machine learning?"], return_tensors="pt", truncation=True, padding=True, max_length=128)["input_ids"]
with torch.no_grad():
    query_emb = model(ids)  # [1, 4096], L2-normalized

Asymmetric Retrieval Pattern

from sentence_transformers import SentenceTransformer
import numpy as np

# === INDEX TIME (run once) ===
doc_model = SentenceTransformer("Qwen/Qwen3-Embedding-8B")
documents = ["Machine learning is...", "Photosynthesis is...", ...]
doc_embs = doc_model.encode(documents, normalize_embeddings=True)
# Store doc_embs in your vector database

# === QUERY TIME (run per query, sub-millisecond) ===
query = "how does machine learning work?"
ids = tok([query], return_tensors="pt", truncation=True, padding=True, max_length=128)["input_ids"]
with torch.no_grad():
    q_emb = model(ids).numpy()

# Retrieve by dot product (both vectors are L2-normalized)
scores = q_emb @ doc_embs.T
top_k = scores[0].argsort()[-10:][::-1]

Evaluation Results

Tested on a set of 5 query-document pairs + 3 distractor documents (8 total docs):

Metric	Value
Retrieval Accuracy (Top-1)	5/5 = 100%
Query Encoding Latency	0.39ms (CPU)
Eval Cosine Similarity	0.5169
Eval L2 Distance	0.9788

The model correctly identifies the matching document as the top result for every test query, even when the document was encoded by the 760× larger Qwen3-Embedding-8B teacher.

Training Details

Method

Based on LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations (ACL 2026).

Key idea: Pre-compute teacher query embeddings offline, then train the student to match them using ℓ₂ + cosine alignment loss. No teacher model is loaded during training.

Loss Function

l2_loss = torch.norm(student_emb - teacher_emb, p=2, dim=-1).mean()
cos_loss = (1 - F.cosine_similarity(student_emb, teacher_emb, dim=-1)).mean()
total_loss = l2_loss + 0.5 * cos_loss

Training Data

Teacher embeddings: 40,000 query embeddings cached from Qwen/Qwen3-Embedding-8B via Inference API
Sources: LightOn embedding datasets:
- lightonai/embeddings-fine-tuning queries: MSMARCO (30K), NQ (20K), HotpotQA (10K), FiQA (5.5K), FEVER (10K), SQuADv2 (10K), TriviaQA (10K)
- lightonai/embeddings-pre-training-curated queries: AGNews (5K), AltLex (5K), Amazon QA (5K), CC-News (5K)
Cached embeddings dataset: erikkaum/qwen3-8b-query-embeddings-lighton

Hyperparameters

Parameter	Value
Optimizer	AdamW (lr=1e-3, wd=0.01)
Scheduler	Linear warmup (5%) + decay
Batch size	256
Epochs	10
Max sequence length	128 tokens
Gradient clipping	1.0

Training Curve

Epoch  1/10 | Train Loss: 1.5016 | Eval L2: 1.0812 | Eval CosSim: 0.4134
Epoch  2/10 | Train Loss: 1.3370 | Eval L2: 1.0387 | Eval CosSim: 0.4577
Epoch  3/10 | Train Loss: 1.2867 | Eval L2: 1.0196 | Eval CosSim: 0.4770
Epoch  5/10 | Train Loss: 1.2336 | Eval L2: 1.0014 | Eval CosSim: 0.4948
Epoch  8/10 | Train Loss: 1.2010 | Eval L2: 0.9817 | Eval CosSim: 0.5142
Epoch 10/10 | Train Loss: 1.1892 | Eval L2: 0.9788 | Eval CosSim: 0.5169

How It Works

This model implements asymmetric retrieval — a retrieval paradigm where the query encoder and document encoder are different models that produce embeddings in the same vector space.

Documents are encoded once at index time using the full Qwen3-Embedding-8B model (7.6B params). This is expensive but done only once.
Queries are encoded at search time using this tiny model (10M params). The EmbeddingBag + MLP architecture requires no attention computation, achieving sub-millisecond latency.
The student was trained to produce embeddings that match the teacher's query embeddings via ℓ₂ distillation. Since the teacher's query and document embeddings live in the same space, the student's query embeddings can be directly compared with the teacher's document embeddings.

Limitations & Future Work

v1 model trained on 40K queries — more training data (the full 115K+ queries, or larger samples from the LightOn pre-training dataset) would improve the cosine alignment.
Static embedding architecture means word order is ignored — the model relies on bag-of-words semantics. This works well for keyword-style queries but may struggle with queries where word order is critical.
No MRL/Matryoshka support yet — the model always outputs full 4096-dim vectors. Adding Matryoshka training could enable flexible dimensionality reduction.

Citation

@misc{leaf2025,
    title={LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations},
    author={Robin Vujanic and Thomas Rueckstiess},
    year={2025},
    eprint={2509.12539},
    archivePrefix={arXiv},
}

Downloads last month: 17

Paper for erikkaum/static-qwen3-query-encoder

LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations

Paper • 2509.12539 • Published Sep 16, 2025