QMD Query Expansion 1.7B
A Qwen3-1.7B model finetuned for query expansion in hybrid search systems (RAG). Expands user queries into retrieval-optimized variations for both sparse (BM25) and dense (vector) search.
Repository: github.com/tobi/qmd
What This Model Does
Given a search query, generates 7 expansions:
- 1 hyde: A hypothetical document snippet (50-200 chars) that would answer the query
- 3 lex: Keyword phrases (2-5 words) optimized for BM25/sparse search
- 3 vec: Natural language sentences (15-30 words) for vector/dense search
This improves recall in hybrid retrieval systems by matching both exact keywords and semantic meaning.
Prompt Format
Critical: Use this exact format. The model was trained on this specific template.
Expand this search query:
<query>
Example Input:
Expand this search query:
postgresql jsonb indexing
Example Output:
hyde: PostgreSQL JSONB supports GIN indexes for fast key lookups and containment queries with @> operator.
lex: postgresql jsonb gin index
lex: postgres json indexing strategies
lex: jsonb index optimization postgresql
vec: How do I create efficient GIN indexes on JSONB columns in PostgreSQL?
vec: Best practices for indexing JSON data in PostgreSQL databases.
vec: Performance comparison of GIN vs BTREE indexes for JSONB fields.
Usage
With vLLM (Recommended)
# Start server
vllm serve tobil/qmd-query-expansion-1.7B --port 8000
# Query
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "tobil/qmd-query-expansion-1.7B",
"messages": [{"role": "user", "content": "Expand this search query:\npostgresql jsonb indexing"}],
"temperature": 0.7,
"max_tokens": 400
}' | jq -r '.choices[0].message.content'
With Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("tobil/qmd-query-expansion-1.7B")
tokenizer = AutoTokenizer.from_pretrained("tobil/qmd-query-expansion-1.7B")
messages = [{"role": "user", "content": "Expand this search query:\nReact hooks tutorial"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=400, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
With llama.cpp (GGUF)
# Download GGUF (Q8_0 quantized, 2.1GB)
huggingface-cli download tobil/qmd-query-expansion-1.7B qmd-query-expansion-1.7B-Q8_0.gguf
# Run
./llama-cli -m qmd-query-expansion-1.7B-Q8_0.gguf \
-p "Expand this search query:\nkubernetes vs docker" \
--temp 0.7 -n 400
Output Parsing
The model outputs in line format. Parse with:
import re
def parse_expansions(text: str) -> list[dict]:
"""Parse line-based expansion output into structured format."""
expansions = []
# Remove thinking tags if present (Qwen3 feature)
text = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL)
for line in text.strip().split('\n'):
line = line.strip()
match = re.match(r'^(hyde|lex|vec)\s*:\s*(.+)$', line, re.IGNORECASE)
if match:
expansions.append({
"type": match.group(1).lower(),
"value": match.group(2).strip()
})
return expansions
# Example
output = """hyde: PostgreSQL JSONB supports GIN indexes for fast queries.
lex: postgresql jsonb gin index
lex: postgres json indexing
lex: jsonb optimization
vec: How to create GIN indexes on JSONB columns?
vec: Best practices for PostgreSQL JSON indexing.
vec: JSONB vs JSON performance comparison."""
expansions = parse_expansions(output)
# [{"type": "hyde", "value": "PostgreSQL JSONB supports..."}, ...]
Training Details
Method: GEPA Distillation
- Teacher Model: GPT-4o-mini with GEPA-optimized prompt
- Prompt Optimization: DSPy's GEPA (Grounded Example-based Prompt Adaptation) automatically evolved the teacher prompt over 34 iterations to reach 87.7% on our scoring metric
- Distillation: Generated 500+ high-quality training examples from teacher
- Student Training: SFT with LoRA on Qwen3-1.7B, 3 epochs
Key Learnings
1. Hyde-First Ordering Matters
Generating the hypothetical document (hyde) first provides context that improves lex and vec quality. The hyde acts as an "anchor" that grounds subsequent expansions.
✅ Good: hyde first, then lex uses hyde context
hyde: Kubernetes orchestrates containers at scale with auto-scaling...
lex: kubernetes container orchestration # informed by hyde
❌ Bad: lex without context
lex: container management # too generic
2. Entity Preservation is Critical
Named entities (brands, products, technical terms) must appear in every lex expansion. Missing entities tanks BM25 recall.
Query: "iPhone 15 vs Samsung S24"
✅ Good lex:
- "iPhone 15 Samsung S24 comparison"
- "iPhone 15 vs Samsung S24 specs"
- "Samsung S24 iPhone 15 camera"
❌ Bad lex:
- "smartphone comparison" # missing entities!
- "phone camera review" # missing entities!
3. Simple Prompts Win for Small Models
The teacher used a complex DSPy signature format with structured sections. But the small model performed better with the simple training format:
✅ Use this (matches training):
"Expand this search query:\n{query}"
❌ Not this (DSPy signature format):
"## Inputs\n### query\n{query}\n## Generated Outputs..."
Complex prompts caused the small model to "leak" instruction fragments into outputs.
4. Line Format > JSON for Small Models
Small models struggle with reliable JSON generation. Line-based format is more robust:
✅ Reliable:
hyde: Some text here
lex: keyword phrase
vec: A full sentence.
❌ Unreliable for 1.7B:
[{"type": "hyde", "value": "..."}, ...]
5. GEPA Prompt Evolution
GEPA automatically discovered these improvements to the teacher prompt:
- Explicit examples for edge cases (ambiguous queries like "pin")
- Emphasis on entity preservation with concrete failure cases
- Factual grounding examples (Louvre hours, GPS navigation steps)
- Score targets ("aim for 78-84%") to calibrate quality
Training Configuration
base_model: Qwen/Qwen3-1.7B
method: SFT with LoRA
lora_r: 64
lora_alpha: 128
learning_rate: 2e-4
epochs: 3
batch_size: 4
gradient_accumulation: 4
warmup_ratio: 0.1
scheduler: cosine
Metrics
| Metric | Value |
|---|---|
| Final Loss | 0.64 |
| Token Accuracy | 84.7% |
| Eval Score Range | 80-96% |
| Training Time | ~7 min (RTX 4090) |
Scoring Rubric
Our evaluation metric scores expansions on:
- Structure (7 items: 1 hyde, 3 lex, 3 vec)
- Entity Preservation (all query entities in every lex)
- No Verbatim Echo (lex shouldn't just repeat the query)
- Hyde Quality (50-200 chars, informative)
- Vec Quality (15-30 words, semantic variation)
- Hyde-Lex-Vec Coherence (lex/vec should build on hyde)
Limitations
- Trained on English queries only
- May hallucinate facts in hyde (use for retrieval, not as ground truth)
- Optimized for general knowledge queries; domain-specific queries may need domain-adapted models
- Qwen3's
<think>tags sometimes appear (strip them in post-processing)
Files
Safetensors (for transformers/vLLM)
model.safetensors- Full precision weights (4.1GB)
GGUF Quantizations (for llama.cpp/Ollama)
| Quant | Size | BPW | Eval Score | Use Case |
|---|---|---|---|---|
| Q8_0 | 2.1GB | 8.5 | 87% | Max quality |
| Q6_K | 1.6GB | 6.6 | 89% | Good balance |
| Q5_K_M | 1.4GB | 5.7 | 89% | Recommended |
| Q4_K_M | 1.2GB | 4.8 | 92% | Best value |
| Q4_0 | 1.2GB | 4.5 | 95% | Smallest |
Results: All quantizations perform excellently on this structured generation task. The eval scores show minimal quality degradation even at Q4_0 - the task (generating hyde/lex/vec expansions) is simple enough that aggressive quantization doesn't hurt. Q4_K_M is recommended for the best size/quality tradeoff.
Citation
@misc{qmd-query-expansion,
title={QMD Query Expansion Model},
author={Shopify},
year={2025},
url={https://github.com/tobi/qmd}
}
License
Apache 2.0
- Downloads last month
- 123