QMD Query Expansion 1.7B

A Qwen3-1.7B model finetuned for query expansion in hybrid search systems (RAG). Expands user queries into retrieval-optimized variations for both sparse (BM25) and dense (vector) search.

Repository: github.com/tobi/qmd

What This Model Does

Given a search query, generates 7 expansions:

1 hyde: A hypothetical document snippet (50-200 chars) that would answer the query
3 lex: Keyword phrases (2-5 words) optimized for BM25/sparse search
3 vec: Natural language sentences (15-30 words) for vector/dense search

This improves recall in hybrid retrieval systems by matching both exact keywords and semantic meaning.

Prompt Format

Critical: Use this exact format. The model was trained on this specific template.

Expand this search query:
<query>

Example Input:

Expand this search query:
postgresql jsonb indexing

Example Output:

hyde: PostgreSQL JSONB supports GIN indexes for fast key lookups and containment queries with @> operator.
lex: postgresql jsonb gin index
lex: postgres json indexing strategies
lex: jsonb index optimization postgresql
vec: How do I create efficient GIN indexes on JSONB columns in PostgreSQL?
vec: Best practices for indexing JSON data in PostgreSQL databases.
vec: Performance comparison of GIN vs BTREE indexes for JSONB fields.

Usage

With vLLM (Recommended)

# Start server
vllm serve tobil/qmd-query-expansion-1.7B --port 8000

# Query
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tobil/qmd-query-expansion-1.7B",
    "messages": [{"role": "user", "content": "Expand this search query:\npostgresql jsonb indexing"}],
    "temperature": 0.7,
    "max_tokens": 400
  }' | jq -r '.choices[0].message.content'

With Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("tobil/qmd-query-expansion-1.7B")
tokenizer = AutoTokenizer.from_pretrained("tobil/qmd-query-expansion-1.7B")

messages = [{"role": "user", "content": "Expand this search query:\nReact hooks tutorial"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=400, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With llama.cpp (GGUF)

# Download GGUF (Q8_0 quantized, 2.1GB)
huggingface-cli download tobil/qmd-query-expansion-1.7B qmd-query-expansion-1.7B-Q8_0.gguf

# Run
./llama-cli -m qmd-query-expansion-1.7B-Q8_0.gguf \
  -p "Expand this search query:\nkubernetes vs docker" \
  --temp 0.7 -n 400

Output Parsing

The model outputs in line format. Parse with:

import re

def parse_expansions(text: str) -> list[dict]:
    """Parse line-based expansion output into structured format."""
    expansions = []
    
    # Remove thinking tags if present (Qwen3 feature)
    text = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL)
    
    for line in text.strip().split('\n'):
        line = line.strip()
        match = re.match(r'^(hyde|lex|vec)\s*:\s*(.+)$', line, re.IGNORECASE)
        if match:
            expansions.append({
                "type": match.group(1).lower(),
                "value": match.group(2).strip()
            })
    
    return expansions

# Example
output = """hyde: PostgreSQL JSONB supports GIN indexes for fast queries.
lex: postgresql jsonb gin index
lex: postgres json indexing
lex: jsonb optimization
vec: How to create GIN indexes on JSONB columns?
vec: Best practices for PostgreSQL JSON indexing.
vec: JSONB vs JSON performance comparison."""

expansions = parse_expansions(output)
# [{"type": "hyde", "value": "PostgreSQL JSONB supports..."}, ...]

Training Details

Method: GEPA Distillation

Teacher Model: GPT-4o-mini with GEPA-optimized prompt
Prompt Optimization: DSPy's GEPA (Grounded Example-based Prompt Adaptation) automatically evolved the teacher prompt over 34 iterations to reach 87.7% on our scoring metric
Distillation: Generated 500+ high-quality training examples from teacher
Student Training: SFT with LoRA on Qwen3-1.7B, 3 epochs

Key Learnings

1. Hyde-First Ordering Matters

Generating the hypothetical document (hyde) first provides context that improves lex and vec quality. The hyde acts as an "anchor" that grounds subsequent expansions.

✅ Good: hyde first, then lex uses hyde context
hyde: Kubernetes orchestrates containers at scale with auto-scaling...
lex: kubernetes container orchestration  # informed by hyde

❌ Bad: lex without context
lex: container management  # too generic

2. Entity Preservation is Critical

Named entities (brands, products, technical terms) must appear in every lex expansion. Missing entities tanks BM25 recall.

Query: "iPhone 15 vs Samsung S24"

✅ Good lex:
- "iPhone 15 Samsung S24 comparison"
- "iPhone 15 vs Samsung S24 specs"  
- "Samsung S24 iPhone 15 camera"

❌ Bad lex:
- "smartphone comparison"  # missing entities!
- "phone camera review"    # missing entities!

3. Simple Prompts Win for Small Models

The teacher used a complex DSPy signature format with structured sections. But the small model performed better with the simple training format:

✅ Use this (matches training):
"Expand this search query:\n{query}"

❌ Not this (DSPy signature format):
"## Inputs\n### query\n{query}\n## Generated Outputs..."

Complex prompts caused the small model to "leak" instruction fragments into outputs.

4. Line Format > JSON for Small Models

Small models struggle with reliable JSON generation. Line-based format is more robust:

✅ Reliable:
hyde: Some text here
lex: keyword phrase
vec: A full sentence.

❌ Unreliable for 1.7B:
[{"type": "hyde", "value": "..."}, ...]

5. GEPA Prompt Evolution

GEPA automatically discovered these improvements to the teacher prompt:

Explicit examples for edge cases (ambiguous queries like "pin")
Emphasis on entity preservation with concrete failure cases
Factual grounding examples (Louvre hours, GPS navigation steps)
Score targets ("aim for 78-84%") to calibrate quality

Training Configuration

base_model: Qwen/Qwen3-1.7B
method: SFT with LoRA
lora_r: 64
lora_alpha: 128
learning_rate: 2e-4
epochs: 3
batch_size: 4
gradient_accumulation: 4
warmup_ratio: 0.1
scheduler: cosine

Metrics

Metric	Value
Final Loss	0.64
Token Accuracy	84.7%
Eval Score Range	80-96%
Training Time	~7 min (RTX 4090)

Scoring Rubric

Our evaluation metric scores expansions on:

Structure (7 items: 1 hyde, 3 lex, 3 vec)
Entity Preservation (all query entities in every lex)
No Verbatim Echo (lex shouldn't just repeat the query)
Hyde Quality (50-200 chars, informative)
Vec Quality (15-30 words, semantic variation)
Hyde-Lex-Vec Coherence (lex/vec should build on hyde)

Limitations

Trained on English queries only
May hallucinate facts in hyde (use for retrieval, not as ground truth)
Optimized for general knowledge queries; domain-specific queries may need domain-adapted models
Qwen3's <think> tags sometimes appear (strip them in post-processing)

Files

Safetensors (for transformers/vLLM)

model.safetensors - Full precision weights (4.1GB)

GGUF Quantizations (for llama.cpp/Ollama)

Quant	Size	BPW	Eval Score	Use Case
Q8_0	2.1GB	8.5	87%	Max quality
Q6_K	1.6GB	6.6	89%	Good balance
Q5_K_M	1.4GB	5.7	89%	Recommended
Q4_K_M	1.2GB	4.8	92%	Best value
Q4_0	1.2GB	4.5	95%	Smallest

Results: All quantizations perform excellently on this structured generation task. The eval scores show minimal quality degradation even at Q4_0 - the task (generating hyde/lex/vec expansions) is simple enough that aggressive quantization doesn't hurt. Q4_K_M is recommended for the best size/quality tradeoff.

Citation

@misc{qmd-query-expansion,
  title={QMD Query Expansion Model},
  author={Shopify},
  year={2025},
  url={https://github.com/tobi/qmd}
}

License

Apache 2.0

Downloads last month: 123

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for tobil/qmd-query-expansion-1.7B

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Quantized

(159)

this model

Quantizations

1 model