QMD Query Expansion β€” LFM2-1.2B GGUF

GGUF quantization of a fine-tuned LiquidAI/LFM2-1.2B for structured query expansion in qmd, a local-first document search engine.

Fine-tuned adapter: OrcsRise/qmd-query-expansion-lfm2-sft

Quantizations

File Quantization Size Use Case
qmd-query-expansion-lfm2-q8_0.gguf Q8_0 1.19 GB Recommended β€” near-original quality

Quick Start with qmd

# Set as your qmd query expansion model
export QMD_GEN_MODEL="hf:OrcsRise/qmd-query-expansion-lfm2-gguf/qmd-query-expansion-lfm2-q8_0.gguf"

# Add to ~/.zshrc or ~/.bashrc for persistence
echo 'export QMD_GEN_MODEL="hf:OrcsRise/qmd-query-expansion-lfm2-gguf/qmd-query-expansion-lfm2-q8_0.gguf"' >> ~/.zshrc

# qmd auto-downloads the GGUF on first use
qmd query "your search query"

The model is automatically downloaded to ~/.cache/qmd/models/ on first run.

What This Model Does

Given a short search query, the model generates structured expansions in three formats for hybrid search:

Prefix Purpose Example
lex: Lexical keywords for BM25/FTS5 search lex: docker container timeout settings
vec: Natural language for vector similarity search vec: how to configure docker container timeout
hyde: Hypothetical document for HyDE retrieval hyde: Docker containers can be configured with timeout settings using the --stop-timeout flag...

Why LFM2 over Qwen3?

This is an alternative to qmd's default Qwen3-1.7B query expansion model, added in qmd v1.0.7.

LFM2-1.2B (this) Qwen3-1.7B (default)
Parameters 1.2B 1.7B
Architecture Hybrid (convolutions + attention) Standard transformer
Decode/prefill speed ~2x faster Baseline
Q8_0 size 1.19 GB ~1.7 GB
Best for On-device, latency-sensitive Maximum quality

LFM2's hybrid architecture makes it ideal for on-device inference where latency and memory matter more than marginal quality differences.

Training

  • Method: SFT with LoRA (rank 16, alpha 32)
  • Dataset: tobil/qmd-query-expansion-train β€” 5,157 examples
  • LoRA targets: q_proj, k_proj, v_proj, out_proj, in_proj, w1, w2, w3
  • Epochs: 5
  • Hardware: NVIDIA Tesla T4 (Google Colab, free tier)
  • Training time: ~2.5 hours

See the SFT adapter card for full training details.

Recommended Generation Parameters

Parameter Value
Temperature 0.3
min_p 0.15
Repetition penalty 1.05

Compatibility

This GGUF works with any inference engine that supports the LFM2 architecture:

  • qmd (via node-llama-cpp) β€” primary use case
  • llama.cpp (b5921+)
  • Ollama
  • LM Studio

Related

License

Apache 2.0 β€” same as the base LFM2-1.2B model.

Downloads last month
21
GGUF
Model size
1B params
Architecture
lfm2
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for OrcsRise/qmd-query-expansion-lfm2-gguf

Quantized
(30)
this model

Dataset used to train OrcsRise/qmd-query-expansion-lfm2-gguf