Supreeth/searchlm-nl2bm25-sft
Text Generation • 3B • Updated • 62
NL2BM25: teaching Qwen2.5-3B to generate Tantivy boolean queries via SFT + GRPO. Covers reward hacking (GRPO v1) and the shaped-reward fix (GRPO v2).
Note SFT v1 warm-start (4,999 examples)
Note SFT v2 quality-filtered (1,751 examples, ndcg>0)
Note GRPO v1 — reward hacking / specification gaming
Note GRPO v2 — shaped reward, best retrieval scores