Supreeth 's Collections

SearchLM

NL2BM25: teaching Qwen2.5-3B to generate Tantivy boolean queries via SFT + GRPO. Covers reward hacking (GRPO v1) and the shaped-reward fix (GRPO v2).