SearchLM Collection NL2BM25: teaching Qwen2.5-3B to generate Tantivy boolean queries via SFT + GRPO. Covers reward hacking (GRPO v1) and the shaped-reward fix (GRPO v2). • 4 items • Updated 4 days ago
Sleeping RL VeriRL — Verilog RTL Design Environment 🔬 Step through a Verirl environment by sending actions and view results