or_v2_rl_v0.1 โ OpenResearcher RL checkpoint (step 62)
RLOO-trained checkpoint from the launch11 / v1 run (experiment name
fullparam_8node_v1_0419-1502) at global_step_62, based on the
OpenResearcher SFT init (sft_qwen35_35b/checkpoint-1560-fused).
Training
- Algorithm: RLOO (GRPO-style rollout with baseline = group mean)
- Rollout: multi-turn, tool-calling (browser.search / browser.open) against an offline FAISS service over ~14.9M docs
- Data:
midpass1kโ pass-rate 0.33โ0.67 subset of the RL training set - Nodes: 8 ร 8 ร A100 (64 GPUs total, TP=4)
- Turn cap: 100 assistant turns, 24576 response token ceiling
- Loss:
loss_agg_mode=token-mean,clip_ratio_high=0.28,kl_loss_coef=1e-4 - Scoring: four-bucket correctness reward (correct+searched / correct+no-search / wrong+searched / wrong+no-search)
See docs/04242026_rl_step62_eval_analysis.md in the repo for the
GAIA-level trajectory audit that motivated the v0.2+ reward-shaping work.
Intended use
Research checkpoint. Multi-turn deep-research agent with browser tools; requires a compatible search service and the OpenResearcher agent loop for inference-time deployment.
- Downloads last month
- 17