or_v2_rl_v0.1 โ€” OpenResearcher RL checkpoint (step 62)

RLOO-trained checkpoint from the launch11 / v1 run (experiment name fullparam_8node_v1_0419-1502) at global_step_62, based on the OpenResearcher SFT init (sft_qwen35_35b/checkpoint-1560-fused).

Training

  • Algorithm: RLOO (GRPO-style rollout with baseline = group mean)
  • Rollout: multi-turn, tool-calling (browser.search / browser.open) against an offline FAISS service over ~14.9M docs
  • Data: midpass1k โ€” pass-rate 0.33โ€“0.67 subset of the RL training set
  • Nodes: 8 ร— 8 ร— A100 (64 GPUs total, TP=4)
  • Turn cap: 100 assistant turns, 24576 response token ceiling
  • Loss: loss_agg_mode=token-mean, clip_ratio_high=0.28, kl_loss_coef=1e-4
  • Scoring: four-bucket correctness reward (correct+searched / correct+no-search / wrong+searched / wrong+no-search)

See docs/04242026_rl_step62_eval_analysis.md in the repo for the GAIA-level trajectory audit that motivated the v0.2+ reward-shaping work.

Intended use

Research checkpoint. Multi-turn deep-research agent with browser tools; requires a compatible search service and the OpenResearcher agent loop for inference-time deployment.

Downloads last month
17
Safetensors
Model size
35B params
Tensor type
BF16
ยท
Video Preview
loading