or_v2_rl_v0.1 — OpenResearcher RL checkpoint (step 62)

RLOO-trained checkpoint from the launch11 / v1 run (experiment name fullparam_8node_v1_0419-1502) at global_step_62, based on the OpenResearcher SFT init (sft_qwen35_35b/checkpoint-1560-fused).

Training

Algorithm: RLOO (GRPO-style rollout with baseline = group mean)
Rollout: multi-turn, tool-calling (browser.search / browser.open) against an offline FAISS service over ~14.9M docs
Data: midpass1k — pass-rate 0.33–0.67 subset of the RL training set
Nodes: 8 × 8 × A100 (64 GPUs total, TP=4)
Turn cap: 100 assistant turns, 24576 response token ceiling
Loss: loss_agg_mode=token-mean, clip_ratio_high=0.28, kl_loss_coef=1e-4
Scoring: four-bucket correctness reward (correct+searched / correct+no-search / wrong+searched / wrong+no-search)

See docs/04242026_rl_step62_eval_analysis.md in the repo for the GAIA-level trajectory audit that motivated the v0.2+ reward-shaping work.

Intended use

Research checkpoint. Multi-turn deep-research agent with browser tools; requires a compatible search service and the OpenResearcher agent loop for inference-time deployment.

Downloads last month: 17

Safetensors

Model size

35B params

Tensor type

BF16

Video Preview

Reinforcement Learning