For full information, go check out the ORBIT paper here. This is a the v0.1 of the ORBIT-4B model fine-tuned with 165 GRPO steps.

Figure 1

Orbit-4B (v0.1)

This is the RL-trained checkpoint of Orbit-4B, a small-sized (4B parameters) expert open search agent using the Qwen3-4B base model, fine-tuned with GRPO to use web search as a tool for multi-turn question answering. It was trained using the verl-tool framework with a live DDGS-based retriever.

Training was conducted for 165 GRPO steps on a mixed dataset of Natural Questions (NQ), HotpotQA, and our ORBIT dataset (orbit-ai/orbit) in a 1:1:1 ratio.


Model Details

Property Value
Architecture Qwen3-4B
Base checkpoint Qwen/Qwen3-4B
Training algorithm GRPO
Training steps 165
Training framework verl-tool
Hardware 4 × H100 SXM5 (80 GB HBM3), NVLink
Parallelism FSDP (param + optimizer offload)
Rollout mode Async (vLLM v1)

Training Dataset

The model was trained on a mixed retrieval QA dataset with equal sampling across three tasks:

Task Source Type
Natural Questions (NQ) Open-domain QA Single-hop factoid
HotpotQA Multi-hop QA 2-hop reasoning
ORBIT Multi-hop QA Difficult and multi-hop reasoning queries

Dataset name: PeterJinGo/nq_hotpotqa_train, orbit-ai/orbit-20k
Train batch size: 256 samples per step (n=8 rollouts per sample → 2048 trajectories/step)


Training Hyperparameters

Hyperparameter Value
Batch size 256
Rollouts 8
Learning rate 1e-6
LR warmup steps 10
LR warmup ratio 0.285
Optimizer AdamW
GRPO rollouts per sample (n) 8
PPO mini-batch size 32
PPO micro-batch size per GPU 1
Temperature 1.0
Top-p 1.0
Top-k disabled (−1)
KL loss coefficient 0.0
KL loss type low_var_kl
Entropy coefficient 0.0
Max prompt length 2048 tokens
Max response length 8192 tokens
Max action length 2048 tokens
Max observation length 1024 tokens
Max turns 5
Max concurrent trajectories 32
GPU memory utilization (vLLM) 0.6
vLLM max model length 8192
Sequence parallelism 1 (disabled)
FSDP size −1 (full sharding)

Tool Configuration

The model was trained with live web search via a DDGS-based retrieval server:

Setting Value
Retriever DDGS (Dux Distributed Global Search)
Search backends google, brave, bing, wikipedia, grokipedia
Top-k documents per query 5
Backend strategy Parallel fan-out — all backends queried simultaneously, results merged and deduplicated
Per-backend HTTP timeout 10 s
Tool server workers 4
Action stop tokens </search>, </answer>
Observations masked in loss Yes (mask_observations=True)

The retriever server runs as a FastAPI service. At each agent turn the model issues a <search> query </search> action; the tool server retrieves results and returns them as <information>…</information> observations. The trajectory ends when the model emits <answer> … </answer> or the turn budget is exhausted.


Training Infrastructure

Setting Value
Nodes 1 × H100 node (g-series)
GPUs per node 4 × H100 SXM5 80 GB
CPUs per node 48 (allocated)
System memory 256 GB
Local scratch (SSD) 200 GB ($TMPDIR, used for Triton/Ray caches)
NCCL NVLink P2P enabled (no NCCL_P2P_DISABLE)
vLLM version v1 (VLLM_USE_V1=1)
Checkpoint frequency Every 5 steps

Reward Model

Reward manager: search_r1_qa_em
Reward is computed as exact-match (EM) between the model's <answer> and the reference target string. Multi-turn rollouts are scored at the final answer only; intermediate search turns receive no intermediate reward. Observations are masked from the KL and policy gradient loss.


Usage

This model is designed to be used with a running tool server that handles <search> actions. Inference without a live retriever will fall back to the model's parametric knowledge.

With verl-tool (recommended)

git clone https://github.com/TIGER-AI-Lab/verl-tool
cd verl-tool
uv sync
source .venv/bin/activate

# Start the DDGS retriever
python ddgs_retrieval_optimized.py --port 8280 --topk 5 \
    --backend "google,brave,bing,wikipedia,grokipedia"

# Start the tool server
python -m verl_tool.servers.serve \
    --host 0.0.0.0 --port 30500 \
    --tool_type search_retrieval \
    --workers_per_tool 4

Direct inference (parametric knowledge only)

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "orbit-ai/orbit-4b-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

prompt = (
    "Answer the given question. Please break down the question, using it to plan "
    "a potential solution trajectory. You must conduct reasoning inside <think> and "
    "</think> first, then you may use tools to gather information. "
    "For search, use <search> query </search>. "
    "Provide your final answer with <answer> answer </answer>.\n\n"
    "Question: What percentage of blood is made up of plasma?"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=1.0, do_sample=True)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Intended Use & Limitations

  • Intended use: Research into multi-turn retrieval-augmented reasoning and RL-based tool-use training.
  • Language: English only.
  • Search dependency: Peak performance requires a live web search backend. Without search, the model still reasons using parametric knowledge but accuracy degrades on fine-grained entity questions (particularly InfoSeek).
  • Not intended for production deployment without additional safety filtering.

Citation

If you use this model or the training methodology, please cite:

Downloads last month
77
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for orbit-ai/orbit-4b-v0.1

Base model

Qwen/Qwen3-4B-Base
Finetuned
Qwen/Qwen3-4B
Finetuned
(508)
this model
Quantizations
2 models

Datasets used to train orbit-ai/orbit-4b-v0.1