For full information, go check out the ORBIT paper here. This is a the v0.1 of the ORBIT-4B model fine-tuned with 165 GRPO steps.

Orbit-4B (v0.1)

This is the RL-trained checkpoint of Orbit-4B, a small-sized (4B parameters) expert open search agent using the Qwen3-4B base model, fine-tuned with GRPO to use web search as a tool for multi-turn question answering. It was trained using the verl-tool framework with a live DDGS-based retriever.

Training was conducted for 165 GRPO steps on a mixed dataset of Natural Questions (NQ), HotpotQA, and our ORBIT dataset (orbit-ai/orbit) in a 1:1:1 ratio.

Model Details

Property	Value
Architecture	Qwen3-4B
Base checkpoint	`Qwen/Qwen3-4B`
Training algorithm	GRPO
Training steps	165
Training framework	verl-tool
Hardware	4 × H100 SXM5 (80 GB HBM3), NVLink
Parallelism	FSDP (param + optimizer offload)
Rollout mode	Async (vLLM v1)

Training Dataset

The model was trained on a mixed retrieval QA dataset with equal sampling across three tasks:

Task	Source	Type
Natural Questions (NQ)	Open-domain QA	Single-hop factoid
HotpotQA	Multi-hop QA	2-hop reasoning
ORBIT	Multi-hop QA	Difficult and multi-hop reasoning queries

Dataset name: PeterJinGo/nq_hotpotqa_train, orbit-ai/orbit-20k
Train batch size: 256 samples per step (n=8 rollouts per sample → 2048 trajectories/step)

Training Hyperparameters

Hyperparameter	Value
Batch size	256
Rollouts	8
Learning rate	`1e-6`
LR warmup steps	10
LR warmup ratio	0.285
Optimizer	AdamW
GRPO rollouts per sample (n)	8
PPO mini-batch size	32
PPO micro-batch size per GPU	1
Temperature	1.0
Top-p	1.0
Top-k	disabled (−1)
KL loss coefficient	0.0
KL loss type	`low_var_kl`
Entropy coefficient	0.0
Max prompt length	2048 tokens
Max response length	8192 tokens
Max action length	2048 tokens
Max observation length	1024 tokens
Max turns	5
Max concurrent trajectories	32
GPU memory utilization (vLLM)	0.6
vLLM max model length	8192
Sequence parallelism	1 (disabled)
FSDP size	−1 (full sharding)

Tool Configuration

The model was trained with live web search via a DDGS-based retrieval server:

Setting	Value
Retriever	DDGS (Dux Distributed Global Search)
Search backends	`google`, `brave`, `bing`, `wikipedia`, `grokipedia`
Top-k documents per query	5
Backend strategy	Parallel fan-out — all backends queried simultaneously, results merged and deduplicated
Per-backend HTTP timeout	10 s
Tool server workers	4
Action stop tokens	`</search>`, `</answer>`
Observations masked in loss	Yes (`mask_observations=True`)

The retriever server runs as a FastAPI service. At each agent turn the model issues a <search> query </search> action; the tool server retrieves results and returns them as <information>…</information> observations. The trajectory ends when the model emits <answer> … </answer> or the turn budget is exhausted.

Training Infrastructure

Setting	Value
Nodes	1 × H100 node (g-series)
GPUs per node	4 × H100 SXM5 80 GB
CPUs per node	48 (allocated)
System memory	256 GB
Local scratch (SSD)	200 GB (`$TMPDIR`, used for Triton/Ray caches)
NCCL	NVLink P2P enabled (no `NCCL_P2P_DISABLE`)
vLLM version	v1 (`VLLM_USE_V1=1`)
Checkpoint frequency	Every 5 steps

Reward Model

Reward manager: search_r1_qa_em
Reward is computed as exact-match (EM) between the model's <answer> and the reference target string. Multi-turn rollouts are scored at the final answer only; intermediate search turns receive no intermediate reward. Observations are masked from the KL and policy gradient loss.

Usage

This model is designed to be used with a running tool server that handles <search> actions. Inference without a live retriever will fall back to the model's parametric knowledge.

With verl-tool (recommended)

git clone https://github.com/TIGER-AI-Lab/verl-tool
cd verl-tool
uv sync
source .venv/bin/activate

# Start the DDGS retriever
python ddgs_retrieval_optimized.py --port 8280 --topk 5 \
    --backend "google,brave,bing,wikipedia,grokipedia"

# Start the tool server
python -m verl_tool.servers.serve \
    --host 0.0.0.0 --port 30500 \
    --tool_type search_retrieval \
    --workers_per_tool 4

Direct inference (parametric knowledge only)

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "orbit-ai/orbit-4b-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

prompt = (
    "Answer the given question. Please break down the question, using it to plan "
    "a potential solution trajectory. You must conduct reasoning inside <think> and "
    "</think> first, then you may use tools to gather information. "
    "For search, use <search> query </search>. "
    "Provide your final answer with <answer> answer </answer>.\n\n"
    "Question: What percentage of blood is made up of plasma?"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=1.0, do_sample=True)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Intended Use & Limitations

Intended use: Research into multi-turn retrieval-augmented reasoning and RL-based tool-use training.
Language: English only.
Search dependency: Peak performance requires a live web search backend. Without search, the model still reasons using parametric knowledge but accuracy degrades on fine-grained entity questions (particularly InfoSeek).
Not intended for production deployment without additional safety filtering.

Citation

If you use this model or the training methodology, please cite:

Downloads last month: 77

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for orbit-ai/orbit-4b-v0.1

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B