For full information, go check out the ORBIT paper here. This is a the v0.1 of the ORBIT-4B model fine-tuned with 165 GRPO steps.
Orbit-4B (v0.1)
This is the RL-trained checkpoint of Orbit-4B, a small-sized (4B parameters) expert open search agent using the Qwen3-4B base model, fine-tuned with GRPO to use web search as a tool for multi-turn question answering. It was trained using the verl-tool framework with a live DDGS-based retriever.
Training was conducted for 165 GRPO steps on a mixed dataset of Natural Questions (NQ), HotpotQA, and our ORBIT dataset (orbit-ai/orbit) in a 1:1:1 ratio.
Model Details
| Property | Value |
|---|---|
| Architecture | Qwen3-4B |
| Base checkpoint | Qwen/Qwen3-4B |
| Training algorithm | GRPO |
| Training steps | 165 |
| Training framework | verl-tool |
| Hardware | 4 × H100 SXM5 (80 GB HBM3), NVLink |
| Parallelism | FSDP (param + optimizer offload) |
| Rollout mode | Async (vLLM v1) |
Training Dataset
The model was trained on a mixed retrieval QA dataset with equal sampling across three tasks:
| Task | Source | Type |
|---|---|---|
| Natural Questions (NQ) | Open-domain QA | Single-hop factoid |
| HotpotQA | Multi-hop QA | 2-hop reasoning |
| ORBIT | Multi-hop QA | Difficult and multi-hop reasoning queries |
Dataset name: PeterJinGo/nq_hotpotqa_train, orbit-ai/orbit-20k
Train batch size: 256 samples per step (n=8 rollouts per sample → 2048 trajectories/step)
Training Hyperparameters
| Hyperparameter | Value |
|---|---|
| Batch size | 256 |
| Rollouts | 8 |
| Learning rate | 1e-6 |
| LR warmup steps | 10 |
| LR warmup ratio | 0.285 |
| Optimizer | AdamW |
| GRPO rollouts per sample (n) | 8 |
| PPO mini-batch size | 32 |
| PPO micro-batch size per GPU | 1 |
| Temperature | 1.0 |
| Top-p | 1.0 |
| Top-k | disabled (−1) |
| KL loss coefficient | 0.0 |
| KL loss type | low_var_kl |
| Entropy coefficient | 0.0 |
| Max prompt length | 2048 tokens |
| Max response length | 8192 tokens |
| Max action length | 2048 tokens |
| Max observation length | 1024 tokens |
| Max turns | 5 |
| Max concurrent trajectories | 32 |
| GPU memory utilization (vLLM) | 0.6 |
| vLLM max model length | 8192 |
| Sequence parallelism | 1 (disabled) |
| FSDP size | −1 (full sharding) |
Tool Configuration
The model was trained with live web search via a DDGS-based retrieval server:
| Setting | Value |
|---|---|
| Retriever | DDGS (Dux Distributed Global Search) |
| Search backends | google, brave, bing, wikipedia, grokipedia |
| Top-k documents per query | 5 |
| Backend strategy | Parallel fan-out — all backends queried simultaneously, results merged and deduplicated |
| Per-backend HTTP timeout | 10 s |
| Tool server workers | 4 |
| Action stop tokens | </search>, </answer> |
| Observations masked in loss | Yes (mask_observations=True) |
The retriever server runs as a FastAPI service. At each agent turn the model issues a <search> query </search> action; the tool server retrieves results and returns them as <information>…</information> observations. The trajectory ends when the model emits <answer> … </answer> or the turn budget is exhausted.
Training Infrastructure
| Setting | Value |
|---|---|
| Nodes | 1 × H100 node (g-series) |
| GPUs per node | 4 × H100 SXM5 80 GB |
| CPUs per node | 48 (allocated) |
| System memory | 256 GB |
| Local scratch (SSD) | 200 GB ($TMPDIR, used for Triton/Ray caches) |
| NCCL | NVLink P2P enabled (no NCCL_P2P_DISABLE) |
| vLLM version | v1 (VLLM_USE_V1=1) |
| Checkpoint frequency | Every 5 steps |
Reward Model
Reward manager: search_r1_qa_em
Reward is computed as exact-match (EM) between the model's <answer> and the reference target string. Multi-turn rollouts are scored at the final answer only; intermediate search turns receive no intermediate reward. Observations are masked from the KL and policy gradient loss.
Usage
This model is designed to be used with a running tool server that handles <search> actions. Inference without a live retriever will fall back to the model's parametric knowledge.
With verl-tool (recommended)
git clone https://github.com/TIGER-AI-Lab/verl-tool
cd verl-tool
uv sync
source .venv/bin/activate
# Start the DDGS retriever
python ddgs_retrieval_optimized.py --port 8280 --topk 5 \
--backend "google,brave,bing,wikipedia,grokipedia"
# Start the tool server
python -m verl_tool.servers.serve \
--host 0.0.0.0 --port 30500 \
--tool_type search_retrieval \
--workers_per_tool 4
Direct inference (parametric knowledge only)
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "orbit-ai/orbit-4b-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
prompt = (
"Answer the given question. Please break down the question, using it to plan "
"a potential solution trajectory. You must conduct reasoning inside <think> and "
"</think> first, then you may use tools to gather information. "
"For search, use <search> query </search>. "
"Provide your final answer with <answer> answer </answer>.\n\n"
"Question: What percentage of blood is made up of plasma?"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=1.0, do_sample=True)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Intended Use & Limitations
- Intended use: Research into multi-turn retrieval-augmented reasoning and RL-based tool-use training.
- Language: English only.
- Search dependency: Peak performance requires a live web search backend. Without search, the model still reasons using parametric knowledge but accuracy degrades on fine-grained entity questions (particularly InfoSeek).
- Not intended for production deployment without additional safety filtering.
Citation
If you use this model or the training methodology, please cite:
- Downloads last month
- 77