RyWorld VLN — Stage 1 Discrete (step 15000)

Vision-language navigation policy built on InternVL3.5-1B with a separate StopHead and ProprioProjector. Trained on the VLNVerse coarse/fine training set; evaluated on the official coarse/val_unseen 835-episode benchmark using the vlnverse_emr evaluation framework.

Headline result

On the full VLNVerse coarse/val_unseen (835 episodes) with stop_threshold = 0.95:

Metric	Value
Success Rate (SR)	51.14%
SPL	49.22%
Oracle Success Rate (OSR)	64.79%
Navigation Error (NE)	3.727 m
nDTW	0.9445
Mean Trajectory Length	6.121 m

Comparison vs VLNVerse paper baselines

Reproduced inside the official vlnverse_emr framework on the same coarse/val_unseen split. Baseline numbers from VLNVerse paper (arXiv:2512.19021, Table 3):

Method	SR ↑	SPL ↑	Δ vs RyWorld
CMA (VLN-CE)	32.15%	29.06%	−18.99 / −20.16
Seq2Seq	31.91%	29.68%	−19.23 / −19.54
HNR	36.02%	33.67%	−15.12 / −15.55
RDP	41.61%	37.53%	−9.53 / −11.69
GAMA (paper SOTA)	42.45%	38.89%	−8.69 / −10.33
RyWorld @ thr=0.95 (this model)	51.14%	49.22%	—

Architecture

Inputs:                              Outputs (per chunk position, chunk_size=4):
- RGB 256x256 (Isaac live or         - Discrete head xattn: 4-way CE
  pre-rendered training video)         (0=Stop / 1=Forward / 2=TurnLeft / 3=TurnRight)
- Instruction text (formal variant)  - Stop head xattn: BCE-with-pos_weight
- Proprio history N=8 keyframes        soft target stop_proximity = exp(-d/tau)
  (body-frame deltas [dx, dy,          tau=4.33 aligned to success_radius=3 m
   cos(dtheta), sin(dtheta)])
- Previous action class history
  (decision-point keyframe selector)

Backbone: OpenGVLab/InternVL3_5-1B (InternViT-300M + Qwen3-0.6B, d_model=1024,
          vision tower frozen, LoRA r=8 on language, mlp1 trainable)
Connector: ProprioProjector (continuous proprio -> 1024 embedding)

Detailed architecture & training in docs/RYWORLD_ARCHITECTURE.md of the source repo.

Per-segment performance

SR broken down by reference path length (shortest_path_length):

Path length (m)	n	SR	NE (m)
[ 0, 5)	151	66.9%	2.55
[ 5, 8)	360	59.4%	3.07
[ 8, 12)	226	42.9%	4.33
[12, 18)	96	14.6%	6.58
[18, 30)	2	50.0%	4.55

The drop on long paths (>12 m) is the dominant remaining gap; addressing it likely requires either training-time long-horizon planning supervision or larger forward_distance per high-level action.

Stop head behavior (151,740 chunk-positions)

Statistic	Value
stop_prob median	0.752
stop_prob p90	0.897
pathA fire (argmax==Stop, natural)	2.51%
pathB fire (threshold override)	0.68%
no-stop	96.81%

stop_threshold=0.95 was selected via a 4-point sweep (0.88/0.92/0.95/0.97) on a 30-episode smoke subset before the full run. Higher thresholds (0.97+) cause overshoot regressions on the long-path segment; 0.95 is the empirical sweet spot.

How to use

1. Load the checkpoint

import torch
from omegaconf import OmegaConf
import sys
sys.path.insert(0, "/path/to/ry-dynamics-vln-ryworld")
from ryworld.training.train_ryworld_vlm import build_model_from_yaml
from ryworld.training.ryworld_train_utils_vlm import load_vlm_checkpoint

cfg = OmegaConf.merge(
    OmegaConf.load("stage1_discrete.yaml"),
    OmegaConf.load("a100_4gpu_discrete.yaml"),
)
model = build_model_from_yaml(cfg, device=torch.device("cuda"))
load_vlm_checkpoint(model, None, "ckpt_step0015000_final.pt", strict_model=False)
model.train(False)  # inference mode

2. Evaluate on VLNVerse

cd /path/to/ry-dynamics-vln-ryworld
conda activate vlnverse  # Isaac Sim 4.5 + torch 2.7.1 + cu126
export OMNI_KIT_ACCEPT_EULA=YES

bash scripts/eval/run_eval_structured.sh \
  --ckpt ckpt_step0015000_final.pt \
  --tag eval_replicate \
  --stop-thr 0.95

See scripts/eval/run_eval_structured.sh for the eval pipeline (records meta.json + per_episode.csv + appends to docs/eval_ledger.jsonl).

Training data

VLNVerse coarse + fine train (~12,000 trajectories, 33 indoor scenes)
Pre-rendered RGB videos at 256x256 (10 fps)
Discrete action labels (0=Stop / 1=Forward / 2=TurnLeft / 3=TurnRight)
Formal-variant instruction text

Trained on 4x A100 80 GB with chunk_size=4 multi-step CE supervision + StopHead BCE (pos_weight=5.0, tau=4.33).

Files in this repo

File	Description
`ckpt_step0015000_final.pt`	Main checkpoint (2.81 GB)
`stage1_discrete.yaml`	Base training config
`a100_4gpu_discrete.yaml`	Production overlay (4x A100)
`h1_ryworld_cfg_vlnverse_coarse_val_unseen.py`	Eval config (vlnverse_emr)
`eval_results/`	Full eval artifacts: per-shard meta.json, per_episode.csv, server.log.gz with STOP_DEBUG
`EVAL_SUMMARY.md`	One-page summary of headline metrics

Citation

@misc{ryworld2026,
  title  = {RyWorld: Vision-Language Navigation with a Unified Multimodal World Model},
  author = {{wei.tao, RUYi Dynamics}},
  year   = {2026.05.13},
  url    = {https://huggingface.co/ruyidynamics/ryworld-vln-discrete}
}

If you use this model on the VLNVerse benchmark, please also cite the underlying benchmark paper:

@article{vlnverse2025,
  title   = {VLNVerse: A Large-Scale Extensible Benchmark for Vision-Language Navigation},
  author  = {Sihao Yu and Yuxuan Zhang and others},
  journal = {arXiv preprint arXiv:2512.19021},
  year    = {2025}
}