RyWorld VLN — Stage 1 Discrete (step 15000)

Vision-language navigation policy built on InternVL3.5-1B with a separate StopHead and ProprioProjector. Trained on the VLNVerse coarse/fine training set; evaluated on the official coarse/val_unseen 835-episode benchmark using the vlnverse_emr evaluation framework.

Headline result

On the full VLNVerse coarse/val_unseen (835 episodes) with stop_threshold = 0.95:

Metric Value
Success Rate (SR) 51.14%
SPL 49.22%
Oracle Success Rate (OSR) 64.79%
Navigation Error (NE) 3.727 m
nDTW 0.9445
Mean Trajectory Length 6.121 m

Comparison vs VLNVerse paper baselines

Reproduced inside the official vlnverse_emr framework on the same coarse/val_unseen split. Baseline numbers from VLNVerse paper (arXiv:2512.19021, Table 3):

Method SR ↑ SPL ↑ Δ vs RyWorld
CMA (VLN-CE) 32.15% 29.06% −18.99 / −20.16
Seq2Seq 31.91% 29.68% −19.23 / −19.54
HNR 36.02% 33.67% −15.12 / −15.55
RDP 41.61% 37.53% −9.53 / −11.69
GAMA (paper SOTA) 42.45% 38.89% −8.69 / −10.33
RyWorld @ thr=0.95 (this model) 51.14% 49.22%

Architecture

Inputs:                              Outputs (per chunk position, chunk_size=4):
- RGB 256x256 (Isaac live or         - Discrete head xattn: 4-way CE
  pre-rendered training video)         (0=Stop / 1=Forward / 2=TurnLeft / 3=TurnRight)
- Instruction text (formal variant)  - Stop head xattn: BCE-with-pos_weight
- Proprio history N=8 keyframes        soft target stop_proximity = exp(-d/tau)
  (body-frame deltas [dx, dy,          tau=4.33 aligned to success_radius=3 m
   cos(dtheta), sin(dtheta)])
- Previous action class history
  (decision-point keyframe selector)

Backbone: OpenGVLab/InternVL3_5-1B (InternViT-300M + Qwen3-0.6B, d_model=1024,
          vision tower frozen, LoRA r=8 on language, mlp1 trainable)
Connector: ProprioProjector (continuous proprio -> 1024 embedding)

Detailed architecture & training in docs/RYWORLD_ARCHITECTURE.md of the source repo.

Per-segment performance

SR broken down by reference path length (shortest_path_length):

Path length (m) n SR NE (m)
[ 0, 5) 151 66.9% 2.55
[ 5, 8) 360 59.4% 3.07
[ 8, 12) 226 42.9% 4.33
[12, 18) 96 14.6% 6.58
[18, 30) 2 50.0% 4.55

The drop on long paths (>12 m) is the dominant remaining gap; addressing it likely requires either training-time long-horizon planning supervision or larger forward_distance per high-level action.

Stop head behavior (151,740 chunk-positions)

Statistic Value
stop_prob median 0.752
stop_prob p90 0.897
pathA fire (argmax==Stop, natural) 2.51%
pathB fire (threshold override) 0.68%
no-stop 96.81%

stop_threshold=0.95 was selected via a 4-point sweep (0.88/0.92/0.95/0.97) on a 30-episode smoke subset before the full run. Higher thresholds (0.97+) cause overshoot regressions on the long-path segment; 0.95 is the empirical sweet spot.

How to use

1. Load the checkpoint

import torch
from omegaconf import OmegaConf
import sys
sys.path.insert(0, "/path/to/ry-dynamics-vln-ryworld")
from ryworld.training.train_ryworld_vlm import build_model_from_yaml
from ryworld.training.ryworld_train_utils_vlm import load_vlm_checkpoint

cfg = OmegaConf.merge(
    OmegaConf.load("stage1_discrete.yaml"),
    OmegaConf.load("a100_4gpu_discrete.yaml"),
)
model = build_model_from_yaml(cfg, device=torch.device("cuda"))
load_vlm_checkpoint(model, None, "ckpt_step0015000_final.pt", strict_model=False)
model.train(False)  # inference mode

2. Evaluate on VLNVerse

cd /path/to/ry-dynamics-vln-ryworld
conda activate vlnverse  # Isaac Sim 4.5 + torch 2.7.1 + cu126
export OMNI_KIT_ACCEPT_EULA=YES

bash scripts/eval/run_eval_structured.sh \
  --ckpt ckpt_step0015000_final.pt \
  --tag eval_replicate \
  --stop-thr 0.95

See scripts/eval/run_eval_structured.sh for the eval pipeline (records meta.json + per_episode.csv + appends to docs/eval_ledger.jsonl).

Training data

  • VLNVerse coarse + fine train (~12,000 trajectories, 33 indoor scenes)
  • Pre-rendered RGB videos at 256x256 (10 fps)
  • Discrete action labels (0=Stop / 1=Forward / 2=TurnLeft / 3=TurnRight)
  • Formal-variant instruction text

Trained on 4x A100 80 GB with chunk_size=4 multi-step CE supervision + StopHead BCE (pos_weight=5.0, tau=4.33).

Files in this repo

File Description
ckpt_step0015000_final.pt Main checkpoint (2.81 GB)
stage1_discrete.yaml Base training config
a100_4gpu_discrete.yaml Production overlay (4x A100)
h1_ryworld_cfg_vlnverse_coarse_val_unseen.py Eval config (vlnverse_emr)
eval_results/ Full eval artifacts: per-shard meta.json, per_episode.csv, server.log.gz with STOP_DEBUG
EVAL_SUMMARY.md One-page summary of headline metrics

Citation

@misc{ryworld2026,
  title  = {RyWorld: Vision-Language Navigation with a Unified Multimodal World Model},
  author = {{wei.tao, RUYi Dynamics}},
  year   = {2026.05.13},
  url    = {https://huggingface.co/ruyidynamics/ryworld-vln-discrete}
}

If you use this model on the VLNVerse benchmark, please also cite the underlying benchmark paper:

@article{vlnverse2025,
  title   = {VLNVerse: A Large-Scale Extensible Benchmark for Vision-Language Navigation},
  author  = {Sihao Yu and Yuxuan Zhang and others},
  journal = {arXiv preprint arXiv:2512.19021},
  year    = {2025}
}

License

Apache-2.0 (model weights & code).

Note: VLNVerse data and Isaac Sim assets retain their own licenses (see the VLNVerse repo for details).

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for ruyidynamics/ryworld-vln-discrete

Paper for ruyidynamics/ryworld-vln-discrete

Evaluation results