RyWorld VLN — Stage 1 Discrete (step 15000)
Vision-language navigation policy built on InternVL3.5-1B with a separate
StopHead and ProprioProjector. Trained on the VLNVerse coarse/fine training
set; evaluated on the official coarse/val_unseen 835-episode benchmark using
the vlnverse_emr evaluation framework.
Headline result
On the full VLNVerse coarse/val_unseen (835 episodes) with stop_threshold = 0.95:
| Metric | Value |
|---|---|
| Success Rate (SR) | 51.14% |
| SPL | 49.22% |
| Oracle Success Rate (OSR) | 64.79% |
| Navigation Error (NE) | 3.727 m |
| nDTW | 0.9445 |
| Mean Trajectory Length | 6.121 m |
Comparison vs VLNVerse paper baselines
Reproduced inside the official vlnverse_emr framework on the same coarse/val_unseen
split. Baseline numbers from VLNVerse paper (arXiv:2512.19021, Table 3):
| Method | SR ↑ | SPL ↑ | Δ vs RyWorld |
|---|---|---|---|
| CMA (VLN-CE) | 32.15% | 29.06% | −18.99 / −20.16 |
| Seq2Seq | 31.91% | 29.68% | −19.23 / −19.54 |
| HNR | 36.02% | 33.67% | −15.12 / −15.55 |
| RDP | 41.61% | 37.53% | −9.53 / −11.69 |
| GAMA (paper SOTA) | 42.45% | 38.89% | −8.69 / −10.33 |
| RyWorld @ thr=0.95 (this model) | 51.14% | 49.22% | — |
Architecture
Inputs: Outputs (per chunk position, chunk_size=4):
- RGB 256x256 (Isaac live or - Discrete head xattn: 4-way CE
pre-rendered training video) (0=Stop / 1=Forward / 2=TurnLeft / 3=TurnRight)
- Instruction text (formal variant) - Stop head xattn: BCE-with-pos_weight
- Proprio history N=8 keyframes soft target stop_proximity = exp(-d/tau)
(body-frame deltas [dx, dy, tau=4.33 aligned to success_radius=3 m
cos(dtheta), sin(dtheta)])
- Previous action class history
(decision-point keyframe selector)
Backbone: OpenGVLab/InternVL3_5-1B (InternViT-300M + Qwen3-0.6B, d_model=1024,
vision tower frozen, LoRA r=8 on language, mlp1 trainable)
Connector: ProprioProjector (continuous proprio -> 1024 embedding)
Detailed architecture & training in docs/RYWORLD_ARCHITECTURE.md of the source repo.
Per-segment performance
SR broken down by reference path length (shortest_path_length):
| Path length (m) | n | SR | NE (m) |
|---|---|---|---|
| [ 0, 5) | 151 | 66.9% | 2.55 |
| [ 5, 8) | 360 | 59.4% | 3.07 |
| [ 8, 12) | 226 | 42.9% | 4.33 |
| [12, 18) | 96 | 14.6% | 6.58 |
| [18, 30) | 2 | 50.0% | 4.55 |
The drop on long paths (>12 m) is the dominant remaining gap; addressing it
likely requires either training-time long-horizon planning supervision or
larger forward_distance per high-level action.
Stop head behavior (151,740 chunk-positions)
| Statistic | Value |
|---|---|
| stop_prob median | 0.752 |
| stop_prob p90 | 0.897 |
| pathA fire (argmax==Stop, natural) | 2.51% |
| pathB fire (threshold override) | 0.68% |
| no-stop | 96.81% |
stop_threshold=0.95 was selected via a 4-point sweep (0.88/0.92/0.95/0.97) on
a 30-episode smoke subset before the full run. Higher thresholds (0.97+) cause
overshoot regressions on the long-path segment; 0.95 is the empirical sweet spot.
How to use
1. Load the checkpoint
import torch
from omegaconf import OmegaConf
import sys
sys.path.insert(0, "/path/to/ry-dynamics-vln-ryworld")
from ryworld.training.train_ryworld_vlm import build_model_from_yaml
from ryworld.training.ryworld_train_utils_vlm import load_vlm_checkpoint
cfg = OmegaConf.merge(
OmegaConf.load("stage1_discrete.yaml"),
OmegaConf.load("a100_4gpu_discrete.yaml"),
)
model = build_model_from_yaml(cfg, device=torch.device("cuda"))
load_vlm_checkpoint(model, None, "ckpt_step0015000_final.pt", strict_model=False)
model.train(False) # inference mode
2. Evaluate on VLNVerse
cd /path/to/ry-dynamics-vln-ryworld
conda activate vlnverse # Isaac Sim 4.5 + torch 2.7.1 + cu126
export OMNI_KIT_ACCEPT_EULA=YES
bash scripts/eval/run_eval_structured.sh \
--ckpt ckpt_step0015000_final.pt \
--tag eval_replicate \
--stop-thr 0.95
See scripts/eval/run_eval_structured.sh for the eval pipeline (records meta.json + per_episode.csv + appends to docs/eval_ledger.jsonl).
Training data
- VLNVerse coarse + fine train (~12,000 trajectories, 33 indoor scenes)
- Pre-rendered RGB videos at 256x256 (10 fps)
- Discrete action labels (0=Stop / 1=Forward / 2=TurnLeft / 3=TurnRight)
- Formal-variant instruction text
Trained on 4x A100 80 GB with chunk_size=4 multi-step CE supervision +
StopHead BCE (pos_weight=5.0, tau=4.33).
Files in this repo
| File | Description |
|---|---|
ckpt_step0015000_final.pt |
Main checkpoint (2.81 GB) |
stage1_discrete.yaml |
Base training config |
a100_4gpu_discrete.yaml |
Production overlay (4x A100) |
h1_ryworld_cfg_vlnverse_coarse_val_unseen.py |
Eval config (vlnverse_emr) |
eval_results/ |
Full eval artifacts: per-shard meta.json, per_episode.csv, server.log.gz with STOP_DEBUG |
EVAL_SUMMARY.md |
One-page summary of headline metrics |
Citation
@misc{ryworld2026,
title = {RyWorld: Vision-Language Navigation with a Unified Multimodal World Model},
author = {{wei.tao, RUYi Dynamics}},
year = {2026.05.13},
url = {https://huggingface.co/ruyidynamics/ryworld-vln-discrete}
}
If you use this model on the VLNVerse benchmark, please also cite the underlying benchmark paper:
@article{vlnverse2025,
title = {VLNVerse: A Large-Scale Extensible Benchmark for Vision-Language Navigation},
author = {Sihao Yu and Yuxuan Zhang and others},
journal = {arXiv preprint arXiv:2512.19021},
year = {2025}
}
License
Apache-2.0 (model weights & code).
Note: VLNVerse data and Isaac Sim assets retain their own licenses (see the VLNVerse repo for details).
Model tree for ruyidynamics/ryworld-vln-discrete
Base model
OpenGVLab/InternVL3_5-1B-PretrainedPaper for ruyidynamics/ryworld-vln-discrete
Evaluation results
- Success Rate (%) on VLNVerse coarse/val_unseenself-reported51.140
- SPL (%) on VLNVerse coarse/val_unseenself-reported49.220
- Oracle Success Rate (%) on VLNVerse coarse/val_unseenself-reported64.790
- Navigation Error (m) on VLNVerse coarse/val_unseenself-reported3.727
- nDTW on VLNVerse coarse/val_unseenself-reported0.945