UniVR: Thinking in Visual Space for Unified Visual Reasoning

UniVR Overview

🌐 Project Page  |  πŸ“„ Paper  |  πŸ’» Code  |  πŸ“¦ VR-X Dataset


Model Summary

UniVR is the first framework that simultaneously learns complex reasoning, fine-grained physical dynamics, and long-term planning from pure visual demonstrations β€” without relying on dense image-text pairs or task-specific heuristics.

Built on Emu3.5 (34B), UniVR uses a unified next-token prediction objective to directly generate visual reasoning traces given an image and instruction. Training employs a two-stage pipeline: supervised cold initialization on the VR-X dataset, followed by VR-GRPO reinforcement learning with complementary global and step-focal rewards.

Feature Detail
Architecture Emu3.5 34B (VQ-VAE unified generative model)
Training SFT (310k samples) β†’ VR-GRPO RL (3k samples)
Visual Thinking Native visual-space reasoning, no intermediate text chain
Benchmark VR-X: 16 sources, 6 task categories, 1.8k evaluation samples

Available Checkpoints

Model Description Link
UniVR-34B-Planning Optimized for long-horizon planning tasks (robotic manipulation, tool use, multi-step control) maverickrzw/UniVR-34B-Planning
UniVR-34B-General Full UniVR recipe with interleaved image-text data; suitable for general visual reasoning maverickrzw/UniVR-34B-General

Key Results

VR-X Benchmark

UniVR achieves up to 25% improvement over the Emu3.5 baseline and approaches Gemini 3 Pro + Nano Banana 2 with only 34B parameters.

Method Visual Thinking Guidance Robot Editing Spatial Puzzle Search Overall↑
Gemini-3-pro + Nano Banana 2 βœ— 66.2 67.1 63.7 55.1 65.5 79.0 66.1
GPT-5 + GPT-image-1.5 βœ— 68.2 64.1 58.0 49.3 64.0 77.4 63.5
Emu3.5 34B βœ— 38.6 42.8 32.7 35.3 43.4 46.2 39.8
UniVR 34B βœ“ 59.5 68.0 48.5 46.5 62.2 64.3 58.2
Ξ” v.s. Emu3.5 ↑20.9 ↑25.2 ↑15.8 ↑11.2 ↑18.8 ↑18.1 ↑18.4

Multimodal Understanding

Enhanced visual reasoning also boosts standard multimodal benchmarks β€” no degradation of the base model's capabilities.

Method MMMU MME(P) MME(C) MMBench MathVista MM-Vet
Emu 3.5 0.292 781.1 324.6 0.183 41.7 28.0
UniVR 0.337 799.3 338.5 0.198 44.0 35.6
Ξ” v.s. Emu3.5 ↑0.045 ↑18.2 ↑13.9 ↑0.015 ↑2.3 ↑7.6

Quick Start

Installation

git clone https://github.com/MaverickRen/UniVR.git
cd UniVR
bash install.sh

Inference

cd UniVR_SFT

# Download checkpoint
huggingface-cli download maverickrzw/UniVR-34B-Planning --local-dir weights/UniVR-34B-Planning

# Download VisionTokenizer
huggingface-cli download BAAI/Emu3.5-VisionTokenizer --local-dir weights/Emu3.5-VisionTokenizer

# Run inference
bash scripts/inference.sh

Configure configs/config.py to set model paths and prompts:

{
    "prompt": "Tie the red rope around the white gift box. Finish this task in 3 steps.",
    "reference_image": "path/to/first_frame.jpg",
}

Training

SFT (Cold Initialization):

cd UniVR_SFT
# LoRA (2 nodes Γ— 8 GPUs)
bash scripts/train_sft_lora.sh
# Full parameter (4 nodes Γ— 8 GPUs)
bash scripts/train_sft_full.sh

RL (VR-GRPO):

cd UniVR_RL
bash examples/emu3_grpo_lora.sh

Method: VR-GRPO

UniVR proposes VR-GRPO (Visual Reasoning GRPO), a reinforcement learning paradigm that combines:

  • Global Reward (R_g): A VLM evaluator assesses overall task completion and visual quality via pairwise comparison.
  • Step-Focal Reward (R_s): Identifies the most error-prone sub-steps by computing inter-trajectory CLIP-feature variance across rollout samples, then applies fine-grained VLM evaluation on critical windows.
  • Combined Reward: R_reason = R_g βˆ’ Ξ»|R_g βˆ’ R_s|, enforcing both terminal correctness and procedural integrity.

This design prevents reward hacking in long-horizon tasks where global-only rewards overlook intermediate physical violations and logical gaps.


Sample Outputs

Tie a Knot Hang Clothes Draw

Training Data

UniVR is trained on VR-X, a large-scale benchmark curated from 1.5M raw samples across 16 diverse sources:

Category Sources Examples
Visual Guidance EgoDex, Action100M, Epic-Kitchen, VideoCraftBench Cooking, handcrafting, daily activities
Robot Manipulation AgiBot, Droid, Bridge, ZebraCoT-Robot Robotic grasping, tool use, multi-step control
Editing ZebraCoT-Multiobject Object manipulation, scene editing
Spatial Perception ThinkMorph-Navigation, ZebraCoT-Embodiment Navigation, spatial reasoning
Visual Search VisualCoT, ThinkMorph-Search Object localization, attention
Puzzle & Game VRBench, Zebra-Jigsaw, ThinkMorph-VisPuzzle Mazes, jigsaw, visual puzzles

Download: maverickrzw/VR-X-SFT-RL


Citation

@article{ren2026univr,
  title={UniVR: Thinking in Visual Space for Unified Visual Reasoning},
  author={Zhongwei Ren and Yunchao Wei and Zhao Yao and Guixun Luo and Yao Zhao and Weibo Gong and Xiao Liu and Anran Wang and Xiangtai Li and Xiaojie Jin},
  year={2026},
}

License

This project is released under the Apache 2.0 License.

Acknowledgements

UniVR is built upon Emu3.5 and verl. We thank the authors for their excellent open-source contributions.

Downloads last month
24
Safetensors
Model size
34B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for maverickrzw/UniVR-34B-General

Base model

BAAI/Emu3.5
Finetuned
(1)
this model