Instructions to use maverickrzw/UniVR-34B-General with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use maverickrzw/UniVR-34B-General with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="maverickrzw/UniVR-34B-General")# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("maverickrzw/UniVR-34B-General", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use maverickrzw/UniVR-34B-General with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "maverickrzw/UniVR-34B-General" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "maverickrzw/UniVR-34B-General", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/maverickrzw/UniVR-34B-General
- SGLang
How to use maverickrzw/UniVR-34B-General with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "maverickrzw/UniVR-34B-General" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "maverickrzw/UniVR-34B-General", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "maverickrzw/UniVR-34B-General" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "maverickrzw/UniVR-34B-General", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use maverickrzw/UniVR-34B-General with Docker Model Runner:
docker model run hf.co/maverickrzw/UniVR-34B-General
UniVR: Thinking in Visual Space for Unified Visual Reasoning
π Project Page | π Paper | π» Code | π¦ VR-X Dataset
Model Summary
UniVR is the first framework that simultaneously learns complex reasoning, fine-grained physical dynamics, and long-term planning from pure visual demonstrations β without relying on dense image-text pairs or task-specific heuristics.
Built on Emu3.5 (34B), UniVR uses a unified next-token prediction objective to directly generate visual reasoning traces given an image and instruction. Training employs a two-stage pipeline: supervised cold initialization on the VR-X dataset, followed by VR-GRPO reinforcement learning with complementary global and step-focal rewards.
| Feature | Detail |
|---|---|
| Architecture | Emu3.5 34B (VQ-VAE unified generative model) |
| Training | SFT (310k samples) β VR-GRPO RL (3k samples) |
| Visual Thinking | Native visual-space reasoning, no intermediate text chain |
| Benchmark | VR-X: 16 sources, 6 task categories, 1.8k evaluation samples |
Available Checkpoints
| Model | Description | Link |
|---|---|---|
| UniVR-34B-Planning | Optimized for long-horizon planning tasks (robotic manipulation, tool use, multi-step control) | maverickrzw/UniVR-34B-Planning |
| UniVR-34B-General | Full UniVR recipe with interleaved image-text data; suitable for general visual reasoning | maverickrzw/UniVR-34B-General |
Key Results
VR-X Benchmark
UniVR achieves up to 25% improvement over the Emu3.5 baseline and approaches Gemini 3 Pro + Nano Banana 2 with only 34B parameters.
| Method | Visual Thinking | Guidance | Robot | Editing | Spatial | Puzzle | Search | Overallβ |
|---|---|---|---|---|---|---|---|---|
| Gemini-3-pro + Nano Banana 2 | β | 66.2 | 67.1 | 63.7 | 55.1 | 65.5 | 79.0 | 66.1 |
| GPT-5 + GPT-image-1.5 | β | 68.2 | 64.1 | 58.0 | 49.3 | 64.0 | 77.4 | 63.5 |
| Emu3.5 34B | β | 38.6 | 42.8 | 32.7 | 35.3 | 43.4 | 46.2 | 39.8 |
| UniVR 34B | β | 59.5 | 68.0 | 48.5 | 46.5 | 62.2 | 64.3 | 58.2 |
| Ξ v.s. Emu3.5 | β20.9 | β25.2 | β15.8 | β11.2 | β18.8 | β18.1 | β18.4 |
Multimodal Understanding
Enhanced visual reasoning also boosts standard multimodal benchmarks β no degradation of the base model's capabilities.
| Method | MMMU | MME(P) | MME(C) | MMBench | MathVista | MM-Vet |
|---|---|---|---|---|---|---|
| Emu 3.5 | 0.292 | 781.1 | 324.6 | 0.183 | 41.7 | 28.0 |
| UniVR | 0.337 | 799.3 | 338.5 | 0.198 | 44.0 | 35.6 |
| Ξ v.s. Emu3.5 | β0.045 | β18.2 | β13.9 | β0.015 | β2.3 | β7.6 |
Quick Start
Installation
git clone https://github.com/MaverickRen/UniVR.git
cd UniVR
bash install.sh
Inference
cd UniVR_SFT
# Download checkpoint
huggingface-cli download maverickrzw/UniVR-34B-Planning --local-dir weights/UniVR-34B-Planning
# Download VisionTokenizer
huggingface-cli download BAAI/Emu3.5-VisionTokenizer --local-dir weights/Emu3.5-VisionTokenizer
# Run inference
bash scripts/inference.sh
Configure configs/config.py to set model paths and prompts:
{
"prompt": "Tie the red rope around the white gift box. Finish this task in 3 steps.",
"reference_image": "path/to/first_frame.jpg",
}
Training
SFT (Cold Initialization):
cd UniVR_SFT
# LoRA (2 nodes Γ 8 GPUs)
bash scripts/train_sft_lora.sh
# Full parameter (4 nodes Γ 8 GPUs)
bash scripts/train_sft_full.sh
RL (VR-GRPO):
cd UniVR_RL
bash examples/emu3_grpo_lora.sh
Method: VR-GRPO
UniVR proposes VR-GRPO (Visual Reasoning GRPO), a reinforcement learning paradigm that combines:
- Global Reward (R_g): A VLM evaluator assesses overall task completion and visual quality via pairwise comparison.
- Step-Focal Reward (R_s): Identifies the most error-prone sub-steps by computing inter-trajectory CLIP-feature variance across rollout samples, then applies fine-grained VLM evaluation on critical windows.
- Combined Reward:
R_reason = R_g β Ξ»|R_g β R_s|, enforcing both terminal correctness and procedural integrity.
This design prevents reward hacking in long-horizon tasks where global-only rewards overlook intermediate physical violations and logical gaps.
Sample Outputs
| Tie a Knot | Hang Clothes | Draw |
![]() |
![]() |
![]() |
Training Data
UniVR is trained on VR-X, a large-scale benchmark curated from 1.5M raw samples across 16 diverse sources:
| Category | Sources | Examples |
|---|---|---|
| Visual Guidance | EgoDex, Action100M, Epic-Kitchen, VideoCraftBench | Cooking, handcrafting, daily activities |
| Robot Manipulation | AgiBot, Droid, Bridge, ZebraCoT-Robot | Robotic grasping, tool use, multi-step control |
| Editing | ZebraCoT-Multiobject | Object manipulation, scene editing |
| Spatial Perception | ThinkMorph-Navigation, ZebraCoT-Embodiment | Navigation, spatial reasoning |
| Visual Search | VisualCoT, ThinkMorph-Search | Object localization, attention |
| Puzzle & Game | VRBench, Zebra-Jigsaw, ThinkMorph-VisPuzzle | Mazes, jigsaw, visual puzzles |
Download: maverickrzw/VR-X-SFT-RL
Citation
@article{ren2026univr,
title={UniVR: Thinking in Visual Space for Unified Visual Reasoning},
author={Zhongwei Ren and Yunchao Wei and Zhao Yao and Guixun Luo and Yao Zhao and Weibo Gong and Xiao Liu and Anran Wang and Xiangtai Li and Xiaojie Jin},
year={2026},
}
License
This project is released under the Apache 2.0 License.
Acknowledgements
UniVR is built upon Emu3.5 and verl. We thank the authors for their excellent open-source contributions.
- Downloads last month
- 24
Model tree for maverickrzw/UniVR-34B-General
Base model
BAAI/Emu3.5

