Instructions to use maverickrzw/UniVR-34B-General with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use maverickrzw/UniVR-34B-General with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="maverickrzw/UniVR-34B-General")

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("maverickrzw/UniVR-34B-General", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use maverickrzw/UniVR-34B-General with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "maverickrzw/UniVR-34B-General"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "maverickrzw/UniVR-34B-General",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/maverickrzw/UniVR-34B-General

SGLang

How to use maverickrzw/UniVR-34B-General with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "maverickrzw/UniVR-34B-General" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "maverickrzw/UniVR-34B-General",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "maverickrzw/UniVR-34B-General" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "maverickrzw/UniVR-34B-General",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use maverickrzw/UniVR-34B-General with Docker Model Runner:
```
docker model run hf.co/maverickrzw/UniVR-34B-General
```

UniVR: Thinking in Visual Space for Unified Visual Reasoning

UniVR Overview

🌐 Project Page | 📄 Paper | 💻 Code | 📦 VR-X Dataset

Model Summary

UniVR is the first framework that simultaneously learns complex reasoning, fine-grained physical dynamics, and long-term planning from pure visual demonstrations — without relying on dense image-text pairs or task-specific heuristics.

Built on Emu3.5 (34B), UniVR uses a unified next-token prediction objective to directly generate visual reasoning traces given an image and instruction. Training employs a two-stage pipeline: supervised cold initialization on the VR-X dataset, followed by VR-GRPO reinforcement learning with complementary global and step-focal rewards.

Feature	Detail
Architecture	Emu3.5 34B (VQ-VAE unified generative model)
Training	SFT (310k samples) → VR-GRPO RL (3k samples)
Visual Thinking	Native visual-space reasoning, no intermediate text chain
Benchmark	VR-X: 16 sources, 6 task categories, 1.8k evaluation samples

Available Checkpoints

Model	Description	Link
UniVR-34B-Planning	Optimized for long-horizon planning tasks (robotic manipulation, tool use, multi-step control)	maverickrzw/UniVR-34B-Planning
UniVR-34B-General	Full UniVR recipe with interleaved image-text data; suitable for general visual reasoning	maverickrzw/UniVR-34B-General

Key Results

VR-X Benchmark

UniVR achieves up to 25% improvement over the Emu3.5 baseline and approaches Gemini 3 Pro + Nano Banana 2 with only 34B parameters.

Method	Visual Thinking	Guidance	Robot	Editing	Spatial	Puzzle	Search	Overall↑
Gemini-3-pro + Nano Banana 2	✗	66.2	67.1	63.7	55.1	65.5	79.0	66.1
GPT-5 + GPT-image-1.5	✗	68.2	64.1	58.0	49.3	64.0	77.4	63.5
Emu3.5 34B	✗	38.6	42.8	32.7	35.3	43.4	46.2	39.8
UniVR 34B	✓	59.5	68.0	48.5	46.5	62.2	64.3	58.2
Δ v.s. Emu3.5		↑20.9	↑25.2	↑15.8	↑11.2	↑18.8	↑18.1	↑18.4

Multimodal Understanding

Enhanced visual reasoning also boosts standard multimodal benchmarks — no degradation of the base model's capabilities.

Method	MMMU	MME(P)	MME(C)	MMBench	MathVista	MM-Vet
Emu 3.5	0.292	781.1	324.6	0.183	41.7	28.0
UniVR	0.337	799.3	338.5	0.198	44.0	35.6
Δ v.s. Emu3.5	↑0.045	↑18.2	↑13.9	↑0.015	↑2.3	↑7.6

Quick Start

Installation

git clone https://github.com/MaverickRen/UniVR.git
cd UniVR
bash install.sh

Inference

cd UniVR_SFT

# Download checkpoint
huggingface-cli download maverickrzw/UniVR-34B-Planning --local-dir weights/UniVR-34B-Planning

# Download VisionTokenizer
huggingface-cli download BAAI/Emu3.5-VisionTokenizer --local-dir weights/Emu3.5-VisionTokenizer

# Run inference
bash scripts/inference.sh

Configure configs/config.py to set model paths and prompts:

{
    "prompt": "Tie the red rope around the white gift box. Finish this task in 3 steps.",
    "reference_image": "path/to/first_frame.jpg",
}

Training

SFT (Cold Initialization):

cd UniVR_SFT
# LoRA (2 nodes × 8 GPUs)
bash scripts/train_sft_lora.sh
# Full parameter (4 nodes × 8 GPUs)
bash scripts/train_sft_full.sh

RL (VR-GRPO):

cd UniVR_RL
bash examples/emu3_grpo_lora.sh

Method: VR-GRPO

UniVR proposes VR-GRPO (Visual Reasoning GRPO), a reinforcement learning paradigm that combines:

Global Reward (R_g): A VLM evaluator assesses overall task completion and visual quality via pairwise comparison.
Step-Focal Reward (R_s): Identifies the most error-prone sub-steps by computing inter-trajectory CLIP-feature variance across rollout samples, then applies fine-grained VLM evaluation on critical windows.
Combined Reward: R_reason = R_g − λ|R_g − R_s|, enforcing both terminal correctness and procedural integrity.

This design prevents reward hacking in long-horizon tasks where global-only rewards overlook intermediate physical violations and logical gaps.

Sample Outputs

Tie a Knot	Hang Clothes	Draw

Training Data

UniVR is trained on VR-X, a large-scale benchmark curated from 1.5M raw samples across 16 diverse sources:

Category	Sources	Examples
Visual Guidance	EgoDex, Action100M, Epic-Kitchen, VideoCraftBench	Cooking, handcrafting, daily activities
Robot Manipulation	AgiBot, Droid, Bridge, ZebraCoT-Robot	Robotic grasping, tool use, multi-step control
Editing	ZebraCoT-Multiobject	Object manipulation, scene editing
Spatial Perception	ThinkMorph-Navigation, ZebraCoT-Embodiment	Navigation, spatial reasoning
Visual Search	VisualCoT, ThinkMorph-Search	Object localization, attention
Puzzle & Game	VRBench, Zebra-Jigsaw, ThinkMorph-VisPuzzle	Mazes, jigsaw, visual puzzles

Download: maverickrzw/VR-X-SFT-RL

Citation

@article{ren2026univr,
  title={UniVR: Thinking in Visual Space for Unified Visual Reasoning},
  author={Zhongwei Ren and Yunchao Wei and Zhao Yao and Guixun Luo and Yao Zhao and Weibo Gong and Xiao Liu and Anran Wang and Xiangtai Li and Xiaojie Jin},
  year={2026},
}

License

This project is released under the Apache 2.0 License.

Acknowledgements

UniVR is built upon Emu3.5 and verl. We thank the authors for their excellent open-source contributions.

Downloads last month: 24

Safetensors

Model size

34B params

Tensor type

BF16

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for maverickrzw/UniVR-34B-General

Base model

BAAI/Emu3.5

Finetuned

(1)

this model