You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

C2R model weights are released for non-commercial research and educational use only under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license.
By requesting access, you agree that you will use the model weights only for non-commercial research or educational purposes, will not use them for any commercial product or service, will not redistribute the original or modified weights, and will provide proper attribution when using this work.

C2R: Coarse-to-Real

Gonzalo Gomez-Nogales¹, Yicong Hong², Chongjian Ge², Peiye Zhuang³, Dan Casas¹, Yi Zhou³

¹Universidad Rey Juan Carlos ²Adobe Research ³Roblox

Model Summary

C2R (Coarse-to-Real) is a generative rendering framework that synthesizes realistic urban crowd videos from coarse 3D simulation videos. Given a text prompt and a coarse control video, C2R generates realistic videos while preserving the input scene layout, camera motion, and human trajectories.

The model is designed for controllable video generation from minimal 3D input. It uses a two-stage synthetic-real domain-hedging strategy: first learning a strong video generative prior from large-scale real footage, then introducing controllability through a small amount of paired synthetic coarse-to-fine data.

This Hugging Face repository contains the released C2R 14B model weights, including:

C2R DiT backbone checkpoint
C2R DINO adapter checkpoint

The inference code is available in the GitHub repository:

git clone https://github.com/GonzaloGNogales/coarse2real.git

Model Details

Model name: C2R: Coarse-to-Real
Task: Controllable video generation / generative rendering
Input: Text prompt + coarse 3D control video
Output: Realistic generated video
Backbone: Wan2.1 14B
Control features: DINOv3-based video features
Release type: Inference-only
License for weights: CC BY-NC-ND 4.0
Access: Gated access required

Repository Files

This model repository provides the C2R-specific checkpoints:

c2r-dit-backbone-14B.safetensors
c2r-dino-adapter.safetensors

The Wan2.1 14B base model is required separately and should be downloaded from:

Wan-AI/Wan2.1-T2V-14B

C2R uses the Wan2.1 14B base folder for the text encoder, VAE, and tokenizer assets.

Installation

Please use the official C2R inference codebase:

git clone https://github.com/GonzaloGNogales/coarse2real.git
cd coarse2real

conda env create -f c2r-setup.yml
conda activate coarse2real

The default environment includes the recommended runtime dependencies for inference.

Download Weights

First, download the Wan2.1 14B base weights:

mkdir -p models/wan
hf download Wan-AI/Wan2.1-T2V-14B \
  --local-dir models/wan

Expected Wan2.1 files include:

models/wan/models_t5_umt5-xxl-enc-bf16.pth
models/wan/Wan2.1_VAE.pth
models/wan/google/umt5-xxl/...

Then download the C2R DiT backbone:

mkdir -p models/pretrained_dit_backbone
hf download gonsaBRK/coarse2real c2r-dit-backbone-14B.safetensors \
  --local-dir models/pretrained_dit_backbone

Download the C2R DINO adapter:

mkdir -p models/dino_adapter
hf download gonsaBRK/coarse2real c2r-dino-adapter.safetensors \
  --local-dir models/dino_adapter

C2R also uses the DINOv3 backbone facebook/dinov3-vitb16-pretrain-lvd1689m for control-video features. For offline or cluster inference, download it locally:

mkdir -p models/dino/dinov3-vitb16-pretrain-lvd1689m
hf download facebook/dinov3-vitb16-pretrain-lvd1689m \
  --local-dir models/dino/dinov3-vitb16-pretrain-lvd1689m

Then set the local path in the inference config:

"dino_model_path": "models/dino/dinov3-vitb16-pretrain-lvd1689m"

Usage

C2R requires:

A text prompt
A coarse 3D control video
The C2R DiT backbone checkpoint
The C2R DINO adapter checkpoint
The Wan2.1 14B base model assets

Prompts are read from:

inference/c2r-prompts.txt

Control videos are read from:

inference/control_videos

Supported control video extensions:

.mp4 .mov .mkv .avi .webm .m4v

Run Inference

Single GPU:

bash inference/launch_1gpu.sh

USP multi-GPU, for splitting one generation across multiple GPUs:

bash inference/launch_multigpu_usp.sh

DP multi-GPU, for generating many results in parallel:

bash inference/launch_multigpu_dp.sh

You can also run a config directly:

python -m inference.run_inference --config inference/config_1gpu.json

or with torchrun:

torchrun --standalone --nproc_per_node=8 -m inference.run_inference \
  --config inference/config_multigpu_usp.json

Gradio Demo

The GitHub codebase also includes a local Gradio demo:

bash inference/launch_gradio.sh

By default, the demo binds to:

127.0.0.1:7860

For remote cluster usage, open an SSH tunnel from your local machine:

ssh -L 7860:127.0.0.1:7860 your_user@cluster-login-host

Then open:

http://127.0.0.1:7860

Prompt Enhancement

C2R supports optional prompt enhancement:

"prompt_enhancement_mode": "enhanced"

This mode uses Qwen3 VLM/LLM models to describe the control video and fuse that information with the user prompt before generation. It may improve generation quality, but adds preprocessing time.

For fastest inference, use:

"prompt_enhancement_mode": "off"

Intended Use

This model is intended for:

Non-commercial research
Academic evaluation
Generative rendering research
Controllable video generation research
Computer graphics and simulation research
Testing coarse-to-real video synthesis from 3D simulation inputs

Out-of-Scope Use

The model weights are not intended for:

Commercial use
Redistribution of modified versions
Production deployment without additional validation
Generating misleading, harmful, or deceptive media
Use cases that violate the license terms of this model or any upstream dependency

Limitations

This is an inference-only research release. The generated videos may contain visual artifacts, temporal inconsistencies, inaccurate fine details, or deviations from the input prompt. Performance may vary depending on the quality, structure, and domain of the coarse control video.

The model is optimized for coarse 3D simulation videos of populated urban scenes. Results outside this domain may be less reliable.

License

The model weights in this repository are released under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license.

Allowed:

Use for non-commercial research and education
Sharing the original work with proper attribution

Not allowed:

Commercial use
Redistribution of modified versions of the model weights

The inference code is released separately under the PolyForm Noncommercial License 1.0.0 in the GitHub repository.

Third-party dependencies and base models are subject to their own licenses.

Citation

If you use this work in academic research, please cite:

@misc{gomeznogales2026coarsetoreal,
  title         = {Coarse-to-Real: Generative Rendering for Populated Dynamic Scenes},
  author        = {Gomez-Nogales, Gonzalo and Hong, Yicong and Ge, Chongjian and Zhuang, Peiye and Comino-Trinidad, Marc and Casas, Dan and Zhou, Yi},
  year          = {2026},
  eprint        = {2601.22301},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  doi           = {10.48550/arXiv.2601.22301},
  url           = {https://arxiv.org/abs/2601.22301}
}

Contact

For questions or collaborations, please contact:

Gonzalo Gomez-Nogales
gonzalo.gomez@urjc.es
Yi Zhou
yizhou@roblox.com
zhouyisjtu2012@gmail.com

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for gonsaBRK/coarse2real

Base model

Wan-AI/Wan2.1-T2V-14B

Finetuned

(63)

this model

Paper for gonsaBRK/coarse2real

Coarse-to-Real: Generative Rendering for Populated Dynamic Scenes

Paper • 2601.22301 • Published 9 days ago