metadata
license: mit
language:
- en
tags:
- 3d-scene-generation
- indoor-scene
- vision-language
- reinforcement-learning
base_model: Qwen/Qwen2.5-VL-7B-Instruct
gated: auto
extra_gated_prompt: >-
By requesting access to SceneReVis-7B, you agree to the following terms: 1.
You will use this model only for academic research purposes. 2. You will not
redistribute the model weights without permission. 3. You will cite our paper
in any published work that uses this model.
extra_gated_fields:
Name: text
Affiliation: text
I want to use this model for:
type: select
options:
- Academic Research
- Education
- label: Commercial Use
value: commercial
- label: Other
value: other
I agree to use this model for non-commercial research only: checkbox
extra_gated_heading: Request access to SceneReVis-7B
extra_gated_description: >-
Please fill out the form below. Access will be granted automatically after
submission.
extra_gated_button_content: Submit & Get Access
SceneReVis-7B
SceneReVis-7B is a vision-language model fine-tuned for iterative 3D indoor scene generation and editing.
Model Details
- Base Model: Qwen2.5-VL-7B-Instruct
- Training: SFT on SceneChain-12K + GRPO reinforcement learning with voxel-based physics rewards
- Architecture: Vision-Language Model with tool-calling capabilities
Usage
See the SceneReVis repository for inference instructions.
Citation
@article{zhao2026scenerevis,
title={SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL},
author={Yang Zhao and Shizhao Sun and Meisheng Zhang and Yingdong Shi and Xubo Yang and Jiang Bian},
journal={arXiv preprint arXiv:2602.09432},
year={2026}
}