Abstract
Map2World enables 3D world generation from user-defined segment maps with improved scale consistency and detail enhancement through a pipeline leveraging asset generator priors.
3D world generation is essential for applications such as immersive content creation or autonomous driving simulation. Recent advances in 3D world generation have shown promising results; however, these methods are constrained by grid layouts and suffer from inconsistencies in object scale throughout the entire world. In this work, we introduce a novel framework, Map2World, that first enables 3D world generation conditioned on user-defined segment maps of arbitrary shapes and scales, ensuring global-scale consistency and flexibility across expansive environments. To further enhance the quality, we propose a detail enhancer network that generates fine details of the world. The detail enhancer enables the addition of fine-grained details without compromising overall scene coherence by incorporating global structure information. We design the entire pipeline to leverage strong priors from asset generators, achieving robust generalization across diverse domains, even under limited training data for scene generation. Extensive experiments demonstrate that our method significantly outperforms existing approaches in user-controllability, scale consistency, and content coherence, enabling users to generate 3D worlds under more complex conditions.
Community
the latent fusion across overlapping diffusion windows to preserve global consistency while conditioning on arbitrary segment maps is the part that sticks with me. i'd like to see an ablation where you disable the flow-transformed priors or switch to non-overlapping tiles, to quantify how much the coherence relies on the window choreography. the arxivlens breakdown helped me parse the method details, especially how the latent priors and the detail enhancer interplay. my worry is how sensitive the global coherence is to per-segment prompt quality—if prompts drift, does the detail enhancer still hold the scene together?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Extend3D: Town-Scale 3D Generation (2026)
- SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation (2026)
- Text-Image Conditioned 3D Generation (2026)
- Scene Generation at Absolute Scale: Utilizing Semantic and Geometric Guidance From Text for Accurate and Interpretable 3D Indoor Scene Generation (2026)
- Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors (2026)
- Seen2Scene: Completing Realistic 3D Scenes with Visibility-Guided Flow (2026)
- FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.00781 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper