Qwen3-VL JEPA World Model

This is a Multimodal World Model architecture based on the Joint-Embedding Predictive Architecture (JEPA). It fuses the reasoning power of Qwen3-VL-4B-Thinking with the visual latent space of Stable Diffusion VAE.

🧠 Architecture

Thinking Engine: Qwen/Qwen3-VL-4B-Thinking
Visual Perception: runwayml/stable-diffusion-v1-5 (VAE)
World Modeling: Designed to predict the next latent state of a scene.

🛠 Status

This repository contains the structural fuse. The predictors are currently randomly initialized and require training on sequential image data to function as a world model.

Downloads last month: 10

Model tree for burnboom/Qwen3_world_model_test

Base model

Qwen/Qwen3-VL-4B-Thinking

Finetuned

(23)

this model