Image-to-Text
Transformers
PyTorch
English
multimodal_jepa_world
multimodal
jepa
world-model
qwen3
vision
Instructions to use burnboom/Qwen3_world_model_test with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use burnboom/Qwen3_world_model_test with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="burnboom/Qwen3_world_model_test")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("burnboom/Qwen3_world_model_test", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Qwen3-VL JEPA World Model
This is a Multimodal World Model architecture based on the Joint-Embedding Predictive Architecture (JEPA). It fuses the reasoning power of Qwen3-VL-4B-Thinking with the visual latent space of Stable Diffusion VAE.
π§ Architecture
- Thinking Engine:
Qwen/Qwen3-VL-4B-Thinking - Visual Perception:
runwayml/stable-diffusion-v1-5(VAE) - World Modeling: Designed to predict the next latent state of a scene.
π Status
This repository contains the structural fuse. The predictors are currently randomly initialized and require training on sequential image data to function as a world model.
- Downloads last month
- 10
Model tree for burnboom/Qwen3_world_model_test
Base model
Qwen/Qwen3-VL-4B-Thinking