Video-Text-to-Text
Transformers
TensorBoard
Safetensors
4DThinker
dynamic-spatial-reasoning
vision-language-model
latent-reasoning
Instructions to use jankin123/4DThinker-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jankin123/4DThinker-3B with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("jankin123/4DThinker-3B", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Update model card: add metadata, library tags, and repository links
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,15 +1,21 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
| 3 |
tags:
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
---
|
| 9 |
|
| 10 |
-
# 4DThinker
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
## Model Structure
|
| 15 |
|
|
@@ -41,13 +47,13 @@ model/
|
|
| 41 |
|
| 42 |
## Special Tokens
|
| 43 |
|
| 44 |
-
Three special tokens are added to the Qwen2.5-VL vocabulary:
|
| 45 |
|
| 46 |
| Token | Description |
|
| 47 |
|-------|-------------|
|
| 48 |
-
| `<
|
| 49 |
-
| `<
|
| 50 |
-
| `<
|
| 51 |
|
| 52 |
## Usage
|
| 53 |
|
|
@@ -55,13 +61,25 @@ Three special tokens are added to the Qwen2.5-VL vocabulary:
|
|
| 55 |
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
|
| 56 |
|
| 57 |
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
| 58 |
-
"
|
|
|
|
| 59 |
torch_dtype="auto",
|
| 60 |
device_map="auto"
|
| 61 |
)
|
| 62 |
-
processor = AutoProcessor.from_pretrained("
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
```
|
| 64 |
|
| 65 |
## License
|
| 66 |
|
| 67 |
-
Apache License 2.0
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: video-text-to-text
|
| 5 |
tags:
|
| 6 |
+
- 4DThinker
|
| 7 |
+
- dynamic-spatial-reasoning
|
| 8 |
+
- vision-language-model
|
| 9 |
+
- latent-reasoning
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# 4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding
|
| 13 |
|
| 14 |
+
[**Paper**](https://huggingface.co/papers/2605.05997) | [**Code**](https://github.com/zhangquanchen/4DThinker)
|
| 15 |
+
|
| 16 |
+
4DThinker is a framework that enables Vision-Language Models (VLMs) to "think with 4D" through dynamic latent mental imagery—internally simulating how scenes evolve within the continuous hidden space. It addresses dynamic spatial reasoning from monocular video by grounding the model in dynamic visual semantics.
|
| 17 |
+
|
| 18 |
+
This repository contains the trained model checkpoints from Qwen2.5-VL-3B for **4DThinker**.
|
| 19 |
|
| 20 |
## Model Structure
|
| 21 |
|
|
|
|
| 47 |
|
| 48 |
## Special Tokens
|
| 49 |
|
| 50 |
+
Three special tokens are added to the Qwen2.5-VL vocabulary to support latent imagery:
|
| 51 |
|
| 52 |
| Token | Description |
|
| 53 |
|-------|-------------|
|
| 54 |
+
| `<|latent_pad|>` | Padding within latent sequences |
|
| 55 |
+
| `<|latent_start|>` | Marks start of latent visual token block |
|
| 56 |
+
| `<|latent_end|>` | Marks end of latent visual token block |
|
| 57 |
|
| 58 |
## Usage
|
| 59 |
|
|
|
|
| 61 |
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
|
| 62 |
|
| 63 |
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
| 64 |
+
"jankin123/4DThinker-3B",
|
| 65 |
+
subfolder="4drl",
|
| 66 |
torch_dtype="auto",
|
| 67 |
device_map="auto"
|
| 68 |
)
|
| 69 |
+
processor = AutoProcessor.from_pretrained("jankin123/4DThinker-3B", subfolder="4drl")
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
## Citation
|
| 73 |
+
|
| 74 |
+
```bibtex
|
| 75 |
+
@article{4dthinker,
|
| 76 |
+
title={4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding},
|
| 77 |
+
author={Zhang, Quanchen and others},
|
| 78 |
+
journal={arXiv preprint arXiv:2605.05997},
|
| 79 |
+
year={2026}
|
| 80 |
+
}
|
| 81 |
```
|
| 82 |
|
| 83 |
## License
|
| 84 |
|
| 85 |
+
Apache License 2.0
|