Video-Text-to-Text
Transformers
Safetensors
English
llava_llama
multimodal
video-understanding
region-grounding
3d-reasoning
4d-reasoning
perceptual-distillation
nvila
vila
Instructions to use nvidia/4D-RGPT-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/4D-RGPT-8B with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("nvidia/4D-RGPT-8B", dtype="auto") - Notebooks
- Google Colab
- Kaggle
fix links
Browse files
README.md
CHANGED
|
@@ -35,7 +35,7 @@ Global
|
|
| 35 |
Expected users are multimodal AI researchers, applied research teams, and developers studying video understanding, region grounding, 3D/4D reasoning, and physical AI. Representative use cases include region-level video question answering, model benchmarking, research on depth-and-time-aware MLLMs, and prototyping for domains such as robotics, autonomous driving, and industrial inspection.
|
| 36 |
|
| 37 |
### Release Date:
|
| 38 |
-
Hugging Face [06/01/2026] via [https://huggingface.co/nvidia/4D-RGPT-8B]
|
| 39 |
|
| 40 |
## References(s):
|
| 41 |
* Paper: https://arxiv.org/abs/2512.17012 <br>
|
|
|
|
| 35 |
Expected users are multimodal AI researchers, applied research teams, and developers studying video understanding, region grounding, 3D/4D reasoning, and physical AI. Representative use cases include region-level video question answering, model benchmarking, research on depth-and-time-aware MLLMs, and prototyping for domains such as robotics, autonomous driving, and industrial inspection.
|
| 36 |
|
| 37 |
### Release Date:
|
| 38 |
+
Hugging Face [06/01/2026] via [https://huggingface.co/nvidia/4D-RGPT-8B](https://huggingface.co/nvidia/4D-RGPT-8B).
|
| 39 |
|
| 40 |
## References(s):
|
| 41 |
* Paper: https://arxiv.org/abs/2512.17012 <br>
|