Video-Text-to-Text
Transformers
PyTorch
English
vision-encoder-decoder
image-text-to-text
video-captioning
Eval Results (legacy)
Instructions to use Neleac/SpaceTimeGPT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Neleac/SpaceTimeGPT with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForImageTextToText tokenizer = AutoTokenizer.from_pretrained("Neleac/SpaceTimeGPT") model = AutoModelForImageTextToText.from_pretrained("Neleac/SpaceTimeGPT") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -35,7 +35,7 @@ model-index:
|
|
| 35 |
<p> (partial diagrams from <a href="https://arxiv.org/abs/2103.15691">1</a>, <a href="https://arxiv.org/abs/2102.05095">2</a>, <a href="https://arxiv.org/abs/1706.03762">3</a>) </p>
|
| 36 |
</div>
|
| 37 |
|
| 38 |
-
SpaceTimeGPT is a video description generation model capable of
|
| 39 |
|
| 40 |
## Architecture and Training
|
| 41 |
Vision Encoder: [timesformer-base-finetuned-k600](https://huggingface.co/facebook/timesformer-base-finetuned-k600) \
|
|
|
|
| 35 |
<p> (partial diagrams from <a href="https://arxiv.org/abs/2103.15691">1</a>, <a href="https://arxiv.org/abs/2102.05095">2</a>, <a href="https://arxiv.org/abs/1706.03762">3</a>) </p>
|
| 36 |
</div>
|
| 37 |
|
| 38 |
+
SpaceTimeGPT is a video description generation model capable of spatial and temporal reasoning. Given a video, eight frames are sampled and analyzed by the model. The output is a sentence description of the events that occured in the video, generated using autoregression.
|
| 39 |
|
| 40 |
## Architecture and Training
|
| 41 |
Vision Encoder: [timesformer-base-finetuned-k600](https://huggingface.co/facebook/timesformer-base-finetuned-k600) \
|