Neleac
/

SpaceTimeGPT

Video-Text-to-Text

vision-encoder-decoder

image-text-to-text

video-captioning

Eval Results (legacy)

Model card Files Files and versions

Neleac commited on Jan 21, 2025

Commit

fe04cfc

·

verified ·

1 Parent(s): 4fd7869

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -35,7 +35,7 @@ model-index:
   <p> (partial diagrams from <a href="https://arxiv.org/abs/2103.15691">1</a>, <a href="https://arxiv.org/abs/2102.05095">2</a>, <a href="https://arxiv.org/abs/1706.03762">3</a>) </p>
 </div>
-SpaceTimeGPT is a video description generation model capable of both spatial and temporal reasoning. Given a video, eight frames are sampled and analyzed by the model. The output is a sentence description of the events that occured in the video, generated using autoregression.
 ## Architecture and Training
 Vision Encoder: [timesformer-base-finetuned-k600](https://huggingface.co/facebook/timesformer-base-finetuned-k600) \

   <p> (partial diagrams from <a href="https://arxiv.org/abs/2103.15691">1</a>, <a href="https://arxiv.org/abs/2102.05095">2</a>, <a href="https://arxiv.org/abs/1706.03762">3</a>) </p>
 </div>
+SpaceTimeGPT is a video description generation model capable of spatial and temporal reasoning. Given a video, eight frames are sampled and analyzed by the model. The output is a sentence description of the events that occured in the video, generated using autoregression.
 ## Architecture and Training
 Vision Encoder: [timesformer-base-finetuned-k600](https://huggingface.co/facebook/timesformer-base-finetuned-k600) \