You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Tarsier Model Card

Model details

Model type: Tarsier-34b is an open-source large-scale video-language models, which is designed to generate high-quality video descriptions, together with good capability of general video understanding (SOTA results on 6 open benchmarks).

Model date: Tarsier-34b was trained in June 2024.

Paper or resources for more information:

License

NousResearch/Nous-Hermes-2-Yi-34B license.

Where to send questions or comments about the model: https://github.com/bytedance/tarsier/issues

Intended use

Primary intended uses: The primary use of Tarsier is research on large multimodal models, especially video description.

Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

Training dataset

Tarsier tasks a two-stage training strategy.

  • Stage-1: Multi-task Pre-training on 13M data
  • Stage-2: Multi-grained Instruction Tuning on 500K data

In both stages, we freeze ViT and train all the parameters of projection layer and LLM.

Evaluation dataset

How to Use

see https://github.com/bytedance/tarsier?tab=readme-ov-file#usage

Downloads last month
19
Safetensors
Model size
35B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for omni-research/Tarsier-34b