Commit ·
958aa2a
1
Parent(s): 1a75103
Update README.md
Browse files
README.md
CHANGED
|
@@ -7,16 +7,27 @@ pipeline_tag: visual-question-answering
|
|
| 7 |
---
|
| 8 |
|
| 9 |
# Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
|
| 10 |
-
This is the Hugging Face repo for storing pre-trained & fine-tuned checkpoints of our [Video-LLaMA](https://
|
| 11 |
|
| 12 |
|
|
|
|
| 13 |
| Checkpoint | Link | Note |
|
| 14 |
|:------------|-------------|-------------|
|
| 15 |
-
| pretrain-vicuna7b
|
| 16 |
-
| finetune-vicuna7b-v2
|
| 17 |
-
| pretrain-vicuna13b
|
| 18 |
-
| finetune-vicuna13b-v2
|
| 19 |
-
| pretrain-ziya13b-zh
|
| 20 |
-
| finetune-ziya13b-zh
|
| 21 |
-
| pretrain-billa7b-zh
|
| 22 |
-
| finetune-billa7b-zh
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
---
|
| 8 |
|
| 9 |
# Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
|
| 10 |
+
This is the Hugging Face repo for storing pre-trained & fine-tuned checkpoints of our [Video-LLaMA](https://arxiv.org/abs/2306.02858), which is a multi-modal conversational large language model with video understanding capability.
|
| 11 |
|
| 12 |
|
| 13 |
+
## Vision-Language Branch
|
| 14 |
| Checkpoint | Link | Note |
|
| 15 |
|:------------|-------------|-------------|
|
| 16 |
+
| pretrain-vicuna7b | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/pretrain_vicuna7b-v2.pth) | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
|
| 17 |
+
| finetune-vicuna7b-v2 | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-vicuna7b-v2.pth) | Fine-tuned on the instruction-tuning data from [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLaVA](https://github.com/haotian-liu/LLaVA) and [VideoChat](https://github.com/OpenGVLab/Ask-Anything)|
|
| 18 |
+
| pretrain-vicuna13b | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/pretrain-vicuna13b.pth) | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
|
| 19 |
+
| finetune-vicuna13b-v2 | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-vicuna13b-v2.pth) | Fine-tuned on the instruction-tuning data from [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLaVA](https://github.com/haotian-liu/LLaVA) and [VideoChat](https://github.com/OpenGVLab/Ask-Anything)|
|
| 20 |
+
| pretrain-ziya13b-zh | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/pretrain-ziya13b-zh.pth) | Pre-trained with Chinese LLM [Ziya-13B](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1) |
|
| 21 |
+
| finetune-ziya13b-zh | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-ziya13b-zh.pth) | Fine-tuned on machine-translated [VideoChat](https://github.com/OpenGVLab/Ask-Anything) instruction-following dataset (in Chinese)|
|
| 22 |
+
| pretrain-billa7b-zh | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/pretrain-billa7b-zh.pth) | Pre-trained with Chinese LLM [BiLLA-7B](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1) |
|
| 23 |
+
| finetune-billa7b-zh | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-billa7b-zh.pth) | Fine-tuned on machine-translated [VideoChat](https://github.com/OpenGVLab/Ask-Anything) instruction-following dataset (in Chinese) |
|
| 24 |
+
|
| 25 |
+
## Audio-Language Branch
|
| 26 |
+
| Checkpoint | Link | Note |
|
| 27 |
+
|:------------|-------------|-------------|
|
| 28 |
+
| pretrain-vicuna7b | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/pretrain_vicuna7b_audiobranch.pth) | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
|
| 29 |
+
| finetune-vicuna7b-v2 | [link](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune_vicuna7b_audiobranch.pth) | Fine-tuned on the instruction-tuning data from [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLaVA](https://github.com/haotian-liu/LLaVA) and [VideoChat](https://github.com/OpenGVLab/Ask-Anything)|
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
## Usage
|
| 33 |
+
For launching the pre-trained Video-LLaMA on your own machine, please refer to our [github repo](https://github.com/DAMO-NLP-SG/Video-LLaMA).
|