Add link to project page and correct pipeline tag
#2
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,15 +1,15 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
| 3 |
datasets:
|
| 4 |
- lmms-lab/LLaVA-Video-178K
|
| 5 |
language:
|
| 6 |
- en
|
|
|
|
|
|
|
| 7 |
metrics:
|
| 8 |
- accuracy
|
| 9 |
-
base_model:
|
| 10 |
-
- lmms-lab/LLaVA-Video-7B-Qwen2
|
| 11 |
pipeline_tag: video-text-to-text
|
| 12 |
-
library_name: transformers
|
| 13 |
tags:
|
| 14 |
- Action
|
| 15 |
- Video
|
|
@@ -81,12 +81,6 @@ model-index:
|
|
| 81 |
value: 63.9
|
| 82 |
name: accuracy
|
| 83 |
verified: true
|
| 84 |
-
- task:
|
| 85 |
-
type: multimodal
|
| 86 |
-
dataset:
|
| 87 |
-
name: VideoMME (w-subs)
|
| 88 |
-
type: videomme
|
| 89 |
-
metrics:
|
| 90 |
- type: accuracy
|
| 91 |
value: 71.4
|
| 92 |
name: accuracy
|
|
@@ -109,7 +103,7 @@ model-index:
|
|
| 109 |
|
| 110 |
<sup>**</sup> First authors <sup>†</sup> Senior Authors <sup>‡</sup> Corresponding Author
|
| 111 |
|
| 112 |
-
\[[arXiv Paper](arxiv.org/abs/2503.18712)\] \[[Project Page](https://mmathislab.github.io/llavaction/)\] \[[Github Repo](https://github.com/AdaptiveMotorControlLab/LLaVAction)\]
|
| 113 |
|
| 114 |
</div>
|
| 115 |
|
|
@@ -118,7 +112,7 @@ The LLaVAction-7B model is trained on EPIC-KITCHENS-100-MQA, based on Qwen2 lang
|
|
| 118 |
This model supports at most 64 frames.
|
| 119 |
|
| 120 |
- **Project Page**: [https://mmathislab.github.io/llavaction/](https://mmathislab.github.io/llavaction/)
|
| 121 |
-
- **Paper**: For more details, please check our [paper](https://arxiv.org/abs/
|
| 122 |
- **Repository**: [https://github.com/AdaptiveMotorControlLab/LLaVAction](https://github.com/AdaptiveMotorControlLab/LLaVAction)
|
| 123 |
- **Point of Contact**: [Mackenzie Mathis](https://people.epfl.ch/mackenzie.mathis)
|
| 124 |
- **Languages**: English
|
|
@@ -185,7 +179,9 @@ video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].c
|
|
| 185 |
video = [video]
|
| 186 |
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
|
| 187 |
time_instruction = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. "
|
| 188 |
-
question = DEFAULT_IMAGE_TOKEN + f"
|
|
|
|
|
|
|
| 189 |
|
| 190 |
conv = copy.deepcopy(conv_templates[conv_template])
|
| 191 |
conv.append_message(conv.roles[0], question)
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- lmms-lab/LLaVA-Video-7B-Qwen2
|
| 4 |
datasets:
|
| 5 |
- lmms-lab/LLaVA-Video-178K
|
| 6 |
language:
|
| 7 |
- en
|
| 8 |
+
library_name: transformers
|
| 9 |
+
license: cc-by-nc-sa-4.0
|
| 10 |
metrics:
|
| 11 |
- accuracy
|
|
|
|
|
|
|
| 12 |
pipeline_tag: video-text-to-text
|
|
|
|
| 13 |
tags:
|
| 14 |
- Action
|
| 15 |
- Video
|
|
|
|
| 81 |
value: 63.9
|
| 82 |
name: accuracy
|
| 83 |
verified: true
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
- type: accuracy
|
| 85 |
value: 71.4
|
| 86 |
name: accuracy
|
|
|
|
| 103 |
|
| 104 |
<sup>**</sup> First authors <sup>†</sup> Senior Authors <sup>‡</sup> Corresponding Author
|
| 105 |
|
| 106 |
+
\[[arXiv Paper](https://arxiv.org/abs/2503.18712)\] \[[Project Page](https://mmathislab.github.io/llavaction/)\] \[[Github Repo](https://github.com/AdaptiveMotorControlLab/LLaVAction)\]
|
| 107 |
|
| 108 |
</div>
|
| 109 |
|
|
|
|
| 112 |
This model supports at most 64 frames.
|
| 113 |
|
| 114 |
- **Project Page**: [https://mmathislab.github.io/llavaction/](https://mmathislab.github.io/llavaction/)
|
| 115 |
+
- **Paper**: For more details, please check our [paper](https://arxiv.org/abs/2503.18712)
|
| 116 |
- **Repository**: [https://github.com/AdaptiveMotorControlLab/LLaVAction](https://github.com/AdaptiveMotorControlLab/LLaVAction)
|
| 117 |
- **Point of Contact**: [Mackenzie Mathis](https://people.epfl.ch/mackenzie.mathis)
|
| 118 |
- **Languages**: English
|
|
|
|
| 179 |
video = [video]
|
| 180 |
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
|
| 181 |
time_instruction = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. "
|
| 182 |
+
question = DEFAULT_IMAGE_TOKEN + f"
|
| 183 |
+
{time_instruction}
|
| 184 |
+
{perspective_prompt} {task_prompt}"
|
| 185 |
|
| 186 |
conv = copy.deepcopy(conv_templates[conv_template])
|
| 187 |
conv.append_message(conv.roles[0], question)
|