Add link to project page and correct pipeline tag

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +9 -13
README.md CHANGED
@@ -1,15 +1,15 @@
1
  ---
2
- license: cc-by-nc-sa-4.0
 
3
  datasets:
4
  - lmms-lab/LLaVA-Video-178K
5
  language:
6
  - en
 
 
7
  metrics:
8
  - accuracy
9
- base_model:
10
- - lmms-lab/LLaVA-Video-7B-Qwen2
11
  pipeline_tag: video-text-to-text
12
- library_name: transformers
13
  tags:
14
  - Action
15
  - Video
@@ -81,12 +81,6 @@ model-index:
81
  value: 63.9
82
  name: accuracy
83
  verified: true
84
- - task:
85
- type: multimodal
86
- dataset:
87
- name: VideoMME (w-subs)
88
- type: videomme
89
- metrics:
90
  - type: accuracy
91
  value: 71.4
92
  name: accuracy
@@ -109,7 +103,7 @@ model-index:
109
 
110
  <sup>**</sup> First authors <sup>†</sup> Senior Authors <sup>‡</sup> Corresponding Author
111
 
112
- \[[arXiv Paper](arxiv.org/abs/2503.18712)\] &nbsp; \[[Project Page](https://mmathislab.github.io/llavaction/)\] &nbsp; \[[Github Repo](https://github.com/AdaptiveMotorControlLab/LLaVAction)\] &nbsp;
113
 
114
  </div>
115
 
@@ -118,7 +112,7 @@ The LLaVAction-7B model is trained on EPIC-KITCHENS-100-MQA, based on Qwen2 lang
118
  This model supports at most 64 frames.
119
 
120
  - **Project Page**: [https://mmathislab.github.io/llavaction/](https://mmathislab.github.io/llavaction/)
121
- - **Paper**: For more details, please check our [paper](https://arxiv.org/abs/tbd)
122
  - **Repository**: [https://github.com/AdaptiveMotorControlLab/LLaVAction](https://github.com/AdaptiveMotorControlLab/LLaVAction)
123
  - **Point of Contact**: [Mackenzie Mathis](https://people.epfl.ch/mackenzie.mathis)
124
  - **Languages**: English
@@ -185,7 +179,9 @@ video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].c
185
  video = [video]
186
  conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
187
  time_instruction = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. "
188
- question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruction}\n{perspective_prompt} {task_prompt}"
 
 
189
 
190
  conv = copy.deepcopy(conv_templates[conv_template])
191
  conv.append_message(conv.roles[0], question)
 
1
  ---
2
+ base_model:
3
+ - lmms-lab/LLaVA-Video-7B-Qwen2
4
  datasets:
5
  - lmms-lab/LLaVA-Video-178K
6
  language:
7
  - en
8
+ library_name: transformers
9
+ license: cc-by-nc-sa-4.0
10
  metrics:
11
  - accuracy
 
 
12
  pipeline_tag: video-text-to-text
 
13
  tags:
14
  - Action
15
  - Video
 
81
  value: 63.9
82
  name: accuracy
83
  verified: true
 
 
 
 
 
 
84
  - type: accuracy
85
  value: 71.4
86
  name: accuracy
 
103
 
104
  <sup>**</sup> First authors <sup>†</sup> Senior Authors <sup>‡</sup> Corresponding Author
105
 
106
+ \[[arXiv Paper](https://arxiv.org/abs/2503.18712)\] &nbsp; \[[Project Page](https://mmathislab.github.io/llavaction/)\] &nbsp; \[[Github Repo](https://github.com/AdaptiveMotorControlLab/LLaVAction)\] &nbsp;
107
 
108
  </div>
109
 
 
112
  This model supports at most 64 frames.
113
 
114
  - **Project Page**: [https://mmathislab.github.io/llavaction/](https://mmathislab.github.io/llavaction/)
115
+ - **Paper**: For more details, please check our [paper](https://arxiv.org/abs/2503.18712)
116
  - **Repository**: [https://github.com/AdaptiveMotorControlLab/LLaVAction](https://github.com/AdaptiveMotorControlLab/LLaVAction)
117
  - **Point of Contact**: [Mackenzie Mathis](https://people.epfl.ch/mackenzie.mathis)
118
  - **Languages**: English
 
179
  video = [video]
180
  conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
181
  time_instruction = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. "
182
+ question = DEFAULT_IMAGE_TOKEN + f"
183
+ {time_instruction}
184
+ {perspective_prompt} {task_prompt}"
185
 
186
  conv = copy.deepcopy(conv_templates[conv_template])
187
  conv.append_message(conv.roles[0], question)