MLAdaptiveIntelligence
/

LLaVAction-7B

@@ -1,15 +1,15 @@
 ---
-license: cc-by-nc-sa-4.0
 datasets:
 - lmms-lab/LLaVA-Video-178K
 language:
 - en
 metrics:
 - accuracy
-base_model:
-- lmms-lab/LLaVA-Video-7B-Qwen2
 pipeline_tag: video-text-to-text
-library_name: transformers
 tags:
 - Action
 - Video
@@ -81,12 +81,6 @@ model-index:
       value: 63.9
       name: accuracy
       verified: true
-  - task:
-      type: multimodal
-    dataset:
-      name: VideoMME (w-subs)
-      type: videomme
-    metrics:
     - type: accuracy
       value: 71.4
       name: accuracy
@@ -109,7 +103,7 @@ model-index:
 <sup>**</sup> First authors  <sup>†</sup> Senior Authors  <sup>‡</sup> Corresponding Author
-\[[arXiv Paper](arxiv.org/abs/2503.18712)\] &nbsp; \[[Project Page](https://mmathislab.github.io/llavaction/)\] &nbsp; \[[Github Repo](https://github.com/AdaptiveMotorControlLab/LLaVAction)\] &nbsp;
 </div>
@@ -118,7 +112,7 @@ The LLaVAction-7B model is trained on EPIC-KITCHENS-100-MQA, based on Qwen2 lang
 This model supports at most 64 frames.
 - **Project Page**:  [https://mmathislab.github.io/llavaction/](https://mmathislab.github.io/llavaction/)
-- **Paper**: For more details, please check our [paper](https://arxiv.org/abs/tbd)
 - **Repository**:  [https://github.com/AdaptiveMotorControlLab/LLaVAction](https://github.com/AdaptiveMotorControlLab/LLaVAction)
 - **Point of Contact**: [Mackenzie Mathis](https://people.epfl.ch/mackenzie.mathis)
 - **Languages**: English
@@ -185,7 +179,9 @@ video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].c
 video = [video]
 conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
 time_instruction = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. "
-question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruction}\n{perspective_prompt} {task_prompt}"
 conv = copy.deepcopy(conv_templates[conv_template])
 conv.append_message(conv.roles[0], question)

 ---
+base_model:
+- lmms-lab/LLaVA-Video-7B-Qwen2
 datasets:
 - lmms-lab/LLaVA-Video-178K
 language:
 - en
+library_name: transformers
+license: cc-by-nc-sa-4.0
 metrics:
 - accuracy
 pipeline_tag: video-text-to-text
 tags:
 - Action
 - Video
       value: 63.9
       name: accuracy
       verified: true
     - type: accuracy
       value: 71.4
       name: accuracy
 <sup>**</sup> First authors  <sup>†</sup> Senior Authors  <sup>‡</sup> Corresponding Author
+\[[arXiv Paper](https://arxiv.org/abs/2503.18712)\] &nbsp; \[[Project Page](https://mmathislab.github.io/llavaction/)\] &nbsp; \[[Github Repo](https://github.com/AdaptiveMotorControlLab/LLaVAction)\] &nbsp;
 </div>
 This model supports at most 64 frames.
 - **Project Page**:  [https://mmathislab.github.io/llavaction/](https://mmathislab.github.io/llavaction/)
+- **Paper**: For more details, please check our [paper](https://arxiv.org/abs/2503.18712)
 - **Repository**:  [https://github.com/AdaptiveMotorControlLab/LLaVAction](https://github.com/AdaptiveMotorControlLab/LLaVAction)
 - **Point of Contact**: [Mackenzie Mathis](https://people.epfl.ch/mackenzie.mathis)
 - **Languages**: English
 video = [video]
 conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
 time_instruction = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. "
+question = DEFAULT_IMAGE_TOKEN + f"
+{time_instruction}
+{perspective_prompt} {task_prompt}"
 conv = copy.deepcopy(conv_templates[conv_template])
 conv.append_message(conv.roles[0], question)