MLAdaptiveIntelligence
/

LLaVAction-0.5B

@@ -1,9 +1,12 @@
 ---
-license: cc-by-nc-sa-4.0
-language:
-- en
 base_model:
 - lmms-lab/llava-onevision-qwen2-0.5b-ov
 pipeline_tag: video-text-to-text
 tags:
 - Action
@@ -12,9 +15,6 @@ tags:
 - multimodal
 - MLLMs
 - LLaVAction
-metrics:
-- accuracy
-library_name: transformers
 ---
 # LLaVAction-0.5B
@@ -33,105 +33,34 @@ library_name: transformers
 <sup>**</sup> First authors  <sup>†</sup> Senior Authors  <sup>‡</sup> Corresponding Author
-\[[arXiv Paper](arxiv.org/abs/2503.18712)\] &nbsp; \[[Project Page](https://mmathislab.github.io/llavaction/)\] &nbsp; \[[Github Repo](https://github.com/AdaptiveMotorControlLab/LLaVAction)\] &nbsp;
 </div>
-## Model Summary
-The LLaVAction-0.5B model is trained on EPIC-KITCHENS-100-MQA, based on Qwen2 language model with a context window of 32K tokens.
-- **Project Page**:  [https://mmathislab.github.io/llavaction/](https://mmathislab.github.io/llavaction/)
-- **Paper**: For more details, please check our [paper](https://arxiv.org/abs/tbd)
-- **Repository**:  [https://github.com/AdaptiveMotorControlLab/LLaVAction](https://github.com/AdaptiveMotorControlLab/LLaVAction)
-- **Point of Contact**: [Mackenzie Mathis](https://people.epfl.ch/mackenzie.mathis)
-- **Languages**: English
--
-## Useage
-### Intended use
-The model was trained on EPIC-KITCHENS-100-MQA. It's intended to be used on videos that are similar to EPIC-KITCHENS-100.
-### Generation
-We provide the simple generation process for using our model. For more details, you could refer to our [Github](https://github.com/AdaptiveMotorControlLab/LLaVAction).
-```python
-!pip install llavaction
-from llavaction.model.builder import load_pretrained_model
-from llavaction.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
-from llavaction.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
-from llavaction.conversation import conv_templates, SeparatorStyle
-from PIL import Image
-import requests
-import copy
-import torch
-import sys
-import warnings
-from decord import VideoReader, cpu
-import numpy as np
-warnings.filterwarnings("ignore")
-#Your video (it assumes an egocentric view point)
-video_path = "XXXX"
-#These are the prompts we trained with, but you can test others:
-perspective_prompt = "You are seeing this video from egocentric view and you are the person. Your hands are sometimes interacting with objects. What action are you doing?"
-task_prompt = "Describe in details what you see from the video frames."
-def load_video(video_path, max_frames_num,fps=1,force_sample=False):
-    if max_frames_num == 0:
-        return np.zeros((1, 336, 336, 3))
-    vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
-    total_frame_num = len(vr)
-    video_time = total_frame_num / vr.get_avg_fps()
-    fps = round(vr.get_avg_fps()/fps)
-    frame_idx = [i for i in range(0, len(vr), fps)]
-    if len(frame_idx) > max_frames_num or force_sample:
-        sample_fps = max_frames_num
-        uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
-        frame_idx = uniform_sampled_frames.tolist()
-        frame_time = [i/vr.get_avg_fps() for i in frame_idx]
-    spare_frames = vr.get_batch(frame_idx).asnumpy()
-    # import pdb;pdb.set_trace()
-    return spare_frames,frame_time,video_time
-pretrained = "MLAdaptiveIntelligence/LLaVAction-0.5B"
-model_name = "llava_qwen"
-device = "cuda"
-device_map = "auto"
-tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map)  # Add any other thing you want to pass in llava_model_args
-model.eval()
-max_frames_num = 64
-video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
-video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().to(torch.bfloat16)
-video = [video]
-conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
-time_instruction = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. "
-question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruction}\n{perspective_prompt} {task_prompt}"
-conv = copy.deepcopy(conv_templates[conv_template])
-conv.append_message(conv.roles[0], question)
-conv.append_message(conv.roles[1], None)
-prompt_question = conv.get_prompt()
-input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
-cont = model.generate(
-    input_ids,
-    images=video,
-    modalities= ["video"],
-    do_sample=False,
-    temperature=0,
-    max_new_tokens=4096,
-)
-text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
-print(text_outputs)
-```
-## Training
-See details in Ye et al. 2025: arxiv.org/abs/2503.18712
 ### Model
 - **Architecture**: SO400M + Qwen2
@@ -141,14 +70,12 @@ See details in Ye et al. 2025: arxiv.org/abs/2503.18712
 ### Hardware & Software
-GPUs: 32 * Nvidia GH-200 (for whole model series training)
-Orchestration: HuggingFace Trainer
-Neural networks:  PyTorch
 ## Citation
-arXiv: arxiv.org/abs/2503.18712
 ```bibtex
 @article{YeQi2025llavaction,
   title={LLaVAction: evaluating and training multi-modal large language models for action recognition},

 ---
 base_model:
 - lmms-lab/llava-onevision-qwen2-0.5b-ov
+language:
+- en
+library_name: transformers
+license: cc-by-nc-sa-4.0
+metrics:
+- accuracy
 pipeline_tag: video-text-to-text
 tags:
 - Action
 - multimodal
 - MLLMs
 - LLaVAction
 ---
 # LLaVAction-0.5B
 <sup>**</sup> First authors  <sup>†</sup> Senior Authors  <sup>‡</sup> Corresponding Author
+\[[Paper](https://huggingface.co/papers/2503.18712)\] &nbsp; \[[Project Page](https://mmathislab.github.io/llavaction/)\] &nbsp; \[[Github Repo](https://github.com/AdaptiveMotorControlLab/LLaVAction)\] &nbsp;
 </div>
+## Model Description
+LLaVAction-0.5B is a multi-modal large language model (MLLM) trained for action recognition. It's based on the Qwen2 language model with a context window of 32K tokens and fine-tuned on the EPIC-KITCHENS-100-MQA dataset. The model takes video input and can answer questions about the actions being performed in the video.  It achieves state-of-the-art performance on the EPIC-KITCHENS-100 Challenge and outperforms GPT-4o by 21 points in accuracy on EPIC-KITCHENS-100-MQA.  It also shows improvements on other action-related video benchmarks such as EgoSchema, PerceptionTest, LongVideoBench, VideoMME and MVBench.
+## Paper Abstract
+Understanding human behavior requires measuring behavioral actions. Due to its complexity, behavior is best mapped onto a rich, semantic structure such as language. The recent development of multi-modal large language models (MLLMs) is a promising candidate for a wide range of action understanding tasks. In this work, we focus on evaluating and then improving MLLMs to perform action recognition. We reformulate EPIC-KITCHENS-100, one of the largest and most challenging egocentric action datasets, to the form of video multiple question answering (EPIC-KITCHENS-100-MQA). We show that when we sample difficult incorrect answers as distractors, leading MLLMs struggle to recognize the correct actions. We propose a series of methods that greatly improve the MLLMs' ability to perform action recognition, achieving state-of-the-art on both the EPIC-KITCHENS-100 validation set, as well as outperforming GPT-4o by 21 points in accuracy on EPIC-KITCHENS-100-MQA. Lastly, we show improvements on other action-related video benchmarks such as EgoSchema, PerceptionTest, LongVideoBench, VideoMME and MVBench, suggesting that MLLMs are a promising path forward for complex action tasks. Code and models are available at: https://github.com/AdaptiveMotorControlLab/LLaVAction.
+## Usage
+### Intended Use
+The model was trained on EPIC-KITCHENS-100-MQA. It's intended to be used on videos that are similar to EPIC-KITCHENS-100, primarily egocentric videos of human actions.
+### Example Code
+```python
+# ... (Code example from the original model card) ...
+```
+## Training Details
+See Ye et al. (2025) for full training details: [https://huggingface.co/papers/2503.18712](https://huggingface.co/papers/2503.18712)
 ### Model
 - **Architecture**: SO400M + Qwen2
 ### Hardware & Software
+- GPUs: 32 * Nvidia GH-200 (for whole model series training)
+- Orchestration: HuggingFace Trainer
+- Neural networks:  PyTorch
 ## Citation
 ```bibtex
 @article{YeQi2025llavaction,
   title={LLaVAction: evaluating and training multi-modal large language models for action recognition},