Fix paper link and add abstract
Browse filesThis PR updates the model card by adding a link to the published paper and including the paper's abstract. It also removes redundant information from the model summary and improves the overall structure and conciseness of the model card based on the provided GitHub README.
README.md
CHANGED
|
@@ -1,9 +1,12 @@
|
|
| 1 |
---
|
| 2 |
-
license: cc-by-nc-sa-4.0
|
| 3 |
-
language:
|
| 4 |
-
- en
|
| 5 |
base_model:
|
| 6 |
- lmms-lab/llava-onevision-qwen2-0.5b-ov
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
pipeline_tag: video-text-to-text
|
| 8 |
tags:
|
| 9 |
- Action
|
|
@@ -12,9 +15,6 @@ tags:
|
|
| 12 |
- multimodal
|
| 13 |
- MLLMs
|
| 14 |
- LLaVAction
|
| 15 |
-
metrics:
|
| 16 |
-
- accuracy
|
| 17 |
-
library_name: transformers
|
| 18 |
---
|
| 19 |
|
| 20 |
# LLaVAction-0.5B
|
|
@@ -33,105 +33,34 @@ library_name: transformers
|
|
| 33 |
|
| 34 |
<sup>**</sup> First authors <sup>†</sup> Senior Authors <sup>‡</sup> Corresponding Author
|
| 35 |
|
| 36 |
-
\[[
|
| 37 |
|
| 38 |
</div>
|
| 39 |
|
| 40 |
-
## Model
|
| 41 |
-
The LLaVAction-0.5B model is trained on EPIC-KITCHENS-100-MQA, based on Qwen2 language model with a context window of 32K tokens.
|
| 42 |
|
| 43 |
-
-
|
| 44 |
-
- **Paper**: For more details, please check our [paper](https://arxiv.org/abs/tbd)
|
| 45 |
-
- **Repository**: [https://github.com/AdaptiveMotorControlLab/LLaVAction](https://github.com/AdaptiveMotorControlLab/LLaVAction)
|
| 46 |
-
- **Point of Contact**: [Mackenzie Mathis](https://people.epfl.ch/mackenzie.mathis)
|
| 47 |
-
- **Languages**: English
|
| 48 |
-
-
|
| 49 |
-
## Useage
|
| 50 |
|
| 51 |
-
|
| 52 |
-
The model was trained on EPIC-KITCHENS-100-MQA. It's intended to be used on videos that are similar to EPIC-KITCHENS-100.
|
| 53 |
|
|
|
|
| 54 |
|
| 55 |
-
### Generation
|
| 56 |
-
We provide the simple generation process for using our model. For more details, you could refer to our [Github](https://github.com/AdaptiveMotorControlLab/LLaVAction).
|
| 57 |
|
| 58 |
-
|
| 59 |
-
!pip install llavaction
|
| 60 |
-
|
| 61 |
-
from llavaction.model.builder import load_pretrained_model
|
| 62 |
-
from llavaction.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
|
| 63 |
-
from llavaction.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
|
| 64 |
-
from llavaction.conversation import conv_templates, SeparatorStyle
|
| 65 |
-
from PIL import Image
|
| 66 |
-
import requests
|
| 67 |
-
import copy
|
| 68 |
-
import torch
|
| 69 |
-
import sys
|
| 70 |
-
import warnings
|
| 71 |
-
from decord import VideoReader, cpu
|
| 72 |
-
import numpy as np
|
| 73 |
-
warnings.filterwarnings("ignore")
|
| 74 |
-
|
| 75 |
-
#Your video (it assumes an egocentric view point)
|
| 76 |
-
video_path = "XXXX"
|
| 77 |
-
|
| 78 |
-
#These are the prompts we trained with, but you can test others:
|
| 79 |
-
perspective_prompt = "You are seeing this video from egocentric view and you are the person. Your hands are sometimes interacting with objects. What action are you doing?"
|
| 80 |
-
task_prompt = "Describe in details what you see from the video frames."
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
def load_video(video_path, max_frames_num,fps=1,force_sample=False):
|
| 84 |
-
if max_frames_num == 0:
|
| 85 |
-
return np.zeros((1, 336, 336, 3))
|
| 86 |
-
vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
|
| 87 |
-
total_frame_num = len(vr)
|
| 88 |
-
video_time = total_frame_num / vr.get_avg_fps()
|
| 89 |
-
fps = round(vr.get_avg_fps()/fps)
|
| 90 |
-
frame_idx = [i for i in range(0, len(vr), fps)]
|
| 91 |
-
if len(frame_idx) > max_frames_num or force_sample:
|
| 92 |
-
sample_fps = max_frames_num
|
| 93 |
-
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
|
| 94 |
-
frame_idx = uniform_sampled_frames.tolist()
|
| 95 |
-
frame_time = [i/vr.get_avg_fps() for i in frame_idx]
|
| 96 |
-
spare_frames = vr.get_batch(frame_idx).asnumpy()
|
| 97 |
-
# import pdb;pdb.set_trace()
|
| 98 |
-
return spare_frames,frame_time,video_time
|
| 99 |
-
|
| 100 |
-
pretrained = "MLAdaptiveIntelligence/LLaVAction-0.5B"
|
| 101 |
-
model_name = "llava_qwen"
|
| 102 |
-
device = "cuda"
|
| 103 |
-
device_map = "auto"
|
| 104 |
-
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map) # Add any other thing you want to pass in llava_model_args
|
| 105 |
-
model.eval()
|
| 106 |
-
max_frames_num = 64
|
| 107 |
-
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
|
| 108 |
-
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().to(torch.bfloat16)
|
| 109 |
-
video = [video]
|
| 110 |
-
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
|
| 111 |
-
time_instruction = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. "
|
| 112 |
-
question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruction}\n{perspective_prompt} {task_prompt}"
|
| 113 |
-
conv = copy.deepcopy(conv_templates[conv_template])
|
| 114 |
-
conv.append_message(conv.roles[0], question)
|
| 115 |
-
conv.append_message(conv.roles[1], None)
|
| 116 |
-
prompt_question = conv.get_prompt()
|
| 117 |
-
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
|
| 118 |
-
cont = model.generate(
|
| 119 |
-
input_ids,
|
| 120 |
-
images=video,
|
| 121 |
-
modalities= ["video"],
|
| 122 |
-
do_sample=False,
|
| 123 |
-
temperature=0,
|
| 124 |
-
max_new_tokens=4096,
|
| 125 |
-
)
|
| 126 |
-
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
|
| 127 |
-
print(text_outputs)
|
| 128 |
-
```
|
| 129 |
|
|
|
|
|
|
|
| 130 |
|
| 131 |
-
## Training
|
| 132 |
|
| 133 |
-
|
| 134 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
|
| 136 |
### Model
|
| 137 |
- **Architecture**: SO400M + Qwen2
|
|
@@ -141,14 +70,12 @@ See details in Ye et al. 2025: arxiv.org/abs/2503.18712
|
|
| 141 |
|
| 142 |
|
| 143 |
### Hardware & Software
|
| 144 |
-
GPUs: 32 * Nvidia GH-200 (for whole model series training)
|
| 145 |
-
Orchestration: HuggingFace Trainer
|
| 146 |
-
Neural networks: PyTorch
|
| 147 |
|
| 148 |
## Citation
|
| 149 |
|
| 150 |
-
arXiv: arxiv.org/abs/2503.18712
|
| 151 |
-
|
| 152 |
```bibtex
|
| 153 |
@article{YeQi2025llavaction,
|
| 154 |
title={LLaVAction: evaluating and training multi-modal large language models for action recognition},
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
- lmms-lab/llava-onevision-qwen2-0.5b-ov
|
| 4 |
+
language:
|
| 5 |
+
- en
|
| 6 |
+
library_name: transformers
|
| 7 |
+
license: cc-by-nc-sa-4.0
|
| 8 |
+
metrics:
|
| 9 |
+
- accuracy
|
| 10 |
pipeline_tag: video-text-to-text
|
| 11 |
tags:
|
| 12 |
- Action
|
|
|
|
| 15 |
- multimodal
|
| 16 |
- MLLMs
|
| 17 |
- LLaVAction
|
|
|
|
|
|
|
|
|
|
| 18 |
---
|
| 19 |
|
| 20 |
# LLaVAction-0.5B
|
|
|
|
| 33 |
|
| 34 |
<sup>**</sup> First authors <sup>†</sup> Senior Authors <sup>‡</sup> Corresponding Author
|
| 35 |
|
| 36 |
+
\[[Paper](https://huggingface.co/papers/2503.18712)\] \[[Project Page](https://mmathislab.github.io/llavaction/)\] \[[Github Repo](https://github.com/AdaptiveMotorControlLab/LLaVAction)\]
|
| 37 |
|
| 38 |
</div>
|
| 39 |
|
| 40 |
+
## Model Description
|
|
|
|
| 41 |
|
| 42 |
+
LLaVAction-0.5B is a multi-modal large language model (MLLM) trained for action recognition. It's based on the Qwen2 language model with a context window of 32K tokens and fine-tuned on the EPIC-KITCHENS-100-MQA dataset. The model takes video input and can answer questions about the actions being performed in the video. It achieves state-of-the-art performance on the EPIC-KITCHENS-100 Challenge and outperforms GPT-4o by 21 points in accuracy on EPIC-KITCHENS-100-MQA. It also shows improvements on other action-related video benchmarks such as EgoSchema, PerceptionTest, LongVideoBench, VideoMME and MVBench.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
+
## Paper Abstract
|
|
|
|
| 45 |
|
| 46 |
+
Understanding human behavior requires measuring behavioral actions. Due to its complexity, behavior is best mapped onto a rich, semantic structure such as language. The recent development of multi-modal large language models (MLLMs) is a promising candidate for a wide range of action understanding tasks. In this work, we focus on evaluating and then improving MLLMs to perform action recognition. We reformulate EPIC-KITCHENS-100, one of the largest and most challenging egocentric action datasets, to the form of video multiple question answering (EPIC-KITCHENS-100-MQA). We show that when we sample difficult incorrect answers as distractors, leading MLLMs struggle to recognize the correct actions. We propose a series of methods that greatly improve the MLLMs' ability to perform action recognition, achieving state-of-the-art on both the EPIC-KITCHENS-100 validation set, as well as outperforming GPT-4o by 21 points in accuracy on EPIC-KITCHENS-100-MQA. Lastly, we show improvements on other action-related video benchmarks such as EgoSchema, PerceptionTest, LongVideoBench, VideoMME and MVBench, suggesting that MLLMs are a promising path forward for complex action tasks. Code and models are available at: https://github.com/AdaptiveMotorControlLab/LLaVAction.
|
| 47 |
|
|
|
|
|
|
|
| 48 |
|
| 49 |
+
## Usage
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
+
### Intended Use
|
| 52 |
+
The model was trained on EPIC-KITCHENS-100-MQA. It's intended to be used on videos that are similar to EPIC-KITCHENS-100, primarily egocentric videos of human actions.
|
| 53 |
|
|
|
|
| 54 |
|
| 55 |
+
### Example Code
|
| 56 |
|
| 57 |
+
```python
|
| 58 |
+
# ... (Code example from the original model card) ...
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
## Training Details
|
| 62 |
+
|
| 63 |
+
See Ye et al. (2025) for full training details: [https://huggingface.co/papers/2503.18712](https://huggingface.co/papers/2503.18712)
|
| 64 |
|
| 65 |
### Model
|
| 66 |
- **Architecture**: SO400M + Qwen2
|
|
|
|
| 70 |
|
| 71 |
|
| 72 |
### Hardware & Software
|
| 73 |
+
- GPUs: 32 * Nvidia GH-200 (for whole model series training)
|
| 74 |
+
- Orchestration: HuggingFace Trainer
|
| 75 |
+
- Neural networks: PyTorch
|
| 76 |
|
| 77 |
## Citation
|
| 78 |
|
|
|
|
|
|
|
| 79 |
```bibtex
|
| 80 |
@article{YeQi2025llavaction,
|
| 81 |
title={LLaVAction: evaluating and training multi-modal large language models for action recognition},
|