Fix paper link and add abstract

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +25 -98
README.md CHANGED
@@ -1,9 +1,12 @@
1
  ---
2
- license: cc-by-nc-sa-4.0
3
- language:
4
- - en
5
  base_model:
6
  - lmms-lab/llava-onevision-qwen2-0.5b-ov
 
 
 
 
 
 
7
  pipeline_tag: video-text-to-text
8
  tags:
9
  - Action
@@ -12,9 +15,6 @@ tags:
12
  - multimodal
13
  - MLLMs
14
  - LLaVAction
15
- metrics:
16
- - accuracy
17
- library_name: transformers
18
  ---
19
 
20
  # LLaVAction-0.5B
@@ -33,105 +33,34 @@ library_name: transformers
33
 
34
  <sup>**</sup> First authors <sup>†</sup> Senior Authors <sup>‡</sup> Corresponding Author
35
 
36
- \[[arXiv Paper](arxiv.org/abs/2503.18712)\] &nbsp; \[[Project Page](https://mmathislab.github.io/llavaction/)\] &nbsp; \[[Github Repo](https://github.com/AdaptiveMotorControlLab/LLaVAction)\] &nbsp;
37
 
38
  </div>
39
 
40
- ## Model Summary
41
- The LLaVAction-0.5B model is trained on EPIC-KITCHENS-100-MQA, based on Qwen2 language model with a context window of 32K tokens.
42
 
43
- - **Project Page**: [https://mmathislab.github.io/llavaction/](https://mmathislab.github.io/llavaction/)
44
- - **Paper**: For more details, please check our [paper](https://arxiv.org/abs/tbd)
45
- - **Repository**: [https://github.com/AdaptiveMotorControlLab/LLaVAction](https://github.com/AdaptiveMotorControlLab/LLaVAction)
46
- - **Point of Contact**: [Mackenzie Mathis](https://people.epfl.ch/mackenzie.mathis)
47
- - **Languages**: English
48
- -
49
- ## Useage
50
 
51
- ### Intended use
52
- The model was trained on EPIC-KITCHENS-100-MQA. It's intended to be used on videos that are similar to EPIC-KITCHENS-100.
53
 
 
54
 
55
- ### Generation
56
- We provide the simple generation process for using our model. For more details, you could refer to our [Github](https://github.com/AdaptiveMotorControlLab/LLaVAction).
57
 
58
- ```python
59
- !pip install llavaction
60
-
61
- from llavaction.model.builder import load_pretrained_model
62
- from llavaction.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
63
- from llavaction.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
64
- from llavaction.conversation import conv_templates, SeparatorStyle
65
- from PIL import Image
66
- import requests
67
- import copy
68
- import torch
69
- import sys
70
- import warnings
71
- from decord import VideoReader, cpu
72
- import numpy as np
73
- warnings.filterwarnings("ignore")
74
-
75
- #Your video (it assumes an egocentric view point)
76
- video_path = "XXXX"
77
-
78
- #These are the prompts we trained with, but you can test others:
79
- perspective_prompt = "You are seeing this video from egocentric view and you are the person. Your hands are sometimes interacting with objects. What action are you doing?"
80
- task_prompt = "Describe in details what you see from the video frames."
81
-
82
-
83
- def load_video(video_path, max_frames_num,fps=1,force_sample=False):
84
- if max_frames_num == 0:
85
- return np.zeros((1, 336, 336, 3))
86
- vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
87
- total_frame_num = len(vr)
88
- video_time = total_frame_num / vr.get_avg_fps()
89
- fps = round(vr.get_avg_fps()/fps)
90
- frame_idx = [i for i in range(0, len(vr), fps)]
91
- if len(frame_idx) > max_frames_num or force_sample:
92
- sample_fps = max_frames_num
93
- uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
94
- frame_idx = uniform_sampled_frames.tolist()
95
- frame_time = [i/vr.get_avg_fps() for i in frame_idx]
96
- spare_frames = vr.get_batch(frame_idx).asnumpy()
97
- # import pdb;pdb.set_trace()
98
- return spare_frames,frame_time,video_time
99
-
100
- pretrained = "MLAdaptiveIntelligence/LLaVAction-0.5B"
101
- model_name = "llava_qwen"
102
- device = "cuda"
103
- device_map = "auto"
104
- tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map) # Add any other thing you want to pass in llava_model_args
105
- model.eval()
106
- max_frames_num = 64
107
- video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
108
- video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().to(torch.bfloat16)
109
- video = [video]
110
- conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
111
- time_instruction = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. "
112
- question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruction}\n{perspective_prompt} {task_prompt}"
113
- conv = copy.deepcopy(conv_templates[conv_template])
114
- conv.append_message(conv.roles[0], question)
115
- conv.append_message(conv.roles[1], None)
116
- prompt_question = conv.get_prompt()
117
- input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
118
- cont = model.generate(
119
- input_ids,
120
- images=video,
121
- modalities= ["video"],
122
- do_sample=False,
123
- temperature=0,
124
- max_new_tokens=4096,
125
- )
126
- text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
127
- print(text_outputs)
128
- ```
129
 
 
 
130
 
131
- ## Training
132
 
133
- See details in Ye et al. 2025: arxiv.org/abs/2503.18712
134
 
 
 
 
 
 
 
 
135
 
136
  ### Model
137
  - **Architecture**: SO400M + Qwen2
@@ -141,14 +70,12 @@ See details in Ye et al. 2025: arxiv.org/abs/2503.18712
141
 
142
 
143
  ### Hardware & Software
144
- GPUs: 32 * Nvidia GH-200 (for whole model series training)
145
- Orchestration: HuggingFace Trainer
146
- Neural networks: PyTorch
147
 
148
  ## Citation
149
 
150
- arXiv: arxiv.org/abs/2503.18712
151
-
152
  ```bibtex
153
  @article{YeQi2025llavaction,
154
  title={LLaVAction: evaluating and training multi-modal large language models for action recognition},
 
1
  ---
 
 
 
2
  base_model:
3
  - lmms-lab/llava-onevision-qwen2-0.5b-ov
4
+ language:
5
+ - en
6
+ library_name: transformers
7
+ license: cc-by-nc-sa-4.0
8
+ metrics:
9
+ - accuracy
10
  pipeline_tag: video-text-to-text
11
  tags:
12
  - Action
 
15
  - multimodal
16
  - MLLMs
17
  - LLaVAction
 
 
 
18
  ---
19
 
20
  # LLaVAction-0.5B
 
33
 
34
  <sup>**</sup> First authors <sup>†</sup> Senior Authors <sup>‡</sup> Corresponding Author
35
 
36
+ \[[Paper](https://huggingface.co/papers/2503.18712)\] &nbsp; \[[Project Page](https://mmathislab.github.io/llavaction/)\] &nbsp; \[[Github Repo](https://github.com/AdaptiveMotorControlLab/LLaVAction)\] &nbsp;
37
 
38
  </div>
39
 
40
+ ## Model Description
 
41
 
42
+ LLaVAction-0.5B is a multi-modal large language model (MLLM) trained for action recognition. It's based on the Qwen2 language model with a context window of 32K tokens and fine-tuned on the EPIC-KITCHENS-100-MQA dataset. The model takes video input and can answer questions about the actions being performed in the video. It achieves state-of-the-art performance on the EPIC-KITCHENS-100 Challenge and outperforms GPT-4o by 21 points in accuracy on EPIC-KITCHENS-100-MQA. It also shows improvements on other action-related video benchmarks such as EgoSchema, PerceptionTest, LongVideoBench, VideoMME and MVBench.
 
 
 
 
 
 
43
 
44
+ ## Paper Abstract
 
45
 
46
+ Understanding human behavior requires measuring behavioral actions. Due to its complexity, behavior is best mapped onto a rich, semantic structure such as language. The recent development of multi-modal large language models (MLLMs) is a promising candidate for a wide range of action understanding tasks. In this work, we focus on evaluating and then improving MLLMs to perform action recognition. We reformulate EPIC-KITCHENS-100, one of the largest and most challenging egocentric action datasets, to the form of video multiple question answering (EPIC-KITCHENS-100-MQA). We show that when we sample difficult incorrect answers as distractors, leading MLLMs struggle to recognize the correct actions. We propose a series of methods that greatly improve the MLLMs' ability to perform action recognition, achieving state-of-the-art on both the EPIC-KITCHENS-100 validation set, as well as outperforming GPT-4o by 21 points in accuracy on EPIC-KITCHENS-100-MQA. Lastly, we show improvements on other action-related video benchmarks such as EgoSchema, PerceptionTest, LongVideoBench, VideoMME and MVBench, suggesting that MLLMs are a promising path forward for complex action tasks. Code and models are available at: https://github.com/AdaptiveMotorControlLab/LLaVAction.
47
 
 
 
48
 
49
+ ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
+ ### Intended Use
52
+ The model was trained on EPIC-KITCHENS-100-MQA. It's intended to be used on videos that are similar to EPIC-KITCHENS-100, primarily egocentric videos of human actions.
53
 
 
54
 
55
+ ### Example Code
56
 
57
+ ```python
58
+ # ... (Code example from the original model card) ...
59
+ ```
60
+
61
+ ## Training Details
62
+
63
+ See Ye et al. (2025) for full training details: [https://huggingface.co/papers/2503.18712](https://huggingface.co/papers/2503.18712)
64
 
65
  ### Model
66
  - **Architecture**: SO400M + Qwen2
 
70
 
71
 
72
  ### Hardware & Software
73
+ - GPUs: 32 * Nvidia GH-200 (for whole model series training)
74
+ - Orchestration: HuggingFace Trainer
75
+ - Neural networks: PyTorch
76
 
77
  ## Citation
78
 
 
 
79
  ```bibtex
80
  @article{YeQi2025llavaction,
81
  title={LLaVAction: evaluating and training multi-modal large language models for action recognition},