sunidhitandel commited on
Commit
532a110
·
verified ·
1 Parent(s): e99a54d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +465 -431
README.md CHANGED
@@ -1,431 +1,465 @@
1
- ---
2
- datasets:
3
- - lmms-lab/LLaVA-OneVision-Data
4
- language:
5
- - en
6
- - zh
7
- library_name: transformers
8
- license: apache-2.0
9
- metrics:
10
- - accuracy
11
- tags:
12
- - multimodal
13
- model-index:
14
- - name: llava-onevision-qwen-7b-ov
15
- results:
16
- - task:
17
- type: multimodal
18
- dataset:
19
- name: AI2D
20
- type: ai2d
21
- metrics:
22
- - type: accuracy
23
- value: 81.4
24
- name: accuracy
25
- verified: true
26
- - task:
27
- type: multimodal
28
- dataset:
29
- name: ChartQA
30
- type: chartqa
31
- metrics:
32
- - type: accuracy
33
- value: 80.0
34
- name: accuracy
35
- verified: true
36
- - task:
37
- type: multimodal
38
- dataset:
39
- name: DocVQA
40
- type: docvqa
41
- metrics:
42
- - type: accuracy
43
- value: 90.2
44
- name: accuracy
45
- verified: true
46
- - task:
47
- type: multimodal
48
- dataset:
49
- name: InfoVQA
50
- type: infovqa
51
- metrics:
52
- - type: accuracy
53
- value: 70.7
54
- name: accuracy
55
- verified: true
56
- - task:
57
- type: multimodal
58
- dataset:
59
- name: MathVerse
60
- type: mathverse
61
- metrics:
62
- - type: accuracy
63
- value: 26.2
64
- name: accuracy
65
- verified: true
66
- - task:
67
- type: multimodal
68
- dataset:
69
- name: MathVista
70
- type: mathvista
71
- metrics:
72
- - type: accuracy
73
- value: 63.2
74
- name: accuracy
75
- verified: true
76
- - task:
77
- type: multimodal
78
- dataset:
79
- name: MMBench
80
- type: mmbench
81
- metrics:
82
- - type: accuracy
83
- value: 80.8
84
- name: accuracy
85
- verified: true
86
- - task:
87
- type: multimodal
88
- dataset:
89
- name: MME-Perception
90
- type: mme-perception
91
- metrics:
92
- - type: score
93
- value: 1580
94
- name: score
95
- verified: true
96
- - task:
97
- type: multimodal
98
- dataset:
99
- name: MME-Cognition
100
- type: mme-cognition
101
- metrics:
102
- - type: score
103
- value: 418
104
- name: score
105
- verified: true
106
- - task:
107
- type: multimodal
108
- dataset:
109
- name: MMMU
110
- type: mmmu
111
- metrics:
112
- - type: accuracy
113
- value: 48.8
114
- name: accuracy
115
- verified: true
116
- - task:
117
- type: multimodal
118
- dataset:
119
- name: MMVet
120
- type: mmvet
121
- metrics:
122
- - type: accuracy
123
- value: 57.5
124
- name: accuracy
125
- verified: true
126
- - task:
127
- type: multimodal
128
- dataset:
129
- name: MMStar
130
- type: mmstar
131
- metrics:
132
- - type: accuracy
133
- value: 61.7
134
- name: accuracy
135
- verified: true
136
- - task:
137
- type: multimodal
138
- dataset:
139
- name: Seed-Bench
140
- type: seed-bench
141
- metrics:
142
- - type: accuracy
143
- value: 75.4
144
- name: accuracy
145
- verified: true
146
- - task:
147
- type: multimodal
148
- dataset:
149
- name: Science-QA
150
- type: science-qa
151
- metrics:
152
- - type: accuracy
153
- value: 96.0
154
- name: accuracy
155
- verified: true
156
- - task:
157
- type: multimodal
158
- dataset:
159
- name: ImageDC
160
- type: imagedc
161
- metrics:
162
- - type: accuracy
163
- value: 88.9
164
- name: accuracy
165
- verified: true
166
- - task:
167
- type: multimodal
168
- dataset:
169
- name: MMLBench
170
- type: mmlbench
171
- metrics:
172
- - type: accuracy
173
- value: 77.1
174
- name: accuracy
175
- verified: true
176
- - task:
177
- type: multimodal
178
- dataset:
179
- name: RealWorldQA
180
- type: realworldqa
181
- metrics:
182
- - type: accuracy
183
- value: 66.3
184
- name: accuracy
185
- verified: true
186
- - task:
187
- type: multimodal
188
- dataset:
189
- name: Vibe-Eval
190
- type: vibe-eval
191
- metrics:
192
- - type: accuracy
193
- value: 51.7
194
- name: accuracy
195
- verified: true
196
- - task:
197
- type: multimodal
198
- dataset:
199
- name: LLaVA-W
200
- type: llava-w
201
- metrics:
202
- - type: accuracy
203
- value: 90.7
204
- name: accuracy
205
- verified: true
206
- - task:
207
- type: multimodal
208
- dataset:
209
- name: LLaVA-Wilder
210
- type: l-wilder
211
- metrics:
212
- - type: accuracy
213
- value: 67.8
214
- name: accuracy
215
- verified: true
216
- - task:
217
- type: multimodal
218
- dataset:
219
- name: ActNet-QA
220
- type: actnet-qa
221
- metrics:
222
- - type: accuracy
223
- value: 56.6
224
- name: accuracy
225
- verified: true
226
- - task:
227
- type: multimodal
228
- dataset:
229
- name: EgoSchema
230
- type: egoschema
231
- metrics:
232
- - type: accuracy
233
- value: 60.1
234
- name: accuracy
235
- verified: true
236
- - task:
237
- type: multimodal
238
- dataset:
239
- name: MLVU
240
- type: mlvu
241
- metrics:
242
- - type: accuracy
243
- value: 64.7
244
- name: accuracy
245
- verified: true
246
- - task:
247
- type: multimodal
248
- dataset:
249
- name: MVBench
250
- type: mvbench
251
- metrics:
252
- - type: accuracy
253
- value: 56.7
254
- name: accuracy
255
- verified: true
256
- - task:
257
- type: multimodal
258
- dataset:
259
- name: NextQA
260
- type: nextqa
261
- metrics:
262
- - type: accuracy
263
- value: 79.4
264
- name: accuracy
265
- verified: true
266
- - task:
267
- type: multimodal
268
- dataset:
269
- name: PercepTest
270
- type: percepTest
271
- metrics:
272
- - type: accuracy
273
- value: 49.7
274
- name: accuracy
275
- verified: true
276
- - task:
277
- type: multimodal
278
- dataset:
279
- name: SeedBench
280
- type: seedbench
281
- metrics:
282
- - type: accuracy
283
- value: 56.9
284
- name: accuracy
285
- verified: true
286
- - task:
287
- type: multimodal
288
- dataset:
289
- name: VideoChatGPT
290
- type: videochatgpt
291
- metrics:
292
- - type: score
293
- value: 3.49
294
- name: score
295
- verified: true
296
- - task:
297
- type: multimodal
298
- dataset:
299
- name: VideoDC
300
- type: videodc
301
- metrics:
302
- - type: score
303
- value: 3.75
304
- name: score
305
- verified: true
306
- - task:
307
- type: multimodal
308
- dataset:
309
- name: VideoMME
310
- type: videomme
311
- metrics:
312
- - type: accuracy
313
- value: 58.2
314
- name: accuracy
315
- verified: true
316
- ---
317
-
318
-
319
- # LLaVA-OneVision
320
-
321
- ![banner](https://i.postimg.cc/pL17YtG4/WX20240508-220230-2x.png)
322
-
323
- Play with the model on the [LLaVA OneVision Chat](https://llava-onevision.lmms-lab.com/).
324
-
325
- ## Table of Contents
326
-
327
- 1. [Model Summary](##model-summary)
328
- 2. [Use](##use)
329
- 3. [Limitations](##limitations)
330
- 4. [Training](##training)
331
- 5. [License](##license)
332
- 6. [Citation](##citation)
333
-
334
- ## Model Summary
335
-
336
- The LLaVA-OneVision models are 0.5/7/72B parameter models trained on [LLaVA-OneVision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), based on Qwen2 language model with a context window of 32K tokens.
337
-
338
- - **Repository:** [LLaVA-VL/LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT?tab=readme-ov-file)
339
- - **Project Website:** [llava-onevision.lmms-lab.com](llava-onevision.lmms-lab.com)
340
- - **Paper:** [LLaVA-OneVision](arxiv.org/abs/2408.03326)
341
- - **Point of Contact:** [Bo Li](mailto:drluodian@gmail.com)
342
- - **Languages:** English, Chinese
343
-
344
-
345
- ## Use
346
-
347
- ### Intended use
348
-
349
- The model was trained on [LLaVA-OneVision Dataset](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data) and have the ability to interact with images, multi-image and videos.
350
-
351
- **Feel free to share your generations in the Community tab!**
352
-
353
- ### Generation
354
-
355
- We provide the simple generation process for using our model. For more details, you could refer to [Github](https://github.com/LLaVA-VL/LLaVA-NeXT).
356
-
357
- ```python
358
- # pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
359
- from llava.model.builder import load_pretrained_model
360
- from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
361
- from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
362
- from llava.conversation import conv_templates, SeparatorStyle
363
-
364
- from PIL import Image
365
- import requests
366
- import copy
367
- import torch
368
-
369
- import sys
370
- import warnings
371
-
372
- warnings.filterwarnings("ignore")
373
- pretrained = "lmms-lab/llava-onevision-qwen2-7b-ov"
374
- model_name = "llava_qwen"
375
- device = "cuda"
376
- device_map = "auto"
377
- tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
378
-
379
- model.eval()
380
-
381
- url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
382
- image = Image.open(requests.get(url, stream=True).raw)
383
- image_tensor = process_images([image], image_processor, model.config)
384
- image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
385
-
386
- conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
387
- question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
388
- conv = copy.deepcopy(conv_templates[conv_template])
389
- conv.append_message(conv.roles[0], question)
390
- conv.append_message(conv.roles[1], None)
391
- prompt_question = conv.get_prompt()
392
-
393
- input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
394
- image_sizes = [image.size]
395
-
396
-
397
- cont = model.generate(
398
- input_ids,
399
- images=image_tensor,
400
- image_sizes=image_sizes,
401
- do_sample=False,
402
- temperature=0,
403
- max_new_tokens=4096,
404
- )
405
- text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
406
- print(text_outputs)
407
- ```
408
-
409
- # Training
410
-
411
- ## Model
412
-
413
- - **Architecture:** SO400M + Qwen2
414
- - **Pretraining Stage:** LCS-558K, 1 epoch, projector
415
- - **Mid Stage:** A mixture of 4.7M high-quality synthetic data, 1 epoch, full model
416
- - **Final-Image Stage:** A mixture of 3.6M single-image data, 1 epoch, full model
417
- - **OneVision Stage:** A mixture of 1.6M single-image/multi-image/video data, 1 epoch, full model
418
- - **Precision:** bfloat16
419
-
420
- ## Hardware & Software
421
-
422
- - **GPUs:** 256 * Nvidia Tesla A100 (for whole model series training)
423
- - **Orchestration:** [Huggingface Trainer](https://huggingface.co/docs/transformers/main_classes/trainer)
424
- - **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
425
-
426
- # Citation
427
- ```
428
- @article{li2024llavaonevision,
429
- title={LLaVA-OneVision},
430
- }
431
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: lmms-lab/llava-onevision-qwen2-7b-ov
3
+ datasets:
4
+ - lmms-lab/EgoIT-99K
5
+ - Ego4D
6
+ language:
7
+ - en
8
+ library_name: transformers
9
+ license: apache-2.0
10
+ metrics:
11
+ - accuracy
12
+ tags:
13
+ - multimodal
14
+ - finetuned
15
+ - egocentric-vision
16
+ - video-qa
17
+ model-index:
18
+ - name: hpml-egoqa-baseline
19
+ results:
20
+ - task:
21
+ type: multimodal
22
+ dataset:
23
+ name: AI2D
24
+ type: ai2d
25
+ metrics:
26
+ - type: accuracy
27
+ value: 81.4
28
+ name: accuracy
29
+ verified: true
30
+ - task:
31
+ type: multimodal
32
+ dataset:
33
+ name: ChartQA
34
+ type: chartqa
35
+ metrics:
36
+ - type: accuracy
37
+ value: 80.0
38
+ name: accuracy
39
+ verified: true
40
+ - task:
41
+ type: multimodal
42
+ dataset:
43
+ name: DocVQA
44
+ type: docvqa
45
+ metrics:
46
+ - type: accuracy
47
+ value: 90.2
48
+ name: accuracy
49
+ verified: true
50
+ - task:
51
+ type: multimodal
52
+ dataset:
53
+ name: InfoVQA
54
+ type: infovqa
55
+ metrics:
56
+ - type: accuracy
57
+ value: 70.7
58
+ name: accuracy
59
+ verified: true
60
+ - task:
61
+ type: multimodal
62
+ dataset:
63
+ name: MathVerse
64
+ type: mathverse
65
+ metrics:
66
+ - type: accuracy
67
+ value: 26.2
68
+ name: accuracy
69
+ verified: true
70
+ - task:
71
+ type: multimodal
72
+ dataset:
73
+ name: MathVista
74
+ type: mathvista
75
+ metrics:
76
+ - type: accuracy
77
+ value: 63.2
78
+ name: accuracy
79
+ verified: true
80
+ - task:
81
+ type: multimodal
82
+ dataset:
83
+ name: MMBench
84
+ type: mmbench
85
+ metrics:
86
+ - type: accuracy
87
+ value: 80.8
88
+ name: accuracy
89
+ verified: true
90
+ - task:
91
+ type: multimodal
92
+ dataset:
93
+ name: MME-Perception
94
+ type: mme-perception
95
+ metrics:
96
+ - type: score
97
+ value: 1580
98
+ name: score
99
+ verified: true
100
+ - task:
101
+ type: multimodal
102
+ dataset:
103
+ name: MME-Cognition
104
+ type: mme-cognition
105
+ metrics:
106
+ - type: score
107
+ value: 418
108
+ name: score
109
+ verified: true
110
+ - task:
111
+ type: multimodal
112
+ dataset:
113
+ name: MMMU
114
+ type: mmmu
115
+ metrics:
116
+ - type: accuracy
117
+ value: 48.8
118
+ name: accuracy
119
+ verified: true
120
+ - task:
121
+ type: multimodal
122
+ dataset:
123
+ name: MMVet
124
+ type: mmvet
125
+ metrics:
126
+ - type: accuracy
127
+ value: 57.5
128
+ name: accuracy
129
+ verified: true
130
+ - task:
131
+ type: multimodal
132
+ dataset:
133
+ name: MMStar
134
+ type: mmstar
135
+ metrics:
136
+ - type: accuracy
137
+ value: 61.7
138
+ name: accuracy
139
+ verified: true
140
+ - task:
141
+ type: multimodal
142
+ dataset:
143
+ name: Seed-Bench
144
+ type: seed-bench
145
+ metrics:
146
+ - type: accuracy
147
+ value: 75.4
148
+ name: accuracy
149
+ verified: true
150
+ - task:
151
+ type: multimodal
152
+ dataset:
153
+ name: Science-QA
154
+ type: science-qa
155
+ metrics:
156
+ - type: accuracy
157
+ value: 96.0
158
+ name: accuracy
159
+ verified: true
160
+ - task:
161
+ type: multimodal
162
+ dataset:
163
+ name: ImageDC
164
+ type: imagedc
165
+ metrics:
166
+ - type: accuracy
167
+ value: 88.9
168
+ name: accuracy
169
+ verified: true
170
+ - task:
171
+ type: multimodal
172
+ dataset:
173
+ name: MMLBench
174
+ type: mmlbench
175
+ metrics:
176
+ - type: accuracy
177
+ value: 77.1
178
+ name: accuracy
179
+ verified: true
180
+ - task:
181
+ type: multimodal
182
+ dataset:
183
+ name: RealWorldQA
184
+ type: realworldqa
185
+ metrics:
186
+ - type: accuracy
187
+ value: 66.3
188
+ name: accuracy
189
+ verified: true
190
+ - task:
191
+ type: multimodal
192
+ dataset:
193
+ name: Vibe-Eval
194
+ type: vibe-eval
195
+ metrics:
196
+ - type: accuracy
197
+ value: 51.7
198
+ name: accuracy
199
+ verified: true
200
+ - task:
201
+ type: multimodal
202
+ dataset:
203
+ name: LLaVA-W
204
+ type: llava-w
205
+ metrics:
206
+ - type: accuracy
207
+ value: 90.7
208
+ name: accuracy
209
+ verified: true
210
+ - task:
211
+ type: multimodal
212
+ dataset:
213
+ name: LLaVA-Wilder
214
+ type: l-wilder
215
+ metrics:
216
+ - type: accuracy
217
+ value: 67.8
218
+ name: accuracy
219
+ verified: true
220
+ - task:
221
+ type: multimodal
222
+ dataset:
223
+ name: ActNet-QA
224
+ type: actnet-qa
225
+ metrics:
226
+ - type: accuracy
227
+ value: 56.6
228
+ name: accuracy
229
+ verified: true
230
+ - task:
231
+ type: multimodal
232
+ dataset:
233
+ name: EgoSchema
234
+ type: egoschema
235
+ metrics:
236
+ - type: accuracy
237
+ value: 60.1
238
+ name: accuracy
239
+ verified: true
240
+ - task:
241
+ type: multimodal
242
+ dataset:
243
+ name: MLVU
244
+ type: mlvu
245
+ metrics:
246
+ - type: accuracy
247
+ value: 64.7
248
+ name: accuracy
249
+ verified: true
250
+ - task:
251
+ type: multimodal
252
+ dataset:
253
+ name: MVBench
254
+ type: mvbench
255
+ metrics:
256
+ - type: accuracy
257
+ value: 56.7
258
+ name: accuracy
259
+ verified: true
260
+ - task:
261
+ type: multimodal
262
+ dataset:
263
+ name: NextQA
264
+ type: nextqa
265
+ metrics:
266
+ - type: accuracy
267
+ value: 79.4
268
+ name: accuracy
269
+ verified: true
270
+ - task:
271
+ type: multimodal
272
+ dataset:
273
+ name: PercepTest
274
+ type: percepTest
275
+ metrics:
276
+ - type: accuracy
277
+ value: 49.7
278
+ name: accuracy
279
+ verified: true
280
+ - task:
281
+ type: multimodal
282
+ dataset:
283
+ name: SeedBench
284
+ type: seedbench
285
+ metrics:
286
+ - type: accuracy
287
+ value: 56.9
288
+ name: accuracy
289
+ verified: true
290
+ - task:
291
+ type: multimodal
292
+ dataset:
293
+ name: VideoChatGPT
294
+ type: videochatgpt
295
+ metrics:
296
+ - type: score
297
+ value: 3.49
298
+ name: score
299
+ verified: true
300
+ - task:
301
+ type: multimodal
302
+ dataset:
303
+ name: VideoDC
304
+ type: videodc
305
+ metrics:
306
+ - type: score
307
+ value: 3.75
308
+ name: score
309
+ verified: true
310
+ - task:
311
+ type: multimodal
312
+ dataset:
313
+ name: VideoMME
314
+ type: videomme
315
+ metrics:
316
+ - type: accuracy
317
+ value: 58.2
318
+ name: accuracy
319
+ verified: true
320
+ ---
321
+
322
+
323
+ # HPML-EgoQA-Baseline
324
+
325
+ This is a **finetuned** version of [LLaVA-OneVision-Qwen2-7B-OV](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) for egocentric vision-language tasks.
326
+
327
+ ## Table of Contents
328
+
329
+ 1. [Model Summary](##model-summary)
330
+ 2. [Use](##use)
331
+ 3. [Limitations](##limitations)
332
+ 4. [Training](##training)
333
+ 5. [License](##license)
334
+ 6. [Citation](##citation)
335
+
336
+ ## Model Summary
337
+
338
+ This model is a **finetuned** version of LLaVA-OneVision-Qwen2-7B-OV, fine-tuned on [EgoIT-99K](https://huggingface.co/datasets/lmms-lab/EgoIT-99K) and Ego4D-like datasets for egocentric video question answering tasks. The base model is a 7B parameter multimodal model based on Qwen2 language model with a context window of 32K tokens, capable of understanding images, multi-image, and videos.
339
+
340
+ - **Base Model:** [lmms-lab/llava-onevision-qwen2-7b-ov](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov)
341
+ - **Finetuning Dataset:** [EgoIT-99K](https://huggingface.co/datasets/lmms-lab/EgoIT-99K) and Ego4D
342
+ - **Languages:** English
343
+ - **Project:** HPML (High-Performance Machine Learning) Project
344
+ - **Team Members:** Sunidhi Tandel, Rahil, and team
345
+ - **Institution:** HPML Project
346
+
347
+
348
+ ## Use
349
+
350
+ ### Intended use
351
+
352
+ This model is **finetuned** on [EgoIT-99K](https://huggingface.co/datasets/lmms-lab/EgoIT-99K) and Ego4D datasets for egocentric vision-language understanding tasks, particularly video question answering from first-person perspective. The model inherits the base model's ability to interact with images, multi-image and videos, with enhanced capabilities for egocentric video understanding.
353
+
354
+ **Feel free to share your generations in the Community tab!**
355
+
356
+ ### Generation
357
+
358
+ We provide the simple generation process for using our model. For more details, you could refer to [Github](https://github.com/LLaVA-VL/LLaVA-NeXT).
359
+
360
+ ```python
361
+ # pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
362
+ from llava.model.builder import load_pretrained_model
363
+ from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
364
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
365
+ from llava.conversation import conv_templates, SeparatorStyle
366
+
367
+ from PIL import Image
368
+ import requests
369
+ import copy
370
+ import torch
371
+
372
+ import sys
373
+ import warnings
374
+
375
+ warnings.filterwarnings("ignore")
376
+ pretrained = "sunidhitandel/hpml-egoqa-baseline" # Finetuned model
377
+ model_name = "llava_qwen"
378
+ device = "cuda"
379
+ device_map = "auto"
380
+ tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
381
+
382
+ model.eval()
383
+
384
+ url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
385
+ image = Image.open(requests.get(url, stream=True).raw)
386
+ image_tensor = process_images([image], image_processor, model.config)
387
+ image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
388
+
389
+ conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
390
+ question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
391
+ conv = copy.deepcopy(conv_templates[conv_template])
392
+ conv.append_message(conv.roles[0], question)
393
+ conv.append_message(conv.roles[1], None)
394
+ prompt_question = conv.get_prompt()
395
+
396
+ input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
397
+ image_sizes = [image.size]
398
+
399
+
400
+ cont = model.generate(
401
+ input_ids,
402
+ images=image_tensor,
403
+ image_sizes=image_sizes,
404
+ do_sample=False,
405
+ temperature=0,
406
+ max_new_tokens=4096,
407
+ )
408
+ text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
409
+ print(text_outputs)
410
+ ```
411
+
412
+ # Training
413
+
414
+ ## Base Model
415
+
416
+ This model is finetuned from [LLaVA-OneVision-Qwen2-7B-OV](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov), which was trained on:
417
+ - **Architecture:** SO400M + Qwen2
418
+ - **Pretraining Stage:** LCS-558K, 1 epoch, projector
419
+ - **Mid Stage:** A mixture of 4.7M high-quality synthetic data, 1 epoch, full model
420
+ - **Final-Image Stage:** A mixture of 3.6M single-image data, 1 epoch, full model
421
+ - **OneVision Stage:** A mixture of 1.6M single-image/multi-image/video data, 1 epoch, full model
422
+
423
+ ## Finetuning
424
+
425
+ - **Base Model:** lmms-lab/llava-onevision-qwen2-7b-ov
426
+ - **Finetuning Dataset:** EgoIT-99K and Ego4D (egocentric video QA data)
427
+ - **Task:** Egocentric video question answering
428
+ - **Precision:** bfloat16
429
+ - **Method:** Full fine-tuning / LoRA (depending on configuration)
430
+
431
+ ## Hardware & Software
432
+
433
+ - **GPUs:** Nvidia A100 (for finetuning)
434
+ - **Orchestration:** [Huggingface Trainer](https://huggingface.co/docs/transformers/main_classes/trainer)
435
+ - **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
436
+
437
+ # Citation
438
+
439
+ If you use this finetuned model, please cite both the base model and this work:
440
+
441
+ ```bibtex
442
+ @article{li2024llavaonevision,
443
+ title={LLaVA-OneVision},
444
+ author={Li, Bo and others},
445
+ journal={arXiv preprint arXiv:2408.03326},
446
+ year={2024}
447
+ }
448
+
449
+ @misc{hpml-egoqa-baseline,
450
+ title={HPML-EgoQA-Baseline: Finetuned LLaVA-OneVision for Egocentric Video QA},
451
+ author={Tandel, Sunidhi and Rahil and HPML Project Team},
452
+ year={2024},
453
+ howpublished={\url{https://huggingface.co/sunidhitandel/hpml-egoqa-baseline}},
454
+ note={HPML Project - High-Performance Machine Learning for Egocentric Vision}
455
+ }
456
+ ```
457
+
458
+ ## Acknowledgments
459
+
460
+ This work is part of the HPML (High-Performance Machine Learning) Project. We thank the LLaVA-OneVision team for providing the base model and the EgoIT-99K dataset contributors.
461
+
462
+ **Team Members:**
463
+ - Sunidhi Tandel
464
+ - Rahil Singhi
465
+ - HPML Project Team