lbourdois commited on
Commit
a2ae2ea
·
verified ·
1 Parent(s): f744d32

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +587 -575
README.md CHANGED
@@ -1,576 +1,588 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- pipeline_tag: image-text-to-text
4
- library_name: transformers
5
- base_model:
6
- - google/paligemma-3b-mix-448
7
- - Qwen/Qwen2.5-1.5B-Instruct
8
- - google/siglip-so400m-patch14-384
9
- base_model_relation: merge
10
- language:
11
- - multilingual
12
- tags:
13
- - eagle
14
- - VLM
15
- ---
16
-
17
-
18
- # Eagle-2
19
-
20
- [\[📂 GitHub\]](https://github.com/NVlabs/EAGLE) [\[📜 Eagle2 Tech Report\]](https://github.com/NVlabs/EAGLE/blob/main/Eagle2/Eagle2_report.pdf)
21
- [\[🗨️ Chat Demo\]](http://eagle-vlm.xyz/) [\[🤗 HF Demo\]](TODO)
22
- ## Introduction
23
-
24
- We are thrilled to release our latest Eagle2 series Vision-Language Model. Open-source Vision-Language Models (VLMs) have made significant strides in narrowing the gap with proprietary models. However, critical details about data strategies and implementation are often missing, limiting reproducibility and innovation. In this project, we focus on VLM post-training from a data-centric perspective, sharing insights into building effective data strategies from scratch. By combining these strategies with robust training recipes and model design, we introduce Eagle2, a family of performant VLMs. Our work aims to empower the open-source community to develop competitive VLMs with transparent processes.
25
-
26
-
27
-
28
- In this repo, we are open-sourcing Eagle2-2B, a lightweight model that achieves remarkable efficiency and speed while maintaining solid performance.
29
-
30
-
31
-
32
-
33
-
34
-
35
-
36
-
37
- ## Model Zoo
38
- We provide the following models:
39
-
40
- | model name | LLM | Vision | Max Length| HF Link|
41
- | ----------- | ------- |---------|-|-|
42
- | Eagle2-1B | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) | Siglip | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-1B)|
43
- | Eagle2-2B | [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | Siglip | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-2B)|
44
- | Eagle2-9B | [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | Siglip+ConvNext | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-9B)|
45
-
46
- ## Benchmark Results
47
- | Benchmark | InternVL2-2B | InternVL2.5-2B | InternVL2-4B |Qwen2-VL-2B| Eagle2-2B|
48
- | :--------------------------: | :------------------: | :----------------: | :----------: |:----------: |:----------: |
49
- | DocVQA<sub>test</sub> | 86.9 | 88.7 | 89.2 |90.1|88.0|
50
- | ChartQA<sub>test</sub> | 76.2 | 79.2 | 81.5 |73.0|82.0|
51
- | InfoVQA<sub>test</sub> | 58.9 | 60.9 | 67.0 |65.5|65.8|
52
- | TextVQA<sub>val</sub> | 73.4 | 74.3 | 74.4 |79.7|79.1|
53
- | OCRBench | 784 | 804 | 788 |809|818|
54
- | MME<sub>sum</sub> | 1876.8 | 2138.2 | 2059.8 |1872.0 | 2109.8
55
- | RealWorldQA | 57.3 | 60.1 | 60.7 |62.6|63.1|
56
- | AI2D<sub>test</sub> | 74.1 | 74.9 | 74.7 | 78.9 |79.3|
57
- | MMMU<sub>val</sub> | 36.3 | 43.6 | 47.9 |41.1|43.1|
58
- | MMVet<sub>GPT-4-Turbo</sub> | 39.5 | 60.8 | 51.0 | 49.5|53.8|
59
- | HallBench<sub>avg</sub> | 37.9 | 42.6 | 41.9 |41.7|45.8
60
- | MathVista<sub>testmini</sub> | 46.3 | 51.3 | 58.6 |43.0|54.7|
61
- | MMstar | 50.1 | 53.7 | 54.3|48.0|56.4|
62
-
63
-
64
-
65
- ## Quick Start
66
-
67
-
68
-
69
- We provide a [demo inference script](./demo.py) to help you quickly start using the model. We support different input types:
70
- - pure text input
71
- - single image input
72
- - multiple image input
73
- - video input
74
-
75
- ### 0. Install the dependencies
76
-
77
- ```bash
78
- pip install transformers==4.37.2
79
- pip install flash-attn
80
- ```
81
- **Note**: Latest version of transformers is not compatible with the model.
82
-
83
- ### 1. Prepare the Model worker
84
-
85
- <details>
86
- <summary>Click to expand</summary>
87
-
88
- ```python
89
-
90
- """
91
- A model worker executes the model.
92
- Copied and modified from https://github.com/OpenGVLab/InternVL/blob/main/streamlit_demo/model_worker.py
93
- """
94
- # Importing torch before transformers can cause `segmentation fault`
95
- from transformers import AutoModel, AutoTokenizer, TextIteratorStreamer, AutoConfig
96
-
97
- import argparse
98
- import base64
99
- import json
100
- import os
101
- import decord
102
- import threading
103
- import time
104
- from io import BytesIO
105
- from threading import Thread
106
- import math
107
- import requests
108
- import torch
109
- import torchvision.transforms as T
110
- from PIL import Image
111
- from torchvision.transforms.functional import InterpolationMode
112
- import numpy as np
113
-
114
-
115
- IMAGENET_MEAN = (0.485, 0.456, 0.406)
116
- IMAGENET_STD = (0.229, 0.224, 0.225)
117
-
118
- SIGLIP_MEAN = (0.5, 0.5, 0.5)
119
- SIGLIP_STD = (0.5, 0.5, 0.5)
120
-
121
-
122
- def get_seq_frames(total_num_frames, desired_num_frames=-1, stride=-1):
123
- """
124
- Calculate the indices of frames to extract from a video.
125
-
126
- Parameters:
127
- total_num_frames (int): Total number of frames in the video.
128
- desired_num_frames (int): Desired number of frames to extract.
129
-
130
- Returns:
131
- list: List of indices of frames to extract.
132
- """
133
-
134
- assert desired_num_frames > 0 or stride > 0 and not (desired_num_frames > 0 and stride > 0)
135
-
136
- if stride > 0:
137
- return list(range(0, total_num_frames, stride))
138
-
139
- # Calculate the size of each segment from which a frame will be extracted
140
- seg_size = float(total_num_frames - 1) / desired_num_frames
141
-
142
- seq = []
143
- for i in range(desired_num_frames):
144
- # Calculate the start and end indices of each segment
145
- start = int(np.round(seg_size * i))
146
- end = int(np.round(seg_size * (i + 1)))
147
-
148
- # Append the middle index of the segment to the list
149
- seq.append((start + end) // 2)
150
-
151
- return seq
152
-
153
- def build_video_prompt(meta_list, num_frames, time_position=False):
154
- # if time_position is True, the frame_timestamp is used.
155
- # 1. pass time_position, 2. use env TIME_POSITION
156
- time_position = os.environ.get("TIME_POSITION", time_position)
157
- prefix = f"This is a video:\n"
158
- for i in range(num_frames):
159
- if time_position:
160
- frame_txt = f"Frame {i+1} sampled at {meta_list[i]:.2f} seconds: <image>\n"
161
- else:
162
- frame_txt = f"Frame {i+1}: <image>\n"
163
- prefix += frame_txt
164
- return prefix
165
-
166
- def load_video(video_path, num_frames=64, frame_cache_root=None):
167
- if isinstance(video_path, str):
168
- video = decord.VideoReader(video_path)
169
- elif isinstance(video_path, dict):
170
- assert False, 'we not support vidoe: "video_path" as input'
171
- fps = video.get_avg_fps()
172
- sampled_frames = get_seq_frames(len(video), num_frames)
173
- samepld_timestamps = [i / fps for i in sampled_frames]
174
- frames = video.get_batch(sampled_frames).asnumpy()
175
- images = [Image.fromarray(frame) for frame in frames]
176
-
177
- return images, build_video_prompt(samepld_timestamps, len(images), time_position=True)
178
-
179
- def load_image(image):
180
- if isinstance(image, str) and os.path.exists(image):
181
- return Image.open(image)
182
- elif isinstance(image, dict):
183
- if 'disk_path' in image:
184
- return Image.open(image['disk_path'])
185
- elif 'base64' in image:
186
- return Image.open(BytesIO(base64.b64decode(image['base64'])))
187
- elif 'url' in image:
188
- response = requests.get(image['url'])
189
- return Image.open(BytesIO(response.content))
190
- elif 'bytes' in image:
191
- return Image.open(BytesIO(image['bytes']))
192
- else:
193
- raise ValueError(f'Invalid image: {image}')
194
- else:
195
- raise ValueError(f'Invalid image: {image}')
196
-
197
- def build_transform(input_size, norm_type='imagenet'):
198
- if norm_type == 'imagenet':
199
- MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
200
- elif norm_type == 'siglip':
201
- MEAN, STD = SIGLIP_MEAN, SIGLIP_STD
202
-
203
- transform = T.Compose([
204
- T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
205
- T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
206
- T.ToTensor(),
207
- T.Normalize(mean=MEAN, std=STD)
208
- ])
209
- return transform
210
-
211
-
212
- def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
213
- """
214
- previous version mainly foucs on ratio.
215
- We also consider area ratio here.
216
- """
217
- best_factor = float('-inf')
218
- best_ratio = (1, 1)
219
- area = width * height
220
- for ratio in target_ratios:
221
- target_aspect_ratio = ratio[0] / ratio[1]
222
- ratio_diff = abs(aspect_ratio - target_aspect_ratio)
223
- area_ratio = (ratio[0]*ratio[1]*image_size*image_size)/ area
224
- """
225
- new area > 60% of original image area is enough.
226
- """
227
- factor_based_on_area_n_ratio = min((ratio[0]*ratio[1]*image_size*image_size)/ area, 0.6)* \
228
- min(target_aspect_ratio/aspect_ratio, aspect_ratio/target_aspect_ratio)
229
-
230
- if factor_based_on_area_n_ratio > best_factor:
231
- best_factor = factor_based_on_area_n_ratio
232
- best_ratio = ratio
233
-
234
- return best_ratio
235
-
236
-
237
- def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
238
- orig_width, orig_height = image.size
239
- aspect_ratio = orig_width / orig_height
240
-
241
- # calculate the existing image aspect ratio
242
- target_ratios = set(
243
- (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
244
- i * j <= max_num and i * j >= min_num)
245
- target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
246
-
247
- # find the closest aspect ratio to the target
248
- target_aspect_ratio = find_closest_aspect_ratio(
249
- aspect_ratio, target_ratios, orig_width, orig_height, image_size)
250
-
251
- # calculate the target width and height
252
- target_width = image_size * target_aspect_ratio[0]
253
- target_height = image_size * target_aspect_ratio[1]
254
- blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
255
-
256
- # resize the image
257
- resized_img = image.resize((target_width, target_height))
258
- processed_images = []
259
- for i in range(blocks):
260
- box = (
261
- (i % (target_width // image_size)) * image_size,
262
- (i // (target_width // image_size)) * image_size,
263
- ((i % (target_width // image_size)) + 1) * image_size,
264
- ((i // (target_width // image_size)) + 1) * image_size
265
- )
266
- # split the image
267
- split_img = resized_img.crop(box)
268
- processed_images.append(split_img)
269
- assert len(processed_images) == blocks
270
- if use_thumbnail and len(processed_images) != 1:
271
- thumbnail_img = image.resize((image_size, image_size))
272
- processed_images.append(thumbnail_img)
273
- return processed_images
274
-
275
- def split_model(model_path, device):
276
-
277
- device_map = {}
278
- world_size = torch.cuda.device_count()
279
- config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
280
- num_layers = config.llm_config.num_hidden_layers
281
-
282
- print('world_size', world_size)
283
- num_layers_per_gpu_ = math.floor(num_layers / (world_size - 1))
284
- num_layers_per_gpu = [num_layers_per_gpu_] * world_size
285
- num_layers_per_gpu[device] = num_layers - num_layers_per_gpu_ * (world_size-1)
286
- print(num_layers_per_gpu)
287
- layer_cnt = 0
288
- for i, num_layer in enumerate(num_layers_per_gpu):
289
- for j in range(num_layer):
290
- device_map[f'language_model.model.layers.{layer_cnt}'] = i
291
- layer_cnt += 1
292
- device_map['vision_model'] = device
293
- device_map['mlp1'] = device
294
- device_map['language_model.model.tok_embeddings'] = device
295
- device_map['language_model.model.embed_tokens'] = device
296
- device_map['language_model.output'] = device
297
- device_map['language_model.model.norm'] = device
298
- device_map['language_model.lm_head'] = device
299
- device_map['language_model.model.rotary_emb'] = device
300
- device_map[f'language_model.model.layers.{num_layers - 1}'] = device
301
- return device_map
302
-
303
- class ModelWorker:
304
- def __init__(self, model_path, model_name,
305
- load_8bit, device):
306
-
307
- if model_path.endswith('/'):
308
- model_path = model_path[:-1]
309
- if model_name is None:
310
- model_paths = model_path.split('/')
311
- if model_paths[-1].startswith('checkpoint-'):
312
- self.model_name = model_paths[-2] + '_' + model_paths[-1]
313
- else:
314
- self.model_name = model_paths[-1]
315
- else:
316
- self.model_name = model_name
317
-
318
- print(f'Loading the model {self.model_name}')
319
-
320
- tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)
321
- tokens_to_keep = ['<box>', '</box>', '<ref>', '</ref>']
322
- tokenizer.additional_special_tokens = [item for item in tokenizer.additional_special_tokens if item not in tokens_to_keep]
323
- self.tokenizer = tokenizer
324
- config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
325
- model_type = config.vision_config.model_type
326
- self.device = torch.cuda.current_device()
327
- if model_type == 'siglip_vision_model':
328
- self.norm_type = 'siglip'
329
- elif model_type == 'MOB':
330
- self.norm_type = 'siglip'
331
- else:
332
- self.norm_type = 'imagenet'
333
-
334
- if any(x in model_path.lower() for x in ['34b']):
335
- device_map = split_model(model_path, self.device)
336
- else:
337
- device_map = None
338
-
339
- if device_map is not None:
340
- self.model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,
341
- low_cpu_mem_usage=True,
342
- device_map=device_map,
343
- trust_remote_code=True,
344
- load_in_8bit=load_8bit).eval()
345
- else:
346
- self.model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,
347
- trust_remote_code=True,
348
- load_in_8bit=load_8bit).eval()
349
-
350
- if not load_8bit and device_map is None:
351
- self.model = self.model.to(device)
352
- self.load_8bit = load_8bit
353
-
354
- self.model_path = model_path
355
- self.image_size = self.model.config.force_image_size
356
- self.context_len = tokenizer.model_max_length
357
- self.per_tile_len = 256
358
-
359
- def reload_model(self):
360
- del self.model
361
- torch.cuda.empty_cache()
362
- if self.device == 'auto':
363
- os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
364
- # This can make distributed deployment work properly
365
- self.model = AutoModel.from_pretrained(
366
- self.model_path,
367
- load_in_8bit=self.load_8bit,
368
- torch_dtype=torch.bfloat16,
369
- device_map=self.device_map,
370
- trust_remote_code=True).eval()
371
- else:
372
- self.model = AutoModel.from_pretrained(
373
- self.model_path,
374
- load_in_8bit=self.load_8bit,
375
- torch_dtype=torch.bfloat16,
376
- trust_remote_code=True).eval()
377
- if not self.load_8bit and not self.device == 'auto':
378
- self.model = self.model.cuda()
379
-
380
- @torch.inference_mode()
381
- def generate(self, params):
382
- system_message = params['prompt'][0]['content']
383
- send_messages = params['prompt'][1:]
384
- max_input_tiles = params['max_input_tiles']
385
- temperature = params['temperature']
386
- top_p = params['top_p']
387
- max_new_tokens = params['max_new_tokens']
388
- repetition_penalty = params['repetition_penalty']
389
- video_frame_num = params.get('video_frame_num', 64)
390
- do_sample = True if temperature > 0.0 else False
391
-
392
- global_image_cnt = 0
393
- history, pil_images, max_input_tile_list = [], [], []
394
- for message in send_messages:
395
- if message['role'] == 'user':
396
- prefix = ''
397
- if 'image' in message:
398
- for image_data in message['image']:
399
- pil_images.append(load_image(image_data))
400
- prefix = prefix + f'<image {global_image_cnt + 1}><image>\n'
401
- global_image_cnt += 1
402
- max_input_tile_list.append(max_input_tiles)
403
- if 'video' in message:
404
- for video_data in message['video']:
405
- video_frames, tmp_prefix = load_video(video_data, num_frames=video_frame_num)
406
- pil_images.extend(video_frames)
407
- prefix = prefix + tmp_prefix
408
- global_image_cnt += len(video_frames)
409
- max_input_tile_list.extend([1] * len(video_frames))
410
- content = prefix + message['content']
411
- history.append([content, ])
412
- else:
413
- history[-1].append(message['content'])
414
- question, history = history[-1][0], history[:-1]
415
-
416
- if global_image_cnt == 1:
417
- question = question.replace('<image 1><image>\n', '<image>\n')
418
- history = [[item[0].replace('<image 1><image>\n', '<image>\n'), item[1]] for item in history]
419
-
420
-
421
- try:
422
- assert len(max_input_tile_list) == len(pil_images), 'The number of max_input_tile_list and pil_images should be the same.'
423
- except Exception as e:
424
- from IPython import embed; embed()
425
- exit()
426
- print(f'Error: {e}')
427
- print(f'max_input_tile_list: {max_input_tile_list}, pil_images: {pil_images}')
428
- # raise e
429
-
430
- old_system_message = self.model.system_message
431
- self.model.system_message = system_message
432
-
433
- transform = build_transform(input_size=self.image_size, norm_type=self.norm_type)
434
- if len(pil_images) > 0:
435
- max_input_tiles_limited_by_contect = params['max_input_tiles']
436
- while True:
437
- image_tiles = []
438
- for current_max_input_tiles, pil_image in zip(max_input_tile_list, pil_images):
439
- if self.model.config.dynamic_image_size:
440
- tiles = dynamic_preprocess(
441
- pil_image, image_size=self.image_size, max_num=min(current_max_input_tiles, max_input_tiles_limited_by_contect),
442
- use_thumbnail=self.model.config.use_thumbnail)
443
- else:
444
- tiles = [pil_image]
445
- image_tiles += tiles
446
- if (len(image_tiles) * self.per_tile_len < self.context_len):
447
- break
448
- else:
449
- max_input_tiles_limited_by_contect -= 2
450
-
451
- if max_input_tiles_limited_by_contect < 1:
452
- break
453
-
454
- pixel_values = [transform(item) for item in image_tiles]
455
- pixel_values = torch.stack(pixel_values).to(self.model.device, dtype=torch.bfloat16)
456
- print(f'Split images to {pixel_values.shape}')
457
- else:
458
- pixel_values = None
459
-
460
- generation_config = dict(
461
- num_beams=1,
462
- max_new_tokens=max_new_tokens,
463
- do_sample=do_sample,
464
- temperature=temperature,
465
- repetition_penalty=repetition_penalty,
466
- max_length=self.context_len,
467
- top_p=top_p,
468
- )
469
-
470
- response = self.model.chat(
471
- tokenizer=self.tokenizer,
472
- pixel_values=pixel_values,
473
- question=question,
474
- history=history,
475
- return_history=False,
476
- generation_config=generation_config,
477
- )
478
- self.model.system_message = old_system_message
479
- return {'text': response, 'error_code': 0}
480
-
481
-
482
-
483
-
484
-
485
- if __name__ == '__main__':
486
- parser = argparse.ArgumentParser()
487
- parser.add_argument('--model-path', type=str, default='nvidia/Eagle2-2B')
488
- parser.add_argument('--model-name', type=str, default='Eagle2-2B')
489
- parser.add_argument('--device', type=str, default='cuda')
490
- parser.add_argument('--load-8bit', action='store_true')
491
- args = parser.parse_args()
492
- print(f'args: {args}')
493
-
494
- worker = ModelWorker(
495
- args.model_path,
496
- args.model_name,
497
- args.load_8bit,
498
- args.device)
499
- ```
500
- </details>
501
-
502
-
503
- ### 2. Prepare the Prompt
504
-
505
- - Single image input
506
- ```python
507
- prompt = [
508
- {'role': 'system', 'content': 'You are a helpful assistant.'},
509
- {'role': 'user', 'content': 'Describe this image in details.',
510
- 'image':[
511
- {'url': 'https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png'}
512
- ],
513
- }
514
- ]
515
- ```
516
-
517
- - Multiple image input
518
- ```python
519
- prompt = [
520
- {'role': 'system', 'content': 'You are a helpful assistant.'},
521
- {'role': 'user', 'content': 'Describe these two images in details.',
522
- 'image':[
523
- {'url': 'https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png'},
524
- {'url': 'https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png'}
525
- ],
526
- }
527
- ]
528
- ```
529
-
530
- - Video input
531
- ```python
532
- prompt = [
533
- {'role': 'system', 'content': 'You are a helpful assistant.'},
534
- {'role': 'user', 'content': 'Describe this video in details.',
535
- 'video':[
536
- 'path/to/your/video.mp4'
537
- ],
538
- }
539
- ]
540
- ```
541
-
542
- ### 3. Generate the response
543
- ```python
544
- params = {
545
- 'prompt': prompt,
546
- 'max_input_tiles': 24,
547
- 'temperature': 0.7,
548
- 'top_p': 1.0,
549
- 'max_new_tokens': 4096,
550
- 'repetition_penalty': 1.0,
551
- }
552
- worker.generate(params)
553
- ```
554
-
555
- ## TODO
556
- - [ ] Support vLLM Inference
557
- - [ ] Provide AWQ Quantization Weights
558
- - [ ] Provide fine-tuning scripts
559
-
560
-
561
- ## License/Terms of Use
562
- - The code is released under the Apache 2.0 license as found in the [LICENSE](https://huggingface.co/NVEagle/Eagle-X5-13B-Chat/blob/main/LICENSE) file.
563
- - The pretrained model weights are released under the [Creative Commons Attribution: Non-Commercial 4.0 International](https://spdx.org/licenses/CC-BY-NC-4.0) <br>
564
- - The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
565
- - Model License of Qwen2.5-1.5B-Instruct: [Apache-2.0](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/blob/main/LICENSE)
566
- - Model License of PaliGemma: [Gemma license](https://ai.google.dev/gemma/terms)
567
-
568
-
569
-
570
- ## Citation
571
-
572
- ## Ethical Considerations
573
- NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
574
-
575
- Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
 
 
 
 
 
 
 
 
 
 
 
 
576
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
+ base_model:
6
+ - google/paligemma-3b-mix-448
7
+ - Qwen/Qwen2.5-1.5B-Instruct
8
+ - google/siglip-so400m-patch14-384
9
+ base_model_relation: merge
10
+ language:
11
+ - zho
12
+ - eng
13
+ - fra
14
+ - spa
15
+ - por
16
+ - deu
17
+ - ita
18
+ - rus
19
+ - jpn
20
+ - kor
21
+ - vie
22
+ - tha
23
+ - ara
24
+ tags:
25
+ - eagle
26
+ - VLM
27
+ ---
28
+
29
+
30
+ # Eagle-2
31
+
32
+ [\[📂 GitHub\]](https://github.com/NVlabs/EAGLE) [\[📜 Eagle2 Tech Report\]](https://github.com/NVlabs/EAGLE/blob/main/Eagle2/Eagle2_report.pdf)
33
+ [\[🗨️ Chat Demo\]](http://eagle-vlm.xyz/) [\[🤗 HF Demo\]](TODO)
34
+ ## Introduction
35
+
36
+ We are thrilled to release our latest Eagle2 series Vision-Language Model. Open-source Vision-Language Models (VLMs) have made significant strides in narrowing the gap with proprietary models. However, critical details about data strategies and implementation are often missing, limiting reproducibility and innovation. In this project, we focus on VLM post-training from a data-centric perspective, sharing insights into building effective data strategies from scratch. By combining these strategies with robust training recipes and model design, we introduce Eagle2, a family of performant VLMs. Our work aims to empower the open-source community to develop competitive VLMs with transparent processes.
37
+
38
+
39
+
40
+ In this repo, we are open-sourcing Eagle2-2B, a lightweight model that achieves remarkable efficiency and speed while maintaining solid performance.
41
+
42
+
43
+
44
+
45
+
46
+
47
+
48
+
49
+ ## Model Zoo
50
+ We provide the following models:
51
+
52
+ | model name | LLM | Vision | Max Length| HF Link|
53
+ | ----------- | ------- |---------|-|-|
54
+ | Eagle2-1B | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) | Siglip | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-1B)|
55
+ | Eagle2-2B | [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | Siglip | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-2B)|
56
+ | Eagle2-9B | [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | Siglip+ConvNext | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-9B)|
57
+
58
+ ## Benchmark Results
59
+ | Benchmark | InternVL2-2B | InternVL2.5-2B | InternVL2-4B |Qwen2-VL-2B| Eagle2-2B|
60
+ | :--------------------------: | :------------------: | :----------------: | :----------: |:----------: |:----------: |
61
+ | DocVQA<sub>test</sub> | 86.9 | 88.7 | 89.2 |90.1|88.0|
62
+ | ChartQA<sub>test</sub> | 76.2 | 79.2 | 81.5 |73.0|82.0|
63
+ | InfoVQA<sub>test</sub> | 58.9 | 60.9 | 67.0 |65.5|65.8|
64
+ | TextVQA<sub>val</sub> | 73.4 | 74.3 | 74.4 |79.7|79.1|
65
+ | OCRBench | 784 | 804 | 788 |809|818|
66
+ | MME<sub>sum</sub> | 1876.8 | 2138.2 | 2059.8 |1872.0 | 2109.8
67
+ | RealWorldQA | 57.3 | 60.1 | 60.7 |62.6|63.1|
68
+ | AI2D<sub>test</sub> | 74.1 | 74.9 | 74.7 | 78.9 |79.3|
69
+ | MMMU<sub>val</sub> | 36.3 | 43.6 | 47.9 |41.1|43.1|
70
+ | MMVet<sub>GPT-4-Turbo</sub> | 39.5 | 60.8 | 51.0 | 49.5|53.8|
71
+ | HallBench<sub>avg</sub> | 37.9 | 42.6 | 41.9 |41.7|45.8
72
+ | MathVista<sub>testmini</sub> | 46.3 | 51.3 | 58.6 |43.0|54.7|
73
+ | MMstar | 50.1 | 53.7 | 54.3|48.0|56.4|
74
+
75
+
76
+
77
+ ## Quick Start
78
+
79
+
80
+
81
+ We provide a [demo inference script](./demo.py) to help you quickly start using the model. We support different input types:
82
+ - pure text input
83
+ - single image input
84
+ - multiple image input
85
+ - video input
86
+
87
+ ### 0. Install the dependencies
88
+
89
+ ```bash
90
+ pip install transformers==4.37.2
91
+ pip install flash-attn
92
+ ```
93
+ **Note**: Latest version of transformers is not compatible with the model.
94
+
95
+ ### 1. Prepare the Model worker
96
+
97
+ <details>
98
+ <summary>Click to expand</summary>
99
+
100
+ ```python
101
+
102
+ """
103
+ A model worker executes the model.
104
+ Copied and modified from https://github.com/OpenGVLab/InternVL/blob/main/streamlit_demo/model_worker.py
105
+ """
106
+ # Importing torch before transformers can cause `segmentation fault`
107
+ from transformers import AutoModel, AutoTokenizer, TextIteratorStreamer, AutoConfig
108
+
109
+ import argparse
110
+ import base64
111
+ import json
112
+ import os
113
+ import decord
114
+ import threading
115
+ import time
116
+ from io import BytesIO
117
+ from threading import Thread
118
+ import math
119
+ import requests
120
+ import torch
121
+ import torchvision.transforms as T
122
+ from PIL import Image
123
+ from torchvision.transforms.functional import InterpolationMode
124
+ import numpy as np
125
+
126
+
127
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
128
+ IMAGENET_STD = (0.229, 0.224, 0.225)
129
+
130
+ SIGLIP_MEAN = (0.5, 0.5, 0.5)
131
+ SIGLIP_STD = (0.5, 0.5, 0.5)
132
+
133
+
134
+ def get_seq_frames(total_num_frames, desired_num_frames=-1, stride=-1):
135
+ """
136
+ Calculate the indices of frames to extract from a video.
137
+
138
+ Parameters:
139
+ total_num_frames (int): Total number of frames in the video.
140
+ desired_num_frames (int): Desired number of frames to extract.
141
+
142
+ Returns:
143
+ list: List of indices of frames to extract.
144
+ """
145
+
146
+ assert desired_num_frames > 0 or stride > 0 and not (desired_num_frames > 0 and stride > 0)
147
+
148
+ if stride > 0:
149
+ return list(range(0, total_num_frames, stride))
150
+
151
+ # Calculate the size of each segment from which a frame will be extracted
152
+ seg_size = float(total_num_frames - 1) / desired_num_frames
153
+
154
+ seq = []
155
+ for i in range(desired_num_frames):
156
+ # Calculate the start and end indices of each segment
157
+ start = int(np.round(seg_size * i))
158
+ end = int(np.round(seg_size * (i + 1)))
159
+
160
+ # Append the middle index of the segment to the list
161
+ seq.append((start + end) // 2)
162
+
163
+ return seq
164
+
165
+ def build_video_prompt(meta_list, num_frames, time_position=False):
166
+ # if time_position is True, the frame_timestamp is used.
167
+ # 1. pass time_position, 2. use env TIME_POSITION
168
+ time_position = os.environ.get("TIME_POSITION", time_position)
169
+ prefix = f"This is a video:\n"
170
+ for i in range(num_frames):
171
+ if time_position:
172
+ frame_txt = f"Frame {i+1} sampled at {meta_list[i]:.2f} seconds: <image>\n"
173
+ else:
174
+ frame_txt = f"Frame {i+1}: <image>\n"
175
+ prefix += frame_txt
176
+ return prefix
177
+
178
+ def load_video(video_path, num_frames=64, frame_cache_root=None):
179
+ if isinstance(video_path, str):
180
+ video = decord.VideoReader(video_path)
181
+ elif isinstance(video_path, dict):
182
+ assert False, 'we not support vidoe: "video_path" as input'
183
+ fps = video.get_avg_fps()
184
+ sampled_frames = get_seq_frames(len(video), num_frames)
185
+ samepld_timestamps = [i / fps for i in sampled_frames]
186
+ frames = video.get_batch(sampled_frames).asnumpy()
187
+ images = [Image.fromarray(frame) for frame in frames]
188
+
189
+ return images, build_video_prompt(samepld_timestamps, len(images), time_position=True)
190
+
191
+ def load_image(image):
192
+ if isinstance(image, str) and os.path.exists(image):
193
+ return Image.open(image)
194
+ elif isinstance(image, dict):
195
+ if 'disk_path' in image:
196
+ return Image.open(image['disk_path'])
197
+ elif 'base64' in image:
198
+ return Image.open(BytesIO(base64.b64decode(image['base64'])))
199
+ elif 'url' in image:
200
+ response = requests.get(image['url'])
201
+ return Image.open(BytesIO(response.content))
202
+ elif 'bytes' in image:
203
+ return Image.open(BytesIO(image['bytes']))
204
+ else:
205
+ raise ValueError(f'Invalid image: {image}')
206
+ else:
207
+ raise ValueError(f'Invalid image: {image}')
208
+
209
+ def build_transform(input_size, norm_type='imagenet'):
210
+ if norm_type == 'imagenet':
211
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
212
+ elif norm_type == 'siglip':
213
+ MEAN, STD = SIGLIP_MEAN, SIGLIP_STD
214
+
215
+ transform = T.Compose([
216
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
217
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
218
+ T.ToTensor(),
219
+ T.Normalize(mean=MEAN, std=STD)
220
+ ])
221
+ return transform
222
+
223
+
224
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
225
+ """
226
+ previous version mainly foucs on ratio.
227
+ We also consider area ratio here.
228
+ """
229
+ best_factor = float('-inf')
230
+ best_ratio = (1, 1)
231
+ area = width * height
232
+ for ratio in target_ratios:
233
+ target_aspect_ratio = ratio[0] / ratio[1]
234
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
235
+ area_ratio = (ratio[0]*ratio[1]*image_size*image_size)/ area
236
+ """
237
+ new area > 60% of original image area is enough.
238
+ """
239
+ factor_based_on_area_n_ratio = min((ratio[0]*ratio[1]*image_size*image_size)/ area, 0.6)* \
240
+ min(target_aspect_ratio/aspect_ratio, aspect_ratio/target_aspect_ratio)
241
+
242
+ if factor_based_on_area_n_ratio > best_factor:
243
+ best_factor = factor_based_on_area_n_ratio
244
+ best_ratio = ratio
245
+
246
+ return best_ratio
247
+
248
+
249
+ def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
250
+ orig_width, orig_height = image.size
251
+ aspect_ratio = orig_width / orig_height
252
+
253
+ # calculate the existing image aspect ratio
254
+ target_ratios = set(
255
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
256
+ i * j <= max_num and i * j >= min_num)
257
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
258
+
259
+ # find the closest aspect ratio to the target
260
+ target_aspect_ratio = find_closest_aspect_ratio(
261
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
262
+
263
+ # calculate the target width and height
264
+ target_width = image_size * target_aspect_ratio[0]
265
+ target_height = image_size * target_aspect_ratio[1]
266
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
267
+
268
+ # resize the image
269
+ resized_img = image.resize((target_width, target_height))
270
+ processed_images = []
271
+ for i in range(blocks):
272
+ box = (
273
+ (i % (target_width // image_size)) * image_size,
274
+ (i // (target_width // image_size)) * image_size,
275
+ ((i % (target_width // image_size)) + 1) * image_size,
276
+ ((i // (target_width // image_size)) + 1) * image_size
277
+ )
278
+ # split the image
279
+ split_img = resized_img.crop(box)
280
+ processed_images.append(split_img)
281
+ assert len(processed_images) == blocks
282
+ if use_thumbnail and len(processed_images) != 1:
283
+ thumbnail_img = image.resize((image_size, image_size))
284
+ processed_images.append(thumbnail_img)
285
+ return processed_images
286
+
287
+ def split_model(model_path, device):
288
+
289
+ device_map = {}
290
+ world_size = torch.cuda.device_count()
291
+ config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
292
+ num_layers = config.llm_config.num_hidden_layers
293
+
294
+ print('world_size', world_size)
295
+ num_layers_per_gpu_ = math.floor(num_layers / (world_size - 1))
296
+ num_layers_per_gpu = [num_layers_per_gpu_] * world_size
297
+ num_layers_per_gpu[device] = num_layers - num_layers_per_gpu_ * (world_size-1)
298
+ print(num_layers_per_gpu)
299
+ layer_cnt = 0
300
+ for i, num_layer in enumerate(num_layers_per_gpu):
301
+ for j in range(num_layer):
302
+ device_map[f'language_model.model.layers.{layer_cnt}'] = i
303
+ layer_cnt += 1
304
+ device_map['vision_model'] = device
305
+ device_map['mlp1'] = device
306
+ device_map['language_model.model.tok_embeddings'] = device
307
+ device_map['language_model.model.embed_tokens'] = device
308
+ device_map['language_model.output'] = device
309
+ device_map['language_model.model.norm'] = device
310
+ device_map['language_model.lm_head'] = device
311
+ device_map['language_model.model.rotary_emb'] = device
312
+ device_map[f'language_model.model.layers.{num_layers - 1}'] = device
313
+ return device_map
314
+
315
+ class ModelWorker:
316
+ def __init__(self, model_path, model_name,
317
+ load_8bit, device):
318
+
319
+ if model_path.endswith('/'):
320
+ model_path = model_path[:-1]
321
+ if model_name is None:
322
+ model_paths = model_path.split('/')
323
+ if model_paths[-1].startswith('checkpoint-'):
324
+ self.model_name = model_paths[-2] + '_' + model_paths[-1]
325
+ else:
326
+ self.model_name = model_paths[-1]
327
+ else:
328
+ self.model_name = model_name
329
+
330
+ print(f'Loading the model {self.model_name}')
331
+
332
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)
333
+ tokens_to_keep = ['<box>', '</box>', '<ref>', '</ref>']
334
+ tokenizer.additional_special_tokens = [item for item in tokenizer.additional_special_tokens if item not in tokens_to_keep]
335
+ self.tokenizer = tokenizer
336
+ config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
337
+ model_type = config.vision_config.model_type
338
+ self.device = torch.cuda.current_device()
339
+ if model_type == 'siglip_vision_model':
340
+ self.norm_type = 'siglip'
341
+ elif model_type == 'MOB':
342
+ self.norm_type = 'siglip'
343
+ else:
344
+ self.norm_type = 'imagenet'
345
+
346
+ if any(x in model_path.lower() for x in ['34b']):
347
+ device_map = split_model(model_path, self.device)
348
+ else:
349
+ device_map = None
350
+
351
+ if device_map is not None:
352
+ self.model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,
353
+ low_cpu_mem_usage=True,
354
+ device_map=device_map,
355
+ trust_remote_code=True,
356
+ load_in_8bit=load_8bit).eval()
357
+ else:
358
+ self.model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,
359
+ trust_remote_code=True,
360
+ load_in_8bit=load_8bit).eval()
361
+
362
+ if not load_8bit and device_map is None:
363
+ self.model = self.model.to(device)
364
+ self.load_8bit = load_8bit
365
+
366
+ self.model_path = model_path
367
+ self.image_size = self.model.config.force_image_size
368
+ self.context_len = tokenizer.model_max_length
369
+ self.per_tile_len = 256
370
+
371
+ def reload_model(self):
372
+ del self.model
373
+ torch.cuda.empty_cache()
374
+ if self.device == 'auto':
375
+ os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
376
+ # This can make distributed deployment work properly
377
+ self.model = AutoModel.from_pretrained(
378
+ self.model_path,
379
+ load_in_8bit=self.load_8bit,
380
+ torch_dtype=torch.bfloat16,
381
+ device_map=self.device_map,
382
+ trust_remote_code=True).eval()
383
+ else:
384
+ self.model = AutoModel.from_pretrained(
385
+ self.model_path,
386
+ load_in_8bit=self.load_8bit,
387
+ torch_dtype=torch.bfloat16,
388
+ trust_remote_code=True).eval()
389
+ if not self.load_8bit and not self.device == 'auto':
390
+ self.model = self.model.cuda()
391
+
392
+ @torch.inference_mode()
393
+ def generate(self, params):
394
+ system_message = params['prompt'][0]['content']
395
+ send_messages = params['prompt'][1:]
396
+ max_input_tiles = params['max_input_tiles']
397
+ temperature = params['temperature']
398
+ top_p = params['top_p']
399
+ max_new_tokens = params['max_new_tokens']
400
+ repetition_penalty = params['repetition_penalty']
401
+ video_frame_num = params.get('video_frame_num', 64)
402
+ do_sample = True if temperature > 0.0 else False
403
+
404
+ global_image_cnt = 0
405
+ history, pil_images, max_input_tile_list = [], [], []
406
+ for message in send_messages:
407
+ if message['role'] == 'user':
408
+ prefix = ''
409
+ if 'image' in message:
410
+ for image_data in message['image']:
411
+ pil_images.append(load_image(image_data))
412
+ prefix = prefix + f'<image {global_image_cnt + 1}><image>\n'
413
+ global_image_cnt += 1
414
+ max_input_tile_list.append(max_input_tiles)
415
+ if 'video' in message:
416
+ for video_data in message['video']:
417
+ video_frames, tmp_prefix = load_video(video_data, num_frames=video_frame_num)
418
+ pil_images.extend(video_frames)
419
+ prefix = prefix + tmp_prefix
420
+ global_image_cnt += len(video_frames)
421
+ max_input_tile_list.extend([1] * len(video_frames))
422
+ content = prefix + message['content']
423
+ history.append([content, ])
424
+ else:
425
+ history[-1].append(message['content'])
426
+ question, history = history[-1][0], history[:-1]
427
+
428
+ if global_image_cnt == 1:
429
+ question = question.replace('<image 1><image>\n', '<image>\n')
430
+ history = [[item[0].replace('<image 1><image>\n', '<image>\n'), item[1]] for item in history]
431
+
432
+
433
+ try:
434
+ assert len(max_input_tile_list) == len(pil_images), 'The number of max_input_tile_list and pil_images should be the same.'
435
+ except Exception as e:
436
+ from IPython import embed; embed()
437
+ exit()
438
+ print(f'Error: {e}')
439
+ print(f'max_input_tile_list: {max_input_tile_list}, pil_images: {pil_images}')
440
+ # raise e
441
+
442
+ old_system_message = self.model.system_message
443
+ self.model.system_message = system_message
444
+
445
+ transform = build_transform(input_size=self.image_size, norm_type=self.norm_type)
446
+ if len(pil_images) > 0:
447
+ max_input_tiles_limited_by_contect = params['max_input_tiles']
448
+ while True:
449
+ image_tiles = []
450
+ for current_max_input_tiles, pil_image in zip(max_input_tile_list, pil_images):
451
+ if self.model.config.dynamic_image_size:
452
+ tiles = dynamic_preprocess(
453
+ pil_image, image_size=self.image_size, max_num=min(current_max_input_tiles, max_input_tiles_limited_by_contect),
454
+ use_thumbnail=self.model.config.use_thumbnail)
455
+ else:
456
+ tiles = [pil_image]
457
+ image_tiles += tiles
458
+ if (len(image_tiles) * self.per_tile_len < self.context_len):
459
+ break
460
+ else:
461
+ max_input_tiles_limited_by_contect -= 2
462
+
463
+ if max_input_tiles_limited_by_contect < 1:
464
+ break
465
+
466
+ pixel_values = [transform(item) for item in image_tiles]
467
+ pixel_values = torch.stack(pixel_values).to(self.model.device, dtype=torch.bfloat16)
468
+ print(f'Split images to {pixel_values.shape}')
469
+ else:
470
+ pixel_values = None
471
+
472
+ generation_config = dict(
473
+ num_beams=1,
474
+ max_new_tokens=max_new_tokens,
475
+ do_sample=do_sample,
476
+ temperature=temperature,
477
+ repetition_penalty=repetition_penalty,
478
+ max_length=self.context_len,
479
+ top_p=top_p,
480
+ )
481
+
482
+ response = self.model.chat(
483
+ tokenizer=self.tokenizer,
484
+ pixel_values=pixel_values,
485
+ question=question,
486
+ history=history,
487
+ return_history=False,
488
+ generation_config=generation_config,
489
+ )
490
+ self.model.system_message = old_system_message
491
+ return {'text': response, 'error_code': 0}
492
+
493
+
494
+
495
+
496
+
497
+ if __name__ == '__main__':
498
+ parser = argparse.ArgumentParser()
499
+ parser.add_argument('--model-path', type=str, default='nvidia/Eagle2-2B')
500
+ parser.add_argument('--model-name', type=str, default='Eagle2-2B')
501
+ parser.add_argument('--device', type=str, default='cuda')
502
+ parser.add_argument('--load-8bit', action='store_true')
503
+ args = parser.parse_args()
504
+ print(f'args: {args}')
505
+
506
+ worker = ModelWorker(
507
+ args.model_path,
508
+ args.model_name,
509
+ args.load_8bit,
510
+ args.device)
511
+ ```
512
+ </details>
513
+
514
+
515
+ ### 2. Prepare the Prompt
516
+
517
+ - Single image input
518
+ ```python
519
+ prompt = [
520
+ {'role': 'system', 'content': 'You are a helpful assistant.'},
521
+ {'role': 'user', 'content': 'Describe this image in details.',
522
+ 'image':[
523
+ {'url': 'https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png'}
524
+ ],
525
+ }
526
+ ]
527
+ ```
528
+
529
+ - Multiple image input
530
+ ```python
531
+ prompt = [
532
+ {'role': 'system', 'content': 'You are a helpful assistant.'},
533
+ {'role': 'user', 'content': 'Describe these two images in details.',
534
+ 'image':[
535
+ {'url': 'https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png'},
536
+ {'url': 'https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png'}
537
+ ],
538
+ }
539
+ ]
540
+ ```
541
+
542
+ - Video input
543
+ ```python
544
+ prompt = [
545
+ {'role': 'system', 'content': 'You are a helpful assistant.'},
546
+ {'role': 'user', 'content': 'Describe this video in details.',
547
+ 'video':[
548
+ 'path/to/your/video.mp4'
549
+ ],
550
+ }
551
+ ]
552
+ ```
553
+
554
+ ### 3. Generate the response
555
+ ```python
556
+ params = {
557
+ 'prompt': prompt,
558
+ 'max_input_tiles': 24,
559
+ 'temperature': 0.7,
560
+ 'top_p': 1.0,
561
+ 'max_new_tokens': 4096,
562
+ 'repetition_penalty': 1.0,
563
+ }
564
+ worker.generate(params)
565
+ ```
566
+
567
+ ## TODO
568
+ - [ ] Support vLLM Inference
569
+ - [ ] Provide AWQ Quantization Weights
570
+ - [ ] Provide fine-tuning scripts
571
+
572
+
573
+ ## License/Terms of Use
574
+ - The code is released under the Apache 2.0 license as found in the [LICENSE](https://huggingface.co/NVEagle/Eagle-X5-13B-Chat/blob/main/LICENSE) file.
575
+ - The pretrained model weights are released under the [Creative Commons Attribution: Non-Commercial 4.0 International](https://spdx.org/licenses/CC-BY-NC-4.0) <br>
576
+ - The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
577
+ - Model License of Qwen2.5-1.5B-Instruct: [Apache-2.0](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/blob/main/LICENSE)
578
+ - Model License of PaliGemma: [Gemma license](https://ai.google.dev/gemma/terms)
579
+
580
+
581
+
582
+ ## Citation
583
+
584
+ ## Ethical Considerations
585
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
586
+
587
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
588