Safetensors
English
qwen2_vl
qwen_vl
video
real-time
multimodal
LLM
chenjoya commited on
Commit
c32e777
·
verified ·
1 Parent(s): 6a6df3a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +207 -1
README.md CHANGED
@@ -1,3 +1,209 @@
1
- README is on the way...
2
 
 
3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LiveCC-7B-Instruct
2
 
3
+ ## Introduction
4
 
5
+ We introduce LiveCC, the first multimodal LLM with real-time video commentary capability, and also strong at general image/video tasks.
6
+
7
+ - Project Page: https://showlab.github.io/livecc
8
+
9
+ > [!Important]
10
+ > This is the SFT model. The base model is at [LiveCC-7B-Base](https://huggingface.co/chenjoya/LiveCC-7B-Base).
11
+
12
+ ## Training with Streaming Frame-Words Paradigm
13
+
14
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/642435a1a3adbc7142c3b0a6/T-Zs50VlFT2tE7RdV49TE.png)
15
+
16
+ ## Quickstart
17
+ Like qwen-vl-utils, we offer a toolkit to help you handle various types of visual input more conveniently, **especially on video streaming inputs**. You can install it using the following command:
18
+
19
+ ```bash
20
+ pip install qwen-vl-utils livecc-utils
21
+ ```
22
+
23
+ Here we show a code snippet to show you how to do **real-time video commentary** with `transformers` and the above utils:
24
+
25
+ ```python
26
+ import functools, torch, os, tqdm
27
+ from liger_kernel.transformers import apply_liger_kernel_to_qwen2_vl
28
+ apply_liger_kernel_to_qwen2_vl()
29
+ from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, LogitsProcessor, logging
30
+ from livecc_utils import prepare_multiturn_multimodal_inputs_for_generation, get_smart_resized_clip, get_smart_resized_video_reader
31
+ from qwen_vl_utils import process_vision_info
32
+
33
+ class LiveCCDemoInfer:
34
+ VIDEO_PLAY_END = object()
35
+ VIDEO_PLAY_CONTINUE = object()
36
+ fps = 2
37
+ initial_fps_frames = 6
38
+ streaming_fps_frames = 2
39
+ initial_time_interval = initial_fps_frames / fps
40
+ streaming_time_interval = streaming_fps_frames / fps
41
+ frame_time_interval = 1 / fps
42
+ def __init__(self, model_path: str = None, device_id: int = 0):
43
+ self.model = Qwen2VLForConditionalGeneration.from_pretrained(
44
+ model_path, torch_dtype="auto",
45
+ device_map=f'cuda:{device_id}',
46
+ attn_implementation='flash_attention_2'
47
+ )
48
+ self.processor = AutoProcessor.from_pretrained(model_path, use_fast=False)
49
+ self.streaming_eos_token_id = self.processor.tokenizer(' ...').input_ids[-1]
50
+ self.model.prepare_inputs_for_generation = functools.partial(prepare_multiturn_multimodal_inputs_for_generation, self.model)
51
+ message = {
52
+ "role": "user",
53
+ "content": [
54
+ {"type": "text", "text": 'livecc'},
55
+ ]
56
+ }
57
+ texts = self.processor.apply_chat_template([message], tokenize=False)
58
+ self.system_prompt_offset = texts.index('<|im_start|>user')
59
+ self._cached_video_readers_with_hw = {}
60
+
61
+ @torch.inference_mode()
62
+ def live_cc(
63
+ self,
64
+ query: str,
65
+ state: dict,
66
+ max_pixels: int = 384 * 28 * 28,
67
+ default_query: str = 'Please describe the video.',
68
+ do_sample: bool = False,
69
+ repetition_penalty: float = 1.05,
70
+ streaming_eos_base_threshold: float = None,
71
+ streaming_eos_threshold_step: float = None,
72
+ **kwargs,
73
+ ):
74
+ """
75
+ state: dict, (maybe) with keys:
76
+ video_path: str, video path
77
+ video_timestamp: float, current video timestamp
78
+ last_timestamp: float, last processed video timestamp
79
+ last_video_pts_index: int, last processed video frame index
80
+ video_pts: np.ndarray, video pts
81
+ last_history: list, last processed history
82
+ """
83
+ # 1. preparation: video_reader, and last processing info
84
+ video_timestamp, last_timestamp = state.get('video_timestamp', 0), state.get('last_timestamp', -1 / self.fps)
85
+ video_path = state['video_path']
86
+ if video_path not in self._cached_video_readers_with_hw:
87
+ self._cached_video_readers_with_hw[video_path] = get_smart_resized_video_reader(video_path, max_pixels)
88
+ video_reader = self._cached_video_readers_with_hw[video_path][0]
89
+ video_reader.get_frame_timestamp(0)
90
+ state['video_pts'] = torch.from_numpy(video_reader._frame_pts[:, 1])
91
+ state['last_video_pts_index'] = -1
92
+ video_pts = state['video_pts']
93
+ if last_timestamp + self.frame_time_interval > video_pts[-1]:
94
+ state['video_end'] = True
95
+ return
96
+ video_reader, resized_height, resized_width = self._cached_video_readers_with_hw[video_path]
97
+ last_video_pts_index = state['last_video_pts_index']
98
+
99
+ # 2. which frames will be processed
100
+ initialized = last_timestamp >= 0
101
+ if not initialized:
102
+ video_timestamp = max(video_timestamp, self.initial_time_interval)
103
+ if video_timestamp <= last_timestamp + self.frame_time_interval:
104
+ return
105
+ timestamps = torch.arange(last_timestamp + self.frame_time_interval, video_timestamp, self.frame_time_interval) # add compensation
106
+
107
+ # 3. fetch frames in required timestamps
108
+ clip, clip_timestamps, clip_idxs = get_smart_resized_clip(video_reader, resized_height, resized_width, timestamps, video_pts, video_pts_index_from=last_video_pts_index+1)
109
+ state['last_video_pts_index'] = clip_idxs[-1]
110
+ state['last_timestamp'] = clip_timestamps[-1]
111
+
112
+ # 4. organize to interleave frames
113
+ interleave_clips, interleave_timestamps = [], []
114
+ if not initialized:
115
+ interleave_clips.append(clip[:self.initial_fps_frames])
116
+ interleave_timestamps.append(clip_timestamps[:self.initial_fps_frames])
117
+ clip = clip[self.initial_fps_frames:]
118
+ clip_timestamps = clip_timestamps[self.initial_fps_frames:]
119
+ if len(clip) > 0:
120
+ interleave_clips.extend(list(clip.split(self.streaming_fps_frames)))
121
+ interleave_timestamps.extend(list(clip_timestamps.split(self.streaming_fps_frames)))
122
+
123
+ # 5. make conversation and send to model
124
+ for clip, timestamps in zip(interleave_clips, interleave_timestamps):
125
+ start_timestamp, stop_timestamp = timestamps[0].item(), timestamps[-1].item() + self.frame_time_interval
126
+ message = {
127
+ "role": "user",
128
+ "content": [
129
+ {"type": "text", "text": f'Time={start_timestamp:.1f}-{stop_timestamp:.1f}s'},
130
+ {"type": "video", "video": clip}
131
+ ]
132
+ }
133
+ if not query and not state.get('query', None):
134
+ query = default_query
135
+ logger.warning(f'No query provided, use default_query={default_query}')
136
+ if query and state.get('query', None) != query:
137
+ message['content'].append({"type": "text", "text": query})
138
+ state['query'] = query
139
+ texts = self.processor.apply_chat_template([message], tokenize=False, add_generation_prompt=True, return_tensors='pt')
140
+ past_ids = state.get('past_ids', None)
141
+ if past_ids is not None:
142
+ texts = '<|im_end|>\n' + texts[self.system_prompt_offset:]
143
+ inputs = self.processor(
144
+ text=texts,
145
+ images=None,
146
+ videos=[clip],
147
+ return_tensors="pt",
148
+ return_attention_mask=False
149
+ )
150
+ inputs.to('cuda')
151
+ if past_ids is not None:
152
+ inputs['input_ids'] = torch.cat([past_ids, inputs.input_ids], dim=1)
153
+ if streaming_eos_base_threshold is not None:
154
+ logits_processor = [ThresholdLogitsProcessor(self.streaming_eos_token_id, streaming_eos_base_threshold, streaming_eos_threshold_step)]
155
+ else:
156
+ logits_processor = None
157
+ outputs = self.model.generate(
158
+ **inputs, past_key_values=state.get('past_key_values', None),
159
+ return_dict_in_generate=True, do_sample=do_sample,
160
+ repetition_penalty=repetition_penalty,
161
+ logits_processor=logits_processor,
162
+ )
163
+ state['past_key_values'] = outputs.past_key_values
164
+ state['past_ids'] = outputs.sequences[:, :-1]
165
+ yield (start_timestamp, stop_timestamp), self.processor.decode(outputs.sequences[0, inputs.input_ids.size(1):], skip_special_tokens=True), state
166
+
167
+ model_path = 'chenjoya/LiveCC-7B-Instruct'
168
+ video_path = "spacex_falcon9.mp4"
169
+ query = """Let's wait together!"""
170
+
171
+ infer = LiveCCDemoInfer(model_path=model_path)
172
+ state = {'video_path': video_path}
173
+ commentaries = []
174
+ t = 0
175
+ for t in range(31):
176
+ state['video_timestamp'] = t
177
+ for (start_t, stop_t), response, state in infer.live_cc(
178
+ query=query, state=state,
179
+ max_pixels = 512 * 28 * 28, repetition_penalty=1.05,
180
+ streaming_eos_base_threshold=0.0, streaming_eos_threshold_step=0
181
+ ):
182
+ print(f'{start_t}s-{stop_t}s: {response}')
183
+ commentaries.append([start_t, stop_t, response])
184
+ if state.get('video_end', False):
185
+ break
186
+ t += 1
187
+ ```
188
+
189
+
190
+ ## Limitations
191
+
192
+ - This model is starting from Qwen2-VL-7B-Base, so it may have limitations mentioned in https://huggingface.co/Qwen/Qwen2-VL-7B.
193
+ - This model is trained only with streaming frame-words paradigm, thus it may be only capable for real-time video commentary.
194
+ - The training ASR data is from YouTube CC, which has well-known low quality, so its formatting is not good (e.g. cannot output punctuation).
195
+
196
+ These limitations serve as ongoing directions for model optimization and improvement, and we are committed to continually enhancing the model's performance and scope of application.
197
+
198
+ ## Citation
199
+
200
+ If you find our work helpful, feel free to give us a cite.
201
+
202
+ ```
203
+ @inproceedings{livecc,
204
+ author = {Joya Chen and Ziyun Zeng and Yiqi Lin and Wei Li and Zejun Ma and Mike Zheng Shou},
205
+ title = {LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale},
206
+ booktitle = {CVPR},
207
+ year = {2025},
208
+ }
209
+ ```