chenjoya
/

LiveCC-7B-Base

@@ -1,25 +1,29 @@
 ---
-license: apache-2.0
 datasets:
 - chenjoya/Live-CC-5M
 language:
 - en
-base_model:
-- Qwen/Qwen2-VL-7B
 tags:
 - qwen_vl
 - video
 - real-time
 - multimodal
 - LLM
 ---
 # LiveCC-7B-Base
 ## Introduction
-We introduce LiveCC, the first video LLM capable of real-time commentary, trained with a novel video-ASR streaming method, SOTA on both streaming and offline benchmarks.
 - Project Page: https://showlab.github.io/livecc
 > [!Important]
 > This is the Base model. The base model is at [LiveCC-7B-Instruct](https://huggingface.co/chenjoya/LiveCC-7B-Instruct).
@@ -152,7 +156,8 @@ class LiveCCDemoInfer:
           texts = self.processor.apply_chat_template([message], tokenize=False, add_generation_prompt=True, return_tensors='pt')
           past_ids = state.get('past_ids', None)
           if past_ids is not None:
-              texts = '<|im_end|>\n' + texts[self.system_prompt_offset:]
           inputs = self.processor(
               text=texts,
               images=None,
@@ -274,7 +279,8 @@ class LiveCCDemoInfer:
       image_inputs, video_inputs = process_vision_info(conversation)
       texts = self.processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True, return_tensors='pt')
       if past_ids is not None:
-          texts = '<|im_end|>\n' + texts[self.system_prompt_offset:]
       inputs = self.processor(
           text=texts,
           images=image_inputs,

 ---
+base_model:
+- Qwen/Qwen2-VL-7B
 datasets:
 - chenjoya/Live-CC-5M
 language:
 - en
+license: apache-2.0
 tags:
 - qwen_vl
 - video
 - real-time
 - multimodal
 - LLM
+pipeline_tag: video-text-to-text
+library_name: transformers
 ---
 # LiveCC-7B-Base
 ## Introduction
+We introduce LiveCC, the first video LLM capable of real-time commentary, trained with a novel video-ASR streaming method, achieving SOTA on both streaming and offline benchmarks. The model takes video and text as input and generates text as output.
 - Project Page: https://showlab.github.io/livecc
+- Paper: [LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale](https://huggingface.co/papers/2504.16030)
 > [!Important]
 > This is the Base model. The base model is at [LiveCC-7B-Instruct](https://huggingface.co/chenjoya/LiveCC-7B-Instruct).
           texts = self.processor.apply_chat_template([message], tokenize=False, add_generation_prompt=True, return_tensors='pt')
           past_ids = state.get('past_ids', None)
           if past_ids is not None:
+              texts = '<|im_end|>
+' + texts[self.system_prompt_offset:]
           inputs = self.processor(
               text=texts,
               images=None,
       image_inputs, video_inputs = process_vision_info(conversation)
       texts = self.processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True, return_tensors='pt')
       if past_ids is not None:
+          texts = '<|im_end|>
+' + texts[self.system_prompt_offset:]
       inputs = self.processor(
           text=texts,
           images=image_inputs,