Safetensors
English
qwen2_vl
qwen_vl
video
real-time
multimodal
LLM
chenjoya commited on
Commit
f942dd9
·
verified ·
1 Parent(s): 3c5ce67

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -207,9 +207,9 @@ for t in range(31):
207
  ## Limitations
208
 
209
  - This model is finetuned on LiveCC-7B-Base, which is starting from Qwen2-VL-7B-Base, so it may have limitations mentioned in https://huggingface.co/Qwen/Qwen2-VL-7B.
210
- - This model is trained only with streaming frame-words paradigm, thus it may be only capable for real-time video commentary.
211
- - The training ASR data is from YouTube CC, which has well-known low quality, so its formatting is not good (e.g. cannot output punctuation).
212
-
213
  These limitations serve as ongoing directions for model optimization and improvement, and we are committed to continually enhancing the model's performance and scope of application.
214
 
215
  ## Citation
 
207
  ## Limitations
208
 
209
  - This model is finetuned on LiveCC-7B-Base, which is starting from Qwen2-VL-7B-Base, so it may have limitations mentioned in https://huggingface.co/Qwen/Qwen2-VL-7B.
210
+ - When performing real-time video commentary, it may appear collapse --- e.g., repeat pattern. If you encounter this situation, try to adjust repetition_penalty, streaming_eos_base_threshold, and streaming_eos_threshold_step.
211
+ - This model only has a context window of 32768. Using more visual tokens per frame (e.g. 768 * 28 * 28) will have the best performance, but will shorten the working duration.
212
+
213
  These limitations serve as ongoing directions for model optimization and improvement, and we are committed to continually enhancing the model's performance and scope of application.
214
 
215
  ## Citation