THUdyh
/

Ola-Image

@@ -1,13 +1,14 @@
 ---
-pipeline_tag: image-text-to-text
-license: apache-2.0
 base_model:
 - Qwen/Qwen2.5-7B-Instruct
 language:
 - en
 - zh
-datasets:
-- HuggingFaceFV/finevideo
 ---
 # Ola-7B
@@ -19,9 +20,10 @@ Based on Qwen2.5 language model, it is trained on text, image, video and audio d
 Ola offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths.
-- **Repository:** https://github.com/Ola-Omni/Ola
-- **Languages:** English, Chinese
-- **Paper:** https://huggingface.co/papers/2502.04328
 ## Use
@@ -177,11 +179,15 @@ def ola_inference(multimodal, audio_path):
     else:
         qs = ''
     if USE_SPEECH and audio_path:
-        qs = DEFAULT_IMAGE_TOKEN + "\n" + "User's question in speech: " + DEFAULT_SPEECH_TOKEN + '\n'
     elif USE_SPEECH:
-        qs = DEFAULT_SPEECH_TOKEN + DEFAULT_IMAGE_TOKEN + "\n" + qs
     else:
-        qs = DEFAULT_IMAGE_TOKEN + "\n" + qs
     conv = conv_templates[conv_mode].copy()
     conv.append_message(conv.roles[0], qs)

 ---
 base_model:
 - Qwen/Qwen2.5-7B-Instruct
+datasets:
+- HuggingFaceFV/finevideo
 language:
 - en
 - zh
+license: apache-2.0
+pipeline_tag: multi-modality
+library_name: transformers
 ---
 # Ola-7B
 Ola offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths.
+-   **Project Page:** https://ola-omni.github.io/
+-   **Repository:** https://github.com/Ola-Omni/Ola
+-   **Languages:** English, Chinese
+-   **Paper:** https://huggingface.co/papers/2502.04328
 ## Use
     else:
         qs = ''
     if USE_SPEECH and audio_path:
+        qs = DEFAULT_IMAGE_TOKEN + "
+" + "User's question in speech: " + DEFAULT_SPEECH_TOKEN + '
+'
     elif USE_SPEECH:
+        qs = DEFAULT_SPEECH_TOKEN + DEFAULT_IMAGE_TOKEN + "
+" + qs
     else:
+        qs = DEFAULT_IMAGE_TOKEN + "
+" + qs
     conv = conv_templates[conv_mode].copy()
     conv.append_message(conv.roles[0], qs)