THUdyh
/

Ola-7b

@@ -2,7 +2,7 @@
 license: apache-2.0
 base_model:
 - Qwen/Qwen2.5-7B-Instruct
-pipeline_tag: any-to-any
 language:
 - en
 - zh
@@ -13,20 +13,20 @@ language:
 ## Model Summary
 The Ola-7B model is developed by people from Tencent, Tsinghua University and Nanyang Technological University.
-Based on Qwen2.5 language model, it is trained on text, image, video and audio data with a context window of 32K tokens. It can take both image/video, text and audio as input and output text/speech.
 Ola offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths.
 - **Repository:** https://github.com/Ola-Omni/Ola
 - **Languages:** English, Chinese
-- **Paper:** https://arxiv.org/abs/2502.04328
 ## Use
 1. Download the speech encoder at https://huggingface.co/THUdyh/Ola_speech_encoders.
 2. Replace the path in config.json with local path of speech encoders.
-We provide a simple generation process for using our model. For more details, please refer to our [Github Repo](xxxxxx)
 ```
 import os
@@ -299,19 +299,21 @@ def ola_inference(multimodal, audio_path):
     return outputs, None
 ```
-### Model Architecture
-- **Architecture:** Pre-trained [Oryx-ViT](https://huggingface.co/THUdyh/Oryx-ViT) + Qwen2.5-7B
-- **Data:** a mixture of more than 5M image/video/audio data, training for 3 stage.
-- **Precision:** BFloat16
 #### Hardware & Software
-- **Hardware:** 64 * NVIDIA Tesla A100
-- **Orchestration:** HuggingFace Trainer
-- **Code:** Pytorch
 ## Citation
 @article{liu2025ola,
@@ -320,3 +322,154 @@ author={Liu, Zuyan and Dong, Yuhao and Wang, Jiahui and Liu, Ziwei and Hu, Winst
 journal={arXiv preprint arXiv:2502.04328},
 year={2025}
 }

 license: apache-2.0
 base_model:
 - Qwen/Qwen2.5-7B-Instruct
+pipeline_tag: any-to-text
 language:
 - en
 - zh
 ## Model Summary
 The Ola-7B model is developed by people from Tencent, Tsinghua University and Nanyang Technological University.
+Based on Qwen2.5 language model, it is trained on text, image, video and audio data with a context window of 32K tokens. It can take both image/video, text and audio as input and output text.
 Ola offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths.
 - **Repository:** https://github.com/Ola-Omni/Ola
 - **Languages:** English, Chinese
+- **Paper:** https://huggingface.co/papers/2502.04328
 ## Use
 1. Download the speech encoder at https://huggingface.co/THUdyh/Ola_speech_encoders.
 2. Replace the path in config.json with local path of speech encoders.
+We provide a simple generation process for using our model. For more details, please refer to our [Github Repo](https://github.com/Ola-Omni/Ola)
 ```
 import os
     return outputs, None
 ```
+### Model Architecture
+-   **Architecture:** Pre-trained [Oryx-ViT](https://huggingface.co/THUdyh/Oryx-ViT) + Qwen2.5-7B
+-   **Data:** a mixture of more than 5M image/video/audio data, training for 3 stage.
+-   **Precision:** BFloat16
 #### Hardware & Software
+-   **Hardware:** 64 \* NVIDIA Tesla A100
+-   **Orchestration:** HuggingFace Trainer
+-   **Code:** Pytorch
 ## Citation
 @article{liu2025ola,
 journal={arXiv preprint arXiv:2502.04328},
 year={2025}
 }
+# File information
+The repository contains the following file information:
+Filename: generation_config.json
+Content: {
+  "attn_implementation": "flash_attention_2",
+  "bos_token_id": 151643,
+  "do_sample": true,
+  "eos_token_id": [
+    151645,
+    151643
+  ],
+  "pad_token_id": 151643,
+  "repetition_penalty": 1.05,
+  "temperature": 0.7,
+  "top_k": 20,
+  "top_p": 0.8,
+  "transformers_version": "4.43.4"
+}
+Filename: merges.txt
+Content: "Content of the file is larger than 50 KB, too long to display."
+Filename: special_tokens_map.json
+Content: {
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<|mm_pad|>"
+}
+Filename: model.safetensors.index.json
+Content: "Content of the file is larger than 50 KB, too long to display."
+Filename: config.json
+Content: "Content of the file is larger than 50 KB, too long to display."
+Filename: vocab.json
+Content: "Content of the file is larger than 50 KB, too long to display."
+Filename: tokenizer_config.json
+Content: "Content of the file is larger than 50 KB, too long to display."
+# Project page
+The project page URL we found has the following URL:
+# Github README
+The Github README we found contains the following content:
+<div align="center">
+<img src="assets/logo.png" width="30%"/>
+# OLA: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment
+Join our [WeChat](http://imagebind-llm.opengvlab.com/qrcode/)
+[[Project Page](https://ola-omni.github.io/)]  [[Demo](http://106.14.2.150:10020/)]
+</div>
+<img src="assets/teaser.png" width="100%"/>
+## 🚀 News
+* [2025/02/07] 🎉🎉🎉 Initial codebase for eval and training will be released ASAP! Thanks for your attention.
+## ⚡ Model Zoo
+1. Speech-Visual Data
+    * [ ] image+text with local audio caption.
+    * [ ] videos from webvid2.5m with audio caption.
+2. Visual Tokenizer
+    * [ ] Imagebind small.
+    * [ ] Oryx-ViT 18B-1152.
+3. Training Pipeline
+    * [ ] image+text stage.
+    * [ ] audio+image+text stage.
+    * [ ] video+audio+image+text stage
+## TODO
+- [ ] Multi Stage Training
+## ⚙️ Installation
+See [INSTALL.md](docs/INSTALL.md) for detailed instructions.
+## 🛴 Quick Inference Code
+- Check out the [quick inference script](example/inference/image_audio.ipynb) using a visual and audio data!
+## 📃 Citation
+```
+@article{liu2025ola,
+title={Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment},
+author={Liu, Zuyan and Dong, Yuhao and Wang, Jiahui and Liu, Ziwei and Hu, Winston and Lu, Jiwen and Rao, Yongming},
+journal={arXiv preprint arXiv:2502.04328},
+year={2025}
+}
+```
+## Acknowledgement
+- This project has been built using the great codebase of [Qwen](https://github.com/QwenLM/Qwen), [Video-LLaVA](https://github.com/mbai-xiao/Video-LLaVA), [OpenFlamingo](https://github.com/mlfoundations/open_flamingo). We thank the authors for their wonderful works.
+## Contact
+- If you have any questions, feel free to open issues or pull requests.
+Format your response as markdown, like this:
+## reasoning
+A reasoning section regarding which metadata is most appropriate for the given model to put in the `content` section as YAML, given the available
+context about the paper (abstract, Github README content and project page content if provided). Formatted as plain text.
+## Title
+The title of your Hugging Face pull request formatted as plain text
+## Comment
+The comment of your Hugging Face pull request formatted as markdown
+## Metadata
+The metadata of the new/updated model card formatted as YAML.
+## Content
+The content of the new/updated README.md (model card) formatted as markdown
+Start your answer directly with a "## Reasoning" section followed by "## Title", "## Comment", "## Metadata" and "## Content" sections
+that are filled in with relevant info for the given paper. Only format the Metadata section using ```yaml and ``` markers.
+In case there is already an Arxiv link present, there is no need to replace it with a Hugging Face paper page link.
+In case there is already a Github or project page URL present, there is no need to mention in the comment that you added it.