benchang1110
/

TaiVisionLM-base-v1

@@ -9,18 +9,15 @@ pipeline_tag: image-text-to-text
 # Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
 ## Model Details
 ## English
 # TaiVisionLM: The First of Its Kind! 🚀
-🌟 This is a very fast and small (only 1.2B parameters) visual language model on Hugging Face that responds to Traditional Chinese instructions given an image input! 🌟
-✨ Developed compatible with the Transformers library, TaiVisionLM is a breeze to load, fine-tune, and use for lightning-fast inferences—all without needing any external libraries! ⚡️
 Ready to experience the Traditional Chinese visual language model? Let's go! 🖼️🤖
@@ -28,32 +25,38 @@ Ready to experience the Traditional Chinese visual language model? Let's go!
 ## Traditional Chinese
 # 臺視: 首創獨一無二的視覺語言模型!! 🚀
-🌟 TaiVisionLM 是一個非常快速且小巧的視覺語言模型（僅有 12 億參數），在 Hugging Face 上可以根據圖像輸入來回應繁體中文指令！🌟
-✨ TaiVisionLM 與 Transformers 完全相容，易於載入、微調和使用，用於快速推理——不需要任何外部庫！⚡️
-準備好體驗這個繁體中文視覺語言模型了嗎？讓我們開始吧！🖼️🤖
 ---
-# Model Details
 ## English
-This model is a multimodal large language model that combines [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) as its vision encoder with [Tinyllama](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat) as its language model. The vision projector connects the two modalities together.
 Its architecture closely resembles [PaliGemma](https://huggingface.co/docs/transformers/v4.44.0/model_doc/paligemma).
 Here's the summary of the development process:
 1) **Unimodal pretraining**
-    - In this stage, instead of pretraining both modalities from scratch, I leverage the image encoder from [google/siglip-base-patch16-224-multilingual](https://huggingface.co/google/siglip-base-patch16-224-multilingual) and the language model trained by ourselves (https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat).
 2) **Feature Alignment**
-    - Following the [LLaVA training recipe](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#train), I train the vision projector using 1B image-text pairs to align visual and textual features.
 3) **Task Specific Training**
-    - The aligned model undergoes further training for tasks such as short captioning, detailed captioning, and simple visual question answering, using over 1M image-prompt-completion triplets.
     We will undergo this stage after the dataset is ready!
 ## 中文
 這個模型是一個多模態的語言模型，結合了 [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) 作為其視覺編碼器，並使用 [Tinyllama](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat) 作為語言模型。視覺投影器將這兩種模態結合在一起。
 其架構與 [PaliGemma](https://huggingface.co/docs/transformers/v4.44.0/model_doc/paligemma) 非常相似。
@@ -63,18 +66,15 @@ Here's the summary of the development process:
 1) **單模態預訓練**
    - 在這個階段，我利用了 [google/siglip-base-patch16-224-multilingual](https://huggingface.co/google/siglip-base-patch16-224-multilingual) 的圖像編碼器，以及我們自己訓練的語言模型（[Taiwan-tinyllama-v1.0-chat](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat)）。
 2) **特徵對齊**
-   - 根據 [LLaVA](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#train)，我使用 10 億個圖文配對數據來訓練視覺投影器，以對齊視覺和文本特徵。
 3) **任務特定訓練**
-   - 對齊後的模型進行進一步的訓練，針對短描述、詳細描述和簡單視覺問答等任務，使用超過 100 萬組圖像-提示-完成三元組數據進行訓練。我們將在數據集準備好後進行這一階段的訓練！
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-- **Developed by:** [benchang1110](https://huggingface.co/benchang1110)
-- **Model type:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
-- **Language(s) (NLP):** *Traditional Chinese*
 ---
 ## How to Get Started with the Model

 # Model Card for Model ID
 ## Model Details
 ## English
 # TaiVisionLM: The First of Its Kind! 🚀
+🌟 This is a small (only 1.2B parameters) visual language model on Hugging Face that responds to Traditional Chinese instructions given an image input! 🌟
+✨ Developed compatible with the Transformers library, TaiVisionLM is quick to load, fine-tune, and use for lightning-fast inferences without needing any external libraries! ⚡️
 Ready to experience the Traditional Chinese visual language model? Let's go! 🖼️🤖
 ## Traditional Chinese
 # 臺視: 首創獨一無二的視覺語言模型!! 🚀
+🌟 TaiVisionLM 是一個小型的視覺語言模型（僅有 12 億參數），可以根據圖像輸入來回覆繁體中文指令！🌟
+✨ TaiVisionLM 可以用 transformers 載入、微調和使用！⚡️
+準備好體驗"臺視"了嗎？讓我們開始吧！🖼️🤖
 ---
+### Model Description
 ## English
+This model is a multimodal large language model that combines [SigLIP](https://huggingface.co/google/siglip-base-patch16-224) as its vision encoder with [Tinyllama](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat) as its language model. The vision projector connects the two modalities together.
 Its architecture closely resembles [PaliGemma](https://huggingface.co/docs/transformers/v4.44.0/model_doc/paligemma).
 Here's the summary of the development process:
 1) **Unimodal pretraining**
+    - In this stage, instead of pretraining both modalities from scratch, I leverage the image encoder from [google/siglip-base-patch16-224-multilingual](https://huggingface.co/google/siglip-base-patch16-224) and the language model trained by ourselves (https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat).
 2) **Feature Alignment**
+    - I train the vision projector and language model using LoRA using 1B image-text pairs to align visual and textual features.
 3) **Task Specific Training**
+    - The aligned model undergoes further training for tasks such as short captioning, detailed captioning, and simple visual question answering.
     We will undergo this stage after the dataset is ready!
+- **Developed by:** [benchang1110](https://huggingface.co/benchang1110)
+- **Model type:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
+- **Language(s) (NLP):** *Traditional Chinese*
 ## 中文
 這個模型是一個多模態的語言模型，結合了 [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) 作為其視覺編碼器，並使用 [Tinyllama](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat) 作為語言模型。視覺投影器將這兩種模態結合在一起。
 其架構與 [PaliGemma](https://huggingface.co/docs/transformers/v4.44.0/model_doc/paligemma) 非常相似。
 1) **單模態預訓練**
    - 在這個階段，我利用了 [google/siglip-base-patch16-224-multilingual](https://huggingface.co/google/siglip-base-patch16-224-multilingual) 的圖像編碼器，以及我們自己訓練的語言模型（[Taiwan-tinyllama-v1.0-chat](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat)）。
 2) **特徵對齊**
+   - 我使用了10億個圖片和文本的配對來訓練圖像投影器 (visual projector)，並使用 LoRA 來微調語言模型的權重。
 3) **任務特定訓練**
+   - 對齊後的模型將進行進一步的訓練，針對短描述、詳細描述和簡單視覺問答等任務。我們將在數據集準備好後進行這一階段的訓練！
+- **創作者:** [benchang1110](https://huggingface.co/benchang1110)
+- **模型類型:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
+- **語言:** *Traditional Chinese*
 ---
 ## How to Get Started with the Model