TaiVisionLM-base-v1 / README.md
benchang1110's picture
Update README.md
d4a6e28 verified
---
library_name: transformers
datasets:
- benchang1110/TaiVision-pretrain-1M
language:
- zh
pipeline_tag: image-text-to-text
---
# Model Card for Model ID
![TaivisionLM](TaivisionLM.png)
## Model Details
## English
# TaiVisionLM: The First of Its Kind! 🚀
🌟 This is a small (only 1.2B parameters) visual language model on Hugging Face that responds to Traditional Chinese instructions given an image input! 🌟
✨ Developed compatible with the Transformers library, TaiVisionLM is quick to load, fine-tune, and use for lightning-fast inferences without needing any external libraries! ⚡️
Ready to experience the Traditional Chinese visual language model? Let's go! 🖼️🤖
## 繁體中文
# 台視: 台灣視覺語言模型!! 🚀
🌟 TaiVisionLM 是一個小型的視覺語言模型(僅有 12 億參數),可以根據圖像輸入來回覆繁體中文指令!🌟
✨ TaiVisionLM 可以用 transformers 載入、微調和使用!⚡️
準備好體驗"臺視"了嗎?讓我們開始吧!🖼️🤖
---
### Model Description
## English
This model is a multimodal large language model that combines [SigLIP](https://huggingface.co/google/siglip-base-patch16-224) as its vision encoder with [Tinyllama](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat) as its language model. The vision projector connects the two modalities together.
Its architecture closely resembles [PaliGemma](https://huggingface.co/docs/transformers/v4.44.0/model_doc/paligemma).
Here's the summary of the development process:
1) **Unimodal pretraining**
- In this stage, instead of pretraining both modalities from scratch, I leverage the image encoder from [google/siglip-base-patch16-224-multilingual](https://huggingface.co/google/siglip-base-patch16-224) and the language model trained by ourselves (https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat).
2) **Feature Alignment**
- I train the vision projector and language model using LoRA using 1B image-text pairs to align visual and textual features.
3) **Task Specific Training**
- The aligned model undergoes further training for tasks such as short captioning, detailed captioning, and simple visual question answering.
We will undergo this stage after the dataset is ready!
- **Developed by:** [benchang1110](https://huggingface.co/benchang1110)
- **Model type:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
- **Language(s) (NLP):** *Traditional Chinese*
## 繁體中文
這個模型是一個多模態的語言模型,結合了 [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) 作為其視覺編碼器,並使用 [Tinyllama](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat) 作為語言模型。視覺投影器將這兩種模態結合在一起。
其架構與 [PaliGemma](https://huggingface.co/docs/transformers/v4.44.0/model_doc/paligemma) 非常相似。
以下是開發過程的摘要:
1) **單模態預訓練**
- 在這個階段,我利用了 [google/siglip-base-patch16-224-multilingual](https://huggingface.co/google/siglip-base-patch16-224-multilingual) 的圖像編碼器,以及我們自己訓練的語言模型([Taiwan-tinyllama-v1.0-chat](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat))。
2) **特徵對齊**
- 我使用了10億個圖片和文本的配對來訓練圖像投影器 (visual projector),並使用 LoRA 來微調語言模型的權重。
3) **任務特定訓練**
- 對齊後的模型將進行進一步的訓練,針對短描述、詳細描述和簡單視覺問答等任務。我們將在數據集準備好後進行這一階段的訓練!
- **創作者:** [benchang1110](https://huggingface.co/benchang1110)
- **模型類型:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
- **語言:** *Traditional Chinese*
---
## How to Get Started with the Model
## English
In Transformers, you can load the model and do inference as follows:
**IMPORTANT NOTE:** TaiVisionLM model is not yet integrated natively into the Transformers library. So you need to set ```trust_remote_code=True``` when loading the model. It will download the ```configuration_taivisionlm.py```, ```modeling_taivisionlm.py``` and ```processing_taivisionlm.py``` files from the repo. You can check out the content of these files under the *Files and Versions* tab and pin the specific versions if you have any concerns regarding malicious code.
```python
from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
from PIL import Image
import requests
import torch
config = AutoConfig.from_pretrained("benchang1110/TaiVisionLM-base",trust_remote_code=True)
processor = AutoProcessor.from_pretrained("benchang1110/TaiVisionLM-base",trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("benchang1110/TaiVisionLM-base",trust_remote_code=True,torch_dtype=torch.float16,attn_implementation="sdpa").to('cuda')
model.eval()
url = "https://media.wired.com/photos/598e35fb99d76447c4eb1f28/master/pass/phonepicutres-TA.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
text = "描述圖片"
inputs = processor(text=text,images=image, return_tensors="pt",padding=False).to('cuda')
outputs = processor.tokenizer.decode(model.generate(**inputs,max_length=512)[0])
print(outputs)
```
## 中文
利用 transformers,可以用下面程式碼進行推論:
**重要通知:** 台視 (TaiVisionLM)還沒被整合進transformers,因此在下載模型時要使用 ```trust_remote_code=True```,下載模型將會使用```configuration_taivisionlm.py``````modeling_taivisionlm.py``` ```processing_taivisionlm.py``` 這三個檔案,若擔心有惡意程式碼,請先點選右方 *Files and Versions* 來查看程式碼內容。
```python
from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
from PIL import Image
import requests
import torch
config = AutoConfig.from_pretrained("benchang1110/TaiVisionLM-base",trust_remote_code=True)
processor = AutoProcessor.from_pretrained("benchang1110/TaiVisionLM-base",trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("benchang1110/TaiVisionLM-base",trust_remote_code=True,torch_dtype=torch.float16,attn_implementation="sdpa").to('cuda')
model.eval()
url = "https://media.wired.com/photos/598e35fb99d76447c4eb1f28/master/pass/phonepicutres-TA.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
text = "描述圖片"
inputs = processor(text=text,images=image, return_tensors="pt",padding=False).to('cuda')
outputs = processor.tokenizer.decode(model.generate(**inputs,max_length=512)[0])
print(outputs)
```
### Training Procedure
The following training hyperparameters are used in feature alignment and task specific training stages respectively:
- **Feature Alignment**
| Data size | Global Batch Size | Learning Rate | Epochs | Max Length | Weight Decay |
|--------------|-------------------|---------------|--------|------------|--------------|
| 1M | 16 | 5e-5 | 1 | 2048 | 1e-5 |
We use full-parameter finetuning for the projector and apply LoRA to the language model.
![metric](metrics.png)
### Compute Infrastructure
- **Feature Alignment**
1xV100(32GB), took approximately 16 GPU hours.