Update README.md
Browse files
README.md
CHANGED
|
@@ -9,7 +9,7 @@ pipeline_tag: image-text-to-text
|
|
| 9 |
|
| 10 |
# Model Card for Model ID
|
| 11 |
|
| 12 |
-
|
| 13 |
## Model Details
|
| 14 |
|
| 15 |
## English
|
|
@@ -22,7 +22,7 @@ pipeline_tag: image-text-to-text
|
|
| 22 |
Ready to experience the Traditional Chinese visual language model? Let's go! 🖼️🤖
|
| 23 |
|
| 24 |
|
| 25 |
-
##
|
| 26 |
# 臺視: 首創獨一無二的視覺語言模型!! 🚀
|
| 27 |
|
| 28 |
🌟 TaiVisionLM 是一個小型的視覺語言模型(僅有 12 億參數),可以根據圖像輸入來回覆繁體中文指令!🌟
|
|
@@ -57,7 +57,7 @@ Here's the summary of the development process:
|
|
| 57 |
- **Model type:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
|
| 58 |
- **Language(s) (NLP):** *Traditional Chinese*
|
| 59 |
|
| 60 |
-
##
|
| 61 |
這個模型是一個多模態的語言模型,結合了 [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) 作為其視覺編碼器,並使用 [Tinyllama](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat) 作為語言模型。視覺投影器將這兩種模態結合在一起。
|
| 62 |
其架構與 [PaliGemma](https://huggingface.co/docs/transformers/v4.44.0/model_doc/paligemma) 非常相似。
|
| 63 |
|
|
@@ -79,6 +79,8 @@ Here's the summary of the development process:
|
|
| 79 |
|
| 80 |
## How to Get Started with the Model
|
| 81 |
|
|
|
|
|
|
|
| 82 |
In Transformers, you can load the model and do inference as follows:
|
| 83 |
|
| 84 |
**IMPORTANT NOTE:** TaiVisionLM model is not yet integrated natively into the Transformers library. So you need to set ```trust_remote_code=True``` when loading the model. It will download the ```configuration_taivisionlm.py```, ```modeling_taivisionlm.py``` and ```processing_taivisionlm.py``` files from the repo. You can check out the content of these files under the *Files and Versions* tab and pin the specific versions if you have any concerns regarding malicious code.
|
|
@@ -101,15 +103,42 @@ outputs = processor.tokenizer.decode(model.generate(**inputs,max_length=512)[0])
|
|
| 101 |
print(outputs)
|
| 102 |
```
|
| 103 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
### Training Procedure
|
| 105 |
-
|
| 106 |
The following training hyperparameters are used in feature alignment and task specific training stages respectively:
|
| 107 |
|
| 108 |
- **Feature Alignment**
|
| 109 |
|
| 110 |
| Data size | Global Batch Size | Learning Rate | Epochs | Max Length | Weight Decay |
|
| 111 |
|--------------|-------------------|---------------|--------|------------|--------------|
|
| 112 |
-
| 1B | 16 | 5e-
|
| 113 |
|
|
|
|
| 114 |
|
| 115 |
-
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
# Model Card for Model ID
|
| 11 |
|
| 12 |
+
![TaivisionLM]()
|
| 13 |
## Model Details
|
| 14 |
|
| 15 |
## English
|
|
|
|
| 22 |
Ready to experience the Traditional Chinese visual language model? Let's go! 🖼️🤖
|
| 23 |
|
| 24 |
|
| 25 |
+
## 繁體中文
|
| 26 |
# 臺視: 首創獨一無二的視覺語言模型!! 🚀
|
| 27 |
|
| 28 |
🌟 TaiVisionLM 是一個小型的視覺語言模型(僅有 12 億參數),可以根據圖像輸入來回覆繁體中文指令!🌟
|
|
|
|
| 57 |
- **Model type:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
|
| 58 |
- **Language(s) (NLP):** *Traditional Chinese*
|
| 59 |
|
| 60 |
+
## 繁體中文
|
| 61 |
這個模型是一個多模態的語言模型,結合了 [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) 作為其視覺編碼器,並使用 [Tinyllama](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat) 作為語言模型。視覺投影器將這兩種模態結合在一起。
|
| 62 |
其架構與 [PaliGemma](https://huggingface.co/docs/transformers/v4.44.0/model_doc/paligemma) 非常相似。
|
| 63 |
|
|
|
|
| 79 |
|
| 80 |
## How to Get Started with the Model
|
| 81 |
|
| 82 |
+
## English
|
| 83 |
+
|
| 84 |
In Transformers, you can load the model and do inference as follows:
|
| 85 |
|
| 86 |
**IMPORTANT NOTE:** TaiVisionLM model is not yet integrated natively into the Transformers library. So you need to set ```trust_remote_code=True``` when loading the model. It will download the ```configuration_taivisionlm.py```, ```modeling_taivisionlm.py``` and ```processing_taivisionlm.py``` files from the repo. You can check out the content of these files under the *Files and Versions* tab and pin the specific versions if you have any concerns regarding malicious code.
|
|
|
|
| 103 |
print(outputs)
|
| 104 |
```
|
| 105 |
|
| 106 |
+
## 中文
|
| 107 |
+
利用 transformers,可以用下面程式碼進行推論:
|
| 108 |
+
|
| 109 |
+
**重要通知:** 台視 (TaiVisionLM)還沒被整合進transformers,因此在下載模型時要使用 ```trust_remote_code=True```,下載模型將會使用``configuration_taivisionlm.py```、 ```modeling_taivisionlm.py``` 和 ```processing_taivisionlm.py``` 這三個檔案,若擔心有惡意程式碼,請先點選右方 *Files and Versions* 來查看程式碼內容。
|
| 110 |
+
|
| 111 |
+
```python
|
| 112 |
+
from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
|
| 113 |
+
from PIL import Image
|
| 114 |
+
import requests
|
| 115 |
+
import torch
|
| 116 |
+
|
| 117 |
+
config = AutoConfig.from_pretrained("benchang1110/TaiVision-base",trust_remote_code=True)
|
| 118 |
+
processor = AutoProcessor.from_pretrained("benchang1110/TaiVision-base",trust_remote_code=True)
|
| 119 |
+
model = AutoModelForCausalLM.from_pretrained("benchang1110/TaiVision-base",trust_remote_code=True,torch_dtype=torch.float16,attn_implementation="sdpa").to('cuda')
|
| 120 |
+
model.eval()
|
| 121 |
+
url = "https://media.wired.com/photos/598e35fb99d76447c4eb1f28/master/pass/phonepicutres-TA.jpg"
|
| 122 |
+
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
|
| 123 |
+
text = "描述圖片"
|
| 124 |
+
inputs = processor(text=text,images=image, return_tensors="pt",padding=False).to('cuda')
|
| 125 |
+
outputs = processor.tokenizer.decode(model.generate(**inputs,max_length=512)[0])
|
| 126 |
+
print(outputs)
|
| 127 |
+
```
|
| 128 |
+
|
| 129 |
### Training Procedure
|
| 130 |
+
|
| 131 |
The following training hyperparameters are used in feature alignment and task specific training stages respectively:
|
| 132 |
|
| 133 |
- **Feature Alignment**
|
| 134 |
|
| 135 |
| Data size | Global Batch Size | Learning Rate | Epochs | Max Length | Weight Decay |
|
| 136 |
|--------------|-------------------|---------------|--------|------------|--------------|
|
| 137 |
+
| 1B | 16 | 5e-5 | 1 | 2048 | 1e-5 |
|
| 138 |
|
| 139 |
+
We use full-parameter finetuning for the projector and apply LoRA to the language model.
|
| 140 |
|
| 141 |
+
### Compute Infrastructure
|
| 142 |
+
- **Feature Alignment**
|
| 143 |
+
- 1xV100(32GB), took approximately 16 GPU hours.
|
| 144 |
+
|