Update README.md
Browse files
README.md
CHANGED
|
@@ -5,6 +5,7 @@ datasets:
|
|
| 5 |
language:
|
| 6 |
- tr
|
| 7 |
pipeline_tag: image-text-to-text
|
|
|
|
| 8 |
---
|
| 9 |
|
| 10 |
<!-- # TraVisionLM - Fast and Native Turkish Visual Language Model -->
|
|
@@ -13,20 +14,59 @@ pipeline_tag: image-text-to-text
|
|
| 13 |
</div>
|
| 14 |
<!-- Provide a quick summary of what the model is/does. -->
|
| 15 |
|
| 16 |
-
|
|
|
|
| 17 |
|
|
|
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
-
|
| 22 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
-
The development process took place as follows:
|
| 25 |
1) **Unimodal pretraining**
|
| 26 |
-
- In this stage, instead of pretraining both modalities from scratch, the image encoder
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
|
| 32 |
### Model Description
|
|
@@ -34,15 +74,15 @@ The development process took place as follows:
|
|
| 34 |
|
| 35 |
- **Developed by:** [ucsahin](https://huggingface.co/ucsahin)
|
| 36 |
- **Model type:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
|
| 37 |
-
- **Language(s) (NLP):**
|
| 38 |
-
- **License:**
|
| 39 |
|
| 40 |
### Model Sources [optional]
|
| 41 |
|
| 42 |
<!-- Provide the basic links for the model. -->
|
| 43 |
|
| 44 |
-
- **Repository:** [
|
| 45 |
-
- **Paper [optional]:**
|
| 46 |
- **Demo [optional]:** [More Information Needed]
|
| 47 |
|
| 48 |
## Uses
|
|
@@ -140,29 +180,7 @@ Use the code below to get started with the model.
|
|
| 140 |
|
| 141 |
[More Information Needed]
|
| 142 |
|
| 143 |
-
#### Summary
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
## Model Examination [optional]
|
| 148 |
-
|
| 149 |
-
<!-- Relevant interpretability work for the model goes here -->
|
| 150 |
-
|
| 151 |
-
[More Information Needed]
|
| 152 |
|
| 153 |
-
## Environmental Impact
|
| 154 |
-
|
| 155 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
| 156 |
-
|
| 157 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
| 158 |
-
|
| 159 |
-
- **Hardware Type:** [More Information Needed]
|
| 160 |
-
- **Hours used:** [More Information Needed]
|
| 161 |
-
- **Cloud Provider:** [More Information Needed]
|
| 162 |
-
- **Compute Region:** [More Information Needed]
|
| 163 |
-
- **Carbon Emitted:** [More Information Needed]
|
| 164 |
-
|
| 165 |
-
## Technical Specifications [optional]
|
| 166 |
|
| 167 |
### Model Architecture and Objective
|
| 168 |
|
|
@@ -172,9 +190,6 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
|
|
| 172 |
|
| 173 |
[More Information Needed]
|
| 174 |
|
| 175 |
-
#### Hardware
|
| 176 |
-
|
| 177 |
-
[More Information Needed]
|
| 178 |
|
| 179 |
#### Software
|
| 180 |
|
|
|
|
| 5 |
language:
|
| 6 |
- tr
|
| 7 |
pipeline_tag: image-text-to-text
|
| 8 |
+
license: apache-2.0
|
| 9 |
---
|
| 10 |
|
| 11 |
<!-- # TraVisionLM - Fast and Native Turkish Visual Language Model -->
|
|
|
|
| 14 |
</div>
|
| 15 |
<!-- Provide a quick summary of what the model is/does. -->
|
| 16 |
|
| 17 |
+
## English
|
| 18 |
+
# 🎉 Introducing TraVisionLM: The First of Its Kind! 🚀
|
| 19 |
|
| 20 |
+
🌟 This is the very first fast and compact (875M parameters) visual language model on Hugging Face that responds to Turkish instructions given an image input! 🌟
|
| 21 |
|
| 22 |
+
✨ Developed compatible with the Transformers library, TRaVisionLM is a breeze to load, fine-tune, and use for lightning-fast inferences—all without needing any external libraries! ⚡️
|
| 23 |
|
| 24 |
+
Ready to experience the Turkish visual language model? Let's go! 🇹🇷🖼️🤖
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
## Türkçe
|
| 28 |
+
# 🎉 TraVisionLM: Türünün İlk Örneği! 🚀
|
| 29 |
+
|
| 30 |
+
🌟 Türkçe görsel dil modelinin ilk hızlı ve kompakt (875M parametre) versiyonu! Bir görüntü ve Türkçe talimat verildiğinde Türkçe yanıt üretir! 🌟
|
| 31 |
+
|
| 32 |
+
✨ Transformers kütüphanesi ile uyumlu olarak geliştirilen TraVisionLM, yüklemek, eğitmek ve dış kütüphaneler kullanmadan hızlı sonuçlar almak için kullanımı çok kolay! ⚡️
|
| 33 |
+
|
| 34 |
+
Türkçe görsel dil modelini deneyimlemeye hazır mısınız? Hadi başlayalım! 🇹🇷🖼️🤖
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
# Model Details
|
| 39 |
+
|
| 40 |
+
## English
|
| 41 |
+
This model is a multimodal large language model that combines [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) as its vision encoder with [GPT2-large](https://huggingface.co/docs/transformers/en/model_doc/gpt2) as its language model. The vision projector connects the two modalities together.
|
| 42 |
+
Its architecture closely resembles [PaliGemma](https://arxiv.org/pdf/2407.07726), with some refined adjustments to the vision projector and the causal language modeling.
|
| 43 |
+
|
| 44 |
+
Here's a glimpse into the development process:
|
| 45 |
|
|
|
|
| 46 |
1) **Unimodal pretraining**
|
| 47 |
+
- In this stage, instead of pretraining both modalities from scratch, I leverage the image encoder from [google/siglip-base-patch16-256-multilingual](https://huggingface.co/google/siglip-base-patch16-256-multilingual) and the language model from [ytu-ce-cosmos/turkish-gpt2-large](https://huggingface.co/ytu-ce-cosmos/turkish-gpt2-large).
|
| 48 |
+
2) **Feature Alignment**
|
| 49 |
+
- Following the [LLaVA training recipe](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#train), I train only the vision projector using 500K image-text pairs to align visual and textual features.
|
| 50 |
+
3) **Task Specific Training**
|
| 51 |
+
- The aligned model undergoes further training for tasks such as short captioning, detailed captioning, and simple visual question answering, using over 1M image-prompt-completion triplets.
|
| 52 |
+
4) **Finetuning on Downstream Tasks**
|
| 53 |
+
- Finally, the model is fine-tuned for object detection to demonstrate its versatility in various downstream tasks. Explore the fine-tuned model for object detection at [ucsahin/TraVisionLM-Object-Detection-ft](https://huggingface.co/ucsahin/TraVisionLM-Object-Detection-ft) for more details.
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
## Türkçe
|
| 57 |
+
Bu model, [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) görsel kodlayıcısını ve [GPT2-large](https://huggingface.co/docs/transformers/en/model_doc/gpt2) dil modelini birleştiren çok modlu büyük bir dil modelidir. Görsel projektör, iki modaliteyi bir araya getirir.
|
| 58 |
+
Mimarisi, [PaliGemma](https://arxiv.org/pdf/2407.07726) ile yakından benzerlik gösterir, ancak görsel projektör ve neden-sonuç dil modellemesinde bazı uyarlamalar yapılmıştır.
|
| 59 |
+
|
| 60 |
+
Geliştirme sürecinin özeti:
|
| 61 |
+
|
| 62 |
+
1) **Tek Modalite Ön Eğitimi**
|
| 63 |
+
- Bu aşamada, her iki modaliteyi sıfırdan eğitmek yerine, [google/siglip-base-patch16-256-multilingual](https://huggingface.co/google/siglip-base-patch16-256-multilingual) modelinin görsel kodlayıcısını ve [ytu-ce-cosmos/turkish-gpt2-large](https://huggingface.co/ytu-ce-cosmos/turkish-gpt2-large) modelinin dil kodlayıcısını kullanıyorum.
|
| 64 |
+
2) **Özellik Uyarlama**
|
| 65 |
+
- [LLaVA eğitim tarifesi](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#train) izlenerek, sadece görsel projektörü 500K görüntü-metin çiftleri ile eğiterek görsel ve metin özelliklerini uyumlu hale getiriyorum.
|
| 66 |
+
3) **Görev Spesifik Eğitim**
|
| 67 |
+
- Uyumlulaştırılmış model, kısa açıklama, detaylı açıklama ve basit görsel soru cevaplama gibi görevler için daha fazla eğitim alıyor; 1M'den fazla görüntü-istek-tamamlanma üçlüsü kullanılıyor.
|
| 68 |
+
4) **İndirgeme Görevlerinde İnce Ayar**
|
| 69 |
+
- Son olarak, modelin çeşitli görevlerdeki çok yönlülüğünü göstermek amacıyla nesne tespiti için ince ayar yapılmıştır. Nesne tespiti için ince ayar yapılmış modeli daha fazla detay için ucsahin/TraVisionLM-Object-Detection-ft adresinden keşfedebilirsiniz.
|
| 70 |
|
| 71 |
|
| 72 |
### Model Description
|
|
|
|
| 74 |
|
| 75 |
- **Developed by:** [ucsahin](https://huggingface.co/ucsahin)
|
| 76 |
- **Model type:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
|
| 77 |
+
- **Language(s) (NLP):** *Turkish*
|
| 78 |
+
- **License:** *Apache license 2.0*
|
| 79 |
|
| 80 |
### Model Sources [optional]
|
| 81 |
|
| 82 |
<!-- Provide the basic links for the model. -->
|
| 83 |
|
| 84 |
+
- **Repository:** [https://huggingface.co/ucsahin/TraVisionLM-base/edit/main/README.md]
|
| 85 |
+
- **Paper [optional]:** More info on this later.
|
| 86 |
- **Demo [optional]:** [More Information Needed]
|
| 87 |
|
| 88 |
## Uses
|
|
|
|
| 180 |
|
| 181 |
[More Information Needed]
|
| 182 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 183 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
|
| 185 |
### Model Architecture and Objective
|
| 186 |
|
|
|
|
| 190 |
|
| 191 |
[More Information Needed]
|
| 192 |
|
|
|
|
|
|
|
|
|
|
| 193 |
|
| 194 |
#### Software
|
| 195 |
|