ucsahin
/

TraVisionLM-base

@@ -5,6 +5,7 @@ datasets:
 language:
 - tr
 pipeline_tag: image-text-to-text
 ---
 <!-- # TraVisionLM - Fast and Native Turkish Visual Language Model -->
@@ -13,20 +14,59 @@ pipeline_tag: image-text-to-text
 </div>
 <!-- Provide a quick summary of what the model is/does. -->
-This is the very first fast and small (875M parameters) visual language model in Hugging Face that given an image input and a Turkish instruction generates a response in Turkish. The model is developed natively in accordance with the Transformers library. So, you can easily load, fine-tune and make some blazingly fast inferences without using any external library!
-## Model Details
-This model is a multimodal large language model that uses [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) as its vision encoder and [GPT2-large](https://huggingface.co/docs/transformers/en/model_doc/gpt2) as its language model. The vision projector is used to connect two modalities together.
-The architecture of the model is very similar to that of [PaliGemma](https://arxiv.org/pdf/2407.07726) with some adjustments to the vision projector and the causal language modeling.
-The development process took place as follows:
 1) **Unimodal pretraining**
-    - In this stage, instead of pretraining both modalities from scratch, the image encoder of the [google/siglip-base-patch16-256-multilingual](https://huggingface.co/google/siglip-base-patch16-256-multilingual) model and [ytu-ce-cosmos/turkish-gpt2-large](https://huggingface.co/ytu-ce-cosmos/turkish-gpt2-large) are selected as the vision encoder and language models, respectively.
-3) **Feature Alignment**
-4) **Task Specific Training**
-5) **Finetuning on Downstream Tasks**
 ### Model Description
@@ -34,15 +74,15 @@ The development process took place as follows:
 - **Developed by:** [ucsahin](https://huggingface.co/ucsahin)
 - **Model type:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
-- **Language(s) (NLP):** [Turkish]
-- **License:** More info on this later...
 ### Model Sources [optional]
 <!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
 - **Demo [optional]:** [More Information Needed]
 ## Uses
@@ -140,29 +180,7 @@ Use the code below to get started with the model.
 [More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
 ### Model Architecture and Objective
@@ -172,9 +190,6 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
 [More Information Needed]
-#### Hardware
-[More Information Needed]
 #### Software

 language:
 - tr
 pipeline_tag: image-text-to-text
+license: apache-2.0
 ---
 <!-- # TraVisionLM - Fast and Native Turkish Visual Language Model -->
 </div>
 <!-- Provide a quick summary of what the model is/does. -->
+## English
+# 🎉 Introducing TraVisionLM: The First of Its Kind! 🚀
+🌟 This is the very first fast and compact (875M parameters) visual language model on Hugging Face that responds to Turkish instructions given an image input! 🌟
+✨ Developed compatible with the Transformers library, TRaVisionLM is a breeze to load, fine-tune, and use for lightning-fast inferences—all without needing any external libraries! ⚡️
+Ready to experience the Turkish visual language model? Let's go! 🇹🇷🖼️🤖
+## Türkçe
+# 🎉 TraVisionLM: Türünün İlk Örneği! 🚀
+🌟 Türkçe görsel dil modelinin ilk hızlı ve kompakt (875M parametre) versiyonu! Bir görüntü ve Türkçe talimat verildiğinde Türkçe yanıt üretir! 🌟
+✨ Transformers kütüphanesi ile uyumlu olarak geliştirilen TraVisionLM, yüklemek, eğitmek ve dış kütüphaneler kullanmadan hızlı sonuçlar almak için kullanımı çok kolay! ⚡️
+Türkçe görsel dil modelini deneyimlemeye hazır mısınız? Hadi başlayalım! 🇹🇷🖼️🤖
+---
+# Model Details
+## English
+This model is a multimodal large language model that combines [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) as its vision encoder with [GPT2-large](https://huggingface.co/docs/transformers/en/model_doc/gpt2) as its language model. The vision projector connects the two modalities together.
+Its architecture closely resembles [PaliGemma](https://arxiv.org/pdf/2407.07726), with some refined adjustments to the vision projector and the causal language modeling.
+Here's a glimpse into the development process:
 1) **Unimodal pretraining**
+    - In this stage, instead of pretraining both modalities from scratch, I leverage the image encoder from [google/siglip-base-patch16-256-multilingual](https://huggingface.co/google/siglip-base-patch16-256-multilingual) and the language model from [ytu-ce-cosmos/turkish-gpt2-large](https://huggingface.co/ytu-ce-cosmos/turkish-gpt2-large).
+2) **Feature Alignment**
+    - Following the [LLaVA training recipe](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#train), I train only the vision projector using 500K image-text pairs to align visual and textual features.
+3) **Task Specific Training**
+    - The aligned model undergoes further training for tasks such as short captioning, detailed captioning, and simple visual question answering, using over 1M image-prompt-completion triplets.
+4) **Finetuning on Downstream Tasks**
+    - Finally, the model is fine-tuned for object detection to demonstrate its versatility in various downstream tasks. Explore the fine-tuned model for object detection at [ucsahin/TraVisionLM-Object-Detection-ft](https://huggingface.co/ucsahin/TraVisionLM-Object-Detection-ft) for more details.
+## Türkçe
+Bu model, [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) görsel kodlayıcısını ve [GPT2-large](https://huggingface.co/docs/transformers/en/model_doc/gpt2) dil modelini birleştiren çok modlu büyük bir dil modelidir. Görsel projektör, iki modaliteyi bir araya getirir.
+Mimarisi, [PaliGemma](https://arxiv.org/pdf/2407.07726) ile yakından benzerlik gösterir, ancak görsel projektör ve neden-sonuç dil modellemesinde bazı uyarlamalar yapılmıştır.
+Geliştirme sürecinin özeti:
+1) **Tek Modalite Ön Eğitimi**
+    - Bu aşamada, her iki modaliteyi sıfırdan eğitmek yerine, [google/siglip-base-patch16-256-multilingual](https://huggingface.co/google/siglip-base-patch16-256-multilingual) modelinin görsel kodlayıcısını ve [ytu-ce-cosmos/turkish-gpt2-large](https://huggingface.co/ytu-ce-cosmos/turkish-gpt2-large) modelinin dil kodlayıcısını kullanıyorum.
+2) **Özellik Uyarlama**
+    - [LLaVA eğitim tarifesi](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#train) izlenerek, sadece görsel projektörü 500K görüntü-metin çiftleri ile eğiterek görsel ve metin özelliklerini uyumlu hale getiriyorum.
+3) **Görev Spesifik Eğitim**
+    - Uyumlulaştırılmış model, kısa açıklama, detaylı açıklama ve basit görsel soru cevaplama gibi görevler için daha fazla eğitim alıyor; 1M'den fazla görüntü-istek-tamamlanma üçlüsü kullanılıyor.
+4) **İndirgeme Görevlerinde İnce Ayar**
+    - Son olarak, modelin çeşitli görevlerdeki çok yönlülüğünü göstermek amacıyla nesne tespiti için ince ayar yapılmıştır. Nesne tespiti için ince ayar yapılmış modeli daha fazla detay için ucsahin/TraVisionLM-Object-Detection-ft adresinden keşfedebilirsiniz.
 ### Model Description
 - **Developed by:** [ucsahin](https://huggingface.co/ucsahin)
 - **Model type:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
+- **Language(s) (NLP):** *Turkish*
+- **License:** *Apache license 2.0*
 ### Model Sources [optional]
 <!-- Provide the basic links for the model. -->
+- **Repository:** [https://huggingface.co/ucsahin/TraVisionLM-base/edit/main/README.md]
+- **Paper [optional]:** More info on this later.
 - **Demo [optional]:** [More Information Needed]
 ## Uses
 [More Information Needed]
 ### Model Architecture and Objective
 [More Information Needed]
 #### Software