TeLVE v1.0dep released. Due to the addressing problem during training, it is not recommended to use it because it is trained with a dataset of about half the size.
Browse files- README.md +80 -79
- models/TeLVE_v1.0dep.pth +3 -0
README.md
CHANGED
|
@@ -1,79 +1,80 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: cc-by-4.0
|
| 3 |
-
language:
|
| 4 |
-
- en
|
| 5 |
-
- tr
|
| 6 |
-
tags:
|
| 7 |
-
- VLM
|
| 8 |
-
- image2text
|
| 9 |
-
- lm
|
| 10 |
-
---
|
| 11 |
-
# TeLVE: Turkish efficient Language Vision Engine 🧿
|
| 12 |
-
[](https://creativecommons.org/licenses/by/4.0/)
|
| 13 |
-
[](https://huggingface.co/outsu/TeLVE)
|
| 14 |
-
## First Turkish VLM ever!
|
| 15 |
-
|
| 16 |
-
TeLVE is the first Visual Language Model specifically designed for Turkish language understanding and image description generation. Built on Vision Transformer (ViT) and BERT pre-trained encoder architectures, it bridges the gap in Turkish visual-linguistic processing.
|
| 17 |
-
No module named 'imagine'
|
| 18 |
-

|
| 19 |
-
|
| 20 |
-
## Model Description
|
| 21 |
-
|
| 22 |
-
TeLVE combines:
|
| 23 |
-
- 🖼️ Vision Transformer (ViT-base-patch16-224)
|
| 24 |
-
- 📝 Turkish BERT (dbmdz/bert-base-turkish-cased)
|
| 25 |
-
- 🔄 Cross-attention mechanism for vision-language fusion
|
| 26 |
-
|
| 27 |
-
### Version Logs
|
| 28 |
-
- **TeLVE v1.0**: Trained on Unsplash Lite dataset
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
-
|
| 42 |
-
-
|
| 43 |
-
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
-
|
| 55 |
-
-
|
| 56 |
-
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
Performance
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
| TeLVE v1.
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
}
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-4.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- tr
|
| 6 |
+
tags:
|
| 7 |
+
- VLM
|
| 8 |
+
- image2text
|
| 9 |
+
- lm
|
| 10 |
+
---
|
| 11 |
+
# TeLVE: Turkish efficient Language Vision Engine 🧿
|
| 12 |
+
[](https://creativecommons.org/licenses/by/4.0/)
|
| 13 |
+
[](https://huggingface.co/outsu/TeLVE)
|
| 14 |
+
## First Turkish VLM ever!
|
| 15 |
+
|
| 16 |
+
TeLVE is the first Visual Language Model specifically designed for Turkish language understanding and image description generation. Built on Vision Transformer (ViT) and BERT pre-trained encoder architectures, it bridges the gap in Turkish visual-linguistic processing.
|
| 17 |
+
No module named 'imagine'
|
| 18 |
+

|
| 19 |
+
|
| 20 |
+
## Model Description
|
| 21 |
+
|
| 22 |
+
TeLVE combines:
|
| 23 |
+
- 🖼️ Vision Transformer (ViT-base-patch16-224)
|
| 24 |
+
- 📝 Turkish BERT (dbmdz/bert-base-turkish-cased)
|
| 25 |
+
- 🔄 Cross-attention mechanism for vision-language fusion
|
| 26 |
+
|
| 27 |
+
### Version Logs
|
| 28 |
+
- **TeLVE v1.0**: Trained on Unsplash Lite dataset
|
| 29 |
+
- **TeLVE v1.0dep**: Dataset enhanced with selective images from Pexels images, the encoder problem with letter "ü" was fixed. *(Deprecated, performance was decreased because of dataset addressing problem. Not recommended to use.)*
|
| 30 |
+
|
| 31 |
+
## Usage
|
| 32 |
+
|
| 33 |
+
The model can be used in two ways:
|
| 34 |
+
|
| 35 |
+
### Inference (imagine.py)
|
| 36 |
+
```python
|
| 37 |
+
# Generate captions for images
|
| 38 |
+
python imagine.py
|
| 39 |
+
```
|
| 40 |
+
This script:
|
| 41 |
+
- Loads a trained TeLVE model
|
| 42 |
+
- Takes images from `images` directory
|
| 43 |
+
- Generates Turkish captions for each image
|
| 44 |
+
- Outputs the results to console
|
| 45 |
+
|
| 46 |
+
### Training (main.py)
|
| 47 |
+
Users can train their own models with ViT and BERT encoders.
|
| 48 |
+
```python
|
| 49 |
+
# Train a new model
|
| 50 |
+
python main.py
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
This script:
|
| 54 |
+
- Loads and preprocesses image-caption pairs
|
| 55 |
+
- Initializes ViT and BERT encoders
|
| 56 |
+
- Trains the combined model
|
| 57 |
+
- Saves the model and tokenizer
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
## Performance
|
| 61 |
+
Performance scores will be evaluated.
|
| 62 |
+
<!--
|
| 63 |
+
| Model Version | Dataset | BLEU-4 | METEOR | CIDEr |
|
| 64 |
+
|--------------|---------|---------|---------|--------|
|
| 65 |
+
| TeLVE v1.0 | Unsplash | *TBD* | *TBD* | *TBD* |
|
| 66 |
+
| TeLVE v1.1 | Unsplash+Pexels | *TBD* | *TBD* | *TBD* |-->
|
| 67 |
+
|
| 68 |
+
## Citation
|
| 69 |
+
|
| 70 |
+
```bibtex
|
| 71 |
+
@software{telve2024,
|
| 72 |
+
author = {Öğüt Su Karagün},
|
| 73 |
+
title = {TeLVE: Turkish efficient Language Vision Engine},
|
| 74 |
+
year = {2024},
|
| 75 |
+
url = {https://huggingface.co/outsu/TeLVE}
|
| 76 |
+
}
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
## License
|
| 80 |
+
This work is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).
|
models/TeLVE_v1.0dep.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5e74ea3f021a45ff9f888c841e8f07924b175fe2a50c73696daa7039be10df48
|
| 3 |
+
size 904212666
|