| # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1 | |
| # Doc / guide: https://huggingface.co/docs/hub/model-cards | |
| library_name: dualtowervlm | |
| license: mit | |
| pipeline_tag: image-text-to-text | |
| tags: | |
| - vision-language | |
| - multimodal | |
| - dual-tower | |
| - research | |
| **DualTowerVLM** is a dual-tower Vision-Language Model (VLM) architecture that processes images and text through separate towers before combining their representations. | |
| For more information, check out the repository. | |
| **Usage:** | |
| ```python | |
| from models.dual_tower.dual_tower import DualTowerVLM | |
| from models.config import VLMConfig | |
| cfg = VLMConfig() | |
| model = DualTowerVLM.from_pretrained("patrickamadeus/dt-cococaps") | |
| ``` | |