---
# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
# Doc / guide: https://huggingface.co/docs/hub/model-cards
library_name: dualtowervlm
license: mit
pipeline_tag: image-text-to-text
tags:
  - vision-language
  - multimodal
  - dual-tower
  - research
---

**DualTowerVLM** is a dual-tower Vision-Language Model (VLM) architecture that processes images and text through separate towers before combining their representations.

For more information, check out the repository.

**Usage:**

```python
from models.dual_tower.dual_tower import DualTowerVLM
from models.config import VLMConfig

cfg = VLMConfig()
model = DualTowerVLM.from_pretrained("patrickamadeus/dt-cococaps")
```