OMDA-PROMPTER / README.md
BINOMDA's picture
Upload folder using huggingface_hub
00fb15c verified
---
license: apache-2.0
language:
- en
- ar
tags:
- omda-architecture
- vision-language
- image-to-prompt
- binomda
---
# ๐Ÿš€ OMDA-PROMPTER: Optical Multi-modal Description Architecture
This model is a core member of the **OMDA Family** by **BINOMDA**.
It bridge the gap between visual perception and detailed linguistic description.
## ๐Ÿง  Model DNA
- **Family Name:** OMDA (Optical Multi-modal Description Architecture)
- **Developer:** BINOMDA
- **Vision Backbone:** SigLIP (Frozen)[cite: 1]
- **Language Decoder:** OMDA-GPT2 Hybrid[cite: 1]
- **Architecture Type:** Multimodal Cross-Attention[cite: 1]
## ๐Ÿ›  Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
from PIL import Image
import torch
# Load the specialized OMDA architecture
model = AutoModelForCausalLM.from_pretrained("BINOMDA/OMDA-PROMPTER", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("BINOMDA/OMDA-PROMPTER")
processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224") # The vision processor is for SigLIP
# Generate description
image = Image.open("your-image.jpg").convert("RGB")
pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(model.device)
generated_ids = model.generate(pixel_values, max_new_tokens=800, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id)
description = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(description)
```
## Training Details
Training Data: Curated dataset of images with detailed descriptions
Max Sequence Length: 2048 tokens
Training Epochs: 5
Learning Rate: 2e-5