| --- |
| license: apache-2.0 |
| language: |
| - en |
| - ar |
| tags: |
| - omda-architecture |
| - vision-language |
| - image-to-prompt |
| - binomda |
| --- |
| |
| # ๐ OMDA-PROMPTER: Optical Multi-modal Description Architecture |
|
|
| This model is a core member of the **OMDA Family** by **BINOMDA**. |
| It bridge the gap between visual perception and detailed linguistic description. |
|
|
| ## ๐ง Model DNA |
| - **Family Name:** OMDA (Optical Multi-modal Description Architecture) |
| - **Developer:** BINOMDA |
| - **Vision Backbone:** SigLIP (Frozen)[cite: 1] |
| - **Language Decoder:** OMDA-GPT2 Hybrid[cite: 1] |
| - **Architecture Type:** Multimodal Cross-Attention[cite: 1] |
|
|
| ## ๐ Usage |
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor |
| from PIL import Image |
| import torch |
| |
| # Load the specialized OMDA architecture |
| model = AutoModelForCausalLM.from_pretrained("BINOMDA/OMDA-PROMPTER", trust_remote_code=True) |
| tokenizer = AutoTokenizer.from_pretrained("BINOMDA/OMDA-PROMPTER") |
| processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224") # The vision processor is for SigLIP |
| |
| # Generate description |
| image = Image.open("your-image.jpg").convert("RGB") |
| pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(model.device) |
| generated_ids = model.generate(pixel_values, max_new_tokens=800, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id) |
| description = tokenizer.decode(generated_ids[0], skip_special_tokens=True) |
| print(description) |
| ``` |
|
|
| ## Training Details |
| Training Data: Curated dataset of images with detailed descriptions |
|
|
| Max Sequence Length: 2048 tokens |
|
|
| Training Epochs: 5 |
|
|
| Learning Rate: 2e-5 |
|
|