--- license: apache-2.0 language: - en - ar tags: - omda-architecture - vision-language - image-to-prompt - binomda --- # 🚀 OMDA-PROMPTER: Optical Multi-modal Description Architecture This model is a core member of the **OMDA Family** by **BINOMDA**. It bridge the gap between visual perception and detailed linguistic description. ## 🧠 Model DNA - **Family Name:** OMDA (Optical Multi-modal Description Architecture) - **Developer:** BINOMDA - **Vision Backbone:** SigLIP (Frozen)[cite: 1] - **Language Decoder:** OMDA-GPT2 Hybrid[cite: 1] - **Architecture Type:** Multimodal Cross-Attention[cite: 1] ## 🛠 Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor from PIL import Image import torch # Load the specialized OMDA architecture model = AutoModelForCausalLM.from_pretrained("BINOMDA/OMDA-PROMPTER", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("BINOMDA/OMDA-PROMPTER") processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224") # The vision processor is for SigLIP # Generate description image = Image.open("your-image.jpg").convert("RGB") pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(model.device) generated_ids = model.generate(pixel_values, max_new_tokens=800, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id) description = tokenizer.decode(generated_ids[0], skip_special_tokens=True) print(description) ``` ## Training Details Training Data: Curated dataset of images with detailed descriptions Max Sequence Length: 2048 tokens Training Epochs: 5 Learning Rate: 2e-5