Mobile-O-1.5B

Unified Multimodal Understanding and Generation on Mobile Device

arXiv Code Project Page Demo Datasets

πŸ“Œ Overview

Mobile-O-1.5B is the larger variant of the Mobile-O family β€” a unified vision–language–diffusion model that performs both multimodal understanding (VQA, OCR, reasoning) and image generation within a single architecture. It offers improved generation quality and understanding accuracy over the 0.5B variant.

Spec Detail
Total Parameters 3.5B
VLM Backbone FastVLM-1.5B (FastViT + Qwen2-1.5B)
Diffusion Decoder SANA-1600M-512 (Linear DiT + VAE)
Connector Mobile Conditioning Projector
Image Resolution 512Γ—512

🎯 Supported Tasks

Task Input β†’ Output
πŸ’¬ Conversational AI Text β†’ Text
πŸ‘οΈ Image Understanding Image + Text β†’ Text
πŸ–ΌοΈ Image Generation Text β†’ Image
✏️ Image Editing Image + Text β†’ Image

πŸš€ Quick Start

Download

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="Amshaker/Mobile-O-1.5B",
    repo_type="model",
    local_dir="checkpoints"
)

Image Understanding

python infer_und.py \
    --model_path checkpoints/ \
    --image_path assets/cute_cat.png \
    --prompt "What is in the image?"

Image Generation

python infer_gen.py \
    --model_path checkpoints/ \
    --prompt "A vibrant tropical rainforest scene with a scarlet macaw perched on a moss-covered branch"

Image Editing

python infer_edit.py \
    --model_path checkpoints/ \
    --image_path assets/cute_cat.png \
    --prompt "Make the cat wear a hat"

πŸ—οΈ Architecture

Mobile-O consists of three main components:

  • Vision-Language Model (VLM): FastVLM-1.5B β€” FastViT vision encoder + Qwen2-1.5B language backbone
  • Diffusion Decoder: SANA-600M-512 β€” lightweight linear DiT with VAE for 512Γ—512 generation
  • Mobile Conditioning Projector (MCP): ~2.4M param connector using layerwise feature fusion with temperature-scaled weights, depthwise-separable 1D convolutions, and efficient channel attention

πŸ‹οΈ Training

Trained in three stages:

  1. Pre-training β€” Cross-modal alignment on 9M text-image pairs
  2. SFT β€” Supervised fine-tuning on ~105K curated pairs
  3. Post-training β€” Unified multimodal training on ~105K quadruplets

πŸ”— Related Resources

Resource Link
πŸ€— Mobile-O-0.5B Model
πŸ€— Mobile-O-0.5B-iOS iOS Components
πŸ“± iOS App Source Code Mobile-O-App

πŸ“„ Citation

@article{shaker2026mobileo,
  title={Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device},
  author={Shaker, Abdelrahman and Heakl, Ahmed and Muhammad, Jaseel and Thawkar, Ritesh and Thawakar, Omkar and Li, Senmao and Cholakkal, Hisham and Reid, Ian and Xing, Eric P. and Khan, Salman and Khan, Fahad Shahbaz},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}

βš–οΈ License

Released under CC BY-NC 4.0. For research purposes only.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Amshaker/Mobile-O-1.5B

Unable to build the model tree, the base model loops to the model itself. Learn more.

Datasets used to train Amshaker/Mobile-O-1.5B

Collection including Amshaker/Mobile-O-1.5B