๐ Introducing PerceptionDLM โ the first multimodal diffusion LLM for parallel region perception!
Most MLLMs are autoregressive, so captioning N regions costs N sequential passes. PerceptionDLM instead describes ALL masked regions in a single denoising process. ๐งฉ
โจ Highlights โข โก Up to 3.4ร faster on dense multi-region captioning, with stable per-image latency โข ๐ PerceptionDLM-Base beats LLaDA-V on 15/16 multimodal benchmarks (new SOTA among open diffusion VLMs) โข ๐ New benchmark: ParaDLC-Bench โ jointly evaluates caption quality AND inference efficiency โข ๐ Code, models & benchmark all open-sourced