Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
PulYong 's Collections
Multimodal Dataset
Diffusion Language Model
Score Based Model
AR Image Generation
Unified MLLM
ETC

Unified MLLM

updated Jun 18, 2025

Unified model that generate Text, Image, Video

Upvote
-

  • TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

    Paper • 2412.03069 • Published Dec 4, 2024 • 34

  • Are Emergent Abilities of Large Language Models a Mirage?

    Paper • 2304.15004 • Published Apr 28, 2023 • 8

  • Scaling Image Tokenizers with Grouped Spherical Quantization

    Paper • 2412.02632 • Published Dec 3, 2024 • 10

  • Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

    Paper • 2410.13848 • Published Oct 17, 2024 • 34

  • VisionZip: Longer is Better but Not Necessary in Vision Language Models

    Paper • 2412.04467 • Published Dec 5, 2024 • 117

  • Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

    Paper • 2412.04432 • Published Dec 5, 2024 • 16

  • Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

    Paper • 2412.14171 • Published Dec 18, 2024 • 24

  • Autoregressive Video Generation without Vector Quantization

    Paper • 2412.14169 • Published Dec 18, 2024 • 14
Upvote
-
  • Collection guide
  • Browse collections
Company
TOS Privacy About Careers
Website
Models Datasets Spaces Pricing Docs