KangLiao
/

Puffin

+---
+license: apache-2.0
+---
+# **Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation**
+<p align="center">
+     &nbsp&nbsp 📖 <a href="https://kangliao929.github.io/projects/puffin">Project Page</a>&nbsp&nbsp｜ &nbsp&nbsp 🖥️ <a href="https://github.com/KangLiao929/Puffin">GitHub</a> &nbsp&nbsp  | &nbsp&nbsp🤗 <a href="https://huggingface.co/spaces/KangLiao/Puffin">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2506.18903v1">Paper </a> &nbsp&nbsp
+<br>
+## Model Details
+Puffin is a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. It learns the **camera-centric** understanding and generation tasks in **a unified multimodal framework**. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling **thinking with camera**. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context.
+| | |
+|---|---|
+| **Developed by** | Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy |
+| **Affiliation** | S-Lab, Nanyang Technological University |
+| **First released** | arXiv pre-print, 2025 |
+| **Model type** | Unified multimodal models (diffusion / autoregressive modelling with camera-centric understanding and generation) |
+| **Modality** | Image → Text+Camera; Text+Camera → Image; Image+Camera → Image; Image+Camera → Text |
+| **License** | Apache-2.0 |
+---
+### Direct Use
+- **Camera-centric understanding and generation** from a single image or a pair of text and camera, supports the thinking mode.
+- **World exploration**: performs the cross-view generation from a given initial view and target camera configuration.
+- **Spatial imagination**: imagines the scene description based on an initial view and target camera configuration.
+- **3D virtual object insertion** in AR/VR: assits the virtual 3D object insertion into in-the-wild images by calibrating camera parameters