Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
# **Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation**
|
| 6 |
+
<p align="center">
|
| 7 |
+
   📖 <a href="https://kangliao929.github.io/projects/puffin">Project Page</a>  |    🖥️ <a href="https://github.com/KangLiao929/Puffin">GitHub</a>    |   🤗 <a href="https://huggingface.co/spaces/KangLiao/Puffin">Hugging Face</a>   |    📑 <a href="https://arxiv.org/abs/2506.18903v1">Paper </a>   
|
| 8 |
+
<br>
|
| 9 |
+
## Model Details
|
| 10 |
+
|
| 11 |
+
Puffin is a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. It learns the **camera-centric** understanding and generation tasks in **a unified multimodal framework**. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling **thinking with camera**. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context.
|
| 12 |
+
|
| 13 |
+
| | |
|
| 14 |
+
|---|---|
|
| 15 |
+
| **Developed by** | Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy |
|
| 16 |
+
| **Affiliation** | S-Lab, Nanyang Technological University |
|
| 17 |
+
| **First released** | arXiv pre-print, 2025 |
|
| 18 |
+
| **Model type** | Unified multimodal models (diffusion / autoregressive modelling with camera-centric understanding and generation) |
|
| 19 |
+
| **Modality** | Image → Text+Camera; Text+Camera → Image; Image+Camera → Image; Image+Camera → Text |
|
| 20 |
+
| **License** | Apache-2.0 |
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
### Direct Use
|
| 25 |
+
- **Camera-centric understanding and generation** from a single image or a pair of text and camera, supports the thinking mode.
|
| 26 |
+
- **World exploration**: performs the cross-view generation from a given initial view and target camera configuration.
|
| 27 |
+
- **Spatial imagination**: imagines the scene description based on an initial view and target camera configuration.
|
| 28 |
+
- **3D virtual object insertion** in AR/VR: assits the virtual 3D object insertion into in-the-wild images by calibrating camera parameters
|