KangLiao commited on
Commit
a60c8da
·
verified ·
1 Parent(s): d0d6ea7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -0
README.md ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # **Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation**
6
+ <p align="center">
7
+ &nbsp&nbsp 📖 <a href="https://kangliao929.github.io/projects/puffin">Project Page</a>&nbsp&nbsp| &nbsp&nbsp 🖥️ <a href="https://github.com/KangLiao929/Puffin">GitHub</a> &nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/spaces/KangLiao/Puffin">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2506.18903v1">Paper </a> &nbsp&nbsp
8
+ <br>
9
+ ## Model Details
10
+
11
+ Puffin is a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. It learns the **camera-centric** understanding and generation tasks in **a unified multimodal framework**. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling **thinking with camera**. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context.
12
+
13
+ | | |
14
+ |---|---|
15
+ | **Developed by** | Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy |
16
+ | **Affiliation** | S-Lab, Nanyang Technological University |
17
+ | **First released** | arXiv pre-print, 2025 |
18
+ | **Model type** | Unified multimodal models (diffusion / autoregressive modelling with camera-centric understanding and generation) |
19
+ | **Modality** | Image → Text+Camera; Text+Camera → Image; Image+Camera → Image; Image+Camera → Text |
20
+ | **License** | Apache-2.0 |
21
+
22
+ ---
23
+
24
+ ### Direct Use
25
+ - **Camera-centric understanding and generation** from a single image or a pair of text and camera, supports the thinking mode.
26
+ - **World exploration**: performs the cross-view generation from a given initial view and target camera configuration.
27
+ - **Spatial imagination**: imagines the scene description based on an initial view and target camera configuration.
28
+ - **3D virtual object insertion** in AR/VR: assits the virtual 3D object insertion into in-the-wild images by calibrating camera parameters