nielsr HF Staff commited on
Commit
4c4ec5c
·
verified ·
1 Parent(s): e92e01f

Add comprehensive model card for LEGO (MV-ScanQA, TripAlign)

Browse files

This PR adds a comprehensive model card for the LEGO model, presented in the paper [Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset](https://huggingface.co/papers/2508.11058).

It includes:
- The appropriate `pipeline_tag`: `image-text-to-text`, allowing users to find the model easily on the Hub.
- The `library_name`: `transformers`, indicating compatibility with the 🤗 Transformers library.
- The `license`: `CC-BY-4.0` as specified in the repository.
- Links to the paper, project page, and the official GitHub repository for more detailed information and code.
- An overview of the model and its capabilities based on the paper abstract and GitHub README.
- A sample Python usage example demonstrating how to load and use the LoRA adapter with its Fuyu base model.

Please review and merge this PR if everything looks good.

Files changed (1) hide show
  1. README.md +189 -0
README.md ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
+ ---
6
+
7
+ # LEGO: A Model for Multi-View 3D Scene Understanding
8
+
9
+ This repository contains the official weights for **LEGO**, a baseline method for multi-view reasoning in 3D scene understanding. LEGO leverages knowledge from pre-trained 2D LVLMs (specifically fine-tuning a Fuyu-8B model) and is trained using the **TripAlign** pre-training dataset. It is evaluated on **MV-ScanQA**, a novel 3D question answering dataset designed to rigorously test multi-view compositional reasoning.
10
+
11
+ LEGO achieves state-of-the-art performance on MV-ScanQA, as well as on existing benchmarks for 3D dense captioning and question answering.
12
+
13
+ This model was presented in the paper [Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset](https://huggingface.co/papers/2508.11058).
14
+
15
+ - 🏠 [Project Page](https://matthewdm0816.github.io/tripalign-mvscanqa)
16
+ - 💻 [GitHub Repository](https://github.com/matthewdm0816/MV-ScanQA-TripAlign)
17
+
18
+ <div align="center">
19
+ <img src="https://raw.githubusercontent.com/matthewdm0816/MV-ScanQA-TripAlign/main/docs/teasor-mm-lego.svg" alt="LEGO Teaser Image" width="70%"/>
20
+ </div>
21
+
22
+ ## Overview of LEGO, MV-ScanQA, and TripAlign
23
+
24
+ The **MV-ScanQA** dataset addresses limitations in existing 3D vision-language datasets by introducing questions that explicitly require integrating information from multiple views, thus rigorously testing multi-view compositional reasoning over distant objects.
25
+
26
+ To facilitate training for such demanding scenarios, the **TripAlign** dataset is introduced. This large-scale, low-cost 2D-3D-language pre-training corpus contains 1M `<2D view, set of 3D objects, text>` triplets, providing richer, view-grounded multi-object multimodal alignment signals than previous single-object annotations.
27
+
28
+ **LEGO** (Large-scale Multi-View Grounding Objective) is the baseline method developed to tackle the multi-view reasoning challenge in MV-ScanQA. It transfers knowledge from pre-trained 2D LVLMs (like Fuyu-8B, which this model fine-tunes) to the 3D domain with TripAlign.
29
+
30
+ ## Usage
31
+
32
+ This model is a PEFT (Parameter-Efficient Fine-Tuning) LoRA adapter built on top of the `adept/fuyu-8b` base model. You can load and use it with the `transformers` and `peft` libraries.
33
+
34
+ First, ensure you have the necessary libraries installed:
35
+
36
+ ```bash
37
+ pip install transformers accelerate peft torch torchvision pillow
38
+ ```
39
+
40
+ Below is a sample code for inference. Please note that the image pre-processing functions (`build_transform`, `find_closest_aspect_ratio`, `dynamic_preprocess`, `load_image`) are adapted from the original repository's usage patterns for Fuyu-based models.
41
+
42
+ ```python
43
+ import numpy as np
44
+ import torch
45
+ import torchvision.transforms as T
46
+ from PIL import Image
47
+ from torchvision.transforms.functional import InterpolationMode
48
+ from transformers import AutoModelForCausalLM, AutoTokenizer
49
+ from peft import PeftModel, PeftConfig
50
+
51
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
52
+ IMAGENET_STD = (0.229, 0.224, 0.225)
53
+
54
+ def build_transform(input_size):
55
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
56
+ transform = T.Compose([
57
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
58
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
59
+ T.ToTensor(),
60
+ T.Normalize(mean=MEAN, std=STD)
61
+ ])
62
+ return transform
63
+
64
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
65
+ best_ratio_diff = float('inf')
66
+ best_ratio = (1, 1)
67
+ area = width * height
68
+ for ratio in target_ratios:
69
+ target_aspect_ratio = ratio[0] / ratio[1]
70
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
71
+ if ratio_diff < best_ratio_diff:
72
+ best_ratio_diff = ratio_diff
73
+ best_ratio = ratio
74
+ elif ratio_diff == best_ratio_diff:
75
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
76
+ best_ratio = ratio
77
+ return best_ratio
78
+
79
+ def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
80
+ orig_width, orig_height = image.size
81
+ aspect_ratio = orig_width / orig_height
82
+
83
+ target_ratios = set(
84
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
85
+ i * j <= max_num and i * j >= min_num)
86
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
87
+
88
+ target_aspect_ratio = find_closest_aspect_ratio(
89
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
90
+
91
+ target_width = image_size * target_aspect_ratio[0]
92
+ target_height = image_size * target_aspect_ratio[1]
93
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
94
+
95
+ resized_img = image.resize((target_width, target_height))
96
+ processed_images = []
97
+ for i in range(blocks):
98
+ box = (
99
+ (i % (target_width // image_size)) * image_size,
100
+ (i // (target_width // image_size)) * image_size,
101
+ ((i % (target_width // image_size)) + 1) * image_size,
102
+ ((i // (target_width // image_size)) + 1) * image_size
103
+ )
104
+ split_img = resized_img.crop(box)
105
+ processed_images.append(split_img)
106
+ assert len(processed_images) == blocks
107
+ if use_thumbnail and len(processed_images) != 1:
108
+ thumbnail_img = image.resize((image_size, image_size))
109
+ processed_images.append(thumbnail_img)
110
+ return processed_images
111
+
112
+ def load_image(image_file, input_size=448, max_num=12):
113
+ image = Image.open(image_file).convert('RGB')
114
+ transform = build_transform(input_size=input_size)
115
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
116
+ pixel_values = [transform(image) for image in images]
117
+ pixel_values = torch.stack(pixel_values)
118
+ return pixel_values
119
+
120
+ # Define the base model and the LoRA adapter ID
121
+ base_model_name_or_path = "adept/fuyu-8b"
122
+ # Replace 'your-org/your-repo' with the actual model ID on Hugging Face Hub
123
+ peft_model_id = "your-org/your-repo" # e.g., kmichiru/LEGO
124
+
125
+ # Load the base model
126
+ print(f"Loading base model: {base_model_name_or_path}...")
127
+ base_model = AutoModelForCausalLM.from_pretrained(
128
+ base_model_name_or_path,
129
+ torch_dtype=torch.bfloat16,
130
+ low_cpu_mem_usage=True,
131
+ trust_remote_code=True,
132
+ device_map="auto" # Use 'auto' to load across available devices
133
+ )
134
+ tokenizer = AutoTokenizer.from_pretrained(base_model_name_or_path, trust_remote_code=True, use_fast=False)
135
+
136
+ # Load the PEFT adapter weights on top of the base model
137
+ print(f"Loading LoRA adapter: {peft_model_id}...")
138
+ model = PeftModel.from_pretrained(base_model, peft_model_id).eval()
139
+ print("Model loaded successfully!")
140
+
141
+ # Example usage (replace with your image path and question)
142
+ # You might need to download a sample image, e.g., from the GitHub repo
143
+ # A dummy image for testing:
144
+ # from PIL import ImageDraw
145
+ # dummy_image = Image.new('RGB', (800, 600), color = 'red')
146
+ # draw = ImageDraw.Draw(dummy_image)
147
+ # draw.text((10,10), "Sample Image", fill=(0,0,0))
148
+ # dummy_image.save("sample_image.png")
149
+ image_path = "sample_image.png" # Replace with path to a real image
150
+ if not Path(image_path).exists():
151
+ print(f"Warning: Image '{image_path}' not found. Please provide a valid image path or create a dummy image.")
152
+ # Exit or handle gracefully if no image is available for execution
153
+ exit()
154
+
155
+ pixel_values = load_image(image_path, max_num=6).to(torch.bfloat16).cuda() # Ensure image is on GPU
156
+ generation_config = dict(max_new_tokens=1024, do_sample=True)
157
+
158
+ question = "Describe the main objects in this 3D scene." # Example question
159
+ # For a Fuyu model, the prompt format might be specific. Refer to Fuyu documentation.
160
+ # This example uses a basic chat format.
161
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
162
+ print(f'User: {question}
163
+ Assistant: {response}')
164
+
165
+ # Example for 3D question answering (assuming the model outputs bounding box coordinates)
166
+ question_with_bbox = "What is the bounding box of the chair in this scene?"
167
+ response_bbox, history_bbox = model.chat(tokenizer, pixel_values, question_with_bbox, generation_config, history=None, return_history=True)
168
+ print(f'User: {question_with_bbox}
169
+ Assistant: {response_bbox}')
170
+ ```
171
+
172
+ ## Citation
173
+
174
+ If you find this codebase useful, please consider citing our work:
175
+
176
+ ```bibtex
177
+ @inproceedings{mo2025mvscanqa,
178
+ title={Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset},
179
+ author={Mo, Wentao and Chen, QingChao and Peng, Yuxin and Huang, Siyuan and Liu, Yang},
180
+ booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
181
+ year={2025},
182
+ }
183
+ ```
184
+
185
+ ## License
186
+
187
+ This code repository and datasets are licensed under a [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
188
+
189
+ Copyright (c) 2025 Wentao Mo.