Add comprehensive model card for LEGO (MV-ScanQA, TripAlign)

This PR adds a comprehensive model card for the LEGO model, presented in the paper [Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset](https://huggingface.co/papers/2508.11058).

It includes:
- The appropriate `pipeline_tag`: `image-text-to-text`, allowing users to find the model easily on the Hub.
- The `library_name`: `transformers`, indicating compatibility with the 🤗 Transformers library.
- The `license`: `CC-BY-4.0` as specified in the repository.
- Links to the paper, project page, and the official GitHub repository for more detailed information and code.
- An overview of the model and its capabilities based on the paper abstract and GitHub README.
- A sample Python usage example demonstrating how to load and use the LoRA adapter with its Fuyu base model.

Please review and merge this PR if everything looks good.

Files changed (1) hide show

README.md +189 -0

README.md ADDED Viewed

	@@ -0,0 +1,189 @@

+---
+license: cc-by-4.0
+pipeline_tag: image-text-to-text
+library_name: transformers
+---
+# LEGO: A Model for Multi-View 3D Scene Understanding
+This repository contains the official weights for **LEGO**, a baseline method for multi-view reasoning in 3D scene understanding. LEGO leverages knowledge from pre-trained 2D LVLMs (specifically fine-tuning a Fuyu-8B model) and is trained using the **TripAlign** pre-training dataset. It is evaluated on **MV-ScanQA**, a novel 3D question answering dataset designed to rigorously test multi-view compositional reasoning.
+LEGO achieves state-of-the-art performance on MV-ScanQA, as well as on existing benchmarks for 3D dense captioning and question answering.
+This model was presented in the paper [Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset](https://huggingface.co/papers/2508.11058).
+- 🏠 [Project Page](https://matthewdm0816.github.io/tripalign-mvscanqa)
+- 💻 [GitHub Repository](https://github.com/matthewdm0816/MV-ScanQA-TripAlign)
+<div align="center">
+  <img src="https://raw.githubusercontent.com/matthewdm0816/MV-ScanQA-TripAlign/main/docs/teasor-mm-lego.svg" alt="LEGO Teaser Image" width="70%"/>
+</div>
+## Overview of LEGO, MV-ScanQA, and TripAlign
+The **MV-ScanQA** dataset addresses limitations in existing 3D vision-language datasets by introducing questions that explicitly require integrating information from multiple views, thus rigorously testing multi-view compositional reasoning over distant objects.
+To facilitate training for such demanding scenarios, the **TripAlign** dataset is introduced. This large-scale, low-cost 2D-3D-language pre-training corpus contains 1M `<2D view, set of 3D objects, text>` triplets, providing richer, view-grounded multi-object multimodal alignment signals than previous single-object annotations.
+**LEGO** (Large-scale Multi-View Grounding Objective) is the baseline method developed to tackle the multi-view reasoning challenge in MV-ScanQA. It transfers knowledge from pre-trained 2D LVLMs (like Fuyu-8B, which this model fine-tunes) to the 3D domain with TripAlign.
+## Usage
+This model is a PEFT (Parameter-Efficient Fine-Tuning) LoRA adapter built on top of the `adept/fuyu-8b` base model. You can load and use it with the `transformers` and `peft` libraries.
+First, ensure you have the necessary libraries installed:
+```bash
+pip install transformers accelerate peft torch torchvision pillow
+```
+Below is a sample code for inference. Please note that the image pre-processing functions (`build_transform`, `find_closest_aspect_ratio`, `dynamic_preprocess`, `load_image`) are adapted from the original repository's usage patterns for Fuyu-based models.
+```python
+import numpy as np
+import torch
+import torchvision.transforms as T
+from PIL import Image
+from torchvision.transforms.functional import InterpolationMode
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel, PeftConfig
+IMAGENET_MEAN = (0.485, 0.456, 0.406)
+IMAGENET_STD = (0.229, 0.224, 0.225)
+def build_transform(input_size):
+    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
+    transform = T.Compose([
+        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
+        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
+        T.ToTensor(),
+        T.Normalize(mean=MEAN, std=STD)
+    ])
+    return transform
+def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
+    best_ratio_diff = float('inf')
+    best_ratio = (1, 1)
+    area = width * height
+    for ratio in target_ratios:
+        target_aspect_ratio = ratio[0] / ratio[1]
+        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
+        if ratio_diff < best_ratio_diff:
+            best_ratio_diff = ratio_diff
+            best_ratio = ratio
+        elif ratio_diff == best_ratio_diff:
+            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
+                best_ratio = ratio
+    return best_ratio
+def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
+    orig_width, orig_height = image.size
+    aspect_ratio = orig_width / orig_height
+    target_ratios = set(
+        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
+        i * j <= max_num and i * j >= min_num)
+    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
+    target_aspect_ratio = find_closest_aspect_ratio(
+        aspect_ratio, target_ratios, orig_width, orig_height, image_size)
+    target_width = image_size * target_aspect_ratio[0]
+    target_height = image_size * target_aspect_ratio[1]
+    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
+    resized_img = image.resize((target_width, target_height))
+    processed_images = []
+    for i in range(blocks):
+        box = (
+            (i % (target_width // image_size)) * image_size,
+            (i // (target_width // image_size)) * image_size,
+            ((i % (target_width // image_size)) + 1) * image_size,
+            ((i // (target_width // image_size)) + 1) * image_size
+        )
+        split_img = resized_img.crop(box)
+        processed_images.append(split_img)
+    assert len(processed_images) == blocks
+    if use_thumbnail and len(processed_images) != 1:
+        thumbnail_img = image.resize((image_size, image_size))
+        processed_images.append(thumbnail_img)
+    return processed_images
+def load_image(image_file, input_size=448, max_num=12):
+    image = Image.open(image_file).convert('RGB')
+    transform = build_transform(input_size=input_size)
+    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
+    pixel_values = [transform(image) for image in images]
+    pixel_values = torch.stack(pixel_values)
+    return pixel_values
+# Define the base model and the LoRA adapter ID
+base_model_name_or_path = "adept/fuyu-8b"
+# Replace 'your-org/your-repo' with the actual model ID on Hugging Face Hub
+peft_model_id = "your-org/your-repo" # e.g., kmichiru/LEGO
+# Load the base model
+print(f"Loading base model: {base_model_name_or_path}...")
+base_model = AutoModelForCausalLM.from_pretrained(
+    base_model_name_or_path,
+    torch_dtype=torch.bfloat16,
+    low_cpu_mem_usage=True,
+    trust_remote_code=True,
+    device_map="auto" # Use 'auto' to load across available devices
+)
+tokenizer = AutoTokenizer.from_pretrained(base_model_name_or_path, trust_remote_code=True, use_fast=False)
+# Load the PEFT adapter weights on top of the base model
+print(f"Loading LoRA adapter: {peft_model_id}...")
+model = PeftModel.from_pretrained(base_model, peft_model_id).eval()
+print("Model loaded successfully!")
+# Example usage (replace with your image path and question)
+# You might need to download a sample image, e.g., from the GitHub repo
+# A dummy image for testing:
+# from PIL import ImageDraw
+# dummy_image = Image.new('RGB', (800, 600), color = 'red')
+# draw = ImageDraw.Draw(dummy_image)
+# draw.text((10,10), "Sample Image", fill=(0,0,0))
+# dummy_image.save("sample_image.png")
+image_path = "sample_image.png" # Replace with path to a real image
+if not Path(image_path).exists():
+    print(f"Warning: Image '{image_path}' not found. Please provide a valid image path or create a dummy image.")
+    # Exit or handle gracefully if no image is available for execution
+    exit()
+pixel_values = load_image(image_path, max_num=6).to(torch.bfloat16).cuda() # Ensure image is on GPU
+generation_config = dict(max_new_tokens=1024, do_sample=True)
+question = "Describe the main objects in this 3D scene." # Example question
+# For a Fuyu model, the prompt format might be specific. Refer to Fuyu documentation.
+# This example uses a basic chat format.
+response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
+print(f'User: {question}
+Assistant: {response}')
+# Example for 3D question answering (assuming the model outputs bounding box coordinates)
+question_with_bbox = "What is the bounding box of the chair in this scene?"
+response_bbox, history_bbox = model.chat(tokenizer, pixel_values, question_with_bbox, generation_config, history=None, return_history=True)
+print(f'User: {question_with_bbox}
+Assistant: {response_bbox}')
+```
+## Citation
+If you find this codebase useful, please consider citing our work:
+```bibtex
+@inproceedings{mo2025mvscanqa,
+  title={Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset},
+  author={Mo, Wentao and Chen, QingChao and Peng, Yuxin and Huang, Siyuan and Liu, Yang},
+  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
+  year={2025},
+}
+```
+## License
+This code repository and datasets are licensed under a [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
+Copyright (c) 2025 Wentao Mo.