BIM-CLIP / README.md
flybrid's picture
Upload README.md with huggingface_hub
a351c92 verified
---
license: mit
language:
- en
- zh
tags:
- 3d
- bim
- multimodal
- contrastive-learning
- point-cloud
- mesh
- vision-language
- building-information-modeling
pipeline_tag: image-classification
---
# BIM-CLIP Model Weights
**Figure1: Overview of the BIM-CLIP framework.**
![Figure1](ε›Ύ/huggingface_README/Figure1.png)
**Figure2: BIM-CLIP workflow and downstream applications.**
> Given heterogeneous inputs the framework encodes each modality through dedicated encoders and aligns them within a shared semantic embedding space via contrastive learning. After alignment-preserving fine-tuning, the learned representations support two practical BIM downstream tasks.
![Figure3](ε›Ύ/huggingface_README/Figure3.png)
Model weights for **BIM-CLIP: Language-Guided Multimodal Representation Learning for BIM Component Recognition**.
> Haining Meng, Haoyang Dong, Mingsong Yang, Xing Fan, Xinhong Hei
> Xi'an University of Technology
πŸ“„ [Paper (Preprint)](https://github.com/[to-be-updated]) | πŸ’» [GitHub](https://github.com/[to-be-updated]) | πŸ—‚οΈ [BIMCompNet Dataset](https://bimcompnet-606lab.xaut.edu.cn/)
---
## Repository Structure
```
bim-clip-weights/
β”‚
β”œβ”€β”€ README.md
β”‚
β”œβ”€β”€ BIMCompNet/ # Models trained on BIMCompNet
β”‚ β”œβ”€β”€ multimodal/
β”‚ β”‚ β”œβ”€β”€ best_100.mdl # BIMCompNet-100 (42 classes), multimodal
β”‚ β”‚ β”œβ”€β”€ best_500.mdl # BIMCompNet-500 (31 classes), multimodal
β”‚ β”‚ └── best_1000.mdl # BIMCompNet-1000 (24 classes), multimodal
β”‚ └── single_modal/
β”‚ β”œβ”€β”€ best_100.mdl # BIMCompNet-100 (42 classes), single modality
β”‚ β”œβ”€β”€ best_500.mdl # BIMCompNet-500 (31 classes), single modality
β”‚ └── best_1000.mdl # BIMCompNet-1000 (24 classes), single modality
β”‚
β”œβ”€β”€ IFCNet/
β”‚ └── best_ifcnet.mdl # Multimodal, trained on IFCNet (20 classes)
β”‚
β”œβ”€β”€ ModelNet/
β”‚ β”œβ”€β”€ best_10.mdl # Multimodal, ModelNet-10
β”‚ β”œβ”€β”€ best_40.mdl # Multimodal, ModelNet-40
β”‚ β”œβ”€β”€ ModelNet10.zip # ModelNet-10 extended with PC + multi-view modalities
β”‚ └── ModelNet40.zip # ModelNet-40 extended with PC + multi-view modalities
β”‚
└── ULIP2/
β”œβ”€β”€ best_ulip2_1000.mdl # ULIP-2 fine-tuned on BIMCompNet-1000
└── ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt # Official pretrained weights (866 MB)
```
---
## Model Summary
| File | Architecture | Training Set | Classes | Acc (%) | F1 (%) |
|------|-------------|--------------|---------|---------|--------|
| `BIMCompNet/multimodal/best_100.mdl` | BIM-CLIP (CMA) | BIMCompNet-100 | 42 | 87.38 | 87.44 |
| `BIMCompNet/multimodal/best_500.mdl` | BIM-CLIP (CMA) | BIMCompNet-500 | 31 | 91.35 | 91.28 |
| `BIMCompNet/multimodal/best_1000.mdl` | BIM-CLIP (CMA) | BIMCompNet-1000 | 24 | 91.79 | 91.83 |
| `BIMCompNet/single_modal/best_100.mdl` | BIM-CLIP (single modality) | BIMCompNet-100 | 42 | β€” | β€” |
| `BIMCompNet/single_modal/best_500.mdl` | BIM-CLIP (single modality) | BIMCompNet-500 | 31 | β€” | β€” |
| `BIMCompNet/single_modal/best_1000.mdl` | BIM-CLIP (single modality) | BIMCompNet-1000 | 24 | 88.69 | 87.90 |
| `IFCNet/best_ifcnet.mdl` | BIM-CLIP (CMA) | IFCNet | 20 | 91.00 | 90.39 |
| `ModelNet/best_10.mdl` | BIM-CLIP (CMA) | ModelNet-10 | 10 | 95.36 | 95.25* |
| `ModelNet/best_40.mdl` | BIM-CLIP (CMA) | ModelNet-40 | 40 | 92.22 | 90.34* |
| `ULIP2/best_ulip2_1000.mdl` | ULIP-2 | BIMCompNet-1000 | 24 | 90.98 | 91.02 |
| `ULIP2/ULIP-2-PointBERT-…-pretrained.pt` | ULIP-2 (official) | Objaverse | β€” | β€” | β€” |
*mAP reported for ModelNet. β€” indicates metrics not separately reported in the paper.*
---
## Usage
### Load and Run Evaluation
Clone the GitHub repo, then:
```bash
# Evaluate BIMCompNet-1000 multimodal
python bimclip.py --mode eval --data_type MULTI_MODAL \
--data_root /path/to/BIMCompNet --index_root /path/to/index \
--set_size 1000 \
--model_path /path/to/BIMCompNet/multimodal/best_1000.mdl \
--embeddings_path embeddings.pt \
--yaml_path ./描述俑息.yaml \
--output_dir ./results
# Evaluate IFCNet multimodal
python bimclip.py --mode eval --data_type MULTI_MODAL \
--ifcnet_root /path/to/IFCNetCorePly/IFCNetCore \
--model_path /path/to/IFCNet/best_ifcnet.mdl \
--embeddings_path ifcnet_embeddings.pt \
--yaml_path ./描述俑息.yaml \
--output_dir ./results
```
### Embeddings
Text embedding files are included in the GitHub repository:
| File | Classes | Dataset |
|------|---------|---------|
| `embeddings.pt` | 57 | BIMCompNet (all categories) |
| `ifcnet_embeddings.pt` | 20 | IFCNet |
| `model_net_10_embeddings.pt` | 10 | ModelNet-10 |
| `model_net_40_embeddings.pt` | 40 | ModelNet-40 |
> Use the matching embeddings file for each dataset. Do **not** mix across datasets.
### Dataset access
BIMCompNet is hosted by the 606 Lab at Xi'an University of Technology. Visit the link below to apply for access or download:
πŸ‘‰ **[https://bimcompnet-606lab.xaut.edu.cn/](https://bimcompnet-606lab.xaut.edu.cn/)**
### ModelNet multimodal extension
`ModelNet10.zip` and `ModelNet40.zip` contain the original ModelNet meshes extended with point clouds and multi-view images, constructed using our multimodal data pipeline. The directory layout inside each zip is:
```
ModelNet{10|40}/
└── {class}/
β”œβ”€β”€ train/
β”‚ β”œβ”€β”€ obj/ # Original mesh (.obj)
β”‚ β”œβ”€β”€ ply/ # Point cloud sampled from mesh (1024 pts, .ply)
β”‚ └── png/
β”‚ └── {sample}/
β”‚ └── Edges/ # 12 edge-rendered views (0.png – 11.png)
└── test/
└── (same structure)
```
Point clouds are uniformly sampled from the mesh surface (1024 points per object). Multi-view images are edge-rendered from 12 fixed viewpoints following the camera placement strategy used in BIMCompNet.
### Third-Party Weights (ULIP-2 baseline)
The official ULIP-2 pretrained PointBERT weights (866 MB) are included in this repository at:
```
ULIP2/ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt
```
Alternatively, download directly from the original source:
```
https://huggingface.co/datasets/SFXX/ulip/resolve/main/ULIP-2/pretrained_models/ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt
```
---
## Architecture Overview
BIM-CLIP uses three modality encoders (ViT for images, PointNet for point clouds, MeshNet for meshes), projects each into a shared 1536-dim language embedding space via contrastive alignment against `text-embedding-ada-002` anchors, then fuses them through a **Language-Guided Cross-Modal Attention (CMA)** module. During fine-tuning, only the CMA module (10.62M parameters) is updated.
---
## Citation
```bibtex
@article{meng2026bimclip,
title={BIM-CLIP: Language-Guided Multimodal Representation Learning for BIM Component Recognition},
author={Meng, Haining and Dong, Haoyang and Yang, Mingsong and Fan, Xing and Hei, Xinhong},
journal={[to-be-updated upon acceptance]},
year={2026}
}
```