BIM-CLIP / README.md

Upload README.md with huggingface_hub

a351c92 verified 8 days ago

7.28 kB

	---
	license: mit
	language:
	- en
	- zh
	tags:
	- 3d
	- bim
	- multimodal
	- contrastive-learning
	- point-cloud
	- mesh
	- vision-language
	- building-information-modeling
	pipeline_tag: image-classification
	---

	# BIM-CLIP Model Weights

	Figure1: Overview of the BIM-CLIP framework.

	![Figure1](图/huggingface_README/Figure1.png)



	Figure2: BIM-CLIP workflow and downstream applications.

	> Given heterogeneous inputs the framework encodes each modality through dedicated encoders and aligns them within a shared semantic embedding space via contrastive learning. After alignment-preserving fine-tuning, the learned representations support two practical BIM downstream tasks.

	![Figure3](图/huggingface_README/Figure3.png)



	Model weights for BIM-CLIP: Language-Guided Multimodal Representation Learning for BIM Component Recognition.

	> Haining Meng, Haoyang Dong, Mingsong Yang, Xing Fan, Xinhong Hei
	> Xi'an University of Technology

	📄 [Paper (Preprint)](https://github.com/[to-be-updated]) \| 💻 [GitHub](https://github.com/[to-be-updated]) \| 🗂️ [BIMCompNet Dataset](https://bimcompnet-606lab.xaut.edu.cn/)

	---

	## Repository Structure

	```
	bim-clip-weights/
	│
	├── README.md
	│
	├── BIMCompNet/ # Models trained on BIMCompNet
	│ ├── multimodal/
	│ │ ├── best_100.mdl # BIMCompNet-100 (42 classes), multimodal
	│ │ ├── best_500.mdl # BIMCompNet-500 (31 classes), multimodal
	│ │ └── best_1000.mdl # BIMCompNet-1000 (24 classes), multimodal
	│ └── single_modal/
	│ ├── best_100.mdl # BIMCompNet-100 (42 classes), single modality
	│ ├── best_500.mdl # BIMCompNet-500 (31 classes), single modality
	│ └── best_1000.mdl # BIMCompNet-1000 (24 classes), single modality
	│
	├── IFCNet/
	│ └── best_ifcnet.mdl # Multimodal, trained on IFCNet (20 classes)
	│
	├── ModelNet/
	│ ├── best_10.mdl # Multimodal, ModelNet-10
	│ ├── best_40.mdl # Multimodal, ModelNet-40
	│ ├── ModelNet10.zip # ModelNet-10 extended with PC + multi-view modalities
	│ └── ModelNet40.zip # ModelNet-40 extended with PC + multi-view modalities
	│
	└── ULIP2/
	├── best_ulip2_1000.mdl # ULIP-2 fine-tuned on BIMCompNet-1000
	└── ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt # Official pretrained weights (866 MB)
	```

	---

	## Model Summary

	\| File \| Architecture \| Training Set \| Classes \| Acc (%) \| F1 (%) \|
	\|------\|-------------\|--------------\|---------\|---------\|--------\|
	\| `BIMCompNet/multimodal/best_100.mdl` \| BIM-CLIP (CMA) \| BIMCompNet-100 \| 42 \| 87.38 \| 87.44 \|
	\| `BIMCompNet/multimodal/best_500.mdl` \| BIM-CLIP (CMA) \| BIMCompNet-500 \| 31 \| 91.35 \| 91.28 \|
	\| `BIMCompNet/multimodal/best_1000.mdl` \| BIM-CLIP (CMA) \| BIMCompNet-1000 \| 24 \| 91.79 \| 91.83 \|
	\| `BIMCompNet/single_modal/best_100.mdl` \| BIM-CLIP (single modality) \| BIMCompNet-100 \| 42 \| — \| — \|
	\| `BIMCompNet/single_modal/best_500.mdl` \| BIM-CLIP (single modality) \| BIMCompNet-500 \| 31 \| — \| — \|
	\| `BIMCompNet/single_modal/best_1000.mdl` \| BIM-CLIP (single modality) \| BIMCompNet-1000 \| 24 \| 88.69 \| 87.90 \|
	\| `IFCNet/best_ifcnet.mdl` \| BIM-CLIP (CMA) \| IFCNet \| 20 \| 91.00 \| 90.39 \|
	\| `ModelNet/best_10.mdl` \| BIM-CLIP (CMA) \| ModelNet-10 \| 10 \| 95.36 \| 95.25* \|
	\| `ModelNet/best_40.mdl` \| BIM-CLIP (CMA) \| ModelNet-40 \| 40 \| 92.22 \| 90.34* \|
	\| `ULIP2/best_ulip2_1000.mdl` \| ULIP-2 \| BIMCompNet-1000 \| 24 \| 90.98 \| 91.02 \|
	\| `ULIP2/ULIP-2-PointBERT-…-pretrained.pt` \| ULIP-2 (official) \| Objaverse \| — \| — \| — \|

	mAP reported for ModelNet. — indicates metrics not separately reported in the paper.

	---

	## Usage

	### Load and Run Evaluation

	Clone the GitHub repo, then:

	```bash
	# Evaluate BIMCompNet-1000 multimodal
	python bimclip.py --mode eval --data_type MULTI_MODAL \
	--data_root /path/to/BIMCompNet --index_root /path/to/index \
	--set_size 1000 \
	--model_path /path/to/BIMCompNet/multimodal/best_1000.mdl \
	--embeddings_path embeddings.pt \
	--yaml_path ./描述信息.yaml \
	--output_dir ./results

	# Evaluate IFCNet multimodal
	python bimclip.py --mode eval --data_type MULTI_MODAL \
	--ifcnet_root /path/to/IFCNetCorePly/IFCNetCore \
	--model_path /path/to/IFCNet/best_ifcnet.mdl \
	--embeddings_path ifcnet_embeddings.pt \
	--yaml_path ./描述信息.yaml \
	--output_dir ./results
	```

	### Embeddings

	Text embedding files are included in the GitHub repository:

	\| File \| Classes \| Dataset \|
	\|------\|---------\|---------\|
	\| `embeddings.pt` \| 57 \| BIMCompNet (all categories) \|
	\| `ifcnet_embeddings.pt` \| 20 \| IFCNet \|
	\| `model_net_10_embeddings.pt` \| 10 \| ModelNet-10 \|
	\| `model_net_40_embeddings.pt` \| 40 \| ModelNet-40 \|

	> Use the matching embeddings file for each dataset. Do not mix across datasets.

	### Dataset access

	BIMCompNet is hosted by the 606 Lab at Xi'an University of Technology. Visit the link below to apply for access or download:

	👉 [https://bimcompnet-606lab.xaut.edu.cn/](https://bimcompnet-606lab.xaut.edu.cn/)

	### ModelNet multimodal extension

	`ModelNet10.zip` and `ModelNet40.zip` contain the original ModelNet meshes extended with point clouds and multi-view images, constructed using our multimodal data pipeline. The directory layout inside each zip is:

	```
	ModelNet{10\|40}/
	└── {class}/
	├── train/
	│ ├── obj/ # Original mesh (.obj)
	│ ├── ply/ # Point cloud sampled from mesh (1024 pts, .ply)
	│ └── png/
	│ └── {sample}/
	│ └── Edges/ # 12 edge-rendered views (0.png – 11.png)
	└── test/
	└── (same structure)
	```

	Point clouds are uniformly sampled from the mesh surface (1024 points per object). Multi-view images are edge-rendered from 12 fixed viewpoints following the camera placement strategy used in BIMCompNet.

	### Third-Party Weights (ULIP-2 baseline)

	The official ULIP-2 pretrained PointBERT weights (866 MB) are included in this repository at:

	```
	ULIP2/ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt
	```

	Alternatively, download directly from the original source:
	```
	https://huggingface.co/datasets/SFXX/ulip/resolve/main/ULIP-2/pretrained_models/ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt
	```

	---

	## Architecture Overview

	BIM-CLIP uses three modality encoders (ViT for images, PointNet for point clouds, MeshNet for meshes), projects each into a shared 1536-dim language embedding space via contrastive alignment against `text-embedding-ada-002` anchors, then fuses them through a Language-Guided Cross-Modal Attention (CMA) module. During fine-tuning, only the CMA module (10.62M parameters) is updated.

	---

	## Citation

	```bibtex
	@article{meng2026bimclip,
	title={BIM-CLIP: Language-Guided Multimodal Representation Learning for BIM Component Recognition},
	author={Meng, Haining and Dong, Haoyang and Yang, Mingsong and Fan, Xing and Hei, Xinhong},
	journal={[to-be-updated upon acceptance]},
	year={2026}
	}
	```