BIM-CLIP Model Weights
Figure1: Overview of the BIM-CLIP framework.
Figure2: BIM-CLIP workflow and downstream applications.
Given heterogeneous inputs the framework encodes each modality through dedicated encoders and aligns them within a shared semantic embedding space via contrastive learning. After alignment-preserving fine-tuning, the learned representations support two practical BIM downstream tasks.
Model weights for BIM-CLIP: Language-Guided Multimodal Representation Learning for BIM Component Recognition.
Haining Meng, Haoyang Dong, Mingsong Yang, Xing Fan, Xinhong Hei
Xi'an University of Technology
π Paper (Preprint) | π» GitHub | ποΈ BIMCompNet Dataset
Repository Structure
bim-clip-weights/
β
βββ README.md
β
βββ BIMCompNet/ # Models trained on BIMCompNet
β βββ multimodal/
β β βββ best_100.mdl # BIMCompNet-100 (42 classes), multimodal
β β βββ best_500.mdl # BIMCompNet-500 (31 classes), multimodal
β β βββ best_1000.mdl # BIMCompNet-1000 (24 classes), multimodal
β βββ single_modal/
β βββ best_100.mdl # BIMCompNet-100 (42 classes), single modality
β βββ best_500.mdl # BIMCompNet-500 (31 classes), single modality
β βββ best_1000.mdl # BIMCompNet-1000 (24 classes), single modality
β
βββ IFCNet/
β βββ best_ifcnet.mdl # Multimodal, trained on IFCNet (20 classes)
β
βββ ModelNet/
β βββ best_10.mdl # Multimodal, ModelNet-10
β βββ best_40.mdl # Multimodal, ModelNet-40
β βββ ModelNet10.zip # ModelNet-10 extended with PC + multi-view modalities
β βββ ModelNet40.zip # ModelNet-40 extended with PC + multi-view modalities
β
βββ ULIP2/
βββ best_ulip2_1000.mdl # ULIP-2 fine-tuned on BIMCompNet-1000
βββ ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt # Official pretrained weights (866 MB)
Model Summary
| File | Architecture | Training Set | Classes | Acc (%) | F1 (%) |
|---|---|---|---|---|---|
BIMCompNet/multimodal/best_100.mdl |
BIM-CLIP (CMA) | BIMCompNet-100 | 42 | 87.38 | 87.44 |
BIMCompNet/multimodal/best_500.mdl |
BIM-CLIP (CMA) | BIMCompNet-500 | 31 | 91.35 | 91.28 |
BIMCompNet/multimodal/best_1000.mdl |
BIM-CLIP (CMA) | BIMCompNet-1000 | 24 | 91.79 | 91.83 |
BIMCompNet/single_modal/best_100.mdl |
BIM-CLIP (single modality) | BIMCompNet-100 | 42 | β | β |
BIMCompNet/single_modal/best_500.mdl |
BIM-CLIP (single modality) | BIMCompNet-500 | 31 | β | β |
BIMCompNet/single_modal/best_1000.mdl |
BIM-CLIP (single modality) | BIMCompNet-1000 | 24 | 88.69 | 87.90 |
IFCNet/best_ifcnet.mdl |
BIM-CLIP (CMA) | IFCNet | 20 | 91.00 | 90.39 |
ModelNet/best_10.mdl |
BIM-CLIP (CMA) | ModelNet-10 | 10 | 95.36 | 95.25* |
ModelNet/best_40.mdl |
BIM-CLIP (CMA) | ModelNet-40 | 40 | 92.22 | 90.34* |
ULIP2/best_ulip2_1000.mdl |
ULIP-2 | BIMCompNet-1000 | 24 | 90.98 | 91.02 |
ULIP2/ULIP-2-PointBERT-β¦-pretrained.pt |
ULIP-2 (official) | Objaverse | β | β | β |
mAP reported for ModelNet. β indicates metrics not separately reported in the paper.
Usage
Load and Run Evaluation
Clone the GitHub repo, then:
# Evaluate BIMCompNet-1000 multimodal
python bimclip.py --mode eval --data_type MULTI_MODAL \
--data_root /path/to/BIMCompNet --index_root /path/to/index \
--set_size 1000 \
--model_path /path/to/BIMCompNet/multimodal/best_1000.mdl \
--embeddings_path embeddings.pt \
--yaml_path ./ζθΏ°δΏ‘ζ―.yaml \
--output_dir ./results
# Evaluate IFCNet multimodal
python bimclip.py --mode eval --data_type MULTI_MODAL \
--ifcnet_root /path/to/IFCNetCorePly/IFCNetCore \
--model_path /path/to/IFCNet/best_ifcnet.mdl \
--embeddings_path ifcnet_embeddings.pt \
--yaml_path ./ζθΏ°δΏ‘ζ―.yaml \
--output_dir ./results
Embeddings
Text embedding files are included in the GitHub repository:
| File | Classes | Dataset |
|---|---|---|
embeddings.pt |
57 | BIMCompNet (all categories) |
ifcnet_embeddings.pt |
20 | IFCNet |
model_net_10_embeddings.pt |
10 | ModelNet-10 |
model_net_40_embeddings.pt |
40 | ModelNet-40 |
Use the matching embeddings file for each dataset. Do not mix across datasets.
Dataset access
BIMCompNet is hosted by the 606 Lab at Xi'an University of Technology. Visit the link below to apply for access or download:
π https://bimcompnet-606lab.xaut.edu.cn/
ModelNet multimodal extension
ModelNet10.zip and ModelNet40.zip contain the original ModelNet meshes extended with point clouds and multi-view images, constructed using our multimodal data pipeline. The directory layout inside each zip is:
ModelNet{10|40}/
βββ {class}/
βββ train/
β βββ obj/ # Original mesh (.obj)
β βββ ply/ # Point cloud sampled from mesh (1024 pts, .ply)
β βββ png/
β βββ {sample}/
β βββ Edges/ # 12 edge-rendered views (0.png β 11.png)
βββ test/
βββ (same structure)
Point clouds are uniformly sampled from the mesh surface (1024 points per object). Multi-view images are edge-rendered from 12 fixed viewpoints following the camera placement strategy used in BIMCompNet.
Third-Party Weights (ULIP-2 baseline)
The official ULIP-2 pretrained PointBERT weights (866 MB) are included in this repository at:
ULIP2/ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt
Alternatively, download directly from the original source:
https://huggingface.co/datasets/SFXX/ulip/resolve/main/ULIP-2/pretrained_models/ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt
Architecture Overview
BIM-CLIP uses three modality encoders (ViT for images, PointNet for point clouds, MeshNet for meshes), projects each into a shared 1536-dim language embedding space via contrastive alignment against text-embedding-ada-002 anchors, then fuses them through a Language-Guided Cross-Modal Attention (CMA) module. During fine-tuning, only the CMA module (10.62M parameters) is updated.
Citation
@article{meng2026bimclip,
title={BIM-CLIP: Language-Guided Multimodal Representation Learning for BIM Component Recognition},
author={Meng, Haining and Dong, Haoyang and Yang, Mingsong and Fan, Xing and Hei, Xinhong},
journal={[to-be-updated upon acceptance]},
year={2026}
}

