| --- |
| license: mit |
| language: |
| - en |
| - zh |
| tags: |
| - 3d |
| - bim |
| - multimodal |
| - contrastive-learning |
| - point-cloud |
| - mesh |
| - vision-language |
| - building-information-modeling |
| pipeline_tag: image-classification |
| --- |
| |
| # BIM-CLIP Model Weights |
|
|
| **Figure1: Overview of the BIM-CLIP framework.** |
|
|
|  |
|
|
|
|
|
|
| **Figure2: BIM-CLIP workflow and downstream applications.** |
|
|
| > Given heterogeneous inputs the framework encodes each modality through dedicated encoders and aligns them within a shared semantic embedding space via contrastive learning. After alignment-preserving fine-tuning, the learned representations support two practical BIM downstream tasks. |
|
|
|  |
|
|
|
|
|
|
| Model weights for **BIM-CLIP: Language-Guided Multimodal Representation Learning for BIM Component Recognition**. |
|
|
| > Haining Meng, Haoyang Dong, Mingsong Yang, Xing Fan, Xinhong Hei |
| > Xi'an University of Technology |
|
|
| π [Paper (Preprint)](https://github.com/[to-be-updated]) | π» [GitHub](https://github.com/[to-be-updated]) | ποΈ [BIMCompNet Dataset](https://bimcompnet-606lab.xaut.edu.cn/) |
|
|
| --- |
|
|
| ## Repository Structure |
|
|
| ``` |
| bim-clip-weights/ |
| β |
| βββ README.md |
| β |
| βββ BIMCompNet/ # Models trained on BIMCompNet |
| β βββ multimodal/ |
| β β βββ best_100.mdl # BIMCompNet-100 (42 classes), multimodal |
| β β βββ best_500.mdl # BIMCompNet-500 (31 classes), multimodal |
| β β βββ best_1000.mdl # BIMCompNet-1000 (24 classes), multimodal |
| β βββ single_modal/ |
| β βββ best_100.mdl # BIMCompNet-100 (42 classes), single modality |
| β βββ best_500.mdl # BIMCompNet-500 (31 classes), single modality |
| β βββ best_1000.mdl # BIMCompNet-1000 (24 classes), single modality |
| β |
| βββ IFCNet/ |
| β βββ best_ifcnet.mdl # Multimodal, trained on IFCNet (20 classes) |
| β |
| βββ ModelNet/ |
| β βββ best_10.mdl # Multimodal, ModelNet-10 |
| β βββ best_40.mdl # Multimodal, ModelNet-40 |
| β βββ ModelNet10.zip # ModelNet-10 extended with PC + multi-view modalities |
| β βββ ModelNet40.zip # ModelNet-40 extended with PC + multi-view modalities |
| β |
| βββ ULIP2/ |
| βββ best_ulip2_1000.mdl # ULIP-2 fine-tuned on BIMCompNet-1000 |
| βββ ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt # Official pretrained weights (866 MB) |
| ``` |
|
|
| --- |
|
|
| ## Model Summary |
|
|
| | File | Architecture | Training Set | Classes | Acc (%) | F1 (%) | |
| |------|-------------|--------------|---------|---------|--------| |
| | `BIMCompNet/multimodal/best_100.mdl` | BIM-CLIP (CMA) | BIMCompNet-100 | 42 | 87.38 | 87.44 | |
| | `BIMCompNet/multimodal/best_500.mdl` | BIM-CLIP (CMA) | BIMCompNet-500 | 31 | 91.35 | 91.28 | |
| | `BIMCompNet/multimodal/best_1000.mdl` | BIM-CLIP (CMA) | BIMCompNet-1000 | 24 | 91.79 | 91.83 | |
| | `BIMCompNet/single_modal/best_100.mdl` | BIM-CLIP (single modality) | BIMCompNet-100 | 42 | β | β | |
| | `BIMCompNet/single_modal/best_500.mdl` | BIM-CLIP (single modality) | BIMCompNet-500 | 31 | β | β | |
| | `BIMCompNet/single_modal/best_1000.mdl` | BIM-CLIP (single modality) | BIMCompNet-1000 | 24 | 88.69 | 87.90 | |
| | `IFCNet/best_ifcnet.mdl` | BIM-CLIP (CMA) | IFCNet | 20 | 91.00 | 90.39 | |
| | `ModelNet/best_10.mdl` | BIM-CLIP (CMA) | ModelNet-10 | 10 | 95.36 | 95.25* | |
| | `ModelNet/best_40.mdl` | BIM-CLIP (CMA) | ModelNet-40 | 40 | 92.22 | 90.34* | |
| | `ULIP2/best_ulip2_1000.mdl` | ULIP-2 | BIMCompNet-1000 | 24 | 90.98 | 91.02 | |
| | `ULIP2/ULIP-2-PointBERT-β¦-pretrained.pt` | ULIP-2 (official) | Objaverse | β | β | β | |
|
|
| *mAP reported for ModelNet. β indicates metrics not separately reported in the paper.* |
|
|
| --- |
|
|
| ## Usage |
|
|
| ### Load and Run Evaluation |
|
|
| Clone the GitHub repo, then: |
|
|
| ```bash |
| # Evaluate BIMCompNet-1000 multimodal |
| python bimclip.py --mode eval --data_type MULTI_MODAL \ |
| --data_root /path/to/BIMCompNet --index_root /path/to/index \ |
| --set_size 1000 \ |
| --model_path /path/to/BIMCompNet/multimodal/best_1000.mdl \ |
| --embeddings_path embeddings.pt \ |
| --yaml_path ./ζθΏ°δΏ‘ζ―.yaml \ |
| --output_dir ./results |
| |
| # Evaluate IFCNet multimodal |
| python bimclip.py --mode eval --data_type MULTI_MODAL \ |
| --ifcnet_root /path/to/IFCNetCorePly/IFCNetCore \ |
| --model_path /path/to/IFCNet/best_ifcnet.mdl \ |
| --embeddings_path ifcnet_embeddings.pt \ |
| --yaml_path ./ζθΏ°δΏ‘ζ―.yaml \ |
| --output_dir ./results |
| ``` |
|
|
| ### Embeddings |
|
|
| Text embedding files are included in the GitHub repository: |
|
|
| | File | Classes | Dataset | |
| |------|---------|---------| |
| | `embeddings.pt` | 57 | BIMCompNet (all categories) | |
| | `ifcnet_embeddings.pt` | 20 | IFCNet | |
| | `model_net_10_embeddings.pt` | 10 | ModelNet-10 | |
| | `model_net_40_embeddings.pt` | 40 | ModelNet-40 | |
|
|
| > Use the matching embeddings file for each dataset. Do **not** mix across datasets. |
|
|
| ### Dataset access |
|
|
| BIMCompNet is hosted by the 606 Lab at Xi'an University of Technology. Visit the link below to apply for access or download: |
|
|
| π **[https://bimcompnet-606lab.xaut.edu.cn/](https://bimcompnet-606lab.xaut.edu.cn/)** |
|
|
| ### ModelNet multimodal extension |
|
|
| `ModelNet10.zip` and `ModelNet40.zip` contain the original ModelNet meshes extended with point clouds and multi-view images, constructed using our multimodal data pipeline. The directory layout inside each zip is: |
|
|
| ``` |
| ModelNet{10|40}/ |
| βββ {class}/ |
| βββ train/ |
| β βββ obj/ # Original mesh (.obj) |
| β βββ ply/ # Point cloud sampled from mesh (1024 pts, .ply) |
| β βββ png/ |
| β βββ {sample}/ |
| β βββ Edges/ # 12 edge-rendered views (0.png β 11.png) |
| βββ test/ |
| βββ (same structure) |
| ``` |
|
|
| Point clouds are uniformly sampled from the mesh surface (1024 points per object). Multi-view images are edge-rendered from 12 fixed viewpoints following the camera placement strategy used in BIMCompNet. |
|
|
| ### Third-Party Weights (ULIP-2 baseline) |
|
|
| The official ULIP-2 pretrained PointBERT weights (866 MB) are included in this repository at: |
|
|
| ``` |
| ULIP2/ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt |
| ``` |
|
|
| Alternatively, download directly from the original source: |
| ``` |
| https://huggingface.co/datasets/SFXX/ulip/resolve/main/ULIP-2/pretrained_models/ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt |
| ``` |
|
|
| --- |
|
|
| ## Architecture Overview |
|
|
| BIM-CLIP uses three modality encoders (ViT for images, PointNet for point clouds, MeshNet for meshes), projects each into a shared 1536-dim language embedding space via contrastive alignment against `text-embedding-ada-002` anchors, then fuses them through a **Language-Guided Cross-Modal Attention (CMA)** module. During fine-tuning, only the CMA module (10.62M parameters) is updated. |
|
|
| --- |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{meng2026bimclip, |
| title={BIM-CLIP: Language-Guided Multimodal Representation Learning for BIM Component Recognition}, |
| author={Meng, Haining and Dong, Haoyang and Yang, Mingsong and Fan, Xing and Hei, Xinhong}, |
| journal={[to-be-updated upon acceptance]}, |
| year={2026} |
| } |
| ``` |
|
|