--- license: mit language: - en - zh tags: - 3d - bim - multimodal - contrastive-learning - point-cloud - mesh - vision-language - building-information-modeling pipeline_tag: image-classification --- # BIM-CLIP Model Weights **Figure1: Overview of the BIM-CLIP framework.** ![Figure1](ε›Ύ/huggingface_README/Figure1.png) **Figure2: BIM-CLIP workflow and downstream applications.** > Given heterogeneous inputs the framework encodes each modality through dedicated encoders and aligns them within a shared semantic embedding space via contrastive learning. After alignment-preserving fine-tuning, the learned representations support two practical BIM downstream tasks. ![Figure3](ε›Ύ/huggingface_README/Figure3.png) Model weights for **BIM-CLIP: Language-Guided Multimodal Representation Learning for BIM Component Recognition**. > Haining Meng, Haoyang Dong, Mingsong Yang, Xing Fan, Xinhong Hei > Xi'an University of Technology πŸ“„ [Paper (Preprint)](https://github.com/[to-be-updated]) | πŸ’» [GitHub](https://github.com/[to-be-updated]) | πŸ—‚οΈ [BIMCompNet Dataset](https://bimcompnet-606lab.xaut.edu.cn/) --- ## Repository Structure ``` bim-clip-weights/ β”‚ β”œβ”€β”€ README.md β”‚ β”œβ”€β”€ BIMCompNet/ # Models trained on BIMCompNet β”‚ β”œβ”€β”€ multimodal/ β”‚ β”‚ β”œβ”€β”€ best_100.mdl # BIMCompNet-100 (42 classes), multimodal β”‚ β”‚ β”œβ”€β”€ best_500.mdl # BIMCompNet-500 (31 classes), multimodal β”‚ β”‚ └── best_1000.mdl # BIMCompNet-1000 (24 classes), multimodal β”‚ └── single_modal/ β”‚ β”œβ”€β”€ best_100.mdl # BIMCompNet-100 (42 classes), single modality β”‚ β”œβ”€β”€ best_500.mdl # BIMCompNet-500 (31 classes), single modality β”‚ └── best_1000.mdl # BIMCompNet-1000 (24 classes), single modality β”‚ β”œβ”€β”€ IFCNet/ β”‚ └── best_ifcnet.mdl # Multimodal, trained on IFCNet (20 classes) β”‚ β”œβ”€β”€ ModelNet/ β”‚ β”œβ”€β”€ best_10.mdl # Multimodal, ModelNet-10 β”‚ β”œβ”€β”€ best_40.mdl # Multimodal, ModelNet-40 β”‚ β”œβ”€β”€ ModelNet10.zip # ModelNet-10 extended with PC + multi-view modalities β”‚ └── ModelNet40.zip # ModelNet-40 extended with PC + multi-view modalities β”‚ └── ULIP2/ β”œβ”€β”€ best_ulip2_1000.mdl # ULIP-2 fine-tuned on BIMCompNet-1000 └── ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt # Official pretrained weights (866 MB) ``` --- ## Model Summary | File | Architecture | Training Set | Classes | Acc (%) | F1 (%) | |------|-------------|--------------|---------|---------|--------| | `BIMCompNet/multimodal/best_100.mdl` | BIM-CLIP (CMA) | BIMCompNet-100 | 42 | 87.38 | 87.44 | | `BIMCompNet/multimodal/best_500.mdl` | BIM-CLIP (CMA) | BIMCompNet-500 | 31 | 91.35 | 91.28 | | `BIMCompNet/multimodal/best_1000.mdl` | BIM-CLIP (CMA) | BIMCompNet-1000 | 24 | 91.79 | 91.83 | | `BIMCompNet/single_modal/best_100.mdl` | BIM-CLIP (single modality) | BIMCompNet-100 | 42 | β€” | β€” | | `BIMCompNet/single_modal/best_500.mdl` | BIM-CLIP (single modality) | BIMCompNet-500 | 31 | β€” | β€” | | `BIMCompNet/single_modal/best_1000.mdl` | BIM-CLIP (single modality) | BIMCompNet-1000 | 24 | 88.69 | 87.90 | | `IFCNet/best_ifcnet.mdl` | BIM-CLIP (CMA) | IFCNet | 20 | 91.00 | 90.39 | | `ModelNet/best_10.mdl` | BIM-CLIP (CMA) | ModelNet-10 | 10 | 95.36 | 95.25* | | `ModelNet/best_40.mdl` | BIM-CLIP (CMA) | ModelNet-40 | 40 | 92.22 | 90.34* | | `ULIP2/best_ulip2_1000.mdl` | ULIP-2 | BIMCompNet-1000 | 24 | 90.98 | 91.02 | | `ULIP2/ULIP-2-PointBERT-…-pretrained.pt` | ULIP-2 (official) | Objaverse | β€” | β€” | β€” | *mAP reported for ModelNet. β€” indicates metrics not separately reported in the paper.* --- ## Usage ### Load and Run Evaluation Clone the GitHub repo, then: ```bash # Evaluate BIMCompNet-1000 multimodal python bimclip.py --mode eval --data_type MULTI_MODAL \ --data_root /path/to/BIMCompNet --index_root /path/to/index \ --set_size 1000 \ --model_path /path/to/BIMCompNet/multimodal/best_1000.mdl \ --embeddings_path embeddings.pt \ --yaml_path ./描述俑息.yaml \ --output_dir ./results # Evaluate IFCNet multimodal python bimclip.py --mode eval --data_type MULTI_MODAL \ --ifcnet_root /path/to/IFCNetCorePly/IFCNetCore \ --model_path /path/to/IFCNet/best_ifcnet.mdl \ --embeddings_path ifcnet_embeddings.pt \ --yaml_path ./描述俑息.yaml \ --output_dir ./results ``` ### Embeddings Text embedding files are included in the GitHub repository: | File | Classes | Dataset | |------|---------|---------| | `embeddings.pt` | 57 | BIMCompNet (all categories) | | `ifcnet_embeddings.pt` | 20 | IFCNet | | `model_net_10_embeddings.pt` | 10 | ModelNet-10 | | `model_net_40_embeddings.pt` | 40 | ModelNet-40 | > Use the matching embeddings file for each dataset. Do **not** mix across datasets. ### Dataset access BIMCompNet is hosted by the 606 Lab at Xi'an University of Technology. Visit the link below to apply for access or download: πŸ‘‰ **[https://bimcompnet-606lab.xaut.edu.cn/](https://bimcompnet-606lab.xaut.edu.cn/)** ### ModelNet multimodal extension `ModelNet10.zip` and `ModelNet40.zip` contain the original ModelNet meshes extended with point clouds and multi-view images, constructed using our multimodal data pipeline. The directory layout inside each zip is: ``` ModelNet{10|40}/ └── {class}/ β”œβ”€β”€ train/ β”‚ β”œβ”€β”€ obj/ # Original mesh (.obj) β”‚ β”œβ”€β”€ ply/ # Point cloud sampled from mesh (1024 pts, .ply) β”‚ └── png/ β”‚ └── {sample}/ β”‚ └── Edges/ # 12 edge-rendered views (0.png – 11.png) └── test/ └── (same structure) ``` Point clouds are uniformly sampled from the mesh surface (1024 points per object). Multi-view images are edge-rendered from 12 fixed viewpoints following the camera placement strategy used in BIMCompNet. ### Third-Party Weights (ULIP-2 baseline) The official ULIP-2 pretrained PointBERT weights (866 MB) are included in this repository at: ``` ULIP2/ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt ``` Alternatively, download directly from the original source: ``` https://huggingface.co/datasets/SFXX/ulip/resolve/main/ULIP-2/pretrained_models/ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt ``` --- ## Architecture Overview BIM-CLIP uses three modality encoders (ViT for images, PointNet for point clouds, MeshNet for meshes), projects each into a shared 1536-dim language embedding space via contrastive alignment against `text-embedding-ada-002` anchors, then fuses them through a **Language-Guided Cross-Modal Attention (CMA)** module. During fine-tuning, only the CMA module (10.62M parameters) is updated. --- ## Citation ```bibtex @article{meng2026bimclip, title={BIM-CLIP: Language-Guided Multimodal Representation Learning for BIM Component Recognition}, author={Meng, Haining and Dong, Haoyang and Yang, Mingsong and Fan, Xing and Hei, Xinhong}, journal={[to-be-updated upon acceptance]}, year={2026} } ```