---
license: mit
language:
- en
- zh
tags:
- 3d
- bim
- multimodal
- contrastive-learning
- point-cloud
- mesh
- vision-language
- building-information-modeling
pipeline_tag: image-classification
---

# BIM-CLIP Model Weights

**Figure1: Overview of the BIM-CLIP framework.**

![Figure1](图/huggingface_README/Figure1.png)


**Figure2: BIM-CLIP workflow and downstream applications.**

> Given heterogeneous inputs the framework encodes each modality through dedicated encoders and aligns them within a shared semantic embedding space via contrastive learning. After alignment-preserving fine-tuning, the learned representations support two practical BIM downstream tasks.

![Figure3](图/huggingface_README/Figure3.png)


Model weights for **BIM-CLIP: Language-Guided Multimodal Representation Learning for BIM Component Recognition**.

> Haining Meng, Haoyang Dong, Mingsong Yang, Xing Fan, Xinhong Hei  
> Xi'an University of Technology

📄 [Paper (Preprint)](https://github.com/[to-be-updated]) | 💻 [GitHub](https://github.com/[to-be-updated]) | 🗂️ [BIMCompNet Dataset](https://bimcompnet-606lab.xaut.edu.cn/)

---

## Repository Structure

```
bim-clip-weights/
│
├── README.md
│
├── BIMCompNet/                         # Models trained on BIMCompNet
│   ├── multimodal/
│   │   ├── best_100.mdl                # BIMCompNet-100 (42 classes), multimodal
│   │   ├── best_500.mdl                # BIMCompNet-500 (31 classes), multimodal
│   │   └── best_1000.mdl              # BIMCompNet-1000 (24 classes), multimodal
│   └── single_modal/
│       ├── best_100.mdl                # BIMCompNet-100 (42 classes), single modality
│       ├── best_500.mdl                # BIMCompNet-500 (31 classes), single modality
│       └── best_1000.mdl              # BIMCompNet-1000 (24 classes), single modality
│
├── IFCNet/
│   └── best_ifcnet.mdl                # Multimodal, trained on IFCNet (20 classes)
│
├── ModelNet/
│   ├── best_10.mdl                    # Multimodal, ModelNet-10
│   ├── best_40.mdl                    # Multimodal, ModelNet-40
│   ├── ModelNet10.zip                 # ModelNet-10 extended with PC + multi-view modalities
│   └── ModelNet40.zip                 # ModelNet-40 extended with PC + multi-view modalities
│
└── ULIP2/
    ├── best_ulip2_1000.mdl            # ULIP-2 fine-tuned on BIMCompNet-1000
    └── ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt  # Official pretrained weights (866 MB)
```

---

## Model Summary

| File | Architecture | Training Set | Classes | Acc (%) | F1 (%) |
|------|-------------|--------------|---------|---------|--------|
| `BIMCompNet/multimodal/best_100.mdl` | BIM-CLIP (CMA) | BIMCompNet-100 | 42 | 87.38 | 87.44 |
| `BIMCompNet/multimodal/best_500.mdl` | BIM-CLIP (CMA) | BIMCompNet-500 | 31 | 91.35 | 91.28 |
| `BIMCompNet/multimodal/best_1000.mdl` | BIM-CLIP (CMA) | BIMCompNet-1000 | 24 | 91.79 | 91.83 |
| `BIMCompNet/single_modal/best_100.mdl` | BIM-CLIP (single modality) | BIMCompNet-100 | 42 | — | — |
| `BIMCompNet/single_modal/best_500.mdl` | BIM-CLIP (single modality) | BIMCompNet-500 | 31 | — | — |
| `BIMCompNet/single_modal/best_1000.mdl` | BIM-CLIP (single modality) | BIMCompNet-1000 | 24 | 88.69 | 87.90 |
| `IFCNet/best_ifcnet.mdl` | BIM-CLIP (CMA) | IFCNet | 20 | 91.00 | 90.39 |
| `ModelNet/best_10.mdl` | BIM-CLIP (CMA) | ModelNet-10 | 10 | 95.36 | 95.25* |
| `ModelNet/best_40.mdl` | BIM-CLIP (CMA) | ModelNet-40 | 40 | 92.22 | 90.34* |
| `ULIP2/best_ulip2_1000.mdl` | ULIP-2 | BIMCompNet-1000 | 24 | 90.98 | 91.02 |
| `ULIP2/ULIP-2-PointBERT-…-pretrained.pt` | ULIP-2 (official) | Objaverse | — | — | — |

*mAP reported for ModelNet. — indicates metrics not separately reported in the paper.*

---

## Usage

### Load and Run Evaluation

Clone the GitHub repo, then:

```bash
# Evaluate BIMCompNet-1000 multimodal
python bimclip.py --mode eval --data_type MULTI_MODAL \
  --data_root /path/to/BIMCompNet --index_root /path/to/index \
  --set_size 1000 \
  --model_path /path/to/BIMCompNet/multimodal/best_1000.mdl \
  --embeddings_path embeddings.pt \
  --yaml_path ./描述信息.yaml \
  --output_dir ./results

# Evaluate IFCNet multimodal
python bimclip.py --mode eval --data_type MULTI_MODAL \
  --ifcnet_root /path/to/IFCNetCorePly/IFCNetCore \
  --model_path /path/to/IFCNet/best_ifcnet.mdl \
  --embeddings_path ifcnet_embeddings.pt \
  --yaml_path ./描述信息.yaml \
  --output_dir ./results
```

### Embeddings

Text embedding files are included in the GitHub repository:

| File | Classes | Dataset |
|------|---------|---------|
| `embeddings.pt` | 57 | BIMCompNet (all categories) |
| `ifcnet_embeddings.pt` | 20 | IFCNet |
| `model_net_10_embeddings.pt` | 10 | ModelNet-10 |
| `model_net_40_embeddings.pt` | 40 | ModelNet-40 |

> Use the matching embeddings file for each dataset. Do **not** mix across datasets.

### Dataset access

BIMCompNet is hosted by the 606 Lab at Xi'an University of Technology. Visit the link below to apply for access or download:

👉 **[https://bimcompnet-606lab.xaut.edu.cn/](https://bimcompnet-606lab.xaut.edu.cn/)**

### ModelNet multimodal extension

`ModelNet10.zip` and `ModelNet40.zip` contain the original ModelNet meshes extended with point clouds and multi-view images, constructed using our multimodal data pipeline. The directory layout inside each zip is:

```
ModelNet{10|40}/
└── {class}/
    ├── train/
    │   ├── obj/          # Original mesh (.obj)
    │   ├── ply/          # Point cloud sampled from mesh (1024 pts, .ply)
    │   └── png/
    │       └── {sample}/
    │           └── Edges/   # 12 edge-rendered views (0.png – 11.png)
    └── test/
        └── (same structure)
```

Point clouds are uniformly sampled from the mesh surface (1024 points per object). Multi-view images are edge-rendered from 12 fixed viewpoints following the camera placement strategy used in BIMCompNet.

### Third-Party Weights (ULIP-2 baseline)

The official ULIP-2 pretrained PointBERT weights (866 MB) are included in this repository at:

```
ULIP2/ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt
```

Alternatively, download directly from the original source:
```
https://huggingface.co/datasets/SFXX/ulip/resolve/main/ULIP-2/pretrained_models/ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt
```

---

## Architecture Overview

BIM-CLIP uses three modality encoders (ViT for images, PointNet for point clouds, MeshNet for meshes), projects each into a shared 1536-dim language embedding space via contrastive alignment against `text-embedding-ada-002` anchors, then fuses them through a **Language-Guided Cross-Modal Attention (CMA)** module. During fine-tuning, only the CMA module (10.62M parameters) is updated.

---

## Citation

```bibtex
@article{meng2026bimclip,
  title={BIM-CLIP: Language-Guided Multimodal Representation Learning for BIM Component Recognition},
  author={Meng, Haining and Dong, Haoyang and Yang, Mingsong and Fan, Xing and Hei, Xinhong},
  journal={[to-be-updated upon acceptance]},
  year={2026}
}
```