File size: 3,283 Bytes
e92d472
 
 
 
 
 
 
 
 
 
 
 
 
186b91e
 
4078dce
8ccf29c
 
 
186b91e
 
 
 
 
 
 
 
 
 
 
 
 
9d1d906
 
 
 
186b91e
 
ec218ea
186b91e
e92d472
186b91e
e92d472
186b91e
 
 
 
e92d472
 
 
186b91e
 
e92d472
 
 
 
186b91e
 
9d1d906
 
2a85ebd
9d1d906
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---

license: apache-2.0
language:
- en
pipeline_tag: zero-shot-image-classification
tags:
- point-cloud
- contrastive-learning
- multi-modal
- clip
datasets:
- Ximeng0831/CTP-Dataset
---


# CTP: Contrastive Tensor Pre-training
[![arXiv](https://img.shields.io/badge/arXiv-2603.07874-b31b1b.svg)](https://arxiv.org/abs/2603.07874)
[![Hugging Face Model](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-FFD21E)](https://huggingface.co/Ximeng0831/CTP)
[![Hugging Face Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue)](https://huggingface.co/datasets/Ximeng0831/CTP-Dataset)
[![GitHub](https://img.shields.io/badge/GitHub-CTP-lightgrey?logo=github)](https://github.com/TAMU-CVRL/CTP)

This repository contains the model checkpoints for **CTP (Contrastive Tensor Pre-training)**. While [CLIP](https://arxiv.org/abs/2103.00020) focuses on aligning two modalities (Image and Text), CTP introduces a unified framework to align **multiple modalities** (Image, Text, and Point Cloud) simultaneously using tensor-based alignment.

## Repository Structure

The checkpoints are organized by experiment configuration. We use the following naming conventions:
- **`all`**: Pre-training of all three encoders (CLIP ViT, CLIP Text, and PointNet++).
- **`pc`**: Only the PointNet++ (Point Cloud) backbone is trained; Image and Text encoders remain frozen.
- **`nm`**: "No Masked" variant (ablation study).

### Checkpoint Variations
| Folder Name | Method Description | Alignment Strategy |
| :--- | :--- | :--- |
| `192_l2_tensor_all` | **Default** | L2 Similarity Tensor |
| `192_l2_tensor_nm_all` | Default (No Masking) | L2 Similarity Tensor |
| `192_l2_tensor_pc` | Frozen Image/Text | L2 Similarity Tensor |
| `192_cos_tensor_all` | Cosine Variant | Cosine Similarity Tensor |
| `192_cos_matrix_all` | Pairwise Matrix | 3× Pairwise Similarity Matrices |
| `192_cos_matrix_pc` | Pairwise (Frozen) | 3× Pairwise Similarity Matrices |
| `192_cos_matrix_IP_pc`| Image-Point Only | 1× Similarity Matrix (I-L) |

## Download the Checkpoints

You can download pretrained checkpoints using the `huggingface_hub` library:

```python
from huggingface_hub import hf_hub_download

# Available: ["192_l2_tensor_all", "192_l2_tensor_nm_all", "192_cos_tensor_all", "192_cos_matrix_all", "192_l2_tensor_pc", "192_cos_matrix_pc", "192_cos_matrix_IP_pc"]

config_name = "192_l2_tensor_all"

checkpoint_path = hf_hub_download(
    repo_id="Ximeng0831/CTP",
    subfolder=config_name,
    filename="ckpt_epoch9.pt",
    # local_dir="checkpoints"
)
```
Source code: https://github.com/TAMU-CVRL/CTP

## Training Configurations

Detailed configuration files (YAML) for each experiment are available in the [Official GitHub Repository](https://github.com/TAMU-CVRL/CTP/tree/main/configs).

* **`all`:** Training is performed for **10 epochs** with a total batch size of **384**. These models are trained using **two NVIDIA A100 (40G)** GPUs.
* **`pc`:** Training is conducted for **20 epochs** with a batch size of **192**. These models are trained on a **single NVIDIA RTX 4090** GPU.

> **Note:** For specific hyperparameter settings such as learning rate schedules and weight decay, please refer to the corresponding `.yaml` files in the link above.