Ximeng0831
/

CTP

Zero-Shot Image Classification

contrastive-learning

Model card Files Files and versions

CTP / README.md

Ximeng0831's picture

Update README.md

e92d472 verified 2 days ago

|

history blame contribute delete

3.28 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: zero-shot-image-classification
	tags:
	- point-cloud
	- contrastive-learning
	- multi-modal
	- clip
	datasets:
	- Ximeng0831/CTP-Dataset
	---

	# CTP: Contrastive Tensor Pre-training
	[![arXiv](https://img.shields.io/badge/arXiv-2603.07874-b31b1b.svg)](https://arxiv.org/abs/2603.07874)
	[![Hugging Face Model](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-FFD21E)](https://huggingface.co/Ximeng0831/CTP)
	[![Hugging Face Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue)](https://huggingface.co/datasets/Ximeng0831/CTP-Dataset)
	[![GitHub](https://img.shields.io/badge/GitHub-CTP-lightgrey?logo=github)](https://github.com/TAMU-CVRL/CTP)

	This repository contains the model checkpoints for CTP (Contrastive Tensor Pre-training). While [CLIP](https://arxiv.org/abs/2103.00020) focuses on aligning two modalities (Image and Text), CTP introduces a unified framework to align multiple modalities (Image, Text, and Point Cloud) simultaneously using tensor-based alignment.

	## Repository Structure

	The checkpoints are organized by experiment configuration. We use the following naming conventions:
	- `all`: Pre-training of all three encoders (CLIP ViT, CLIP Text, and PointNet++).
	- `pc`: Only the PointNet++ (Point Cloud) backbone is trained; Image and Text encoders remain frozen.
	- `nm`: "No Masked" variant (ablation study).

	### Checkpoint Variations
	\| Folder Name \| Method Description \| Alignment Strategy \|
	\| :--- \| :--- \| :--- \|
	\| `192_l2_tensor_all` \| Default \| L2 Similarity Tensor \|
	\| `192_l2_tensor_nm_all` \| Default (No Masking) \| L2 Similarity Tensor \|
	\| `192_l2_tensor_pc` \| Frozen Image/Text \| L2 Similarity Tensor \|
	\| `192_cos_tensor_all` \| Cosine Variant \| Cosine Similarity Tensor \|
	\| `192_cos_matrix_all` \| Pairwise Matrix \| 3× Pairwise Similarity Matrices \|
	\| `192_cos_matrix_pc` \| Pairwise (Frozen) \| 3× Pairwise Similarity Matrices \|
	\| `192_cos_matrix_IP_pc`\| Image-Point Only \| 1× Similarity Matrix (I-L) \|

	## Download the Checkpoints

	You can download pretrained checkpoints using the `huggingface_hub` library:

	```python
	from huggingface_hub import hf_hub_download

	# Available: ["192_l2_tensor_all", "192_l2_tensor_nm_all", "192_cos_tensor_all", "192_cos_matrix_all", "192_l2_tensor_pc", "192_cos_matrix_pc", "192_cos_matrix_IP_pc"]

	config_name = "192_l2_tensor_all"

	checkpoint_path = hf_hub_download(
	repo_id="Ximeng0831/CTP",
	subfolder=config_name,
	filename="ckpt_epoch9.pt",
	# local_dir="checkpoints"
	)
	```
	Source code: https://github.com/TAMU-CVRL/CTP

	## Training Configurations

	Detailed configuration files (YAML) for each experiment are available in the [Official GitHub Repository](https://github.com/TAMU-CVRL/CTP/tree/main/configs).

	* `all`: Training is performed for 10 epochs with a total batch size of 384. These models are trained using two NVIDIA A100 (40G) GPUs.
	* `pc`: Training is conducted for 20 epochs with a batch size of 192. These models are trained on a single NVIDIA RTX 4090 GPU.

	> Note: For specific hyperparameter settings such as learning rate schedules and weight decay, please refer to the corresponding `.yaml` files in the link above.