TextME / README.md

Add viclip_internvid to offset listing in README

cd56dc0 verified 3 days ago

4.96 kB

	---
	license: mit
	tags:
	- multimodal
	- cross-modal-retrieval
	- zero-shot-classification
	- text-only-training
	- modality-expansion
	- projection-network
	language:
	- en
	library_name: pytorch
	pipeline_tag: feature-extraction
	---

	# TextME: Bridging Unseen Modalities Through Text Descriptions

	[![arXiv](https://img.shields.io/badge/arXiv-2602.03098-b31b1b.svg)](https://arxiv.org/abs/2602.03098)
	[![GitHub](https://img.shields.io/badge/GitHub-TextME-blue)](https://github.com/SoyeonHH/TextME)

	Official projection checkpoints and offset vectors for TextME, a text-only modality expansion framework that projects diverse modalities into LLM embedding space without paired cross-modal data.

	## Model Description

	TextME trains lightweight projection heads (2-layer MLP, ~10M params each) to map pretrained encoder embeddings into a unified Qwen3-Embedding-4B anchor space (2560-dim). Training uses only text descriptions — no paired multimodal data is needed.

	## Repository Structure

	```
	├── projections/
	│ ├── languagebind/ # Source text encoder projections (per-domain)
	│ │ ├── languagebind_coco.pt # Image domain (59M)
	│ │ ├── languagebind_audiocaps.pt # Audio domain (59M)
	│ │ ├── languagebind_objaverse.pt # 3D domain (59M)
	│ │ ├── languagebind_chestxray.pt # X-ray domain (59M)
	│ │ ├── languagebind_pubchem.pt # Molecule domain (59M)
	│ │ ├── languagebind_remoteclip_ret3.pt # Remote sensing domain (59M)
	│ │ └── languagebind_internvid.pt # Video domain (59M)
	│ └── target_encoders/ # Target modality encoder projections
	│ ├── clip.pt # CLIP → image (85M)
	│ ├── viclip.pt # ViCLIP → video (59M)
	│ ├── clap.pt # CLAP → audio (37M)
	│ ├── uni3d.pt # Uni3D → 3D point cloud (85M)
	│ ├── cxr_clip.pt # CXR-CLIP → X-ray (37M)
	│ ├── moleculestm.pt # MoleculeSTM → molecule (17M)
	│ ├── remoteclip.pt # RemoteCLIP → remote sensing (59M)
	│ └── languagebind.pt # LanguageBind → multi-modal (59M)
	└── offsets/ # Precomputed modality gap offset vectors
	├── clip_coco/
	├── clap_audiocaps/
	├── uni3d_objaverse/
	├── cxr_clip_chestxray/
	├── moleculestm_pubchem/
	├── remoteclip_ret3/
	├── languagebind_coco/
	└── viclip_internvid/
	```

	## Supported Modalities

	\| Modality \| Source Encoder \| Target Encoder \| Embedding Dim \|
	\|----------\|---------------\|----------------\|---------------\|
	\| Image \| LanguageBind (768) \| CLIP (1024) \| → 2560 \|
	\| Video \| LanguageBind (768) \| ViCLIP (768) \| → 2560 \|
	\| Audio \| LanguageBind (768) \| CLAP (512) \| → 2560 \|
	\| 3D \| LanguageBind (768) \| Uni3D (1024) \| → 2560 \|
	\| X-ray \| LanguageBind (768) \| CXR-CLIP (512) \| → 2560 \|
	\| Molecule \| LanguageBind (768) \| MoleculeSTM (256) \| → 2560 \|
	\| Remote Sensing \| LanguageBind (768) \| RemoteCLIP (768) \| → 2560 \|

	## Usage

	```python
	from huggingface_hub import hf_hub_download
	import torch

	# Download a projection checkpoint
	ckpt_path = hf_hub_download(
	repo_id="SoyeonHH/TextME",
	filename="projections/target_encoders/clip.pt"
	)

	# Load checkpoint
	checkpoint = torch.load(ckpt_path, map_location="cpu")

	# Download offset vectors
	offset_path = hf_hub_download(
	repo_id="SoyeonHH/TextME",
	filename="offsets/clip_coco/text_embed_mean.pkl"
	)
	```

	See the [GitHub repository](https://github.com/SoyeonHH/TextME) for full evaluation and training code.

	## Training Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Anchor Space \| Qwen3-Embedding-4B (2560-dim) \|
	\| Projection \| 2-layer MLP with GELU, BatchNorm \|
	\| Batch size \| 512 \|
	\| Optimizer \| AdamW (β₁=0.9, β₂=0.999) \|
	\| Learning rate \| 5×10⁻⁴ (target) / 5×10⁻² (LanguageBind) \|
	\| Epochs \| 50 \|
	\| Temperature \| 0.07 \|
	\| Training data \| ~100K text descriptions per modality \|
	\| Offset samples \| 5,000 per modality \|
	\| GPU \| Single NVIDIA A6000 (48GB) \|

	## Results

	### Text→X Retrieval (R@1)

	\| Image (Flickr) \| Video (MSVD) \| Audio (ACaps) \| Molecule (Drug) \|
	\|:-:\|:-:\|:-:\|:-:\|
	\| 51.66 (PPR 66.5%) \| 45.82 (PPR 89.7%) \| 15.35 (PPR 68.3%) \| 34.75 (PPR 43.9%) \|

	### Zero-Shot Classification (Top-1)

	\| 3D (MN40) \| 3D (Scan) \| Audio (ESC) \| X-ray (RSNA) \|
	\|:-:\|:-:\|:-:\|:-:\|
	\| 70.86 (PPR 104.6%) \| 42.15 (PPR 99.9%) \| 77.25 (PPR 90.7%) \| 46.59 (PPR 88.5%) \|

	## Citation

	```bibtex
	@article{hong2026textme,
	title={TextME: Bridging Unseen Modalities Through Text Descriptions},
	author={Hong, Soyeon and Kim, Jinchan and You, Jaegook and Choi, Seungtaek and Kwak, Suha and Cho, Hyunsouk},
	journal={arXiv preprint arXiv:2602.03098},
	year={2026}
	}
	```

	## License

	MIT License