TextME / README.md
SoyeonHH's picture
Add viclip_internvid to offset listing in README
cd56dc0 verified
---
license: mit
tags:
- multimodal
- cross-modal-retrieval
- zero-shot-classification
- text-only-training
- modality-expansion
- projection-network
language:
- en
library_name: pytorch
pipeline_tag: feature-extraction
---
# TextME: Bridging Unseen Modalities Through Text Descriptions
[![arXiv](https://img.shields.io/badge/arXiv-2602.03098-b31b1b.svg)](https://arxiv.org/abs/2602.03098)
[![GitHub](https://img.shields.io/badge/GitHub-TextME-blue)](https://github.com/SoyeonHH/TextME)
Official projection checkpoints and offset vectors for **TextME**, a text-only modality expansion framework that projects diverse modalities into LLM embedding space without paired cross-modal data.
## Model Description
TextME trains lightweight projection heads (2-layer MLP, ~10M params each) to map pretrained encoder embeddings into a unified Qwen3-Embedding-4B anchor space (2560-dim). Training uses **only text descriptions** β€” no paired multimodal data is needed.
## Repository Structure
```
β”œβ”€β”€ projections/
β”‚ β”œβ”€β”€ languagebind/ # Source text encoder projections (per-domain)
β”‚ β”‚ β”œβ”€β”€ languagebind_coco.pt # Image domain (59M)
β”‚ β”‚ β”œβ”€β”€ languagebind_audiocaps.pt # Audio domain (59M)
β”‚ β”‚ β”œβ”€β”€ languagebind_objaverse.pt # 3D domain (59M)
β”‚ β”‚ β”œβ”€β”€ languagebind_chestxray.pt # X-ray domain (59M)
β”‚ β”‚ β”œβ”€β”€ languagebind_pubchem.pt # Molecule domain (59M)
β”‚ β”‚ β”œβ”€β”€ languagebind_remoteclip_ret3.pt # Remote sensing domain (59M)
β”‚ β”‚ └── languagebind_internvid.pt # Video domain (59M)
β”‚ └── target_encoders/ # Target modality encoder projections
β”‚ β”œβ”€β”€ clip.pt # CLIP β†’ image (85M)
β”‚ β”œβ”€β”€ viclip.pt # ViCLIP β†’ video (59M)
β”‚ β”œβ”€β”€ clap.pt # CLAP β†’ audio (37M)
β”‚ β”œβ”€β”€ uni3d.pt # Uni3D β†’ 3D point cloud (85M)
β”‚ β”œβ”€β”€ cxr_clip.pt # CXR-CLIP β†’ X-ray (37M)
β”‚ β”œβ”€β”€ moleculestm.pt # MoleculeSTM β†’ molecule (17M)
β”‚ β”œβ”€β”€ remoteclip.pt # RemoteCLIP β†’ remote sensing (59M)
β”‚ └── languagebind.pt # LanguageBind β†’ multi-modal (59M)
└── offsets/ # Precomputed modality gap offset vectors
β”œβ”€β”€ clip_coco/
β”œβ”€β”€ clap_audiocaps/
β”œβ”€β”€ uni3d_objaverse/
β”œβ”€β”€ cxr_clip_chestxray/
β”œβ”€β”€ moleculestm_pubchem/
β”œβ”€β”€ remoteclip_ret3/
β”œβ”€β”€ languagebind_coco/
└── viclip_internvid/
```
## Supported Modalities
| Modality | Source Encoder | Target Encoder | Embedding Dim |
|----------|---------------|----------------|---------------|
| Image | LanguageBind (768) | CLIP (1024) | β†’ 2560 |
| Video | LanguageBind (768) | ViCLIP (768) | β†’ 2560 |
| Audio | LanguageBind (768) | CLAP (512) | β†’ 2560 |
| 3D | LanguageBind (768) | Uni3D (1024) | β†’ 2560 |
| X-ray | LanguageBind (768) | CXR-CLIP (512) | β†’ 2560 |
| Molecule | LanguageBind (768) | MoleculeSTM (256) | β†’ 2560 |
| Remote Sensing | LanguageBind (768) | RemoteCLIP (768) | β†’ 2560 |
## Usage
```python
from huggingface_hub import hf_hub_download
import torch
# Download a projection checkpoint
ckpt_path = hf_hub_download(
repo_id="SoyeonHH/TextME",
filename="projections/target_encoders/clip.pt"
)
# Load checkpoint
checkpoint = torch.load(ckpt_path, map_location="cpu")
# Download offset vectors
offset_path = hf_hub_download(
repo_id="SoyeonHH/TextME",
filename="offsets/clip_coco/text_embed_mean.pkl"
)
```
See the [GitHub repository](https://github.com/SoyeonHH/TextME) for full evaluation and training code.
## Training Details
| Parameter | Value |
|-----------|-------|
| Anchor Space | Qwen3-Embedding-4B (2560-dim) |
| Projection | 2-layer MLP with GELU, BatchNorm |
| Batch size | 512 |
| Optimizer | AdamW (β₁=0.9, Ξ²β‚‚=0.999) |
| Learning rate | 5Γ—10⁻⁴ (target) / 5Γ—10⁻² (LanguageBind) |
| Epochs | 50 |
| Temperature | 0.07 |
| Training data | ~100K text descriptions per modality |
| Offset samples | 5,000 per modality |
| GPU | Single NVIDIA A6000 (48GB) |
## Results
### Text→X Retrieval (R@1)
| Image (Flickr) | Video (MSVD) | Audio (ACaps) | Molecule (Drug) |
|:-:|:-:|:-:|:-:|
| 51.66 (PPR 66.5%) | 45.82 (PPR 89.7%) | 15.35 (PPR 68.3%) | 34.75 (PPR 43.9%) |
### Zero-Shot Classification (Top-1)
| 3D (MN40) | 3D (Scan) | Audio (ESC) | X-ray (RSNA) |
|:-:|:-:|:-:|:-:|
| 70.86 (PPR 104.6%) | 42.15 (PPR 99.9%) | 77.25 (PPR 90.7%) | 46.59 (PPR 88.5%) |
## Citation
```bibtex
@article{hong2026textme,
title={TextME: Bridging Unseen Modalities Through Text Descriptions},
author={Hong, Soyeon and Kim, Jinchan and You, Jaegook and Choi, Seungtaek and Kwak, Suha and Cho, Hyunsouk},
journal={arXiv preprint arXiv:2602.03098},
year={2026}
}
```
## License
MIT License