---
license: mit
tags:
  - multimodal
  - cross-modal-retrieval
  - zero-shot-classification
  - text-only-training
  - modality-expansion
  - projection-network
language:
  - en
library_name: pytorch
pipeline_tag: feature-extraction
---

# TextME: Bridging Unseen Modalities Through Text Descriptions

[![arXiv](https://img.shields.io/badge/arXiv-2602.03098-b31b1b.svg)](https://arxiv.org/abs/2602.03098)
[![GitHub](https://img.shields.io/badge/GitHub-TextME-blue)](https://github.com/SoyeonHH/TextME)

Official projection checkpoints and offset vectors for **TextME**, a text-only modality expansion framework that projects diverse modalities into LLM embedding space without paired cross-modal data.

## Model Description

TextME trains lightweight projection heads (2-layer MLP, ~10M params each) to map pretrained encoder embeddings into a unified Qwen3-Embedding-4B anchor space (2560-dim). Training uses **only text descriptions** — no paired multimodal data is needed.

## Repository Structure

```
├── projections/
│   ├── languagebind/           # Source text encoder projections (per-domain)
│   │   ├── languagebind_coco.pt            # Image domain (59M)
│   │   ├── languagebind_audiocaps.pt       # Audio domain (59M)
│   │   ├── languagebind_objaverse.pt       # 3D domain (59M)
│   │   ├── languagebind_chestxray.pt       # X-ray domain (59M)
│   │   ├── languagebind_pubchem.pt         # Molecule domain (59M)
│   │   ├── languagebind_remoteclip_ret3.pt # Remote sensing domain (59M)
│   │   └── languagebind_internvid.pt       # Video domain (59M)
│   └── target_encoders/        # Target modality encoder projections
│       ├── clip.pt              # CLIP → image (85M)
│       ├── viclip.pt            # ViCLIP → video (59M)
│       ├── clap.pt              # CLAP → audio (37M)
│       ├── uni3d.pt             # Uni3D → 3D point cloud (85M)
│       ├── cxr_clip.pt          # CXR-CLIP → X-ray (37M)
│       ├── moleculestm.pt       # MoleculeSTM → molecule (17M)
│       ├── remoteclip.pt        # RemoteCLIP → remote sensing (59M)
│       └── languagebind.pt      # LanguageBind → multi-modal (59M)
└── offsets/                    # Precomputed modality gap offset vectors
    ├── clip_coco/
    ├── clap_audiocaps/
    ├── uni3d_objaverse/
    ├── cxr_clip_chestxray/
    ├── moleculestm_pubchem/
    ├── remoteclip_ret3/
    ├── languagebind_coco/
    └── viclip_internvid/
```

## Supported Modalities

| Modality | Source Encoder | Target Encoder | Embedding Dim |
|----------|---------------|----------------|---------------|
| Image | LanguageBind (768) | CLIP (1024) | → 2560 |
| Video | LanguageBind (768) | ViCLIP (768) | → 2560 |
| Audio | LanguageBind (768) | CLAP (512) | → 2560 |
| 3D | LanguageBind (768) | Uni3D (1024) | → 2560 |
| X-ray | LanguageBind (768) | CXR-CLIP (512) | → 2560 |
| Molecule | LanguageBind (768) | MoleculeSTM (256) | → 2560 |
| Remote Sensing | LanguageBind (768) | RemoteCLIP (768) | → 2560 |

## Usage

```python
from huggingface_hub import hf_hub_download
import torch

# Download a projection checkpoint
ckpt_path = hf_hub_download(
    repo_id="SoyeonHH/TextME",
    filename="projections/target_encoders/clip.pt"
)

# Load checkpoint
checkpoint = torch.load(ckpt_path, map_location="cpu")

# Download offset vectors
offset_path = hf_hub_download(
    repo_id="SoyeonHH/TextME",
    filename="offsets/clip_coco/text_embed_mean.pkl"
)
```

See the [GitHub repository](https://github.com/SoyeonHH/TextME) for full evaluation and training code.

## Training Details

| Parameter | Value |
|-----------|-------|
| Anchor Space | Qwen3-Embedding-4B (2560-dim) |
| Projection | 2-layer MLP with GELU, BatchNorm |
| Batch size | 512 |
| Optimizer | AdamW (β₁=0.9, β₂=0.999) |
| Learning rate | 5×10⁻⁴ (target) / 5×10⁻² (LanguageBind) |
| Epochs | 50 |
| Temperature | 0.07 |
| Training data | ~100K text descriptions per modality |
| Offset samples | 5,000 per modality |
| GPU | Single NVIDIA A6000 (48GB) |

## Results

### Text→X Retrieval (R@1)

| Image (Flickr) | Video (MSVD) | Audio (ACaps) | Molecule (Drug) |
|:-:|:-:|:-:|:-:|
| 51.66 (PPR 66.5%) | 45.82 (PPR 89.7%) | 15.35 (PPR 68.3%) | 34.75 (PPR 43.9%) |

### Zero-Shot Classification (Top-1)

| 3D (MN40) | 3D (Scan) | Audio (ESC) | X-ray (RSNA) |
|:-:|:-:|:-:|:-:|
| 70.86 (PPR 104.6%) | 42.15 (PPR 99.9%) | 77.25 (PPR 90.7%) | 46.59 (PPR 88.5%) |

## Citation

```bibtex
@article{hong2026textme,
  title={TextME: Bridging Unseen Modalities Through Text Descriptions},
  author={Hong, Soyeon and Kim, Jinchan and You, Jaegook and Choi, Seungtaek and Kwak, Suha and Cho, Hyunsouk},
  journal={arXiv preprint arXiv:2602.03098},
  year={2026}
}
```

## License

MIT License