File size: 4,957 Bytes
f1c8a45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cd56dc0
 
f1c8a45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
license: mit
tags:
  - multimodal
  - cross-modal-retrieval
  - zero-shot-classification
  - text-only-training
  - modality-expansion
  - projection-network
language:
  - en
library_name: pytorch
pipeline_tag: feature-extraction
---

# TextME: Bridging Unseen Modalities Through Text Descriptions

[![arXiv](https://img.shields.io/badge/arXiv-2602.03098-b31b1b.svg)](https://arxiv.org/abs/2602.03098)
[![GitHub](https://img.shields.io/badge/GitHub-TextME-blue)](https://github.com/SoyeonHH/TextME)

Official projection checkpoints and offset vectors for **TextME**, a text-only modality expansion framework that projects diverse modalities into LLM embedding space without paired cross-modal data.

## Model Description

TextME trains lightweight projection heads (2-layer MLP, ~10M params each) to map pretrained encoder embeddings into a unified Qwen3-Embedding-4B anchor space (2560-dim). Training uses **only text descriptions** β€” no paired multimodal data is needed.

## Repository Structure

```
β”œβ”€β”€ projections/
β”‚   β”œβ”€β”€ languagebind/           # Source text encoder projections (per-domain)
β”‚   β”‚   β”œβ”€β”€ languagebind_coco.pt            # Image domain (59M)
β”‚   β”‚   β”œβ”€β”€ languagebind_audiocaps.pt       # Audio domain (59M)
β”‚   β”‚   β”œβ”€β”€ languagebind_objaverse.pt       # 3D domain (59M)
β”‚   β”‚   β”œβ”€β”€ languagebind_chestxray.pt       # X-ray domain (59M)
β”‚   β”‚   β”œβ”€β”€ languagebind_pubchem.pt         # Molecule domain (59M)
β”‚   β”‚   β”œβ”€β”€ languagebind_remoteclip_ret3.pt # Remote sensing domain (59M)
β”‚   β”‚   └── languagebind_internvid.pt       # Video domain (59M)
β”‚   └── target_encoders/        # Target modality encoder projections
β”‚       β”œβ”€β”€ clip.pt              # CLIP β†’ image (85M)
β”‚       β”œβ”€β”€ viclip.pt            # ViCLIP β†’ video (59M)
β”‚       β”œβ”€β”€ clap.pt              # CLAP β†’ audio (37M)
β”‚       β”œβ”€β”€ uni3d.pt             # Uni3D β†’ 3D point cloud (85M)
β”‚       β”œβ”€β”€ cxr_clip.pt          # CXR-CLIP β†’ X-ray (37M)
β”‚       β”œβ”€β”€ moleculestm.pt       # MoleculeSTM β†’ molecule (17M)
β”‚       β”œβ”€β”€ remoteclip.pt        # RemoteCLIP β†’ remote sensing (59M)
β”‚       └── languagebind.pt      # LanguageBind β†’ multi-modal (59M)
└── offsets/                    # Precomputed modality gap offset vectors
    β”œβ”€β”€ clip_coco/
    β”œβ”€β”€ clap_audiocaps/
    β”œβ”€β”€ uni3d_objaverse/
    β”œβ”€β”€ cxr_clip_chestxray/
    β”œβ”€β”€ moleculestm_pubchem/
    β”œβ”€β”€ remoteclip_ret3/
    β”œβ”€β”€ languagebind_coco/
    └── viclip_internvid/
```

## Supported Modalities

| Modality | Source Encoder | Target Encoder | Embedding Dim |
|----------|---------------|----------------|---------------|
| Image | LanguageBind (768) | CLIP (1024) | β†’ 2560 |
| Video | LanguageBind (768) | ViCLIP (768) | β†’ 2560 |
| Audio | LanguageBind (768) | CLAP (512) | β†’ 2560 |
| 3D | LanguageBind (768) | Uni3D (1024) | β†’ 2560 |
| X-ray | LanguageBind (768) | CXR-CLIP (512) | β†’ 2560 |
| Molecule | LanguageBind (768) | MoleculeSTM (256) | β†’ 2560 |
| Remote Sensing | LanguageBind (768) | RemoteCLIP (768) | β†’ 2560 |

## Usage

```python
from huggingface_hub import hf_hub_download
import torch

# Download a projection checkpoint
ckpt_path = hf_hub_download(
    repo_id="SoyeonHH/TextME",
    filename="projections/target_encoders/clip.pt"
)

# Load checkpoint
checkpoint = torch.load(ckpt_path, map_location="cpu")

# Download offset vectors
offset_path = hf_hub_download(
    repo_id="SoyeonHH/TextME",
    filename="offsets/clip_coco/text_embed_mean.pkl"
)
```

See the [GitHub repository](https://github.com/SoyeonHH/TextME) for full evaluation and training code.

## Training Details

| Parameter | Value |
|-----------|-------|
| Anchor Space | Qwen3-Embedding-4B (2560-dim) |
| Projection | 2-layer MLP with GELU, BatchNorm |
| Batch size | 512 |
| Optimizer | AdamW (β₁=0.9, Ξ²β‚‚=0.999) |
| Learning rate | 5Γ—10⁻⁴ (target) / 5Γ—10⁻² (LanguageBind) |
| Epochs | 50 |
| Temperature | 0.07 |
| Training data | ~100K text descriptions per modality |
| Offset samples | 5,000 per modality |
| GPU | Single NVIDIA A6000 (48GB) |

## Results

### Text→X Retrieval (R@1)

| Image (Flickr) | Video (MSVD) | Audio (ACaps) | Molecule (Drug) |
|:-:|:-:|:-:|:-:|
| 51.66 (PPR 66.5%) | 45.82 (PPR 89.7%) | 15.35 (PPR 68.3%) | 34.75 (PPR 43.9%) |

### Zero-Shot Classification (Top-1)

| 3D (MN40) | 3D (Scan) | Audio (ESC) | X-ray (RSNA) |
|:-:|:-:|:-:|:-:|
| 70.86 (PPR 104.6%) | 42.15 (PPR 99.9%) | 77.25 (PPR 90.7%) | 46.59 (PPR 88.5%) |

## Citation

```bibtex
@article{hong2026textme,
  title={TextME: Bridging Unseen Modalities Through Text Descriptions},
  author={Hong, Soyeon and Kim, Jinchan and You, Jaegook and Choi, Seungtaek and Kwak, Suha and Cho, Hyunsouk},
  journal={arXiv preprint arXiv:2602.03098},
  year={2026}
}
```

## License

MIT License