File size: 4,957 Bytes
f1c8a45 cd56dc0 f1c8a45 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | ---
license: mit
tags:
- multimodal
- cross-modal-retrieval
- zero-shot-classification
- text-only-training
- modality-expansion
- projection-network
language:
- en
library_name: pytorch
pipeline_tag: feature-extraction
---
# TextME: Bridging Unseen Modalities Through Text Descriptions
[](https://arxiv.org/abs/2602.03098)
[](https://github.com/SoyeonHH/TextME)
Official projection checkpoints and offset vectors for **TextME**, a text-only modality expansion framework that projects diverse modalities into LLM embedding space without paired cross-modal data.
## Model Description
TextME trains lightweight projection heads (2-layer MLP, ~10M params each) to map pretrained encoder embeddings into a unified Qwen3-Embedding-4B anchor space (2560-dim). Training uses **only text descriptions** β no paired multimodal data is needed.
## Repository Structure
```
βββ projections/
β βββ languagebind/ # Source text encoder projections (per-domain)
β β βββ languagebind_coco.pt # Image domain (59M)
β β βββ languagebind_audiocaps.pt # Audio domain (59M)
β β βββ languagebind_objaverse.pt # 3D domain (59M)
β β βββ languagebind_chestxray.pt # X-ray domain (59M)
β β βββ languagebind_pubchem.pt # Molecule domain (59M)
β β βββ languagebind_remoteclip_ret3.pt # Remote sensing domain (59M)
β β βββ languagebind_internvid.pt # Video domain (59M)
β βββ target_encoders/ # Target modality encoder projections
β βββ clip.pt # CLIP β image (85M)
β βββ viclip.pt # ViCLIP β video (59M)
β βββ clap.pt # CLAP β audio (37M)
β βββ uni3d.pt # Uni3D β 3D point cloud (85M)
β βββ cxr_clip.pt # CXR-CLIP β X-ray (37M)
β βββ moleculestm.pt # MoleculeSTM β molecule (17M)
β βββ remoteclip.pt # RemoteCLIP β remote sensing (59M)
β βββ languagebind.pt # LanguageBind β multi-modal (59M)
βββ offsets/ # Precomputed modality gap offset vectors
βββ clip_coco/
βββ clap_audiocaps/
βββ uni3d_objaverse/
βββ cxr_clip_chestxray/
βββ moleculestm_pubchem/
βββ remoteclip_ret3/
βββ languagebind_coco/
βββ viclip_internvid/
```
## Supported Modalities
| Modality | Source Encoder | Target Encoder | Embedding Dim |
|----------|---------------|----------------|---------------|
| Image | LanguageBind (768) | CLIP (1024) | β 2560 |
| Video | LanguageBind (768) | ViCLIP (768) | β 2560 |
| Audio | LanguageBind (768) | CLAP (512) | β 2560 |
| 3D | LanguageBind (768) | Uni3D (1024) | β 2560 |
| X-ray | LanguageBind (768) | CXR-CLIP (512) | β 2560 |
| Molecule | LanguageBind (768) | MoleculeSTM (256) | β 2560 |
| Remote Sensing | LanguageBind (768) | RemoteCLIP (768) | β 2560 |
## Usage
```python
from huggingface_hub import hf_hub_download
import torch
# Download a projection checkpoint
ckpt_path = hf_hub_download(
repo_id="SoyeonHH/TextME",
filename="projections/target_encoders/clip.pt"
)
# Load checkpoint
checkpoint = torch.load(ckpt_path, map_location="cpu")
# Download offset vectors
offset_path = hf_hub_download(
repo_id="SoyeonHH/TextME",
filename="offsets/clip_coco/text_embed_mean.pkl"
)
```
See the [GitHub repository](https://github.com/SoyeonHH/TextME) for full evaluation and training code.
## Training Details
| Parameter | Value |
|-----------|-------|
| Anchor Space | Qwen3-Embedding-4B (2560-dim) |
| Projection | 2-layer MLP with GELU, BatchNorm |
| Batch size | 512 |
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.999) |
| Learning rate | 5Γ10β»β΄ (target) / 5Γ10β»Β² (LanguageBind) |
| Epochs | 50 |
| Temperature | 0.07 |
| Training data | ~100K text descriptions per modality |
| Offset samples | 5,000 per modality |
| GPU | Single NVIDIA A6000 (48GB) |
## Results
### TextβX Retrieval (R@1)
| Image (Flickr) | Video (MSVD) | Audio (ACaps) | Molecule (Drug) |
|:-:|:-:|:-:|:-:|
| 51.66 (PPR 66.5%) | 45.82 (PPR 89.7%) | 15.35 (PPR 68.3%) | 34.75 (PPR 43.9%) |
### Zero-Shot Classification (Top-1)
| 3D (MN40) | 3D (Scan) | Audio (ESC) | X-ray (RSNA) |
|:-:|:-:|:-:|:-:|
| 70.86 (PPR 104.6%) | 42.15 (PPR 99.9%) | 77.25 (PPR 90.7%) | 46.59 (PPR 88.5%) |
## Citation
```bibtex
@article{hong2026textme,
title={TextME: Bridging Unseen Modalities Through Text Descriptions},
author={Hong, Soyeon and Kim, Jinchan and You, Jaegook and Choi, Seungtaek and Kwak, Suha and Cho, Hyunsouk},
journal={arXiv preprint arXiv:2602.03098},
year={2026}
}
```
## License
MIT License
|