SoyeonHH
/

TextME

+---
+license: mit
+tags:
+  - multimodal
+  - cross-modal-retrieval
+  - zero-shot-classification
+  - text-only-training
+  - modality-expansion
+  - projection-network
+language:
+  - en
+library_name: pytorch
+pipeline_tag: feature-extraction
+---
+# TextME: Bridging Unseen Modalities Through Text Descriptions
+[![arXiv](https://img.shields.io/badge/arXiv-2602.03098-b31b1b.svg)](https://arxiv.org/abs/2602.03098)
+[![GitHub](https://img.shields.io/badge/GitHub-TextME-blue)](https://github.com/SoyeonHH/TextME)
+Official projection checkpoints and offset vectors for **TextME**, a text-only modality expansion framework that projects diverse modalities into LLM embedding space without paired cross-modal data.
+## Model Description
+TextME trains lightweight projection heads (2-layer MLP, ~10M params each) to map pretrained encoder embeddings into a unified Qwen3-Embedding-4B anchor space (2560-dim). Training uses **only text descriptions** — no paired multimodal data is needed.
+## Repository Structure
+```
+├── projections/
+│   ├── languagebind/           # Source text encoder projections (per-domain)
+│   │   ├── languagebind_coco.pt            # Image domain (59M)
+│   │   ├── languagebind_audiocaps.pt       # Audio domain (59M)
+│   │   ├── languagebind_objaverse.pt       # 3D domain (59M)
+│   │   ├── languagebind_chestxray.pt       # X-ray domain (59M)
+│   │   ├── languagebind_pubchem.pt         # Molecule domain (59M)
+│   │   ├── languagebind_remoteclip_ret3.pt # Remote sensing domain (59M)
+│   │   └── languagebind_internvid.pt       # Video domain (59M)
+│   └── target_encoders/        # Target modality encoder projections
+│       ├── clip.pt              # CLIP → image (85M)
+│       ├── viclip.pt            # ViCLIP → video (59M)
+│       ├── clap.pt              # CLAP → audio (37M)
+│       ├── uni3d.pt             # Uni3D → 3D point cloud (85M)
+│       ├── cxr_clip.pt          # CXR-CLIP → X-ray (37M)
+│       ├── moleculestm.pt       # MoleculeSTM → molecule (17M)
+│       ├── remoteclip.pt        # RemoteCLIP → remote sensing (59M)
+│       └── languagebind.pt      # LanguageBind → multi-modal (59M)
+└── offsets/                    # Precomputed modality gap offset vectors
+    ├── clip_coco/
+    ├── clap_audiocaps/
+    ├── uni3d_objaverse/
+    ├── cxr_clip_chestxray/
+    ├── moleculestm_pubchem/
+    ├── remoteclip_ret3/
+    └── languagebind_coco/
+```
+## Supported Modalities
+| Modality | Source Encoder | Target Encoder | Embedding Dim |
+|----------|---------------|----------------|---------------|
+| Image | LanguageBind (768) | CLIP (1024) | → 2560 |
+| Video | LanguageBind (768) | ViCLIP (768) | → 2560 |
+| Audio | LanguageBind (768) | CLAP (512) | → 2560 |
+| 3D | LanguageBind (768) | Uni3D (1024) | → 2560 |
+| X-ray | LanguageBind (768) | CXR-CLIP (512) | → 2560 |
+| Molecule | LanguageBind (768) | MoleculeSTM (256) | → 2560 |
+| Remote Sensing | LanguageBind (768) | RemoteCLIP (768) | → 2560 |
+## Usage
+```python
+from huggingface_hub import hf_hub_download
+import torch
+# Download a projection checkpoint
+ckpt_path = hf_hub_download(
+    repo_id="SoyeonHH/TextME",
+    filename="projections/target_encoders/clip.pt"
+)
+# Load checkpoint
+checkpoint = torch.load(ckpt_path, map_location="cpu")
+# Download offset vectors
+offset_path = hf_hub_download(
+    repo_id="SoyeonHH/TextME",
+    filename="offsets/clip_coco/text_embed_mean.pkl"
+)
+```
+See the [GitHub repository](https://github.com/SoyeonHH/TextME) for full evaluation and training code.
+## Training Details
+| Parameter | Value |
+|-----------|-------|
+| Anchor Space | Qwen3-Embedding-4B (2560-dim) |
+| Projection | 2-layer MLP with GELU, BatchNorm |
+| Batch size | 512 |
+| Optimizer | AdamW (β₁=0.9, β₂=0.999) |
+| Learning rate | 5×10⁻⁴ (target) / 5×10⁻² (LanguageBind) |
+| Epochs | 50 |
+| Temperature | 0.07 |
+| Training data | ~100K text descriptions per modality |
+| Offset samples | 5,000 per modality |
+| GPU | Single NVIDIA A6000 (48GB) |
+## Results
+### Text→X Retrieval (R@1)
+| Image (Flickr) | Video (MSVD) | Audio (ACaps) | Molecule (Drug) |
+|:-:|:-:|:-:|:-:|
+| 51.66 (PPR 66.5%) | 45.82 (PPR 89.7%) | 15.35 (PPR 68.3%) | 34.75 (PPR 43.9%) |
+### Zero-Shot Classification (Top-1)
+| 3D (MN40) | 3D (Scan) | Audio (ESC) | X-ray (RSNA) |
+|:-:|:-:|:-:|:-:|
+| 70.86 (PPR 104.6%) | 42.15 (PPR 99.9%) | 77.25 (PPR 90.7%) | 46.59 (PPR 88.5%) |
+## Citation
+```bibtex
+@article{hong2026textme,
+  title={TextME: Bridging Unseen Modalities Through Text Descriptions},
+  author={Hong, Soyeon and Kim, Jinchan and You, Jaegook and Choi, Seungtaek and Kwak, Suha and Cho, Hyunsouk},
+  journal={arXiv preprint arXiv:2602.03098},
+  year={2026}
+}
+```
+## License
+MIT License