--- license: mit tags: - multimodal - cross-modal-retrieval - zero-shot-classification - text-only-training - modality-expansion - projection-network language: - en library_name: pytorch pipeline_tag: feature-extraction --- # TextME: Bridging Unseen Modalities Through Text Descriptions [![arXiv](https://img.shields.io/badge/arXiv-2602.03098-b31b1b.svg)](https://arxiv.org/abs/2602.03098) [![GitHub](https://img.shields.io/badge/GitHub-TextME-blue)](https://github.com/SoyeonHH/TextME) Official projection checkpoints and offset vectors for **TextME**, a text-only modality expansion framework that projects diverse modalities into LLM embedding space without paired cross-modal data. ## Model Description TextME trains lightweight projection heads (2-layer MLP, ~10M params each) to map pretrained encoder embeddings into a unified Qwen3-Embedding-4B anchor space (2560-dim). Training uses **only text descriptions** — no paired multimodal data is needed. ## Repository Structure ``` ├── projections/ │ ├── languagebind/ # Source text encoder projections (per-domain) │ │ ├── languagebind_coco.pt # Image domain (59M) │ │ ├── languagebind_audiocaps.pt # Audio domain (59M) │ │ ├── languagebind_objaverse.pt # 3D domain (59M) │ │ ├── languagebind_chestxray.pt # X-ray domain (59M) │ │ ├── languagebind_pubchem.pt # Molecule domain (59M) │ │ ├── languagebind_remoteclip_ret3.pt # Remote sensing domain (59M) │ │ └── languagebind_internvid.pt # Video domain (59M) │ └── target_encoders/ # Target modality encoder projections │ ├── clip.pt # CLIP → image (85M) │ ├── viclip.pt # ViCLIP → video (59M) │ ├── clap.pt # CLAP → audio (37M) │ ├── uni3d.pt # Uni3D → 3D point cloud (85M) │ ├── cxr_clip.pt # CXR-CLIP → X-ray (37M) │ ├── moleculestm.pt # MoleculeSTM → molecule (17M) │ ├── remoteclip.pt # RemoteCLIP → remote sensing (59M) │ └── languagebind.pt # LanguageBind → multi-modal (59M) └── offsets/ # Precomputed modality gap offset vectors ├── clip_coco/ ├── clap_audiocaps/ ├── uni3d_objaverse/ ├── cxr_clip_chestxray/ ├── moleculestm_pubchem/ ├── remoteclip_ret3/ ├── languagebind_coco/ └── viclip_internvid/ ``` ## Supported Modalities | Modality | Source Encoder | Target Encoder | Embedding Dim | |----------|---------------|----------------|---------------| | Image | LanguageBind (768) | CLIP (1024) | → 2560 | | Video | LanguageBind (768) | ViCLIP (768) | → 2560 | | Audio | LanguageBind (768) | CLAP (512) | → 2560 | | 3D | LanguageBind (768) | Uni3D (1024) | → 2560 | | X-ray | LanguageBind (768) | CXR-CLIP (512) | → 2560 | | Molecule | LanguageBind (768) | MoleculeSTM (256) | → 2560 | | Remote Sensing | LanguageBind (768) | RemoteCLIP (768) | → 2560 | ## Usage ```python from huggingface_hub import hf_hub_download import torch # Download a projection checkpoint ckpt_path = hf_hub_download( repo_id="SoyeonHH/TextME", filename="projections/target_encoders/clip.pt" ) # Load checkpoint checkpoint = torch.load(ckpt_path, map_location="cpu") # Download offset vectors offset_path = hf_hub_download( repo_id="SoyeonHH/TextME", filename="offsets/clip_coco/text_embed_mean.pkl" ) ``` See the [GitHub repository](https://github.com/SoyeonHH/TextME) for full evaluation and training code. ## Training Details | Parameter | Value | |-----------|-------| | Anchor Space | Qwen3-Embedding-4B (2560-dim) | | Projection | 2-layer MLP with GELU, BatchNorm | | Batch size | 512 | | Optimizer | AdamW (β₁=0.9, β₂=0.999) | | Learning rate | 5×10⁻⁴ (target) / 5×10⁻² (LanguageBind) | | Epochs | 50 | | Temperature | 0.07 | | Training data | ~100K text descriptions per modality | | Offset samples | 5,000 per modality | | GPU | Single NVIDIA A6000 (48GB) | ## Results ### Text→X Retrieval (R@1) | Image (Flickr) | Video (MSVD) | Audio (ACaps) | Molecule (Drug) | |:-:|:-:|:-:|:-:| | 51.66 (PPR 66.5%) | 45.82 (PPR 89.7%) | 15.35 (PPR 68.3%) | 34.75 (PPR 43.9%) | ### Zero-Shot Classification (Top-1) | 3D (MN40) | 3D (Scan) | Audio (ESC) | X-ray (RSNA) | |:-:|:-:|:-:|:-:| | 70.86 (PPR 104.6%) | 42.15 (PPR 99.9%) | 77.25 (PPR 90.7%) | 46.59 (PPR 88.5%) | ## Citation ```bibtex @article{hong2026textme, title={TextME: Bridging Unseen Modalities Through Text Descriptions}, author={Hong, Soyeon and Kim, Jinchan and You, Jaegook and Choi, Seungtaek and Kwak, Suha and Cho, Hyunsouk}, journal={arXiv preprint arXiv:2602.03098}, year={2026} } ``` ## License MIT License