| | --- |
| | license: mit |
| | tags: |
| | - multimodal |
| | - cross-modal-retrieval |
| | - zero-shot-classification |
| | - text-only-training |
| | - modality-expansion |
| | - projection-network |
| | language: |
| | - en |
| | library_name: pytorch |
| | pipeline_tag: feature-extraction |
| | --- |
| | |
| | # TextME: Bridging Unseen Modalities Through Text Descriptions |
| |
|
| | [](https://arxiv.org/abs/2602.03098) |
| | [](https://github.com/SoyeonHH/TextME) |
| |
|
| | Official projection checkpoints and offset vectors for **TextME**, a text-only modality expansion framework that projects diverse modalities into LLM embedding space without paired cross-modal data. |
| |
|
| | ## Model Description |
| |
|
| | TextME trains lightweight projection heads (2-layer MLP, ~10M params each) to map pretrained encoder embeddings into a unified Qwen3-Embedding-4B anchor space (2560-dim). Training uses **only text descriptions** β no paired multimodal data is needed. |
| |
|
| | ## Repository Structure |
| |
|
| | ``` |
| | βββ projections/ |
| | β βββ languagebind/ # Source text encoder projections (per-domain) |
| | β β βββ languagebind_coco.pt # Image domain (59M) |
| | β β βββ languagebind_audiocaps.pt # Audio domain (59M) |
| | β β βββ languagebind_objaverse.pt # 3D domain (59M) |
| | β β βββ languagebind_chestxray.pt # X-ray domain (59M) |
| | β β βββ languagebind_pubchem.pt # Molecule domain (59M) |
| | β β βββ languagebind_remoteclip_ret3.pt # Remote sensing domain (59M) |
| | β β βββ languagebind_internvid.pt # Video domain (59M) |
| | β βββ target_encoders/ # Target modality encoder projections |
| | β βββ clip.pt # CLIP β image (85M) |
| | β βββ viclip.pt # ViCLIP β video (59M) |
| | β βββ clap.pt # CLAP β audio (37M) |
| | β βββ uni3d.pt # Uni3D β 3D point cloud (85M) |
| | β βββ cxr_clip.pt # CXR-CLIP β X-ray (37M) |
| | β βββ moleculestm.pt # MoleculeSTM β molecule (17M) |
| | β βββ remoteclip.pt # RemoteCLIP β remote sensing (59M) |
| | β βββ languagebind.pt # LanguageBind β multi-modal (59M) |
| | βββ offsets/ # Precomputed modality gap offset vectors |
| | βββ clip_coco/ |
| | βββ clap_audiocaps/ |
| | βββ uni3d_objaverse/ |
| | βββ cxr_clip_chestxray/ |
| | βββ moleculestm_pubchem/ |
| | βββ remoteclip_ret3/ |
| | βββ languagebind_coco/ |
| | βββ viclip_internvid/ |
| | ``` |
| |
|
| | ## Supported Modalities |
| |
|
| | | Modality | Source Encoder | Target Encoder | Embedding Dim | |
| | |----------|---------------|----------------|---------------| |
| | | Image | LanguageBind (768) | CLIP (1024) | β 2560 | |
| | | Video | LanguageBind (768) | ViCLIP (768) | β 2560 | |
| | | Audio | LanguageBind (768) | CLAP (512) | β 2560 | |
| | | 3D | LanguageBind (768) | Uni3D (1024) | β 2560 | |
| | | X-ray | LanguageBind (768) | CXR-CLIP (512) | β 2560 | |
| | | Molecule | LanguageBind (768) | MoleculeSTM (256) | β 2560 | |
| | | Remote Sensing | LanguageBind (768) | RemoteCLIP (768) | β 2560 | |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from huggingface_hub import hf_hub_download |
| | import torch |
| | |
| | # Download a projection checkpoint |
| | ckpt_path = hf_hub_download( |
| | repo_id="SoyeonHH/TextME", |
| | filename="projections/target_encoders/clip.pt" |
| | ) |
| | |
| | # Load checkpoint |
| | checkpoint = torch.load(ckpt_path, map_location="cpu") |
| | |
| | # Download offset vectors |
| | offset_path = hf_hub_download( |
| | repo_id="SoyeonHH/TextME", |
| | filename="offsets/clip_coco/text_embed_mean.pkl" |
| | ) |
| | ``` |
| |
|
| | See the [GitHub repository](https://github.com/SoyeonHH/TextME) for full evaluation and training code. |
| |
|
| | ## Training Details |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Anchor Space | Qwen3-Embedding-4B (2560-dim) | |
| | | Projection | 2-layer MLP with GELU, BatchNorm | |
| | | Batch size | 512 | |
| | | Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.999) | |
| | | Learning rate | 5Γ10β»β΄ (target) / 5Γ10β»Β² (LanguageBind) | |
| | | Epochs | 50 | |
| | | Temperature | 0.07 | |
| | | Training data | ~100K text descriptions per modality | |
| | | Offset samples | 5,000 per modality | |
| | | GPU | Single NVIDIA A6000 (48GB) | |
| |
|
| | ## Results |
| |
|
| | ### TextβX Retrieval (R@1) |
| |
|
| | | Image (Flickr) | Video (MSVD) | Audio (ACaps) | Molecule (Drug) | |
| | |:-:|:-:|:-:|:-:| |
| | | 51.66 (PPR 66.5%) | 45.82 (PPR 89.7%) | 15.35 (PPR 68.3%) | 34.75 (PPR 43.9%) | |
| |
|
| | ### Zero-Shot Classification (Top-1) |
| |
|
| | | 3D (MN40) | 3D (Scan) | Audio (ESC) | X-ray (RSNA) | |
| | |:-:|:-:|:-:|:-:| |
| | | 70.86 (PPR 104.6%) | 42.15 (PPR 99.9%) | 77.25 (PPR 90.7%) | 46.59 (PPR 88.5%) | |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @article{hong2026textme, |
| | title={TextME: Bridging Unseen Modalities Through Text Descriptions}, |
| | author={Hong, Soyeon and Kim, Jinchan and You, Jaegook and Choi, Seungtaek and Kwak, Suha and Cho, Hyunsouk}, |
| | journal={arXiv preprint arXiv:2602.03098}, |
| | year={2026} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | MIT License |
| |
|