Add model card with documentation
Browse files
README.md
ADDED
|
@@ -0,0 +1,136 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- multimodal
|
| 5 |
+
- cross-modal-retrieval
|
| 6 |
+
- zero-shot-classification
|
| 7 |
+
- text-only-training
|
| 8 |
+
- modality-expansion
|
| 9 |
+
- projection-network
|
| 10 |
+
language:
|
| 11 |
+
- en
|
| 12 |
+
library_name: pytorch
|
| 13 |
+
pipeline_tag: feature-extraction
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# TextME: Bridging Unseen Modalities Through Text Descriptions
|
| 17 |
+
|
| 18 |
+
[](https://arxiv.org/abs/2602.03098)
|
| 19 |
+
[](https://github.com/SoyeonHH/TextME)
|
| 20 |
+
|
| 21 |
+
Official projection checkpoints and offset vectors for **TextME**, a text-only modality expansion framework that projects diverse modalities into LLM embedding space without paired cross-modal data.
|
| 22 |
+
|
| 23 |
+
## Model Description
|
| 24 |
+
|
| 25 |
+
TextME trains lightweight projection heads (2-layer MLP, ~10M params each) to map pretrained encoder embeddings into a unified Qwen3-Embedding-4B anchor space (2560-dim). Training uses **only text descriptions** β no paired multimodal data is needed.
|
| 26 |
+
|
| 27 |
+
## Repository Structure
|
| 28 |
+
|
| 29 |
+
```
|
| 30 |
+
βββ projections/
|
| 31 |
+
β βββ languagebind/ # Source text encoder projections (per-domain)
|
| 32 |
+
β β βββ languagebind_coco.pt # Image domain (59M)
|
| 33 |
+
β β βββ languagebind_audiocaps.pt # Audio domain (59M)
|
| 34 |
+
β β βββ languagebind_objaverse.pt # 3D domain (59M)
|
| 35 |
+
β β βββ languagebind_chestxray.pt # X-ray domain (59M)
|
| 36 |
+
β β βββ languagebind_pubchem.pt # Molecule domain (59M)
|
| 37 |
+
β β βββ languagebind_remoteclip_ret3.pt # Remote sensing domain (59M)
|
| 38 |
+
β β βββ languagebind_internvid.pt # Video domain (59M)
|
| 39 |
+
β βββ target_encoders/ # Target modality encoder projections
|
| 40 |
+
β βββ clip.pt # CLIP β image (85M)
|
| 41 |
+
β βββ viclip.pt # ViCLIP β video (59M)
|
| 42 |
+
β βββ clap.pt # CLAP β audio (37M)
|
| 43 |
+
β βββ uni3d.pt # Uni3D β 3D point cloud (85M)
|
| 44 |
+
β βββ cxr_clip.pt # CXR-CLIP β X-ray (37M)
|
| 45 |
+
β βββ moleculestm.pt # MoleculeSTM β molecule (17M)
|
| 46 |
+
β βββ remoteclip.pt # RemoteCLIP β remote sensing (59M)
|
| 47 |
+
β βββ languagebind.pt # LanguageBind β multi-modal (59M)
|
| 48 |
+
βββ offsets/ # Precomputed modality gap offset vectors
|
| 49 |
+
βββ clip_coco/
|
| 50 |
+
βββ clap_audiocaps/
|
| 51 |
+
βββ uni3d_objaverse/
|
| 52 |
+
βββ cxr_clip_chestxray/
|
| 53 |
+
βββ moleculestm_pubchem/
|
| 54 |
+
βββ remoteclip_ret3/
|
| 55 |
+
βββ languagebind_coco/
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
## Supported Modalities
|
| 59 |
+
|
| 60 |
+
| Modality | Source Encoder | Target Encoder | Embedding Dim |
|
| 61 |
+
|----------|---------------|----------------|---------------|
|
| 62 |
+
| Image | LanguageBind (768) | CLIP (1024) | β 2560 |
|
| 63 |
+
| Video | LanguageBind (768) | ViCLIP (768) | β 2560 |
|
| 64 |
+
| Audio | LanguageBind (768) | CLAP (512) | β 2560 |
|
| 65 |
+
| 3D | LanguageBind (768) | Uni3D (1024) | β 2560 |
|
| 66 |
+
| X-ray | LanguageBind (768) | CXR-CLIP (512) | β 2560 |
|
| 67 |
+
| Molecule | LanguageBind (768) | MoleculeSTM (256) | β 2560 |
|
| 68 |
+
| Remote Sensing | LanguageBind (768) | RemoteCLIP (768) | β 2560 |
|
| 69 |
+
|
| 70 |
+
## Usage
|
| 71 |
+
|
| 72 |
+
```python
|
| 73 |
+
from huggingface_hub import hf_hub_download
|
| 74 |
+
import torch
|
| 75 |
+
|
| 76 |
+
# Download a projection checkpoint
|
| 77 |
+
ckpt_path = hf_hub_download(
|
| 78 |
+
repo_id="SoyeonHH/TextME",
|
| 79 |
+
filename="projections/target_encoders/clip.pt"
|
| 80 |
+
)
|
| 81 |
+
|
| 82 |
+
# Load checkpoint
|
| 83 |
+
checkpoint = torch.load(ckpt_path, map_location="cpu")
|
| 84 |
+
|
| 85 |
+
# Download offset vectors
|
| 86 |
+
offset_path = hf_hub_download(
|
| 87 |
+
repo_id="SoyeonHH/TextME",
|
| 88 |
+
filename="offsets/clip_coco/text_embed_mean.pkl"
|
| 89 |
+
)
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
See the [GitHub repository](https://github.com/SoyeonHH/TextME) for full evaluation and training code.
|
| 93 |
+
|
| 94 |
+
## Training Details
|
| 95 |
+
|
| 96 |
+
| Parameter | Value |
|
| 97 |
+
|-----------|-------|
|
| 98 |
+
| Anchor Space | Qwen3-Embedding-4B (2560-dim) |
|
| 99 |
+
| Projection | 2-layer MLP with GELU, BatchNorm |
|
| 100 |
+
| Batch size | 512 |
|
| 101 |
+
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.999) |
|
| 102 |
+
| Learning rate | 5Γ10β»β΄ (target) / 5Γ10β»Β² (LanguageBind) |
|
| 103 |
+
| Epochs | 50 |
|
| 104 |
+
| Temperature | 0.07 |
|
| 105 |
+
| Training data | ~100K text descriptions per modality |
|
| 106 |
+
| Offset samples | 5,000 per modality |
|
| 107 |
+
| GPU | Single NVIDIA A6000 (48GB) |
|
| 108 |
+
|
| 109 |
+
## Results
|
| 110 |
+
|
| 111 |
+
### TextβX Retrieval (R@1)
|
| 112 |
+
|
| 113 |
+
| Image (Flickr) | Video (MSVD) | Audio (ACaps) | Molecule (Drug) |
|
| 114 |
+
|:-:|:-:|:-:|:-:|
|
| 115 |
+
| 51.66 (PPR 66.5%) | 45.82 (PPR 89.7%) | 15.35 (PPR 68.3%) | 34.75 (PPR 43.9%) |
|
| 116 |
+
|
| 117 |
+
### Zero-Shot Classification (Top-1)
|
| 118 |
+
|
| 119 |
+
| 3D (MN40) | 3D (Scan) | Audio (ESC) | X-ray (RSNA) |
|
| 120 |
+
|:-:|:-:|:-:|:-:|
|
| 121 |
+
| 70.86 (PPR 104.6%) | 42.15 (PPR 99.9%) | 77.25 (PPR 90.7%) | 46.59 (PPR 88.5%) |
|
| 122 |
+
|
| 123 |
+
## Citation
|
| 124 |
+
|
| 125 |
+
```bibtex
|
| 126 |
+
@article{hong2026textme,
|
| 127 |
+
title={TextME: Bridging Unseen Modalities Through Text Descriptions},
|
| 128 |
+
author={Hong, Soyeon and Kim, Jinchan and You, Jaegook and Choi, Seungtaek and Kwak, Suha and Cho, Hyunsouk},
|
| 129 |
+
journal={arXiv preprint arXiv:2602.03098},
|
| 130 |
+
year={2026}
|
| 131 |
+
}
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
## License
|
| 135 |
+
|
| 136 |
+
MIT License
|