SoyeonHH commited on
Commit
f1c8a45
Β·
verified Β·
1 Parent(s): c1cf2c3

Add model card with documentation

Browse files
Files changed (1) hide show
  1. README.md +136 -0
README.md ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - multimodal
5
+ - cross-modal-retrieval
6
+ - zero-shot-classification
7
+ - text-only-training
8
+ - modality-expansion
9
+ - projection-network
10
+ language:
11
+ - en
12
+ library_name: pytorch
13
+ pipeline_tag: feature-extraction
14
+ ---
15
+
16
+ # TextME: Bridging Unseen Modalities Through Text Descriptions
17
+
18
+ [![arXiv](https://img.shields.io/badge/arXiv-2602.03098-b31b1b.svg)](https://arxiv.org/abs/2602.03098)
19
+ [![GitHub](https://img.shields.io/badge/GitHub-TextME-blue)](https://github.com/SoyeonHH/TextME)
20
+
21
+ Official projection checkpoints and offset vectors for **TextME**, a text-only modality expansion framework that projects diverse modalities into LLM embedding space without paired cross-modal data.
22
+
23
+ ## Model Description
24
+
25
+ TextME trains lightweight projection heads (2-layer MLP, ~10M params each) to map pretrained encoder embeddings into a unified Qwen3-Embedding-4B anchor space (2560-dim). Training uses **only text descriptions** β€” no paired multimodal data is needed.
26
+
27
+ ## Repository Structure
28
+
29
+ ```
30
+ β”œβ”€β”€ projections/
31
+ β”‚ β”œβ”€β”€ languagebind/ # Source text encoder projections (per-domain)
32
+ β”‚ β”‚ β”œβ”€β”€ languagebind_coco.pt # Image domain (59M)
33
+ β”‚ β”‚ β”œβ”€β”€ languagebind_audiocaps.pt # Audio domain (59M)
34
+ β”‚ β”‚ β”œβ”€β”€ languagebind_objaverse.pt # 3D domain (59M)
35
+ β”‚ β”‚ β”œβ”€β”€ languagebind_chestxray.pt # X-ray domain (59M)
36
+ β”‚ β”‚ β”œβ”€β”€ languagebind_pubchem.pt # Molecule domain (59M)
37
+ β”‚ β”‚ β”œβ”€β”€ languagebind_remoteclip_ret3.pt # Remote sensing domain (59M)
38
+ β”‚ β”‚ └── languagebind_internvid.pt # Video domain (59M)
39
+ β”‚ └── target_encoders/ # Target modality encoder projections
40
+ β”‚ β”œβ”€β”€ clip.pt # CLIP β†’ image (85M)
41
+ β”‚ β”œβ”€β”€ viclip.pt # ViCLIP β†’ video (59M)
42
+ β”‚ β”œβ”€β”€ clap.pt # CLAP β†’ audio (37M)
43
+ β”‚ β”œβ”€β”€ uni3d.pt # Uni3D β†’ 3D point cloud (85M)
44
+ β”‚ β”œβ”€β”€ cxr_clip.pt # CXR-CLIP β†’ X-ray (37M)
45
+ β”‚ β”œβ”€β”€ moleculestm.pt # MoleculeSTM β†’ molecule (17M)
46
+ β”‚ β”œβ”€β”€ remoteclip.pt # RemoteCLIP β†’ remote sensing (59M)
47
+ β”‚ └── languagebind.pt # LanguageBind β†’ multi-modal (59M)
48
+ └── offsets/ # Precomputed modality gap offset vectors
49
+ β”œβ”€β”€ clip_coco/
50
+ β”œβ”€β”€ clap_audiocaps/
51
+ β”œβ”€β”€ uni3d_objaverse/
52
+ β”œβ”€β”€ cxr_clip_chestxray/
53
+ β”œβ”€β”€ moleculestm_pubchem/
54
+ β”œβ”€β”€ remoteclip_ret3/
55
+ └── languagebind_coco/
56
+ ```
57
+
58
+ ## Supported Modalities
59
+
60
+ | Modality | Source Encoder | Target Encoder | Embedding Dim |
61
+ |----------|---------------|----------------|---------------|
62
+ | Image | LanguageBind (768) | CLIP (1024) | β†’ 2560 |
63
+ | Video | LanguageBind (768) | ViCLIP (768) | β†’ 2560 |
64
+ | Audio | LanguageBind (768) | CLAP (512) | β†’ 2560 |
65
+ | 3D | LanguageBind (768) | Uni3D (1024) | β†’ 2560 |
66
+ | X-ray | LanguageBind (768) | CXR-CLIP (512) | β†’ 2560 |
67
+ | Molecule | LanguageBind (768) | MoleculeSTM (256) | β†’ 2560 |
68
+ | Remote Sensing | LanguageBind (768) | RemoteCLIP (768) | β†’ 2560 |
69
+
70
+ ## Usage
71
+
72
+ ```python
73
+ from huggingface_hub import hf_hub_download
74
+ import torch
75
+
76
+ # Download a projection checkpoint
77
+ ckpt_path = hf_hub_download(
78
+ repo_id="SoyeonHH/TextME",
79
+ filename="projections/target_encoders/clip.pt"
80
+ )
81
+
82
+ # Load checkpoint
83
+ checkpoint = torch.load(ckpt_path, map_location="cpu")
84
+
85
+ # Download offset vectors
86
+ offset_path = hf_hub_download(
87
+ repo_id="SoyeonHH/TextME",
88
+ filename="offsets/clip_coco/text_embed_mean.pkl"
89
+ )
90
+ ```
91
+
92
+ See the [GitHub repository](https://github.com/SoyeonHH/TextME) for full evaluation and training code.
93
+
94
+ ## Training Details
95
+
96
+ | Parameter | Value |
97
+ |-----------|-------|
98
+ | Anchor Space | Qwen3-Embedding-4B (2560-dim) |
99
+ | Projection | 2-layer MLP with GELU, BatchNorm |
100
+ | Batch size | 512 |
101
+ | Optimizer | AdamW (β₁=0.9, Ξ²β‚‚=0.999) |
102
+ | Learning rate | 5Γ—10⁻⁴ (target) / 5Γ—10⁻² (LanguageBind) |
103
+ | Epochs | 50 |
104
+ | Temperature | 0.07 |
105
+ | Training data | ~100K text descriptions per modality |
106
+ | Offset samples | 5,000 per modality |
107
+ | GPU | Single NVIDIA A6000 (48GB) |
108
+
109
+ ## Results
110
+
111
+ ### Text→X Retrieval (R@1)
112
+
113
+ | Image (Flickr) | Video (MSVD) | Audio (ACaps) | Molecule (Drug) |
114
+ |:-:|:-:|:-:|:-:|
115
+ | 51.66 (PPR 66.5%) | 45.82 (PPR 89.7%) | 15.35 (PPR 68.3%) | 34.75 (PPR 43.9%) |
116
+
117
+ ### Zero-Shot Classification (Top-1)
118
+
119
+ | 3D (MN40) | 3D (Scan) | Audio (ESC) | X-ray (RSNA) |
120
+ |:-:|:-:|:-:|:-:|
121
+ | 70.86 (PPR 104.6%) | 42.15 (PPR 99.9%) | 77.25 (PPR 90.7%) | 46.59 (PPR 88.5%) |
122
+
123
+ ## Citation
124
+
125
+ ```bibtex
126
+ @article{hong2026textme,
127
+ title={TextME: Bridging Unseen Modalities Through Text Descriptions},
128
+ author={Hong, Soyeon and Kim, Jinchan and You, Jaegook and Choi, Seungtaek and Kwak, Suha and Cho, Hyunsouk},
129
+ journal={arXiv preprint arXiv:2602.03098},
130
+ year={2026}
131
+ }
132
+ ```
133
+
134
+ ## License
135
+
136
+ MIT License