SajayR commited on
Commit
f7e4488
·
verified ·
1 Parent(s): 6cd64b6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +183 -6
README.md CHANGED
@@ -1,10 +1,187 @@
1
  ---
2
  license: mit
3
- tags:
4
- - model_hub_mixin
5
- - pytorch_model_hub_mixin
6
  ---
7
 
8
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
9
- - Library: [More Information Needed]
10
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
 
 
 
3
  ---
4
 
5
+
6
+ # Triad2: Training extension to Triad
7
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64792e9d50ff700163188784/2o6JBAgVerp5sUVM7WChK.png)
8
+
9
+ I built Triad to explore dense feature correspondences between video, audio and text modalities - focusing on learning fine-grained, localized relationships rather than just global alignment. The goal was to create a model that could ground features between specific image regions, audio segments, and text spans simultaneously.
10
+
11
+ This is a very early research checkpoint for dense multi-modal learning, with lots of room for improvement and experimentation. The current model was trained on a subset of AudioSet (\~400k videos, \~20% of the entire dataset) and CC3M (\~2M image-text pairs) for just one epoch, so while it shows promising behavior, it's definitely not state-of-the-art yet.
12
+
13
+ #### TL:DR - The model embeds semantic concepts on dense features (patches, audio features, text spans) instead of global embeddings containing semantic concepts. The embedding of a patch that contains a cat in a image has high cosine similarity with the word "cat" and an audio segment of a cat meow.
14
+
15
+ ## What Makes This Interesting?
16
+ Unlike models that learn global alignment between modalities (think CLIP, ImageBind), Triad learns to map specific parts of each modality to each other. This means it can:
17
+ - Locate which parts of an image correspond to particular words or sounds
18
+ - Ground audio segments to relevant visual regions
19
+ - Connect text descriptions to precise areas in images
20
+ - (Potentially) Learn transitive audio-text relationships through the shared visual space
21
+
22
+ ## What's Next?
23
+ I've got lots of ideas for making this better - longer training, playing with the architecture, investigating some interesting behaviors I've noticed and solving that massive issue of dealing with text, audio features that do not exist in the visual features.
24
+
25
+ I'm actively looking to push this research further and super interested in tackling more multimodal learning problems. Feel free to reach out if you're working in this space!
26
+
27
+ ## Inference
28
+
29
+ # Triad Model
30
+
31
+ The model can process image, audio, and text inputs - either individually or together.
32
+
33
+ ## Installation & Loading
34
+
35
+ ```python
36
+ from safetensors.torch import load_file
37
+ from huggingface_hub import hf_hub_download
38
+ import torch
39
+ import json
40
+ import sys
41
+ from pathlib import Path
42
+
43
+ def load_model(path="SajayR/Triad", device="cpu"):
44
+ model_path = hf_hub_download(repo_id=path, filename="model.safetensors")
45
+ model_config = hf_hub_download(repo_id=path, filename="config.json")
46
+ model_arch = hf_hub_download(repo_id=path, filename="hf_model.py")
47
+
48
+ sys.path.append(str(Path(model_arch).parent))
49
+ from hf_model import Triad
50
+
51
+ model = Triad(**json.load(open(model_config)))
52
+ weights = load_file(model_path)
53
+ model.load_state_dict(weights)
54
+ return model.to(device)
55
+
56
+ # Initialize model
57
+ model = load_model() # Use load_model(device="cuda") for GPU
58
+ ```
59
+
60
+ ## Single Modality Examples
61
+
62
+ ### Image Input
63
+
64
+ You can provide images as file paths or tensors:
65
+
66
+ ```python
67
+ # From file path
68
+ output = model(image="path/to/image.jpg")
69
+ output['visual_feats'].shape # torch.Size([1, 256, 512])
70
+
71
+ # From tensor (already pre-processed)
72
+ from torchvision import transforms
73
+ from PIL import Image
74
+
75
+ # Load and preprocess image
76
+ image = Image.open("path/to/image.jpg").convert('RGB')
77
+ transform = transforms.Compose([
78
+ transforms.Resize((224, 224)),
79
+ transforms.ToTensor(),
80
+ transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
81
+ ])
82
+ image_tensor = transform(image) # Shape: [3, 224, 224]
83
+
84
+ # Pass to model
85
+ output = model(image=image_tensor)
86
+ output['visual_feats'].shape # torch.Size([1, 256, 512])
87
+ ```
88
+
89
+ ### Audio Input
90
+
91
+ ```python
92
+ # Audio only - returns audio features (B, N_segments, D)
93
+ # Currently is trained for audio features of 1 seconds each. Longer audio sequences could have worse performance
94
+ audio = torch.randn(1, 16331) # Raw audio waveform
95
+ output = model(audio=audio)
96
+ output['audio_feats'].shape # torch.Size([1, 50, 512])
97
+ ```
98
+
99
+ ### Text Input
100
+
101
+ ```python
102
+ # Text only - returns text features (B, N_tokens, D)
103
+ text_list = ["a man riding a bicycle"]
104
+ output = model(text_list=text_list)
105
+ output['text_feats'].shape # torch.Size([1, 5, 512])
106
+ ```
107
+
108
+ ## Batch Processing
109
+
110
+ The model now supports batch processing for image inputs:
111
+
112
+ ### Batch of Image Paths
113
+
114
+ ```python
115
+ # Process a batch of image paths
116
+ image_paths = ["path/to/image1.jpg", "path/to/image2.jpg", "path/to/image3.jpg"]
117
+ output = model(image=image_paths)
118
+ output['visual_feats'].shape # torch.Size([3, 256, 512])
119
+ ```
120
+
121
+ ### Batch of Image Tensors
122
+
123
+ ```python
124
+ # Process a batch of image tensors
125
+ import torch
126
+ from torchvision import transforms
127
+ from PIL import Image
128
+
129
+ # Create a transform
130
+ transform = transforms.Compose([
131
+ transforms.Resize((224, 224)),
132
+ transforms.ToTensor(),
133
+ transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
134
+ ])
135
+
136
+ # Load and preprocess images
137
+ images = []
138
+ for path in ["image1.jpg", "image2.jpg", "image3.jpg"]:
139
+ img = Image.open(path).convert('RGB')
140
+ images.append(transform(img))
141
+
142
+ # Stack into a batch
143
+ batch = torch.stack(images) # Shape: [3, 3, 224, 224]
144
+
145
+ # Process the batch
146
+ output = model(image=batch)
147
+ output['visual_feats'].shape # torch.Size([3, 256, 512])
148
+ ```
149
+
150
+ ## Multi-Modal Examples
151
+
152
+ ### Image and Audio Together
153
+
154
+ ```python
155
+ # Process image and audio together
156
+ output = model(
157
+ audio=audio,
158
+ image="path/to/image.jpg"
159
+ )
160
+
161
+ print(output.keys()) # dict_keys(['visual_feats', 'audio_feats', 'vis_audio_sim_matrix'])
162
+
163
+ # Output shapes:
164
+ # - audio_feats: [1, 50, 512] # (batch, audio_segments, features)
165
+ # - visual_feats: [1, 256, 512] # (batch, image_patches, features)
166
+ # - vis_audio_sim_matrix: [1, 50, 256] # (batch, audio_segments, image_patches)
167
+ ```
168
+
169
+ The similarity matrix shows the correspondence between each audio segment and image patch.
170
+
171
+ ## Output Key Reference
172
+
173
+ Depending on which modalities you provide, the model returns different outputs:
174
+
175
+ - `visual_feats`: (B, 256, 512) # When you pass an image
176
+ - `audio_feats`: (B, 50, 512) # When you pass audio
177
+ - `text_feats`: (B, N_tokens, 512) # When you pass text
178
+ - `vis_text_sim_matrix`: (B, N_tokens, 256) # When you pass both image and text
179
+ - `vis_audio_sim_matrix`: (B, 50, 256) # When you pass both image and audio
180
+ - `text_audio_sim_matrix`: (B, N_tokens, 50) # When you pass both text and audio
181
+
182
+ Where:
183
+ - B = batch size
184
+ - 256 = number of image patches
185
+ - 50 = number of audio segments
186
+ - N_tokens = variable length of text tokens
187
+ - 512 = embedding dimension