SajayR commited on
Commit
f5ab9d3
·
verified ·
1 Parent(s): 4a47434

Push model using huggingface_hub.

Browse files
Files changed (2) hide show
  1. README.md +6 -181
  2. model.safetensors +1 -1
README.md CHANGED
@@ -1,185 +1,10 @@
1
  ---
2
  license: mit
 
 
 
3
  ---
4
- # Triad: Dense Cross-Modal Feature Learning
5
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64792e9d50ff700163188784/2o6JBAgVerp5sUVM7WChK.png)
6
 
7
- I built Triad to explore dense feature correspondences between video, audio and text modalities - focusing on learning fine-grained, localized relationships rather than just global alignment. The goal was to create a model that could ground features between specific image regions, audio segments, and text spans simultaneously.
8
-
9
- This is a very early research checkpoint for dense multi-modal learning, with lots of room for improvement and experimentation. The current model was trained on a subset of AudioSet (\~400k videos, \~20% of the entire dataset) and CC3M (\~2M image-text pairs) for just one epoch, so while it shows promising behavior, it's definitely not state-of-the-art yet.
10
-
11
- #### TL:DR - The model embeds semantic concepts on dense features (patches, audio features, text spans) instead of global embeddings containing semantic concepts. The embedding of a patch that contains a cat in a image has high cosine similarity with the word "cat" and an audio segment of a cat meow.
12
-
13
- ## What Makes This Interesting?
14
- Unlike models that learn global alignment between modalities (think CLIP, ImageBind), Triad learns to map specific parts of each modality to each other. This means it can:
15
- - Locate which parts of an image correspond to particular words or sounds
16
- - Ground audio segments to relevant visual regions
17
- - Connect text descriptions to precise areas in images
18
- - (Potentially) Learn transitive audio-text relationships through the shared visual space
19
-
20
- ## What's Next?
21
- I've got lots of ideas for making this better - longer training, playing with the architecture, investigating some interesting behaviors I've noticed and solving that massive issue of dealing with text, audio features that do not exist in the visual features.
22
-
23
- I'm actively looking to push this research further and super interested in tackling more multimodal learning problems. Feel free to reach out if you're working in this space!
24
-
25
- ## Inference
26
-
27
- # Triad Model
28
-
29
- The model can process image, audio, and text inputs - either individually or together.
30
-
31
- ## Installation & Loading
32
-
33
- ```python
34
- from safetensors.torch import load_file
35
- from huggingface_hub import hf_hub_download
36
- import torch
37
- import json
38
- import sys
39
- from pathlib import Path
40
-
41
- def load_model(path="SajayR/Triad", device="cpu"):
42
- model_path = hf_hub_download(repo_id=path, filename="model.safetensors")
43
- model_config = hf_hub_download(repo_id=path, filename="config.json")
44
- model_arch = hf_hub_download(repo_id=path, filename="hf_model.py")
45
-
46
- sys.path.append(str(Path(model_arch).parent))
47
- from hf_model import Triad
48
-
49
- model = Triad(**json.load(open(model_config)))
50
- weights = load_file(model_path)
51
- model.load_state_dict(weights)
52
- return model.to(device)
53
-
54
- # Initialize model
55
- model = load_model() # Use load_model(device="cuda") for GPU
56
- ```
57
-
58
- ## Single Modality Examples
59
-
60
- ### Image Input
61
-
62
- You can provide images as file paths or tensors:
63
-
64
- ```python
65
- # From file path
66
- output = model(image="path/to/image.jpg")
67
- output['visual_feats'].shape # torch.Size([1, 256, 512])
68
-
69
- # From tensor (already pre-processed)
70
- from torchvision import transforms
71
- from PIL import Image
72
-
73
- # Load and preprocess image
74
- image = Image.open("path/to/image.jpg").convert('RGB')
75
- transform = transforms.Compose([
76
- transforms.Resize((224, 224)),
77
- transforms.ToTensor(),
78
- transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
79
- ])
80
- image_tensor = transform(image) # Shape: [3, 224, 224]
81
-
82
- # Pass to model
83
- output = model(image=image_tensor)
84
- output['visual_feats'].shape # torch.Size([1, 256, 512])
85
- ```
86
-
87
- ### Audio Input
88
-
89
- ```python
90
- # Audio only - returns audio features (B, N_segments, D)
91
- # Currently is trained for audio features of 1 seconds each. Longer audio sequences could have worse performance
92
- audio = torch.randn(1, 16331) # Raw audio waveform
93
- output = model(audio=audio)
94
- output['audio_feats'].shape # torch.Size([1, 50, 512])
95
- ```
96
-
97
- ### Text Input
98
-
99
- ```python
100
- # Text only - returns text features (B, N_tokens, D)
101
- text_list = ["a man riding a bicycle"]
102
- output = model(text_list=text_list)
103
- output['text_feats'].shape # torch.Size([1, 5, 512])
104
- ```
105
-
106
- ## Batch Processing
107
-
108
- The model now supports batch processing for image inputs:
109
-
110
- ### Batch of Image Paths
111
-
112
- ```python
113
- # Process a batch of image paths
114
- image_paths = ["path/to/image1.jpg", "path/to/image2.jpg", "path/to/image3.jpg"]
115
- output = model(image=image_paths)
116
- output['visual_feats'].shape # torch.Size([3, 256, 512])
117
- ```
118
-
119
- ### Batch of Image Tensors
120
-
121
- ```python
122
- # Process a batch of image tensors
123
- import torch
124
- from torchvision import transforms
125
- from PIL import Image
126
-
127
- # Create a transform
128
- transform = transforms.Compose([
129
- transforms.Resize((224, 224)),
130
- transforms.ToTensor(),
131
- transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
132
- ])
133
-
134
- # Load and preprocess images
135
- images = []
136
- for path in ["image1.jpg", "image2.jpg", "image3.jpg"]:
137
- img = Image.open(path).convert('RGB')
138
- images.append(transform(img))
139
-
140
- # Stack into a batch
141
- batch = torch.stack(images) # Shape: [3, 3, 224, 224]
142
-
143
- # Process the batch
144
- output = model(image=batch)
145
- output['visual_feats'].shape # torch.Size([3, 256, 512])
146
- ```
147
-
148
- ## Multi-Modal Examples
149
-
150
- ### Image and Audio Together
151
-
152
- ```python
153
- # Process image and audio together
154
- output = model(
155
- audio=audio,
156
- image="path/to/image.jpg"
157
- )
158
-
159
- print(output.keys()) # dict_keys(['visual_feats', 'audio_feats', 'vis_audio_sim_matrix'])
160
-
161
- # Output shapes:
162
- # - audio_feats: [1, 50, 512] # (batch, audio_segments, features)
163
- # - visual_feats: [1, 256, 512] # (batch, image_patches, features)
164
- # - vis_audio_sim_matrix: [1, 50, 256] # (batch, audio_segments, image_patches)
165
- ```
166
-
167
- The similarity matrix shows the correspondence between each audio segment and image patch.
168
-
169
- ## Output Key Reference
170
-
171
- Depending on which modalities you provide, the model returns different outputs:
172
-
173
- - `visual_feats`: (B, 256, 512) # When you pass an image
174
- - `audio_feats`: (B, 50, 512) # When you pass audio
175
- - `text_feats`: (B, N_tokens, 512) # When you pass text
176
- - `vis_text_sim_matrix`: (B, N_tokens, 256) # When you pass both image and text
177
- - `vis_audio_sim_matrix`: (B, 50, 256) # When you pass both image and audio
178
- - `text_audio_sim_matrix`: (B, N_tokens, 50) # When you pass both text and audio
179
-
180
- Where:
181
- - B = batch size
182
- - 256 = number of image patches
183
- - 50 = number of audio segments
184
- - N_tokens = variable length of text tokens
185
- - 512 = embedding dimension
 
1
  ---
2
  license: mit
3
+ tags:
4
+ - model_hub_mixin
5
+ - pytorch_model_hub_mixin
6
  ---
 
 
7
 
8
+ This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
9
+ - Library: [More Information Needed]
10
+ - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1d74973238c3e88faa0528646e4c0f1bf462964d503fe89595ec59611dff0121
3
  size 994047476
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6744ec47f7ebeb4aabe0be576ce6c4943f9951d041e04bdb7154204413e0556f
3
  size 994047476