SajayR commited on
Commit
4d81923
·
verified ·
1 Parent(s): d6ee458

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -3
README.md CHANGED
@@ -3,6 +3,7 @@ license: mit
3
  ---
4
 
5
  # Triad: Dense Cross-Modal Feature Learning
 
6
 
7
  I built Triad to explore dense feature correspondences between video, audio and text modalities - focusing on learning fine-grained, localized relationships rather than just global alignment. The goal was to create a model that could ground features between specific image regions, audio segments, and text spans simultaneously.
8
 
@@ -14,9 +15,6 @@ Unlike models that learn global alignment between modalities (think CLIP, ImageB
14
  - Connect text descriptions to precise areas in images
15
  - (Potentially) Learn transitive audio-text relationships through the shared visual space
16
 
17
- ## Visualization
18
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64792e9d50ff700163188784/2o6JBAgVerp5sUVM7WChK.png)
19
-
20
  ## What's Next?
21
  I've got lots of ideas for making this better - longer training, playing with the architecture, investigating some interesting behaviors I've noticed and solving that massive issue of dealing with text, audio features that do not exist in the visual features.
22
 
 
3
  ---
4
 
5
  # Triad: Dense Cross-Modal Feature Learning
6
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64792e9d50ff700163188784/2o6JBAgVerp5sUVM7WChK.png)
7
 
8
  I built Triad to explore dense feature correspondences between video, audio and text modalities - focusing on learning fine-grained, localized relationships rather than just global alignment. The goal was to create a model that could ground features between specific image regions, audio segments, and text spans simultaneously.
9
 
 
15
  - Connect text descriptions to precise areas in images
16
  - (Potentially) Learn transitive audio-text relationships through the shared visual space
17
 
 
 
 
18
  ## What's Next?
19
  I've got lots of ideas for making this better - longer training, playing with the architecture, investigating some interesting behaviors I've noticed and solving that massive issue of dealing with text, audio features that do not exist in the visual features.
20