Update README.md
Browse files
README.md
CHANGED
|
@@ -3,6 +3,7 @@ license: mit
|
|
| 3 |
---
|
| 4 |
|
| 5 |
# Triad: Dense Cross-Modal Feature Learning
|
|
|
|
| 6 |
|
| 7 |
I built Triad to explore dense feature correspondences between video, audio and text modalities - focusing on learning fine-grained, localized relationships rather than just global alignment. The goal was to create a model that could ground features between specific image regions, audio segments, and text spans simultaneously.
|
| 8 |
|
|
@@ -14,9 +15,6 @@ Unlike models that learn global alignment between modalities (think CLIP, ImageB
|
|
| 14 |
- Connect text descriptions to precise areas in images
|
| 15 |
- (Potentially) Learn transitive audio-text relationships through the shared visual space
|
| 16 |
|
| 17 |
-
## Visualization
|
| 18 |
-

|
| 19 |
-
|
| 20 |
## What's Next?
|
| 21 |
I've got lots of ideas for making this better - longer training, playing with the architecture, investigating some interesting behaviors I've noticed and solving that massive issue of dealing with text, audio features that do not exist in the visual features.
|
| 22 |
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
# Triad: Dense Cross-Modal Feature Learning
|
| 6 |
+

|
| 7 |
|
| 8 |
I built Triad to explore dense feature correspondences between video, audio and text modalities - focusing on learning fine-grained, localized relationships rather than just global alignment. The goal was to create a model that could ground features between specific image regions, audio segments, and text spans simultaneously.
|
| 9 |
|
|
|
|
| 15 |
- Connect text descriptions to precise areas in images
|
| 16 |
- (Potentially) Learn transitive audio-text relationships through the shared visual space
|
| 17 |
|
|
|
|
|
|
|
|
|
|
| 18 |
## What's Next?
|
| 19 |
I've got lots of ideas for making this better - longer training, playing with the architecture, investigating some interesting behaviors I've noticed and solving that massive issue of dealing with text, audio features that do not exist in the visual features.
|
| 20 |
|