Collections of public datasets for Vision-Language modalities, especially for Frozen Vision Language Alignment.
Note SynthRecap