|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- multimodal |
|
|
- agentic-ai |
|
|
- retrieval-augmented |
|
|
- explainable-ai |
|
|
- reasoning |
|
|
- automation |
|
|
- accessibility |
|
|
- vision-language |
|
|
- audio-processing |
|
|
- table-understanding |
|
|
language: |
|
|
- en |
|
|
- multilingual |
|
|
pipeline_tag: any-to-any |
|
|
--- |
|
|
# Universal-Multimodal-Agent (UMA) |
|
|
|
|
|
## New: Multimodal Datasets Catalog (Phase 1 Data Collection) |
|
|
We’re kicking off data collection for UMA. Below is a curated, growing catalog of top public multimodal datasets by category with brief notes and links for immediate use. |
|
|
|
|
|
### A. Text–Image |
|
|
- LAION-5B — Massive web-scale image–text pairs; LAION-400M/1B subsets available. https://laion.ai/blog/laion-5b/ |
|
|
- COCO (2017 Captions) — Image captioning and detection; strong baselines. https://cocodataset.org/#home |
|
|
- Visual Genome — Dense region descriptions, objects, attributes, relationships. https://visualgenome.org/ |
|
|
- Conceptual Captions (3M/12M) — Web image–alt-text pairs. https://ai.google.com/research/ConceptualCaptions/ |
|
|
- Flickr30k / Flickr8k — Classic captioning sets. https://hockenmaier.cs.illinois.edu/CS546-2014/data/flickr30k.html |
|
|
- CC3M/CC12M — Common Crawl–derived image–text pairs. https://github.com/google-research-datasets/conceptual-captions |
|
|
- SBU Captions — Image–text pairs from Flickr. http://www.cs.virginia.edu/~vicente/sbucaptions/ |
|
|
- TextCaps — OCR-centric captioning with text in images. https://textvqa.org/textcaps/ |
|
|
- VizWiz — Images taken by blind users; accessibility focus. https://vizwiz.org/ |
|
|
- WebLI (if accessible) — Large-scale multilingual image–text pairs. https://ai.google/discover/papers/webli/ |
|
|
|
|
|
### B. Text–Image Reasoning / VQA / Document QA |
|
|
- VQAv2 — Visual Question Answering benchmark. https://visualqa.org/ |
|
|
- GQA — Compositional reasoning over scenes. https://cs.stanford.edu/people/dorarad/gqa/ |
|
|
- OK-VQA / A-OKVQA — Requires external knowledge. https://okvqa.allenai.org/ |
|
|
- ScienceQA — Multimodal science questions with diagrams. https://scienceqa.github.io/ |
|
|
- DocVQA / TextVQA — Reading text in images. https://textvqa.org/ |
|
|
- InfographicVQA — VQA on charts/infographics. https://www.microsoft.com/en-us/research/project/infographicvqa/ |
|
|
- ChartQA / PlotQA / Chart-to-Text — Chart understanding, reasoning. https://github.com/vis-nlp/ChartQA |
|
|
|
|
|
### C. Text–Table (Structured Data) |
|
|
- TabFact — Table fact verification from Wikipedia. https://tabfact.github.io/ |
|
|
- WikiTableQuestions — Semantic parsing over tables. https://ppasupat.github.io/WikiTableQuestions/ |
|
|
- ToTTo — Controlled table-to-text generation. https://github.com/google-research-datasets/ToTTo |
|
|
- SQA (Sequential QA over tables) — Multi-turn QA on tables. https://allenai.org/data/sqa |
|
|
- Spider — Text-to-SQL over multiple DBs (semi-structured). https://yale-lily.github.io/spider |
|
|
- TURL — Table understanding pretraining. https://github.com/sunlab-osu/TURL |
|
|
- OpenTabQA — Open-domain QA over tables. https://github.com/IBM/OpenTabQA |
|
|
- MultiTab / TABBIE resources — Tabular reasoning. https://multitab-project.github.io/ |
|
|
|
|
|
### D. Text–Audio / Speech |
|
|
- LibriSpeech — ASR with read English speech. https://www.openslr.org/12 |
|
|
- Common Voice — Multilingual crowdsourced speech. https://commonvoice.mozilla.org/ |
|
|
- Librilight — Large-scale unlabeled speech for self-supervised learning. https://github.com/facebookresearch/libri-light |
|
|
- TED-LIUM / How2 — Talks with transcripts and multimodal context. https://lium.univ-lemans.fr/ted-lium/ |
|
|
- AudioSet — Weakly labeled ontology of sounds (with YouTube links). https://research.google.com/audioset/ |
|
|
- ESC-50 / UrbanSound8K — Environmental sound classification. https://github.com/karoldvl/ESC-50 |
|
|
- VoxCeleb — Speaker identification/verification. http://www.robots.ox.ac.uk/~vgg/data/voxceleb/ |
|
|
- SPGISpeech (if license allows) — Financial domain ASR. https://datasets.kensho.com/s/sgpispeech |
|
|
|
|
|
### E. Full Multimodal / Multi-domain (Text–Image–Table–Audio—and more) |
|
|
- MMMU — Massive Multidiscipline Multimodal Understanding benchmark. https://mmmu-benchmark.github.io/ |
|
|
- MMBench / MME / LVLM-eHub — LVLM comprehensive eval suites. https://mmbench.opencompass.org.cn/ |
|
|
- Egoschema / Ego4D (video+audio+text) — Egocentric multi-sensor datasets. https://ego4d-data.org/ |
|
|
- MultiModal C4 (MMC4) — Web-scale multi-doc image–text corpus. https://github.com/allenai/mmc4 |
|
|
- WebQA / MultimodalQA — QA over web images and text. https://github.com/omni-us/research-multimodalqa |
|
|
- Chart/Document suites: DocLayNet, PubLayNet, DocVQA series. https://github.com/ibm-aur-nlp/PubLayNet |
|
|
- ArXivDoc / ChartX / SynthChart — Synthetic + real doc/chart sets. https://github.com/vis-nlp/ChartX |
|
|
|
|
|
### F. Safety, Bias, and Accessibility-focused Sets |
|
|
- Hateful Memes — Multimodal bias/toxicity benchmark. https://github.com/facebookresearch/mmf/tree/main/projects/hateful_memes |
|
|
- ImageNet-A/O/R — Robustness variants. https://github.com/hendrycks/imagenet-r |
|
|
- VizWiz (again) — Accessibility-oriented images/questions. https://vizwiz.org/ |
|
|
- MS MARCO (multimodal passages via docs) + OCR corpora — Retrieval grounding. https://microsoft.github.io/msmarco/ |
|
|
|
|
|
### G. Licensing and Usage Notes |
|
|
- Always check each dataset’s license and terms of use; some require access requests or restrict commercial use. |
|
|
- Maintain separate manifests with source, license, checksum, and intended use. Prefer mirrored, deduplicated shards with exact provenance. |
|
|
|
|
|
--- |
|
|
|
|
|
## Call for Collaboration: Build UMA with Us |
|
|
We’re assembling an open team. If you’re passionate about agentic multimodal AI, join us. |
|
|
|
|
|
Roles we’re seeking (volunteer or sponsored collaborations): |
|
|
- Research Scientists: Multimodal learning, alignment, grounding, evaluation. |
|
|
- Research Engineers: Training pipelines, distributed systems, retrieval, tool-use interfaces. |
|
|
- Data Scientists / Data Engineers: Dataset curation, cleaning, deduplication, data governance. |
|
|
- Domain Experts: Finance, healthcare, education, accessibility, scientific communication. |
|
|
- Accessibility Specialists: Inclusive design, alt-text/sonification, screen-reader workflows, disability advocacy. |
|
|
- MLOps/Infra: Dataset storage, versioning, scalable training eval infra (HF Datasets, WebDataset, parquet, Arrow). |
|
|
- Community & Documentation: Tutorials, examples, benchmark harnesses, governance. |
|
|
|
|
|
How to get involved now: |
|
|
- Open a Discussion with your background and interests: https://huggingface.co/amalsp/Universal-Multimodal-Agent/discussions |
|
|
- Propose datasets or contribute manifests via PRs (add to datasets/manifest/*.jsonl) |
|
|
- Share domain-specific tasks and evaluation rubrics |
|
|
- Star and watch the repo for updates |
|
|
|
|
|
Initial roadmap for data: |
|
|
- Phase 1: Curate public datasets and licenses; build manifests and downloaders |
|
|
- Phase 2: Unified preprocessing (image, OCR, tables, audio), deduping, quality filters |
|
|
- Phase 3: Balanced training mixtures + eval suites (MMMU/MMBench/DocVQA/ASR) |
|
|
|
|
|
Ethics & Safety: |
|
|
- Respect dataset licenses, privacy, and consent. Implement filter lists and red-teaming sets. |
|
|
- Document known biases and limitations; enable opt-out mechanisms where applicable. |
|
|
|
|
|
Contributors will be acknowledged in the README and future preprint. |
|
|
|
|
|
|
|
|
## Original Project Overview |
|
|
[Existing content retained below] |
|
|
|
|
|
|