Update README.md

4855269 verified 3 months ago

7.25 kB

	---
	license: apache-2.0
	tags:
	- multimodal
	- agentic-ai
	- retrieval-augmented
	- explainable-ai
	- reasoning
	- automation
	- accessibility
	- vision-language
	- audio-processing
	- table-understanding
	language:
	- en
	- multilingual
	pipeline_tag: any-to-any
	---
	# Universal-Multimodal-Agent (UMA)

	## New: Multimodal Datasets Catalog (Phase 1 Data Collection)
	We’re kicking off data collection for UMA. Below is a curated, growing catalog of top public multimodal datasets by category with brief notes and links for immediate use.

	### A. Text–Image
	- LAION-5B — Massive web-scale image–text pairs; LAION-400M/1B subsets available. https://laion.ai/blog/laion-5b/
	- COCO (2017 Captions) — Image captioning and detection; strong baselines. https://cocodataset.org/#home
	- Visual Genome — Dense region descriptions, objects, attributes, relationships. https://visualgenome.org/
	- Conceptual Captions (3M/12M) — Web image–alt-text pairs. https://ai.google.com/research/ConceptualCaptions/
	- Flickr30k / Flickr8k — Classic captioning sets. https://hockenmaier.cs.illinois.edu/CS546-2014/data/flickr30k.html
	- CC3M/CC12M — Common Crawl–derived image–text pairs. https://github.com/google-research-datasets/conceptual-captions
	- SBU Captions — Image–text pairs from Flickr. http://www.cs.virginia.edu/~vicente/sbucaptions/
	- TextCaps — OCR-centric captioning with text in images. https://textvqa.org/textcaps/
	- VizWiz — Images taken by blind users; accessibility focus. https://vizwiz.org/
	- WebLI (if accessible) — Large-scale multilingual image–text pairs. https://ai.google/discover/papers/webli/

	### B. Text–Image Reasoning / VQA / Document QA
	- VQAv2 — Visual Question Answering benchmark. https://visualqa.org/
	- GQA — Compositional reasoning over scenes. https://cs.stanford.edu/people/dorarad/gqa/
	- OK-VQA / A-OKVQA — Requires external knowledge. https://okvqa.allenai.org/
	- ScienceQA — Multimodal science questions with diagrams. https://scienceqa.github.io/
	- DocVQA / TextVQA — Reading text in images. https://textvqa.org/
	- InfographicVQA — VQA on charts/infographics. https://www.microsoft.com/en-us/research/project/infographicvqa/
	- ChartQA / PlotQA / Chart-to-Text — Chart understanding, reasoning. https://github.com/vis-nlp/ChartQA

	### C. Text–Table (Structured Data)
	- TabFact — Table fact verification from Wikipedia. https://tabfact.github.io/
	- WikiTableQuestions — Semantic parsing over tables. https://ppasupat.github.io/WikiTableQuestions/
	- ToTTo — Controlled table-to-text generation. https://github.com/google-research-datasets/ToTTo
	- SQA (Sequential QA over tables) — Multi-turn QA on tables. https://allenai.org/data/sqa
	- Spider — Text-to-SQL over multiple DBs (semi-structured). https://yale-lily.github.io/spider
	- TURL — Table understanding pretraining. https://github.com/sunlab-osu/TURL
	- OpenTabQA — Open-domain QA over tables. https://github.com/IBM/OpenTabQA
	- MultiTab / TABBIE resources — Tabular reasoning. https://multitab-project.github.io/

	### D. Text–Audio / Speech
	- LibriSpeech — ASR with read English speech. https://www.openslr.org/12
	- Common Voice — Multilingual crowdsourced speech. https://commonvoice.mozilla.org/
	- Librilight — Large-scale unlabeled speech for self-supervised learning. https://github.com/facebookresearch/libri-light
	- TED-LIUM / How2 — Talks with transcripts and multimodal context. https://lium.univ-lemans.fr/ted-lium/
	- AudioSet — Weakly labeled ontology of sounds (with YouTube links). https://research.google.com/audioset/
	- ESC-50 / UrbanSound8K — Environmental sound classification. https://github.com/karoldvl/ESC-50
	- VoxCeleb — Speaker identification/verification. http://www.robots.ox.ac.uk/~vgg/data/voxceleb/
	- SPGISpeech (if license allows) — Financial domain ASR. https://datasets.kensho.com/s/sgpispeech

	### E. Full Multimodal / Multi-domain (Text–Image–Table–Audio—and more)
	- MMMU — Massive Multidiscipline Multimodal Understanding benchmark. https://mmmu-benchmark.github.io/
	- MMBench / MME / LVLM-eHub — LVLM comprehensive eval suites. https://mmbench.opencompass.org.cn/
	- Egoschema / Ego4D (video+audio+text) — Egocentric multi-sensor datasets. https://ego4d-data.org/
	- MultiModal C4 (MMC4) — Web-scale multi-doc image–text corpus. https://github.com/allenai/mmc4
	- WebQA / MultimodalQA — QA over web images and text. https://github.com/omni-us/research-multimodalqa
	- Chart/Document suites: DocLayNet, PubLayNet, DocVQA series. https://github.com/ibm-aur-nlp/PubLayNet
	- ArXivDoc / ChartX / SynthChart — Synthetic + real doc/chart sets. https://github.com/vis-nlp/ChartX

	### F. Safety, Bias, and Accessibility-focused Sets
	- Hateful Memes — Multimodal bias/toxicity benchmark. https://github.com/facebookresearch/mmf/tree/main/projects/hateful_memes
	- ImageNet-A/O/R — Robustness variants. https://github.com/hendrycks/imagenet-r
	- VizWiz (again) — Accessibility-oriented images/questions. https://vizwiz.org/
	- MS MARCO (multimodal passages via docs) + OCR corpora — Retrieval grounding. https://microsoft.github.io/msmarco/

	### G. Licensing and Usage Notes
	- Always check each dataset’s license and terms of use; some require access requests or restrict commercial use.
	- Maintain separate manifests with source, license, checksum, and intended use. Prefer mirrored, deduplicated shards with exact provenance.

	---

	## Call for Collaboration: Build UMA with Us
	We’re assembling an open team. If you’re passionate about agentic multimodal AI, join us.

	Roles we’re seeking (volunteer or sponsored collaborations):
	- Research Scientists: Multimodal learning, alignment, grounding, evaluation.
	- Research Engineers: Training pipelines, distributed systems, retrieval, tool-use interfaces.
	- Data Scientists / Data Engineers: Dataset curation, cleaning, deduplication, data governance.
	- Domain Experts: Finance, healthcare, education, accessibility, scientific communication.
	- Accessibility Specialists: Inclusive design, alt-text/sonification, screen-reader workflows, disability advocacy.
	- MLOps/Infra: Dataset storage, versioning, scalable training eval infra (HF Datasets, WebDataset, parquet, Arrow).
	- Community & Documentation: Tutorials, examples, benchmark harnesses, governance.

	How to get involved now:
	- Open a Discussion with your background and interests: https://huggingface.co/amalsp/Universal-Multimodal-Agent/discussions
	- Propose datasets or contribute manifests via PRs (add to datasets/manifest/*.jsonl)
	- Share domain-specific tasks and evaluation rubrics
	- Star and watch the repo for updates

	Initial roadmap for data:
	- Phase 1: Curate public datasets and licenses; build manifests and downloaders
	- Phase 2: Unified preprocessing (image, OCR, tables, audio), deduping, quality filters
	- Phase 3: Balanced training mixtures + eval suites (MMMU/MMBench/DocVQA/ASR)

	Ethics & Safety:
	- Respect dataset licenses, privacy, and consent. Implement filter lists and red-teaming sets.
	- Document known biases and limitations; enable opt-out mechanisms where applicable.

	Contributors will be acknowledged in the README and future preprint.


	## Original Project Overview
	[Existing content retained below]