Respair
/

NeMo_Canary

Model card Files Files and versions

Metrics Training metrics Community

NeMo_Canary / docs /source /multimodal /vlm /intro.rst

Respair's picture

Upload folder using huggingface_hub

b386992 verified 9 months ago

history blame contribute delete

806 Bytes

	Vision-Language Foundation
	==========================

	Humans naturally process information using multiple senses like sight and sound. Similarly, multimodal learning aims to create models that handle different data types, such as images, text, and audio. There's a growing trend in models that combine vision and language, like OpenAI's CLIP. These models excel at tasks like aligning image and text features, image captioning, and visual question-answering. Their ability to generalize without specific training offers many practical uses. Please refer to `NeMo Framework User Guide for Multimodal Models <https://docs.nvidia.com/nemo-framework/user-guide/latest/multimodalmodels/index.html>`_ for detailed support information.

	.. toctree::
	:maxdepth: 1

	datasets
	configs
	checkpoint
	clip