Spaces:

Vvaann
/

Capstone-MultimodalGPT

Runtime error

Capstone-MultimodalGPT / README.md

Update README.md

21ed6fb verified over 1 year ago

1.41 kB

	---
	title: Capstone
	emoji: 🐨
	colorFrom: green
	colorTo: indigo
	sdk: gradio
	sdk_version: 5.1.0
	app_file: app.py
	pinned: false
	license: mit
	short_description: A multimodel LLM
	---

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference


	### Huggingface Gradio App:

	- The app.py script is a multimodal AI application that integrates image, audio, and text inputs using pre-trained models like CLIP (for vision tasks), Phi-2
	(for text generation), and WhisperX (for audio transcription). The script sets up tokenizers and processors for handling inputs and defines a custom residual
	block (SimpleResBlock) to transform embeddings for more stable learning. After loading pretrained and fine-tuned weights for both the projection and residual layers,
	it implements the model_generate_ans function, which processes inputs from different modalities, combines their embeddings, and generates responses sequentially.
	This model handles tasks like image embedding extraction, audio transcription and embedding, and text tokenization to predict responses. The app features a Gradio
	interface where users can upload images, record or upload audio, and submit text queries, receiving multimodal answers through a web interface. This interactive
	application is designed for seamless, multi-input AI tasks using advanced model architectures.