---
title: Capstone
emoji: 🐨
colorFrom: green
colorTo: indigo
sdk: gradio
sdk_version: 5.1.0
app_file: app.py
pinned: false
license: mit
short_description: A multimodel LLM
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference


### Huggingface Gradio App:

    -    The app.py script is a multimodal AI application that integrates image, audio, and text inputs using pre-trained models like CLIP (for vision tasks), Phi-2 
    (for text generation), and WhisperX (for audio transcription). The script sets up tokenizers and processors for handling inputs and defines a custom residual 
    block (SimpleResBlock) to transform embeddings for more stable learning. After loading pretrained and fine-tuned weights for both the projection and residual layers,
    it implements the model_generate_ans function, which processes inputs from different modalities, combines their embeddings, and generates responses sequentially.
    This model handles tasks like image embedding extraction, audio transcription and embedding, and text tokenization to predict responses. The app features a Gradio 
    interface where users can upload images, record or upload audio, and submit text queries, receiving multimodal answers through a web interface. This interactive 
    application is designed for seamless, multi-input AI tasks using advanced model architectures.