| title: Multimodal Coherence AI | |
| emoji: "\U0001f3a8" | |
| colorFrom: purple | |
| colorTo: pink | |
| sdk: streamlit | |
| sdk_version: "1.41.0" | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| short_description: Coherent text + image + audio with MSCI | |
| # Multimodal Coherence AI | |
| Generate semantically coherent **text + image + audio** bundles and evaluate | |
| cross-modal alignment using the **Multimodal Semantic Coherence Index (MSCI)**. | |
| ## How it works | |
| 1. **Text** β generated via HF Inference API | |
| 2. **Image** β retrieved from a curated index using CLIP (ViT-B/32) embeddings | |
| 3. **Audio** β retrieved from a curated index using CLAP (HTSAT-unfused) embeddings | |
| 4. **MSCI** β computed as `0.45 * cos_sim(text, image) + 0.45 * cos_sim(text, audio)` | |
| ## Research | |
| This demo accompanies a study evaluating multimodal semantic coherence across | |
| three research questions: | |
| - **RQ1**: Is MSCI sensitive to controlled semantic perturbations? (Supported, d > 2.0) | |
| - **RQ2**: Does structured planning improve cross-modal alignment? (Not supported) | |
| - **RQ3**: Does MSCI correlate with human coherence judgments? (Supported, rho = 0.379) | |