Spaces:

iamfaham
/

multimodal-sentiment-analysis

Running

App Files Files Community

Faham commited on Aug 27

Commit

1d798d1

1 Parent(s): 93e56c4

UPDATE: add max fusion parts

Browse files

Files changed (7) hide show

.gitignore +1 -0
README.md +17 -0
VIDEO_PROCESSING_GUIDE.md +495 -0
app.py +538 -0
pyproject.toml +4 -0
requirements.txt +0 -0
uv.lock +0 -0

.gitignore CHANGED Viewed

@@ -40,6 +40,7 @@ venv/
 env/
 ENV/
 .venv/
 .env/
 # IDE

 env/
 ENV/
 .venv/
+.venv2/
 .env/
 # IDE

README.md CHANGED Viewed

@@ -52,6 +52,18 @@ This project implements a **fused sentiment analysis system** that combines pred
 - **Capability**: Provides comprehensive sentiment analysis across modalities
 - **Status**: ✅ Fully integrated and ready to use
 ## Project Structure
 ```
@@ -134,6 +146,7 @@ sentiment-fused/
    - 🎵 **Audio Sentiment**: Analyze audio files or record with microphone
    - 🖼️ **Vision Sentiment**: Analyze images or capture with camera
    - 🔗 **Fused Model**: Combine all three models
 ## Model Development
@@ -238,6 +251,8 @@ Key libraries used:
 - **Librosa**: Audio processing
 - **TextBlob**: Natural language processing
 - **Gdown**: Google Drive file downloader
 ## What This Project Demonstrates
@@ -247,5 +262,7 @@ Key libraries used:
 4. **Smart Preprocessing**: Automatic format conversion and optimization
 5. **Modern Web UI**: Professional Streamlit application with custom styling
 6. **Production Ready**: Docker containerization and deployment
 This project serves as a comprehensive example of building production-ready multimodal AI applications with modern Python tools and frameworks.

 - **Capability**: Provides comprehensive sentiment analysis across modalities
 - **Status**: ✅ Fully integrated and ready to use
+### 5. 🎬 Max Fusion
+- **Approach**: Video-based comprehensive sentiment analysis
+- **Capability**: Analyzes 5-second videos by extracting frames, audio, and transcribing speech
+- **Features**:
+  - Video recording or file upload (MP4, AVI, MOV, MKV, WMV, FLV)
+  - Automatic frame extraction for vision analysis
+  - Audio extraction for vocal sentiment analysis
+  - Speech-to-text transcription for text sentiment analysis
+  - Combined results from all three modalities
+- **Status**: ✅ Fully integrated and ready to use
 ## Project Structure
 ```
    - 🎵 **Audio Sentiment**: Analyze audio files or record with microphone
    - 🖼️ **Vision Sentiment**: Analyze images or capture with camera
    - 🔗 **Fused Model**: Combine all three models
+   - 🎬 **Max Fusion**: Video-based comprehensive analysis
 ## Model Development
 - **Librosa**: Audio processing
 - **TextBlob**: Natural language processing
 - **Gdown**: Google Drive file downloader
+- **MoviePy**: Video processing and audio extraction
+- **SpeechRecognition**: Audio transcription
 ## What This Project Demonstrates
 4. **Smart Preprocessing**: Automatic format conversion and optimization
 5. **Modern Web UI**: Professional Streamlit application with custom styling
 6. **Production Ready**: Docker containerization and deployment
+7. **Video Analysis**: Comprehensive video processing with multi-modal extraction
+8. **Speech Recognition**: Audio-to-text transcription for enhanced analysis
 This project serves as a comprehensive example of building production-ready multimodal AI applications with modern Python tools and frameworks.

VIDEO_PROCESSING_GUIDE.md ADDED Viewed

	@@ -0,0 +1,495 @@

+# 🎬 Video Processing Pipeline Guide for Multimodal Analysis
+## 🎯 **Objective & Scope**
+**Goal**: Create a Streamlit app that uploads a video and extracts its core components for multimodal analysis:
+- **Visual Frames**: Representative images from the video
+- **Audio Track**: Extracted audio in WAV format
+- **Transcribed Text**: Speech converted to text
+**Scope**: This guide covers the complete extraction and conversion pipeline. Machine learning models and sentiment analysis are excluded - the focus is purely on data processing and UI components.
+---
+## 📚 **Step 1: Essential Libraries & Setup**
+### **Required Python Libraries**
+```bash
+pip install streamlit opencv-python-headless moviepy SpeechRecognition
+```
+### **Requirements.txt**
+```txt
+streamlit
+opencv-python-headless
+moviepy
+SpeechRecognition
+```
+### **FFmpeg Dependency**
+- **MoviePy** requires FFmpeg for video processing
+- **Windows**: Download from https://ffmpeg.org/download.html
+- **macOS**: `brew install ffmpeg`
+- **Linux**: `sudo apt install ffmpeg`
+---
+## 🖥️ **Step 2: Creating the Streamlit Interface**
+### **Basic UI Setup**
+```python
+import streamlit as st
+st.set_page_config(
+    page_title="Video Processing Pipeline",
+    page_icon="🎬",
+    layout="wide"
+)
+st.title("🎬 Video Processing Pipeline")
+st.markdown("Upload a video to extract frames, audio, and text for analysis")
+# File uploader
+uploaded_video = st.file_uploader(
+    "Choose a video file",
+    type=["mp4", "avi", "mov", "mkv", "wmv", "flv"],
+    help="Supported formats: MP4, AVI, MOV, MKV, WMV, FLV"
+)
+# Process button
+if st.button("🚀 Process Video", type="primary", use_container_width=True):
+    if uploaded_video:
+        process_video(uploaded_video)
+    else:
+        st.warning("Please upload a video file first")
+```
+---
+## ⚙️ **Step 3: The Core Extraction Logic**
+### **3.1 Video-to-Frames Extraction**
+```python
+def extract_frames_from_video(video_file, max_frames=5):
+    """
+    Extract representative frames from video using OpenCV
+    Args:
+        video_file: Video file object or path
+        max_frames: Maximum frames to extract (default: 5)
+    Returns:
+        List of PIL Image objects
+    """
+    try:
+        import cv2
+        import tempfile
+        import numpy as np
+        from PIL import Image
+        # Save video bytes to temporary file
+        with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as tmp_file:
+            if hasattr(video_file, "getvalue"):
+                tmp_file.write(video_file.getvalue())
+            else:
+                tmp_file.write(video_file)
+            tmp_file_path = tmp_file.name
+        try:
+            # Open video with OpenCV
+            cap = cv2.VideoCapture(tmp_file_path)
+            if not cap.isOpened():
+                st.error("Could not open video file")
+                return []
+            # Get video properties
+            total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+            fps = cap.get(cv2.CAP_PROP_FPS)
+            duration = total_frames / fps if fps > 0 else 0
+            st.info(f"📹 Video: {total_frames} frames, {fps:.1f} FPS, {duration:.1f}s duration")
+            # Extract frames at strategic intervals
+            frames = []
+            if total_frames > 0:
+                # Select frames: start, 25%, 50%, 75%, end
+                frame_indices = [
+                    0,
+                    int(total_frames * 0.25),
+                    int(total_frames * 0.5),
+                    int(total_frames * 0.75),
+                    total_frames - 1
+                ]
+                frame_indices = list(set(frame_indices))  # Remove duplicates
+                frame_indices.sort()
+                for frame_idx in frame_indices:
+                    cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
+                    ret, frame = cap.read()
+                    if ret:
+                        # Convert BGR to RGB
+                        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+                        # Convert to PIL Image
+                        pil_image = Image.fromarray(frame_rgb)
+                        frames.append(pil_image)
+            cap.release()
+            return frames
+        finally:
+            # Clean up temporary file
+            try:
+                os.unlink(tmp_file_path)
+            except (OSError, PermissionError):
+                pass
+    except ImportError:
+        st.error("OpenCV not installed. Please install it with: pip install opencv-python")
+        return []
+    except Exception as e:
+        st.error(f"Error extracting frames: {str(e)}")
+        return []
+```
+**How it works:**
+1. **Temporary File**: Saves video bytes to a temporary MP4 file
+2. **OpenCV Capture**: Opens video and reads properties (frames, FPS, duration)
+3. **Strategic Sampling**: Selects frames at key points (start, 25%, 50%, 75%, end)
+4. **Format Conversion**: Converts BGR to RGB and creates PIL Image objects
+5. **Cleanup**: Removes temporary files safely
+---
+### **3.2 Video-to-Audio Conversion**
+```python
+def extract_audio_from_video(video_file):
+    """
+    Extract audio track from video using MoviePy
+    Args:
+        video_file: Video file object or path
+    Returns:
+        Audio bytes in WAV format
+    """
+    try:
+        import tempfile
+        from moviepy import VideoFileClip
+        # Save video bytes to temporary file
+        with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as tmp_file:
+            if hasattr(video_file, "getvalue"):
+                tmp_file.write(video_file.getvalue())
+            else:
+                tmp_file.write(video_file)
+            tmp_file_path = tmp_file.name
+        try:
+            # Extract audio using MoviePy
+            video = VideoFileClip(tmp_file_path)
+            audio = video.audio
+            if audio is None:
+                st.warning("No audio track found in video")
+                return None
+            # Save audio to temporary WAV file
+            with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as audio_file:
+                audio_path = audio_file.name
+            # Export audio as WAV
+            audio.write_audiofile(audio_path, verbose=False, logger=None)
+            # Read the audio file and return bytes
+            with open(audio_path, "rb") as f:
+                audio_bytes = f.read()
+            # Clean up temporary audio file
+            try:
+                os.unlink(audio_path)
+            except (OSError, PermissionError):
+                pass
+            return audio_bytes
+        finally:
+            # Clean up temporary video file
+            try:
+                # Close video and audio objects first
+                if 'video' in locals():
+                    video.close()
+                if 'audio' in locals() and audio:
+                    audio.close()
+                # Wait a bit before trying to delete
+                import time
+                time.sleep(0.1)
+                os.unlink(tmp_file_path)
+            except (OSError, PermissionError):
+                pass
+    except ImportError:
+        st.error("MoviePy not installed. Please install it with: pip install moviepy")
+        return None
+    except Exception as e:
+        st.error(f"Error extracting audio: {str(e)}")
+        return None
+```
+**How it works:**
+1. **Temporary File**: Creates temporary MP4 file from video bytes
+2. **MoviePy Processing**: Uses VideoFileClip to extract audio track
+3. **WAV Export**: Converts audio to WAV format
+4. **Bytes Return**: Reads WAV file and returns as bytes
+5. **Resource Management**: Properly closes video/audio objects and cleans up files
+---
+### **3.3 Audio-to-Text Transcription**
+```python
+def transcribe_audio(audio_bytes):
+    """
+    Transcribe audio to text using SpeechRecognition
+    Args:
+        audio_bytes: Audio bytes in WAV format
+    Returns:
+        Transcribed text string
+    """
+    if audio_bytes is None:
+        return ""
+    try:
+        import tempfile
+        import speech_recognition as sr
+        # Save audio bytes to temporary file
+        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp_file:
+            tmp_file.write(audio_bytes)
+            tmp_file_path = tmp_file.name
+        try:
+            # Initialize recognizer
+            recognizer = sr.Recognizer()
+            # Load audio file
+            with sr.AudioFile(tmp_file_path) as source:
+                # Read audio data
+                audio_data = recognizer.record(source)
+                # Transcribe using Google Speech Recognition
+                try:
+                    text = recognizer.recognize_google(audio_data)
+                    return text
+                except sr.UnknownValueError:
+                    st.warning("Speech could not be understood")
+                    return ""
+                except sr.RequestError as e:
+                    st.error(f"Could not request results from speech recognition service: {e}")
+                    return ""
+        finally:
+            # Clean up temporary file
+            try:
+                os.unlink(tmp_file_path)
+            except (OSError, PermissionError):
+                pass
+    except ImportError:
+        st.error("SpeechRecognition not installed. Please install it with: pip install SpeechRecognition")
+        return ""
+    except Exception as e:
+        st.error(f"Error transcribing audio: {str(e)}")
+        return ""
+```
+**How it works:**
+1. **Temporary File**: Saves audio bytes to temporary WAV file
+2. **Speech Recognition**: Uses Google's speech recognition service
+3. **Audio Processing**: Records and processes audio data
+4. **Text Return**: Returns transcribed text or empty string on failure
+5. **Cleanup**: Removes temporary files safely
+---
+## 🔄 **Step 4: Complete Processing Pipeline**
+### **Integrated Processing Function**
+```python
+def process_video(uploaded_video):
+    """Complete video processing pipeline"""
+    st.subheader("🎬 Video Processing Pipeline")
+    st.info("📁 Processing uploaded video file...")
+    # 1. Extract frames
+    st.markdown("**1. 🎥 Frame Extraction**")
+    frames = extract_frames_from_video(uploaded_video, max_frames=5)
+    if frames:
+        st.success(f"✅ Extracted {len(frames)} representative frames")
+        # Display extracted frames
+        cols = st.columns(len(frames))
+        for i, frame in enumerate(frames):
+            with cols[i]:
+                st.image(frame, caption=f"Frame {i+1}", use_container_width=True)
+    else:
+        st.warning("⚠️ Could not extract frames from video")
+        frames = []
+    # 2. Extract audio
+    st.markdown("**2. 🎵 Audio Extraction**")
+    audio_bytes = extract_audio_from_video(uploaded_video)
+    if audio_bytes:
+        st.success("✅ Audio extracted successfully")
+        st.audio(audio_bytes, format="audio/wav")
+    else:
+        st.warning("⚠️ Could not extract audio from video")
+        audio_bytes = None
+    # 3. Transcribe audio
+    st.markdown("**3. 📝 Audio Transcription**")
+    if audio_bytes:
+        transcribed_text = transcribe_audio(audio_bytes)
+        if transcribed_text:
+            st.success("✅ Audio transcribed successfully")
+            st.markdown(f'**Transcribed Text:** "{transcribed_text}"')
+        else:
+            st.warning("⚠️ Could not transcribe audio")
+            transcribed_text = ""
+    else:
+        transcribed_text = ""
+        st.info("ℹ️ No audio available for transcription")
+    # Store results for later use
+    st.session_state.processed_frames = frames
+    st.session_state.processed_audio = audio_bytes
+    st.session_state.transcribed_text = transcribed_text
+    st.success("🎉 Video processing completed! All components extracted successfully.")
+```
+---
+## 🎯 **Key Benefits of This Approach**
+### **1. Real Video Processing**
+- ✅ **Actual Audio**: Extracts real audio from uploaded videos
+- ✅ **Representative Frames**: Strategic frame selection (not just sequential)
+- ✅ **Real Transcription**: Converts actual speech to text
+### **2. Robust Error Handling**
+- ✅ **File Access**: Handles temporary file conflicts gracefully
+- ✅ **Resource Management**: Properly closes video/audio objects
+- ✅ **Cleanup**: Safe temporary file removal
+### **3. User Experience**
+- ✅ **Visual Feedback**: Shows extracted frames, audio player, and text
+- ✅ **Progress Tracking**: Clear step-by-step processing display
+- ✅ **Error Messages**: Informative feedback for troubleshooting
+### **4. Scalability**
+- ✅ **Modular Design**: Each extraction function is independent
+- ✅ **Reusable Components**: Functions can be used in other parts of the app
+- ✅ **Easy Maintenance**: Clear separation of concerns
+---
+## 🚀 **Usage Example**
+```python
+# Complete working example
+import streamlit as st
+import tempfile
+import os
+# Setup page
+st.set_page_config(page_title="Video Processor", layout="wide")
+st.title("🎬 Video Processing Pipeline")
+# File upload
+uploaded_video = st.file_uploader("Choose video file", type=["mp4", "avi", "mov"])
+# Process button
+if st.button("🚀 Process Video", type="primary"):
+    if uploaded_video:
+        process_video(uploaded_video)
+    else:
+        st.warning("Please upload a video first")
+# Display results
+if 'processed_frames' in st.session_state:
+    st.subheader("📊 Processing Results")
+    st.write(f"Frames: {len(st.session_state.processed_frames)}")
+    st.write(f"Audio: {'✅' if st.session_state.processed_audio else '❌'}")
+    st.write(f"Text: {'✅' if st.session_state.transcribed_text else '❌'}")
+```
+---
+## 🔧 **Troubleshooting Common Issues**
+### **1. FFmpeg Not Found**
+```bash
+# Windows: Add FFmpeg to PATH
+# macOS: brew install ffmpeg
+# Linux: sudo apt install ffmpeg
+```
+### **2. OpenCV Import Error**
+```bash
+pip install opencv-python-headless
+```
+### **3. MoviePy Audio Issues**
+```bash
+pip install moviepy --upgrade
+# Ensure FFmpeg is installed
+```
+### **4. Speech Recognition Errors**
+```bash
+pip install SpeechRecognition
+# Check internet connection for Google service
+```
+---
+## 📝 **Summary**
+This guide provides a complete video processing pipeline that:
+1. **🎥 Extracts Frames**: Strategic sampling of representative video frames
+2. **🎵 Extracts Audio**: Converts video audio to WAV format
+3. **📝 Transcribes Speech**: Converts audio to searchable text
+4. **🖥️ Provides UI**: Clean Streamlit interface with progress tracking
+5. **🔧 Handles Errors**: Robust error handling and resource management
+The result is a production-ready video processing system that extracts all necessary components for multimodal analysis without any machine learning dependencies. Each component is extracted independently and can be used for further processing or analysis as needed.
+**Next Steps**: Use the extracted frames, audio, and text with your preferred analysis models or export them for external processing.

app.py CHANGED Viewed

@@ -6,6 +6,7 @@ import torch
 import torch.nn as nn
 from torchvision import transforms, models
 import torch.nn.functional as F
 # Import the Google Drive model manager
 from simple_model_manager import SimpleModelManager
@@ -520,6 +521,223 @@ def predict_fused_sentiment(text=None, audio_bytes=None, image=None):
     return final_sentiment, avg_confidence
 # Sidebar navigation
 st.sidebar.title("Sentiment Analysis")
 st.sidebar.markdown("---")
@@ -533,6 +751,7 @@ page = st.sidebar.selectbox(
         "Audio Sentiment",
         "Vision Sentiment",
         "Fused Model",
     ],
 )
@@ -626,6 +845,23 @@ if page == "Home":
         unsafe_allow_html=True,
     )
     st.markdown("---")
     st.markdown(
         """
@@ -1195,6 +1431,308 @@ elif page == "Fused Model":
                 "Please provide at least one input (text, audio, or image) for fused analysis."
             )
 # Footer
 st.markdown("---")
 st.markdown(

 import torch.nn as nn
 from torchvision import transforms, models
 import torch.nn.functional as F
+import cv2
 # Import the Google Drive model manager
 from simple_model_manager import SimpleModelManager
     return final_sentiment, avg_confidence
+def extract_frames_from_video(video_file, max_frames=10):
+    """
+    Extract frames from video file for vision sentiment analysis
+    Args:
+        video_file: StreamlitUploadedFile or bytes
+        max_frames: Maximum number of frames to extract
+    Returns:
+        List of PIL Image objects
+    """
+    try:
+        import cv2
+        import numpy as np
+        import tempfile
+        # Save video bytes to temporary file
+        with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as tmp_file:
+            if hasattr(video_file, "getvalue"):
+                tmp_file.write(video_file.getvalue())
+            else:
+                tmp_file.write(video_file)
+            tmp_file_path = tmp_file.name
+        try:
+            # Open video with OpenCV
+            cap = cv2.VideoCapture(tmp_file_path)
+            if not cap.isOpened():
+                st.error("Could not open video file")
+                return []
+            frames = []
+            total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+            fps = cap.get(cv2.CAP_PROP_FPS)
+            duration = total_frames / fps if fps > 0 else 0
+            st.info(
+                f"📹 Video: {total_frames} frames, {fps:.1f} FPS, {duration:.1f}s duration"
+            )
+            # Extract frames at strategic intervals
+            if total_frames > 0:
+                # Select frames: start, 25%, 50%, 75%, end
+                frame_indices = [
+                    0,
+                    int(total_frames * 0.25),
+                    int(total_frames * 0.5),
+                    int(total_frames * 0.75),
+                    total_frames - 1,
+                ]
+                frame_indices = list(set(frame_indices))  # Remove duplicates
+                frame_indices.sort()
+                for frame_idx in frame_indices:
+                    cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
+                    ret, frame = cap.read()
+                    if ret:
+                        # Convert BGR to RGB
+                        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+                        # Convert to PIL Image
+                        pil_image = Image.fromarray(frame_rgb)
+                        frames.append(pil_image)
+            cap.release()
+            return frames
+        finally:
+            # Clean up temporary file
+            os.unlink(tmp_file_path)
+    except ImportError:
+        st.error(
+            "OpenCV not installed. Please install it with: pip install opencv-python"
+        )
+        return []
+    except Exception as e:
+        st.error(f"Error extracting frames: {str(e)}")
+        return []
+def extract_audio_from_video(video_file):
+    """
+    Extract audio from video file for audio sentiment analysis
+    Args:
+        video_file: StreamlitUploadedFile or bytes
+    Returns:
+        Audio bytes in WAV format
+    """
+    try:
+        import tempfile
+        from moviepy import VideoFileClip
+        # Save video bytes to temporary file
+        with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as tmp_file:
+            if hasattr(video_file, "getvalue"):
+                tmp_file.write(video_file.getvalue())
+            else:
+                tmp_file.write(video_file)
+            tmp_file_path = tmp_file.name
+        try:
+            # Extract audio using moviepy
+            video = VideoFileClip(tmp_file_path)
+            audio = video.audio
+            if audio is None:
+                st.warning("No audio track found in video")
+                return None
+            # Save audio to temporary WAV file
+            with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as audio_file:
+                audio_path = audio_file.name
+            # Export audio as WAV
+            audio.write_audiofile(audio_path, logger=None)
+            # Read the audio file and return bytes
+            with open(audio_path, "rb") as f:
+                audio_bytes = f.read()
+            # Clean up temporary audio file
+            try:
+                os.unlink(audio_path)
+            except (OSError, PermissionError):
+                # File might be in use, skip cleanup
+                pass
+            return audio_bytes
+        finally:
+            # Clean up temporary video file
+            try:
+                # Close video and audio objects first
+                if "video" in locals():
+                    video.close()
+                if "audio" in locals() and audio:
+                    audio.close()
+                # Wait a bit before trying to delete
+                import time
+                time.sleep(0.1)
+                os.unlink(tmp_file_path)
+            except (OSError, PermissionError):
+                # File might be in use, skip cleanup
+                pass
+    except ImportError:
+        st.error("MoviePy not installed. Please install it with: pip install moviepy")
+        return None
+    except Exception as e:
+        st.error(f"Error extracting audio: {str(e)}")
+        return None
+def transcribe_audio(audio_bytes):
+    """
+    Transcribe audio to text for text sentiment analysis
+    Args:
+        audio_bytes: Audio bytes in WAV format
+    Returns:
+        Transcribed text string
+    """
+    if audio_bytes is None:
+        return ""
+    try:
+        import tempfile
+        import speech_recognition as sr
+        # Save audio bytes to temporary file
+        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp_file:
+            tmp_file.write(audio_bytes)
+            tmp_file_path = tmp_file.name
+        try:
+            # Initialize recognizer
+            recognizer = sr.Recognizer()
+            # Load audio file
+            with sr.AudioFile(tmp_file_path) as source:
+                # Read audio data
+                audio_data = recognizer.record(source)
+                # Transcribe using Google Speech Recognition
+                try:
+                    text = recognizer.recognize_google(audio_data)
+                    return text
+                except sr.UnknownValueError:
+                    st.warning("Speech could not be understood")
+                    return ""
+                except sr.RequestError as e:
+                    st.error(
+                        f"Could not request results from speech recognition service: {e}"
+                    )
+                    return ""
+        finally:
+            # Clean up temporary file
+            os.unlink(tmp_file_path)
+    except ImportError:
+        st.error(
+            "SpeechRecognition not installed. Please install it with: pip install SpeechRecognition"
+        )
+        return ""
+    except Exception as e:
+        st.error(f"Error transcribing audio: {str(e)}")
+        return ""
 # Sidebar navigation
 st.sidebar.title("Sentiment Analysis")
 st.sidebar.markdown("---")
         "Audio Sentiment",
         "Vision Sentiment",
         "Fused Model",
+        "Max Fusion",
     ],
 )
         unsafe_allow_html=True,
     )
+    st.markdown(
+        """
+    <div class="model-card">
+        <h3>🎬 Max Fusion</h3>
+        <p>Ultimate video-based sentiment analysis combining all three modalities</p>
+        <ul>
+            <li>🎥 Record or upload 5-second videos</li>
+            <li>🔍 Extract frames for vision analysis</li>
+            <li>🎵 Extract audio for vocal sentiment</li>
+            <li>📝 Transcribe audio for text analysis</li>
+            <li>🚀 Comprehensive multi-modal results</li>
+        </ul>
+    </div>
+    """,
+        unsafe_allow_html=True,
+    )
     st.markdown("---")
     st.markdown(
         """
                 "Please provide at least one input (text, audio, or image) for fused analysis."
             )
+# Max Fusion Page
+elif page == "Max Fusion":
+    st.title("Max Fusion - Multi-Modal Sentiment Analysis")
+    st.markdown(
+        """
+    <div class="model-card">
+        <h3>Ultimate Multi-Modal Sentiment Analysis</h3>
+        <p>Take photos with camera or upload videos to get comprehensive sentiment analysis from multiple modalities:</p>
+        <ul>
+            <li>📸 <strong>Vision Analysis:</strong> Camera photos or video frames for facial expression analysis</li>
+            <li>🎵 <strong>Audio Analysis:</strong> Audio files or extracted audio from videos for vocal sentiment</li>
+            <li>📝 <strong>Text Analysis:</strong> Transcribed audio for text sentiment analysis</li>
+        </ul>
+    </div>
+    """,
+        unsafe_allow_html=True,
+    )
+    # Video input method selection
+    st.subheader("Video Input")
+    video_input_method = st.radio(
+        "Choose input method:",
+        ["Upload Video File", "Record Video (Coming Soon)"],
+        horizontal=True,
+        index=0,  # Default to upload video
+    )
+    if video_input_method == "Record Video (Coming Soon)":
+        # Coming Soon message for video recording
+        st.info("🎥 Video recording feature is coming soon!")
+        st.info("📁 Please use the Upload Video File option for now.")
+        # Show a nice coming soon message
+        st.markdown("---")
+        col1, col2, col3 = st.columns([1, 2, 1])
+        with col2:
+            st.markdown(
+                """
+            <div style="text-align: center; padding: 20px; background: linear-gradient(90deg, #667eea 0%, #764ba2 100%); border-radius: 10px; color: white;">
+                <h3>🚧 Coming Soon 🚧</h3>
+                <p>Video recording feature is under development</p>
+                <p>Use Upload Video File for now!</p>
+            </div>
+            """,
+                unsafe_allow_html=True,
+            )
+        # Placeholder for future recording functionality
+        st.markdown(
+            """
+        **Future Features:**
+        - Real-time video recording with camera
+        - Audio capture during recording
+        - Automatic frame extraction
+        - Live transcription
+        - WebRTC integration for low-latency streaming
+        """
+        )
+        # Skip all the recording logic for now
+        uploaded_video = None
+        video_source = None
+        video_name = None
+        video_file = None
+    elif video_input_method == "Upload Video File":
+        # File upload option
+        st.markdown(
+            """
+        <div class="upload-section">
+            <h4>📁 Upload Video File</h4>
+            <p>Upload a video file for comprehensive multimodal analysis.</p>
+            <p><strong>Supported Formats:</strong> MP4, AVI, MOV, MKV, WMV, FLV</p>
+            <p><strong>Recommended:</strong> Videos with clear audio and visual content</p>
+        </div>
+        """,
+            unsafe_allow_html=True,
+        )
+        uploaded_video = st.file_uploader(
+            "Choose a video file",
+            type=["mp4", "avi", "mov", "mkv", "wmv", "flv"],
+            help="Supported formats: MP4, AVI, MOV, MKV, WMV, FLV",
+        )
+        video_source = "uploaded_file"
+        video_name = uploaded_video.name if uploaded_video else None
+        video_file = uploaded_video
+        # Video recording using streamlit-webrtc component - COMING SOON
+    if video_file is not None:
+        # Display video or photo
+        if video_source == "camera_photo":
+            # For camera photos, we already displayed the image above
+            st.info(f"Source: Camera Photo | Ready for vision analysis")
+            # Add audio upload option for camera photo mode
+            st.subheader("🎵 Audio Input for Analysis")
+            st.info(
+                "Since we're using a photo, please upload an audio file for audio sentiment analysis:"
+            )
+            uploaded_audio = st.file_uploader(
+                "Upload audio file for audio analysis:",
+                type=["wav", "mp3", "m4a", "flac"],
+                key="camera_audio",
+                help="Upload an audio file to complement the photo analysis",
+            )
+            if uploaded_audio:
+                st.audio(
+                    uploaded_audio, format=f'audio/{uploaded_audio.name.split(".")[-1]}'
+                )
+                st.success("✅ Audio uploaded successfully!")
+                audio_bytes = uploaded_audio.getvalue()
+            else:
+                audio_bytes = None
+                st.warning("⚠️ Please upload an audio file for complete analysis")
+        else:
+            # For uploaded videos
+            st.video(video_file)
+            if hasattr(video_file, "getvalue"):
+                file_size = len(video_file.getvalue()) / 1024  # KB
+            else:
+                file_size = len(video_file) / 1024  # KB
+            st.info(f"File: {video_name} | Size: {file_size:.1f} KB")
+            audio_bytes = None  # Will be extracted from video
+        # Video Processing Pipeline
+        st.subheader("🎬 Video Processing Pipeline")
+        # Initialize variables
+        frames = []
+        audio_bytes = None
+        transcribed_text = ""
+        # Process uploaded video
+        if uploaded_video:
+            st.info("📁 Processing uploaded video file...")
+            # Extract frames
+            st.markdown("**1. 🎥 Frame Extraction**")
+            frames = extract_frames_from_video(uploaded_video, max_frames=5)
+            if frames:
+                st.success(f"✅ Extracted {len(frames)} representative frames")
+                # Display extracted frames
+                cols = st.columns(len(frames))
+                for i, frame in enumerate(frames):
+                    with cols[i]:
+                        st.image(
+                            frame, caption=f"Frame {i+1}", use_container_width=True
+                        )
+            else:
+                st.warning("⚠️ Could not extract frames from video")
+                frames = []
+            # Extract audio
+            st.markdown("**2. 🎵 Audio Extraction**")
+            audio_bytes = extract_audio_from_video(uploaded_video)
+            if audio_bytes:
+                st.success("✅ Audio extracted successfully")
+                st.audio(audio_bytes, format="audio/wav")
+            else:
+                st.warning("⚠️ Could not extract audio from video")
+                audio_bytes = None
+            # Transcribe audio
+            st.markdown("**3. 📝 Audio Transcription**")
+            if audio_bytes:
+                transcribed_text = transcribe_audio(audio_bytes)
+                if transcribed_text:
+                    st.success("✅ Audio transcribed successfully")
+                    st.markdown(f'**Transcribed Text:** "{transcribed_text}"')
+                else:
+                    st.warning("⚠️ Could not transcribe audio")
+                    transcribed_text = ""
+            else:
+                transcribed_text = ""
+                st.info("ℹ️ No audio available for transcription")
+        # Analysis button
+        if st.button(
+            "🚀 Run Max Fusion Analysis", type="primary", use_container_width=True
+        ):
+            with st.spinner(
+                "🔄 Processing video and running comprehensive analysis..."
+            ):
+                # Run individual analyses
+                st.subheader("🔍 Individual Model Analysis")
+                results_data = []
+                # Vision analysis (use first frame for uploaded videos)
+                if frames:
+                    st.markdown("**Vision Analysis:**")
+                    # For uploaded videos, use first frame
+                    vision_sentiment, vision_conf = predict_vision_sentiment(
+                        frames[0], crop_tightness=0.0
+                    )
+                    results_data.append(
+                        {
+                            "Model": "Vision (ResNet-50)",
+                            "Input": f"Video Frame 1",
+                            "Sentiment": vision_sentiment,
+                            "Confidence": f"{vision_conf:.2f}",
+                        }
+                    )
+                    st.success(
+                        f"Vision: {vision_sentiment} (Confidence: {vision_conf:.2f})"
+                    )
+                # Audio analysis
+                if audio_bytes:
+                    st.markdown("**Audio Analysis:**")
+                    audio_sentiment, audio_conf = predict_audio_sentiment(audio_bytes)
+                    results_data.append(
+                        {
+                            "Model": "Audio (Wav2Vec2)",
+                            "Input": f"Video Audio",
+                            "Sentiment": audio_sentiment,
+                            "Confidence": f"{audio_conf:.2f}",
+                        }
+                    )
+                    st.success(
+                        f"Audio: {audio_sentiment} (Confidence: {audio_conf:.2f})"
+                    )
+                # Text analysis
+                if transcribed_text:
+                    st.markdown("**Text Analysis:**")
+                    text_sentiment, text_conf = predict_text_sentiment(transcribed_text)
+                    results_data.append(
+                        {
+                            "Model": "Text (TextBlob)",
+                            "Input": f"Transcribed: {transcribed_text[:50]}...",
+                            "Sentiment": text_sentiment,
+                            "Confidence": f"{text_conf:.2f}",
+                        }
+                    )
+                    st.success(f"Text: {text_sentiment} (Confidence: {text_conf:.2f})")
+                # Run fused analysis
+                st.subheader("🎯 Max Fusion Results")
+                if results_data:
+                    # Display results table
+                    df = pd.DataFrame(results_data)
+                    st.dataframe(df, use_container_width=True)
+                    # Calculate fused sentiment
+                    image_for_fusion = frames[0] if frames else None
+                    sentiment, confidence = predict_fused_sentiment(
+                        text=transcribed_text if transcribed_text else None,
+                        audio_bytes=audio_bytes,
+                        image=image_for_fusion,
+                    )
+                    # Display final results
+                    col1, col2 = st.columns(2)
+                    with col1:
+                        st.metric("🎯 Final Sentiment", sentiment)
+                    with col2:
+                        st.metric("📊 Overall Confidence", f"{confidence:.2f}")
+                    # Color-coded sentiment display
+                    sentiment_colors = {
+                        "Positive": "🟢",
+                        "Negative": "🔴",
+                        "Neutral": "🟡",
+                    }
+                    st.markdown(
+                        f"""
+                    <div class="result-box">
+                        <h4>{sentiment_colors.get(sentiment, "❓")} Max Fusion Sentiment: {sentiment}</h4>
+                        <p><strong>Overall Confidence:</strong> {confidence:.2f}</p>
+                        <p><strong>Modalities Analyzed:</strong> {len(results_data)}</p>
+                        <p><strong>Video Source:</strong> {video_name}</p>
+                        <p><strong>Analysis Type:</strong> Comprehensive Multi-Modal Sentiment Analysis</p>
+                        </div>
+                        """,
+                        unsafe_allow_html=True,
+                    )
+                else:
+                    st.error(
+                        "❌ No analysis could be performed. Please check your video input."
+                    )
+    else:
+        if video_input_method == "Record Video (Coming Soon)":
+            st.info(
+                "🎥 Video recording feature is coming soon! Please use Upload Video File for now."
+            )
+        else:
+            st.info("📁 Please upload a video file to begin Max Fusion analysis.")
 # Footer
 st.markdown("---")
 st.markdown(

pyproject.toml CHANGED Viewed

@@ -7,4 +7,8 @@ requires-python = ">=3.9"
 dependencies = [
     "gdown>=5.2.0",
     "python-dotenv>=1.1.1",
 ]

 dependencies = [
     "gdown>=5.2.0",
     "python-dotenv>=1.1.1",
+    "moviepy>=1.0.3",
+    "speechrecognition>=3.10.0",
+    "streamlit-webrtc>=0.47.0",
+    "opencv-python-headless>=4.12.0.88",
 ]

requirements.txt CHANGED Viewed

Binary files a/requirements.txt and b/requirements.txt differ

uv.lock CHANGED Viewed

The diff for this file is too large to render. See raw diff