DeepCritical / MULTIMODAL_SETTINGS_IMPLEMENTATION_PLAN.md
Joseph Pollack
final countdown
e427816
|
raw
history blame
13.9 kB

Multimodal Settings & File Rendering - Implementation Plan

Executive Summary

This document provides a comprehensive analysis of the current settings implementation, multimodal input handling, and file rendering in src/app.py, along with a detailed implementation plan to improve the user experience.

1. Current Settings Analysis

1.1 Settings Structure in src/app.py

Current Implementation (Lines 741-887):

  1. Sidebar Structure:

    • Authentication section (lines 745-750)
    • About section (lines 752-764)
    • Settings section (lines 767-850):
      • Research Configuration Accordion (lines 771-796):
        • mode_radio: Orchestrator mode selector
        • graph_mode_radio: Graph research mode selector
        • use_graph_checkbox: Graph execution toggle
      • Audio Output Accordion (lines 798-850):
        • enable_audio_output_checkbox: TTS enable/disable
        • tts_voice_dropdown: Voice selection
        • tts_speed_slider: Speech speed control
        • tts_gpu_dropdown: GPU type (non-interactive, visible only if Modal available)
  2. Hidden Components (Lines 852-865):

    • hf_model_dropdown: Hidden Textbox for model selection
    • hf_provider_dropdown: Hidden Textbox for provider selection
  3. Main Area Components (Lines 867-887):

    • audio_output: Audio output component (visible based on settings.enable_audio_output)
    • Visibility update function for TTS components

1.2 Settings Flow

Settings β†’ Function Parameters:

  • Settings from sidebar accordions are passed via additional_inputs to research_agent() function
  • Hidden textboxes are also passed but use empty strings (converted to None)
  • OAuth token/profile are automatically passed by Gradio

Function Signature (Lines 535-546):

async def research_agent(
    message: str | MultimodalPostprocess,
    history: list[dict[str, Any]],
    mode: str = "simple",
    hf_model: str | None = None,
    hf_provider: str | None = None,
    graph_mode: str = "auto",
    use_graph: bool = True,
    tts_voice: str = "af_heart",
    tts_speed: float = 1.0,
    oauth_token: gr.OAuthToken | None = None,
    oauth_profile: gr.OAuthProfile | None = None,
)

1.3 Issues Identified

  1. Settings Organization:

    • Audio output component is in main area, not sidebar
    • Hidden components (hf_model, hf_provider) should be visible or removed
    • No image input enable/disable setting (only audio input has this)
  2. Visibility:

    • Audio output visibility is controlled by checkbox, but component placement is suboptimal
    • TTS settings visibility is controlled by checkbox change event
  3. Configuration Gaps:

    • No enable_image_input setting in config (only enable_audio_input exists)
    • Image processing always happens if files are present (line 626 comment says "not just when enable_image_input is True" but setting doesn't exist)

2. Multimodal Input Analysis

2.1 Current Implementation

ChatInterface Configuration (Line 892-958):

  • multimodal=True: Enables MultimodalTextbox component
  • MultimodalTextbox automatically provides:
    • Text input
    • Image upload button
    • Audio recording button
    • File upload support

Input Processing (Lines 613-642):

  • Message can be str or MultimodalPostprocess (dict format)
  • MultimodalPostprocess format: {"text": str, "files": list[FileData], "audio": tuple | None}
  • Processing happens in research_agent() function:
    • Extracts text, files, and audio from message
    • Calls multimodal_service.process_multimodal_input()
    • Condition: if files or (audio_input_data is not None and settings.enable_audio_input)

Multimodal Service (src/services/multimodal_processing.py):

  • Processes audio input if settings.enable_audio_input is True
  • Processes image files (no enable/disable check - always processes if files present)
  • Extracts text from images using OCR service
  • Transcribes audio using STT service

2.2 Gradio Documentation Findings

MultimodalTextbox (ChatInterface with multimodal=True):

  • Automatically provides image and audio input capabilities
  • Inputs are always visible when ChatInterface is rendered
  • No explicit visibility control needed - it's part of the textbox component
  • Files are handled via files array in MultimodalPostprocess
  • Audio recording is handled via audio tuple in MultimodalPostprocess

Reference Implementation Pattern:

gr.ChatInterface(
    fn=chat_function,
    multimodal=True,  # Enables image/audio inputs
    # ... other parameters
)

2.3 Issues Identified

  1. Visibility:

    • Multimodal inputs ARE always visible (they're part of MultimodalTextbox)
    • No explicit control needed - this is working correctly
    • However, users may not realize image/audio inputs are available
  2. Configuration:

    • No enable_image_input setting to disable image processing
    • Image processing always happens if files are present
    • Audio processing respects settings.enable_audio_input
  3. User Experience:

    • No visual indication that multimodal inputs are available
    • Description mentions "🎀 Multimodal Support" but could be more prominent

3. File Rendering Analysis

3.1 Current Implementation

File Detection (Lines 168-195):

  • _is_file_path(): Checks if text looks like a file path
  • Checks for file extensions and path separators

File Rendering in Events (Lines 242-298):

  • For "complete" events, checks event.data for "files" or "file" keys
  • Validates files exist using os.path.exists()
  • Formats files as markdown download links: πŸ“Ž [Download: filename](filepath)
  • Stores files in metadata for potential future use

File Links Format:

file_links = "\n\n".join([
    f"πŸ“Ž [Download: {_get_file_name(f)}]({f})" 
    for f in valid_files
])
result["content"] = f"{content}\n\n{file_links}"

3.2 Issues Identified

  1. Rendering Method:

    • Uses markdown links in content string
    • May not work reliably in all Gradio versions
    • Better approach: Use Gradio's native file components or File component
  2. File Validation:

    • Only checks if file exists
    • Doesn't validate file type or size
    • No error handling for inaccessible files
  3. User Experience:

    • Files appear as text links, not as proper file components
    • No preview for images/PDFs
    • No file size information

4. Implementation Plan

Activity 1: Settings Reorganization

Goal: Move all settings to sidebar with better organization

File: src/app.py

Tasks:

  1. Move Audio Output Component to Sidebar (Lines 867-887)

    • Move audio_output component into sidebar
    • Place it in Audio Output accordion or create separate section
    • Update visibility logic to work within sidebar
  2. Add Image Input Settings (New)

    • Add enable_image_input checkbox to sidebar
    • Create "Image Input" accordion or add to existing "Multimodal Input" accordion
    • Update config to include enable_image_input setting
  3. Organize Settings Accordions

    • Research Configuration (existing)
    • Multimodal Input (new - combine image and audio input settings)
    • Audio Output (existing - move component here)
    • Model Configuration (new - for hf_model, hf_provider if we make them visible)

Subtasks:

  • Line 867-871: Move audio_output component definition into sidebar
  • Line 873-887: Update visibility update function to work with sidebar placement
  • Line 798-850: Reorganize Audio Output accordion to include audio_output component
  • Line 767-796: Keep Research Configuration as-is
  • After line 796: Add new "Multimodal Input" accordion with enable_image_input and enable_audio_input checkboxes
  • Line 852-865: Consider making hf_model and hf_provider visible or remove them

Activity 2: Multimodal Input Visibility

Goal: Ensure multimodal inputs are always visible and well-documented

File: src/app.py

Tasks:

  1. Verify Multimodal Inputs Are Visible

    • Confirm multimodal=True in ChatInterface (already done - line 894)
    • Add visual indicators in description
    • Add tooltips or help text
  2. Add Image Input Configuration

    • Add enable_image_input to config (src/utils/config.py)
    • Update multimodal processing to respect this setting
    • Add UI control in sidebar

Subtasks:

  • Line 894: Verify multimodal=True is set (already correct)
  • Line 908: Enhance description to highlight multimodal capabilities
  • src/utils/config.py: Add enable_image_input: bool = Field(default=True, ...)
  • src/services/multimodal_processing.py: Add check for settings.enable_image_input before processing images
  • src/app.py: Add enable_image_input checkbox to sidebar

Activity 3: File Rendering Improvements

Goal: Improve file rendering using proper Gradio components

File: src/app.py

Tasks:

  1. Improve File Rendering Method

    • Use Gradio File component or proper file handling
    • Add file previews for images
    • Show file size and type information
  2. Enhance File Validation

    • Validate file types
    • Check file accessibility
    • Handle errors gracefully

Subtasks:

  • Line 280-296: Replace markdown link approach with proper file component rendering
  • Line 168-195: Enhance _is_file_path() to validate file types
  • Line 242-298: Update event_to_chat_message() to use Gradio File components
  • Add file preview functionality for images
  • Add error handling for inaccessible files

Activity 4: Configuration Updates

Goal: Add missing configuration settings

File: src/utils/config.py

Tasks:

  1. Add Image Input Setting
    • Add enable_image_input field
    • Add ocr_api_url field if missing
    • Add property methods for availability checks

Subtasks:

  • After line 147: Add enable_image_input: bool = Field(default=True, description="Enable image input (OCR) in multimodal interface")
  • Check if ocr_api_url exists (should be in config)
  • Add image_ocr_available property if missing

Activity 5: Multimodal Service Updates

Goal: Respect image input enable/disable setting

File: src/services/multimodal_processing.py

Tasks:

  1. Add Image Input Check
    • Check settings.enable_image_input before processing images
    • Log when image processing is skipped due to setting

Subtasks:

  • Line 66-77: Add check for settings.enable_image_input before processing image files
  • Add logging when image processing is skipped

5. Detailed File-Level Tasks

File: src/app.py

Line-Level Subtasks:

  1. Lines 741-850: Sidebar Reorganization

    • 741-765: Keep authentication and about sections
    • 767-796: Keep Research Configuration accordion
    • 797: Add new "Multimodal Input" accordion after Research Configuration
    • 798-850: Reorganize Audio Output accordion, move audio_output component here
    • 852-865: Review hidden components - make visible or remove
  2. Lines 867-887: Audio Output Component

    • 867-871: Move audio_output definition into sidebar (Audio Output accordion)
    • 873-887: Update visibility function to work with sidebar placement
  3. Lines 892-958: ChatInterface Configuration

    • 894: Verify multimodal=True (already correct)
    • 908: Enhance description with multimodal capabilities
    • 946-956: Review additional_inputs - ensure all settings are included
  4. Lines 242-298: File Rendering

    • 280-296: Replace markdown links with proper file component rendering
    • Add file preview for images
    • Add file size/type information
  5. Lines 613-642: Multimodal Input Processing

    • 626: Update condition to check settings.enable_image_input for files
    • Add logging for when image processing is skipped

File: src/utils/config.py

Line-Level Subtasks:

  1. Lines 143-180: Audio/Image Configuration
    • 144-147: enable_audio_input exists (keep as-is)
    • After 147: Add enable_image_input: bool = Field(default=True, description="Enable image input (OCR) in multimodal interface")
    • Check if ocr_api_url exists (add if missing)
    • Add image_ocr_available property method

File: src/services/multimodal_processing.py

Line-Level Subtasks:

  1. Lines 65-77: Image Processing
    • 66: Add check: if files and settings.enable_image_input:
    • 71-77: Keep image processing logic inside the new condition
    • Add logging when image processing is skipped

6. Testing Checklist

  • Verify all settings are in sidebar
  • Test multimodal inputs (image upload, audio recording)
  • Test file rendering (markdown, PDF, images)
  • Test enable/disable toggles for image and audio inputs
  • Test audio output generation and display
  • Test file download links
  • Verify settings persist across chat sessions
  • Test on different screen sizes (responsive design)

7. Implementation Order

  1. Phase 1: Configuration (Foundation)

    • Add enable_image_input to config
    • Update multimodal service to respect setting
  2. Phase 2: Settings Reorganization (UI)

    • Move audio output to sidebar
    • Add image input settings to sidebar
    • Organize accordions
  3. Phase 3: File Rendering (Enhancement)

    • Improve file rendering method
    • Add file previews
    • Enhance validation
  4. Phase 4: Testing & Refinement (Quality)

    • Test all functionality
    • Fix any issues
    • Refine UI/UX

8. Success Criteria

  • βœ… All settings are in sidebar
  • βœ… Multimodal inputs are always visible and functional
  • βœ… Files are rendered properly with previews
  • βœ… Image and audio input can be enabled/disabled
  • βœ… Settings are well-organized and intuitive
  • βœ… No regressions in existing functionality