Spaces:
Running
Multimodal Settings & File Rendering - Implementation Plan
Executive Summary
This document provides a comprehensive analysis of the current settings implementation, multimodal input handling, and file rendering in src/app.py, along with a detailed implementation plan to improve the user experience.
1. Current Settings Analysis
1.1 Settings Structure in src/app.py
Current Implementation (Lines 741-887):
Sidebar Structure:
- Authentication section (lines 745-750)
- About section (lines 752-764)
- Settings section (lines 767-850):
- Research Configuration Accordion (lines 771-796):
mode_radio: Orchestrator mode selectorgraph_mode_radio: Graph research mode selectoruse_graph_checkbox: Graph execution toggle
- Audio Output Accordion (lines 798-850):
enable_audio_output_checkbox: TTS enable/disabletts_voice_dropdown: Voice selectiontts_speed_slider: Speech speed controltts_gpu_dropdown: GPU type (non-interactive, visible only if Modal available)
- Research Configuration Accordion (lines 771-796):
Hidden Components (Lines 852-865):
hf_model_dropdown: Hidden Textbox for model selectionhf_provider_dropdown: Hidden Textbox for provider selection
Main Area Components (Lines 867-887):
audio_output: Audio output component (visible based onsettings.enable_audio_output)- Visibility update function for TTS components
1.2 Settings Flow
Settings β Function Parameters:
- Settings from sidebar accordions are passed via
additional_inputstoresearch_agent()function - Hidden textboxes are also passed but use empty strings (converted to None)
- OAuth token/profile are automatically passed by Gradio
Function Signature (Lines 535-546):
async def research_agent(
message: str | MultimodalPostprocess,
history: list[dict[str, Any]],
mode: str = "simple",
hf_model: str | None = None,
hf_provider: str | None = None,
graph_mode: str = "auto",
use_graph: bool = True,
tts_voice: str = "af_heart",
tts_speed: float = 1.0,
oauth_token: gr.OAuthToken | None = None,
oauth_profile: gr.OAuthProfile | None = None,
)
1.3 Issues Identified
Settings Organization:
- Audio output component is in main area, not sidebar
- Hidden components (hf_model, hf_provider) should be visible or removed
- No image input enable/disable setting (only audio input has this)
Visibility:
- Audio output visibility is controlled by checkbox, but component placement is suboptimal
- TTS settings visibility is controlled by checkbox change event
Configuration Gaps:
- No
enable_image_inputsetting in config (onlyenable_audio_inputexists) - Image processing always happens if files are present (line 626 comment says "not just when enable_image_input is True" but setting doesn't exist)
- No
2. Multimodal Input Analysis
2.1 Current Implementation
ChatInterface Configuration (Line 892-958):
multimodal=True: Enables MultimodalTextbox component- MultimodalTextbox automatically provides:
- Text input
- Image upload button
- Audio recording button
- File upload support
Input Processing (Lines 613-642):
- Message can be
strorMultimodalPostprocess(dict format) - MultimodalPostprocess format:
{"text": str, "files": list[FileData], "audio": tuple | None} - Processing happens in
research_agent()function:- Extracts text, files, and audio from message
- Calls
multimodal_service.process_multimodal_input() - Condition:
if files or (audio_input_data is not None and settings.enable_audio_input)
Multimodal Service (src/services/multimodal_processing.py):
- Processes audio input if
settings.enable_audio_inputis True - Processes image files (no enable/disable check - always processes if files present)
- Extracts text from images using OCR service
- Transcribes audio using STT service
2.2 Gradio Documentation Findings
MultimodalTextbox (ChatInterface with multimodal=True):
- Automatically provides image and audio input capabilities
- Inputs are always visible when ChatInterface is rendered
- No explicit visibility control needed - it's part of the textbox component
- Files are handled via
filesarray in MultimodalPostprocess - Audio recording is handled via
audiotuple in MultimodalPostprocess
Reference Implementation Pattern:
gr.ChatInterface(
fn=chat_function,
multimodal=True, # Enables image/audio inputs
# ... other parameters
)
2.3 Issues Identified
Visibility:
- Multimodal inputs ARE always visible (they're part of MultimodalTextbox)
- No explicit control needed - this is working correctly
- However, users may not realize image/audio inputs are available
Configuration:
- No
enable_image_inputsetting to disable image processing - Image processing always happens if files are present
- Audio processing respects
settings.enable_audio_input
- No
User Experience:
- No visual indication that multimodal inputs are available
- Description mentions "π€ Multimodal Support" but could be more prominent
3. File Rendering Analysis
3.1 Current Implementation
File Detection (Lines 168-195):
_is_file_path(): Checks if text looks like a file path- Checks for file extensions and path separators
File Rendering in Events (Lines 242-298):
- For "complete" events, checks
event.datafor "files" or "file" keys - Validates files exist using
os.path.exists() - Formats files as markdown download links:
π [Download: filename](filepath) - Stores files in metadata for potential future use
File Links Format:
file_links = "\n\n".join([
f"π [Download: {_get_file_name(f)}]({f})"
for f in valid_files
])
result["content"] = f"{content}\n\n{file_links}"
3.2 Issues Identified
Rendering Method:
- Uses markdown links in content string
- May not work reliably in all Gradio versions
- Better approach: Use Gradio's native file components or File component
File Validation:
- Only checks if file exists
- Doesn't validate file type or size
- No error handling for inaccessible files
User Experience:
- Files appear as text links, not as proper file components
- No preview for images/PDFs
- No file size information
4. Implementation Plan
Activity 1: Settings Reorganization
Goal: Move all settings to sidebar with better organization
File: src/app.py
Tasks:
Move Audio Output Component to Sidebar (Lines 867-887)
- Move
audio_outputcomponent into sidebar - Place it in Audio Output accordion or create separate section
- Update visibility logic to work within sidebar
- Move
Add Image Input Settings (New)
- Add
enable_image_inputcheckbox to sidebar - Create "Image Input" accordion or add to existing "Multimodal Input" accordion
- Update config to include
enable_image_inputsetting
- Add
Organize Settings Accordions
- Research Configuration (existing)
- Multimodal Input (new - combine image and audio input settings)
- Audio Output (existing - move component here)
- Model Configuration (new - for hf_model, hf_provider if we make them visible)
Subtasks:
- Line 867-871: Move
audio_outputcomponent definition into sidebar - Line 873-887: Update visibility update function to work with sidebar placement
- Line 798-850: Reorganize Audio Output accordion to include audio_output component
- Line 767-796: Keep Research Configuration as-is
- After line 796: Add new "Multimodal Input" accordion with enable_image_input and enable_audio_input checkboxes
- Line 852-865: Consider making hf_model and hf_provider visible or remove them
Activity 2: Multimodal Input Visibility
Goal: Ensure multimodal inputs are always visible and well-documented
File: src/app.py
Tasks:
Verify Multimodal Inputs Are Visible
- Confirm
multimodal=Truein ChatInterface (already done - line 894) - Add visual indicators in description
- Add tooltips or help text
- Confirm
Add Image Input Configuration
- Add
enable_image_inputto config (src/utils/config.py) - Update multimodal processing to respect this setting
- Add UI control in sidebar
- Add
Subtasks:
- Line 894: Verify
multimodal=Trueis set (already correct) - Line 908: Enhance description to highlight multimodal capabilities
- src/utils/config.py: Add
enable_image_input: bool = Field(default=True, ...) - src/services/multimodal_processing.py: Add check for
settings.enable_image_inputbefore processing images - src/app.py: Add enable_image_input checkbox to sidebar
Activity 3: File Rendering Improvements
Goal: Improve file rendering using proper Gradio components
File: src/app.py
Tasks:
Improve File Rendering Method
- Use Gradio File component or proper file handling
- Add file previews for images
- Show file size and type information
Enhance File Validation
- Validate file types
- Check file accessibility
- Handle errors gracefully
Subtasks:
- Line 280-296: Replace markdown link approach with proper file component rendering
- Line 168-195: Enhance
_is_file_path()to validate file types - Line 242-298: Update
event_to_chat_message()to use Gradio File components - Add file preview functionality for images
- Add error handling for inaccessible files
Activity 4: Configuration Updates
Goal: Add missing configuration settings
File: src/utils/config.py
Tasks:
- Add Image Input Setting
- Add
enable_image_inputfield - Add
ocr_api_urlfield if missing - Add property methods for availability checks
- Add
Subtasks:
- After line 147: Add
enable_image_input: bool = Field(default=True, description="Enable image input (OCR) in multimodal interface") - Check if
ocr_api_urlexists (should be in config) - Add
image_ocr_availableproperty if missing
Activity 5: Multimodal Service Updates
Goal: Respect image input enable/disable setting
File: src/services/multimodal_processing.py
Tasks:
- Add Image Input Check
- Check
settings.enable_image_inputbefore processing images - Log when image processing is skipped due to setting
- Check
Subtasks:
- Line 66-77: Add check for
settings.enable_image_inputbefore processing image files - Add logging when image processing is skipped
5. Detailed File-Level Tasks
File: src/app.py
Line-Level Subtasks:
Lines 741-850: Sidebar Reorganization
- 741-765: Keep authentication and about sections
- 767-796: Keep Research Configuration accordion
- 797: Add new "Multimodal Input" accordion after Research Configuration
- 798-850: Reorganize Audio Output accordion, move audio_output component here
- 852-865: Review hidden components - make visible or remove
Lines 867-887: Audio Output Component
- 867-871: Move
audio_outputdefinition into sidebar (Audio Output accordion) - 873-887: Update visibility function to work with sidebar placement
- 867-871: Move
Lines 892-958: ChatInterface Configuration
- 894: Verify
multimodal=True(already correct) - 908: Enhance description with multimodal capabilities
- 946-956: Review
additional_inputs- ensure all settings are included
- 894: Verify
Lines 242-298: File Rendering
- 280-296: Replace markdown links with proper file component rendering
- Add file preview for images
- Add file size/type information
Lines 613-642: Multimodal Input Processing
- 626: Update condition to check
settings.enable_image_inputfor files - Add logging for when image processing is skipped
- 626: Update condition to check
File: src/utils/config.py
Line-Level Subtasks:
- Lines 143-180: Audio/Image Configuration
- 144-147:
enable_audio_inputexists (keep as-is) - After 147: Add
enable_image_input: bool = Field(default=True, description="Enable image input (OCR) in multimodal interface") - Check if
ocr_api_urlexists (add if missing) - Add
image_ocr_availableproperty method
- 144-147:
File: src/services/multimodal_processing.py
Line-Level Subtasks:
- Lines 65-77: Image Processing
- 66: Add check:
if files and settings.enable_image_input: - 71-77: Keep image processing logic inside the new condition
- Add logging when image processing is skipped
- 66: Add check:
6. Testing Checklist
- Verify all settings are in sidebar
- Test multimodal inputs (image upload, audio recording)
- Test file rendering (markdown, PDF, images)
- Test enable/disable toggles for image and audio inputs
- Test audio output generation and display
- Test file download links
- Verify settings persist across chat sessions
- Test on different screen sizes (responsive design)
7. Implementation Order
Phase 1: Configuration (Foundation)
- Add
enable_image_inputto config - Update multimodal service to respect setting
- Add
Phase 2: Settings Reorganization (UI)
- Move audio output to sidebar
- Add image input settings to sidebar
- Organize accordions
Phase 3: File Rendering (Enhancement)
- Improve file rendering method
- Add file previews
- Enhance validation
Phase 4: Testing & Refinement (Quality)
- Test all functionality
- Fix any issues
- Refine UI/UX
8. Success Criteria
- β All settings are in sidebar
- β Multimodal inputs are always visible and functional
- β Files are rendered properly with previews
- β Image and audio input can be enabled/disabled
- β Settings are well-organized and intuitive
- β No regressions in existing functionality