Spaces:
Sleeping
Sleeping
| # System Architecture | |
| > **Meeting Intelligence Agent - Technical Architecture Documentation** | |
| This document provides a high-level overview of the system architecture, design decisions, and component relationships for the Meeting Intelligence Agent. | |
| --- | |
| ## π Table of Contents | |
| - [System Overview](#-system-overview) | |
| - [Architecture Diagram](#-architecture-diagram) | |
| - [Core Components](#-core-components) | |
| - [Data Flow](#-data-flow) | |
| - [Key Design Decisions](#-key-design-decisions) | |
| - [State Management](#-state-management) | |
| - [Scalability & Performance](#-scalability--performance) | |
| - [Security Architecture](#-security-architecture) | |
| - [Technology Stack](#-technology-stack) | |
| --- | |
| ## π― System Overview | |
| The Meeting Intelligence Agent is a **conversational AI system** built on LangGraph that orchestrates meeting video processing, transcription, storage, and intelligent querying through natural language interaction. | |
| ### Core Capabilities | |
| 1. **Video Processing Pipeline**: Upload β Transcription β Speaker Diarization β Metadata Extraction β Vector Storage | |
| 2. **Semantic Search**: RAG-based querying across meeting transcripts using natural language | |
| 3. **External Integrations**: MCP (Model Context Protocol) servers for Notion and time-aware queries | |
| 4. **Conversational Interface**: Gradio-based chat UI with file upload support | |
| ### Design Philosophy | |
| - **Conversational-First**: All functionality accessible through natural language | |
| - **Modular Architecture**: Clear separation between UI, agent, tools, and services | |
| - **Extensible**: MCP protocol enables easy addition of new capabilities | |
| - **Async-Ready**: Supports long-running operations (transcription, MCP calls) | |
| - **Production-Ready**: Docker support, error handling, graceful degradation | |
| --- | |
| ## ποΈ Architecture Diagram | |
| ```mermaid | |
| graph TB | |
| subgraph "Frontend Layer" | |
| UI[Gradio Interface] | |
| Chat[Chat Component] | |
| Upload[File Upload] | |
| Editor[Transcript Editor] | |
| end | |
| subgraph "Agent Layer (LangGraph)" | |
| Agent[Conversational Agent] | |
| StateMachine[State Machine] | |
| ToolRouter[Tool Router] | |
| end | |
| subgraph "Tool Layer" | |
| VideoTools[Video Processing Tools] | |
| QueryTools[Meeting Query Tools] | |
| MCPTools[MCP Integration Tools] | |
| end | |
| subgraph "Processing Layer" | |
| WhisperX[WhisperX Transcription] | |
| Pyannote[Speaker Diarization] | |
| MetadataExtractor[GPT-4o-mini Metadata] | |
| Embeddings[OpenAI Embeddings] | |
| end | |
| subgraph "Storage Layer" | |
| Pinecone[(Pinecone Vector DB)] | |
| LocalState[Local State Cache] | |
| end | |
| subgraph "External Services" | |
| OpenAI[OpenAI API] | |
| NotionMCP[Notion MCP Server] | |
| TimeMCP[Time MCP Server] | |
| ZoomMCP[Zoom MCP Server<br/>In Development] | |
| end | |
| UI --> Agent | |
| Agent --> StateMachine | |
| StateMachine --> ToolRouter | |
| ToolRouter --> VideoTools | |
| ToolRouter --> QueryTools | |
| ToolRouter --> MCPTools | |
| VideoTools --> WhisperX | |
| VideoTools --> Pyannote | |
| VideoTools --> MetadataExtractor | |
| VideoTools --> Embeddings | |
| QueryTools --> Embeddings | |
| QueryTools --> Pinecone | |
| MCPTools --> NotionMCP | |
| MCPTools --> TimeMCP | |
| MCPTools -.-> ZoomMCP | |
| Embeddings --> Pinecone | |
| MetadataExtractor --> OpenAI | |
| Agent --> OpenAI | |
| classDef frontend fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000000 | |
| classDef agent fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000000 | |
| classDef tools fill:#fff3e0,stroke:#ef6c00,stroke-width:2px,color:#000000 | |
| classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000000 | |
| classDef storage fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,color:#000000 | |
| classDef external fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#000000 | |
| class UI,Chat,Upload,Editor frontend | |
| class Agent,StateMachine,ToolRouter agent | |
| class VideoTools,QueryTools,MCPTools tools | |
| class WhisperX,Pyannote,MetadataExtractor,Embeddings processing | |
| class Pinecone,LocalState storage | |
| class OpenAI,NotionMCP,TimeMCP,ZoomMCP external | |
| ``` | |
| --- | |
| ## π§© Core Components | |
| ### 1. Frontend Layer (Gradio) | |
| **Purpose**: User interface for interaction and file management | |
| **Components**: | |
| - **Chat Interface**: Primary conversational UI using `gr.ChatInterface` | |
| - **File Upload**: Video file upload widget | |
| - **Transcript Editor**: Editable text area for manual corrections | |
| - **State Display**: Real-time feedback on processing status | |
| **Technology**: Gradio 5.x with async support | |
| **Key Files**: | |
| - `src/ui/gradio_app.py` - UI component definitions and event handlers | |
| --- | |
| ### 2. Agent Layer (LangGraph) | |
| **Purpose**: Orchestrates the entire workflow through conversational AI | |
| **Architecture**: State machine with three nodes: | |
| ``` | |
| βββββββββββ βββββββββ βββββββββ | |
| β PREPARE β --> β AGENT β --> β TOOLS β | |
| βββββββββββ βββββββββ βββββββββ | |
| β β | |
| βββββββββββββββ | |
| ``` | |
| **Components**: | |
| 1. **Prepare Node**: Converts chat history to LangChain messages | |
| 2. **Agent Node**: LLM decides which tools to call | |
| 3. **Tools Node**: Executes selected tools | |
| 4. **Conditional Router**: Determines if more tool calls are needed | |
| **State Structure**: | |
| ```python | |
| { | |
| "message": str, # Current user query | |
| "history": List[List[str]], # Conversation history | |
| "llm_messages": List[Message], # LangChain message format | |
| "response": str, # Generated response | |
| "error": Optional[str] # Error tracking | |
| } | |
| ``` | |
| **Key Files**: | |
| - `src/agents/conversational.py` - LangGraph agent implementation (570 lines) | |
| --- | |
| ### 3. Tool Layer | |
| **Purpose**: Provides discrete capabilities that the agent can invoke | |
| **Categories**: | |
| #### Video Processing Tools (8 tools) | |
| - File upload management | |
| - Transcription orchestration | |
| - Speaker name mapping | |
| - Transcript editing | |
| - Pinecone upload | |
| #### Meeting Query Tools (6 tools) | |
| - Semantic search | |
| - Metadata retrieval | |
| - Meeting listing | |
| - Text upsert | |
| - Notion import/export | |
| #### MCP Integration Tools (6+ tools) | |
| - Notion API operations | |
| - Time queries | |
| - Future: Zoom RTMS | |
| **Design Pattern**: LangChain `@tool` decorator for automatic schema generation | |
| **Key Files**: | |
| - `src/tools/video.py` - Video processing tools (528 lines) | |
| - `src/tools/general.py` - Query and integration tools (577 lines) | |
| - `src/tools/mcp/` - MCP client wrappers | |
| --- | |
| ### 4. Processing Layer | |
| **Purpose**: Handles compute-intensive operations | |
| **Components**: | |
| #### WhisperX Transcription | |
| - **Model**: Configurable (tiny/small/medium/large) | |
| - **Features**: Word-level timestamps, language detection | |
| - **Performance**: GPU-accelerated when available | |
| #### Pyannote Speaker Diarization | |
| - **Model**: `pyannote/speaker-diarization-3.1` | |
| - **Output**: Speaker segments with timestamps | |
| - **Integration**: Aligned with WhisperX word timestamps | |
| #### Metadata Extraction | |
| - **Model**: GPT-4o-mini (cost-optimized) | |
| - **Extracts**: Title, date, summary, speaker mapping | |
| - **Format**: Structured JSON output | |
| #### Embeddings | |
| - **Model**: OpenAI `text-embedding-3-small` | |
| - **Dimension**: 1536 | |
| - **Usage**: Query and document embedding | |
| **Key Files**: | |
| - `src/processing/transcription.py` - WhisperX + Pyannote pipeline | |
| - `src/processing/metadata_extractor.py` - GPT-4o-mini extraction | |
| --- | |
| ### 5. Storage Layer | |
| **Purpose**: Persistent and temporary data storage | |
| #### Pinecone Vector Database | |
| - **Type**: Serverless | |
| - **Index**: `meeting-transcripts-1-dev` | |
| - **Namespace**: Environment-based (`development`/`production`) | |
| - **Metadata**: Rich metadata for filtering (title, date, source, speakers) | |
| **Schema**: | |
| ```python | |
| { | |
| "id": "meeting_abc12345_chunk_001", | |
| "values": [1536-dim embedding], | |
| "metadata": { | |
| "meeting_id": "meeting_abc12345", | |
| "meeting_title": "Q4 Planning", | |
| "meeting_date": "2024-12-07", | |
| "summary": "...", | |
| "speaker_mapping": {...}, | |
| "source": "video", | |
| "chunk_index": 1, | |
| "text": "actual transcript chunk" | |
| } | |
| } | |
| ``` | |
| #### Local State Cache | |
| - **Purpose**: Temporary storage for video processing workflow | |
| - **Scope**: In-memory, per-session | |
| - **Contents**: Uploaded video path, transcription text, timing info | |
| **Key Files**: | |
| - `src/retrievers/pinecone.py` - Vector database manager | |
| --- | |
| ### 6. External Services | |
| **Purpose**: Third-party APIs and custom MCP servers | |
| #### OpenAI API | |
| - **Models**: GPT-3.5-turbo (agent), GPT-4o-mini (metadata) | |
| - **Usage**: Agent reasoning, metadata extraction, embeddings | |
| #### Notion MCP Server | |
| - **Type**: Official `@notionhq/notion-mcp-server` | |
| - **Transport**: stdio (local subprocess) | |
| - **Capabilities**: Search, read, create, update pages | |
| #### Time MCP Server (Custom) | |
| - **Type**: Gradio-based MCP server | |
| - **Transport**: SSE (Server-Sent Events) | |
| - **Deployment**: HuggingFace Spaces | |
| - **URL**: `https://gfiamon-date-time-mpc-server-tool.hf.space/gradio_api/mcp/sse` | |
| - **Purpose**: Time-aware query support | |
| #### Zoom RTMS Server (In Development) | |
| - **Type**: FastAPI + Gradio hybrid | |
| - **Transport**: stdio + webhooks | |
| - **Status**: Prototype, API integration pending | |
| - **Purpose**: Live meeting transcription | |
| **Key Files**: | |
| - `src/tools/mcp/mcp_manager.py` - Multi-server MCP client | |
| - `external_mcp_servers/time_mcp_server/` - Custom time server | |
| - `external_mcp_servers/zoom_mcp/` - Zoom RTMS prototype | |
| --- | |
| ## π Data Flow | |
| ### Video Upload Flow | |
| ``` | |
| User uploads video.mp4 | |
| β | |
| Gradio saves to temp directory | |
| β | |
| Agent calls transcribe_uploaded_video(path) | |
| β | |
| WhisperX extracts audio + transcribes | |
| β | |
| Pyannote identifies speakers | |
| β | |
| Alignment: Match speakers to transcript | |
| β | |
| Format: SPEAKER_00, SPEAKER_01, etc. | |
| β | |
| Return formatted transcript to agent | |
| β | |
| Agent shows transcript to user | |
| β | |
| User optionally edits or updates speaker names | |
| β | |
| Agent calls upload_transcription_to_pinecone() | |
| β | |
| GPT-4o-mini extracts metadata | |
| β | |
| Text chunked into semantic segments | |
| β | |
| OpenAI embeddings generated | |
| β | |
| Upsert to Pinecone with metadata | |
| β | |
| Return meeting_id to user | |
| ``` | |
| ### Query Flow | |
| ``` | |
| User asks: "What action items were assigned last Tuesday?" | |
| β | |
| Agent receives query | |
| β | |
| Agent calls get_time_for_city("Berlin") [Time MCP] | |
| β | |
| Time server returns: "2024-12-07" | |
| β | |
| Agent calculates: "Last Tuesday = 2024-12-03" | |
| β | |
| Agent calls search_meetings(query="action items", date_filter="2024-12-03") | |
| β | |
| Query embedded via OpenAI | |
| β | |
| Pinecone vector search | |
| β | |
| Top-k chunks retrieved with metadata | |
| β | |
| Results returned to agent | |
| β | |
| Agent synthesizes answer from chunks | |
| β | |
| Response streamed to user | |
| ``` | |
| ### Notion Integration Flow | |
| ``` | |
| User: "Import 'Meeting 3' from Notion" | |
| β | |
| Agent calls import_notion_to_pinecone(query="Meeting 3") | |
| β | |
| Tool calls Notion MCP: API-post-search(query="Meeting 3") | |
| β | |
| Notion returns page_id | |
| β | |
| Tool calls API-retrieve-a-page(page_id) β metadata | |
| β | |
| Tool calls API-get-block-children(page_id) β content blocks | |
| β | |
| Recursive extraction of nested blocks | |
| β | |
| Full text assembled | |
| β | |
| GPT-4o-mini extracts metadata | |
| β | |
| Text chunked and embedded | |
| β | |
| Upsert to Pinecone | |
| β | |
| Return success message with meeting_id | |
| ``` | |
| --- | |
| ## π¨ Key Design Decisions | |
| ### 1. Why LangGraph? | |
| **Decision**: Use LangGraph instead of LangChain's AgentExecutor or other frameworks | |
| **Rationale**: | |
| - β **Explicit state management**: Full control over conversation state | |
| - β **Async support**: Required for MCP tools (Notion API) | |
| - β **Debugging**: Clear visibility into state transitions | |
| - β **Flexibility**: Easy to add custom nodes and conditional routing | |
| - β **Streaming**: Native support for response streaming | |
| **Alternative Considered**: LangChain AgentExecutor (rejected due to limited async support) | |
| --- | |
| ### 2. Why Separate MCP Servers? | |
| **Decision**: Deploy custom MCP servers in `external_mcp_servers/` as standalone applications | |
| **Rationale**: | |
| - β **Independent scaling**: Time server can handle multiple agents | |
| - β **Deployment flexibility**: Update servers without redeploying agent | |
| - β **Development isolation**: Test MCP servers independently | |
| - β **Reusability**: Other projects can use the same MCP servers | |
| - β **Transport options**: HTTP (SSE) for remote, stdio for local | |
| **Architecture**: | |
| ``` | |
| Main Agent (HF Space 1) | |
| β HTTP/SSE | |
| Time MCP Server (HF Space 2) | |
| β HTTP/SSE | |
| Zoom MCP Server (HF Space 3) | |
| ``` | |
| **Alternative Considered**: Embed MCP servers in main app (rejected due to coupling) | |
| --- | |
| ### 3. Why Pinecone Serverless? | |
| **Decision**: Use Pinecone serverless for vector storage | |
| **Rationale**: | |
| - β **No infrastructure management**: Fully managed | |
| - β **Cost-effective**: Pay per usage, no idle costs | |
| - β **Scalability**: Auto-scales with demand | |
| - β **Metadata filtering**: Rich filtering capabilities | |
| - β **Namespaces**: Environment isolation (dev/prod) | |
| **Alternative Considered**: Chroma (rejected due to self-hosting requirements) | |
| --- | |
| ### 4. Why GPT-3.5-turbo for Agent? | |
| **Decision**: Use GPT-3.5-turbo instead of GPT-4 for agent reasoning | |
| **Rationale**: | |
| - β **Cost**: 10x cheaper than GPT-4 | |
| - β **Speed**: Faster response times | |
| - β **Sufficient**: Tool calling works well with 3.5-turbo | |
| - β **Budget**: GPT-4o-mini used for metadata extraction (specialized task) | |
| **Cost Comparison** (per 1M tokens): | |
| - GPT-3.5-turbo: $0.50 input / $1.50 output | |
| - GPT-4: $30 input / $60 output | |
| - GPT-4o-mini: $0.15 input / $0.60 output | |
| --- | |
| ### 5. Why Async Patterns? | |
| **Decision**: Use `async/await` throughout the agent | |
| **Rationale**: | |
| - β **MCP requirement**: Notion MCP tools are async | |
| - β **Long operations**: Transcription can take minutes | |
| - β **Streaming**: Gradio async streaming for better UX | |
| - β **Concurrency**: Handle multiple tool calls efficiently | |
| **Implementation**: | |
| ```python | |
| async def generate_response(self, message, history): | |
| async for event in self.graph.astream(initial_state): | |
| # Process events | |
| yield response_chunk | |
| ``` | |
| --- | |
| ## ποΈ State Management | |
| ### LangGraph State | |
| **Structure**: TypedDict with annotated message list | |
| ```python | |
| class ConversationalAgentState(TypedDict): | |
| message: str # Current query | |
| history: List[List[str]] # Gradio format | |
| llm_messages: Annotated[List[Any], add_messages] # LangChain format | |
| response: str # Generated response | |
| error: Optional[str] # Error tracking | |
| ``` | |
| **State Transitions**: | |
| 1. **Prepare**: `history` β `llm_messages` (format conversion) | |
| 2. **Agent**: `llm_messages` β `llm_messages` (append AI response) | |
| 3. **Tools**: `llm_messages` β `llm_messages` (append tool results) | |
| **Persistence**: In-memory only, no database persistence (stateless per session) | |
| --- | |
| ### Video Processing State | |
| **Purpose**: Track video upload workflow across multiple tool calls | |
| **Storage**: Global dictionary in `src/tools/video.py` | |
| ```python | |
| _video_state = { | |
| "uploaded_video_path": None, | |
| "transcription_text": None, | |
| "transcription_segments": None, | |
| "timing_info": None, | |
| "show_video_upload": False, | |
| "show_transcription_editor": False, | |
| "transcription_in_progress": False | |
| } | |
| ``` | |
| **Lifecycle**: | |
| 1. `request_video_upload()` β sets `show_video_upload = True` | |
| 2. `transcribe_uploaded_video()` β stores transcript | |
| 3. `upload_transcription_to_pinecone()` β clears state | |
| **Reset**: Automatic after successful upload or manual via `cancel_video_workflow()` | |
| --- | |
| ### UI State Synchronization | |
| **Challenge**: Keep Gradio UI in sync with agent state | |
| **Solution**: Tools return UI state changes via `get_video_state()` | |
| ```python | |
| # Tool returns state | |
| state = get_video_state() | |
| return { | |
| "show_upload": state["show_video_upload"], | |
| "show_editor": state["show_transcription_editor"], | |
| "transcript": state["transcription_text"] | |
| } | |
| ``` | |
| **Gradio Integration**: UI components update based on returned state | |
| --- | |
| ## β‘ Scalability & Performance | |
| ### Concurrency | |
| **Current**: Single-user sessions (Gradio default) | |
| **Scalability**: | |
| - β Stateless agent (can handle multiple sessions) | |
| - β Pinecone auto-scales | |
| - β MCP servers deployed independently | |
| - β οΈ WhisperX requires GPU (bottleneck for concurrent transcriptions) | |
| **Future Improvements**: | |
| - Queue system for transcription jobs | |
| - Separate transcription service (microservice) | |
| - Redis for shared state across instances | |
| --- | |
| ### Caching | |
| **Current Caching**: | |
| - β No LLM response caching | |
| - β No embedding caching | |
| - β Pinecone handles vector index caching | |
| **Future Improvements**: | |
| - Cache frequent queries (e.g., "list meetings") | |
| - Cache embeddings for repeated text | |
| - LangChain cache for LLM responses | |
| --- | |
| ### Performance Bottlenecks | |
| 1. **Transcription**: 2-5 minutes for typical meeting (GPU-dependent) | |
| 2. **Metadata Extraction**: 5-10 seconds (GPT-4o-mini API call) | |
| 3. **Embedding**: 1-2 seconds per chunk (OpenAI API) | |
| 4. **Pinecone Upsert**: 1-3 seconds for typical meeting | |
| **Optimization Strategies**: | |
| - Parallel embedding generation | |
| - Batch Pinecone upserts | |
| - Async MCP calls | |
| - Streaming responses to user | |
| --- | |
| ## π Security Architecture | |
| ### API Key Management | |
| **Storage**: Environment variables via `.env` file | |
| ```bash | |
| OPENAI_API_KEY=sk-... | |
| PINECONE_API_KEY=... | |
| NOTION_TOKEN=secret_... | |
| ``` | |
| **Access**: Loaded via `python-dotenv` in `src/config/settings.py` | |
| **Best Practices**: | |
| - β Never commit `.env` to git (`.gitignore` configured) | |
| - β Use HuggingFace Spaces secrets for deployment | |
| - β Rotate keys regularly | |
| --- | |
| ### Data Privacy | |
| **User Data**: | |
| - Video files: Stored temporarily, deleted after processing | |
| - Transcripts: Stored in Pinecone (user-controlled index) | |
| - Conversation history: In-memory only, not persisted | |
| **Third-Party Data Sharing**: | |
| - OpenAI: Transcripts sent for embedding/metadata extraction | |
| - Pinecone: Encrypted at rest and in transit | |
| - Notion: Only accessed with user's token | |
| **Compliance**: | |
| - GDPR: User can delete Pinecone index | |
| - Data retention: No long-term storage of raw videos | |
| --- | |
| ### MCP Server Security | |
| **Notion MCP**: | |
| - Authentication: User's Notion token | |
| - Permissions: Limited to token's access scope | |
| - Transport: stdio (local process, no network exposure) | |
| **Time MCP**: | |
| - Authentication: None required (public API) | |
| - Transport: HTTPS (TLS encrypted) | |
| - Rate limiting: HuggingFace Spaces default limits | |
| **Zoom MCP** (planned): | |
| - Authentication: OAuth 2.0 | |
| - Webhook validation: HMAC-SHA256 signature | |
| - Transport: HTTPS + WebSocket (TLS) | |
| --- | |
| ## π οΈ Technology Stack | |
| ### Core Framework | |
| - **Python**: 3.11+ | |
| - **LangGraph**: Agent orchestration | |
| - **LangChain**: Tool abstractions, message handling | |
| - **Gradio**: Web UI framework | |
| ### AI/ML Models | |
| - **OpenAI GPT-3.5-turbo**: Agent reasoning | |
| - **OpenAI GPT-4o-mini**: Metadata extraction | |
| - **OpenAI text-embedding-3-small**: Vector embeddings | |
| - **WhisperX**: Speech-to-text transcription | |
| - **Pyannote**: Speaker diarization | |
| ### Storage & Databases | |
| - **Pinecone**: Vector database (serverless) | |
| - **Local filesystem**: Temporary video storage | |
| ### External Integrations | |
| - **Notion API**: Via MCP server | |
| - **Custom Time API**: Via Gradio MCP server | |
| - **Zoom API** (planned): Via custom MCP server | |
| ### Development Tools | |
| - **Docker**: Containerization | |
| - **FFmpeg**: Audio extraction | |
| - **pytest**: Testing (planned) | |
| - **LangSmith**: Tracing and debugging (optional) | |
| ### Deployment | |
| - **HuggingFace Spaces**: Primary deployment platform | |
| - **Docker**: Container runtime | |
| - **Environment Variables**: Configuration management | |
| --- | |
| ## π Related Documentation | |
| - [TECHNICAL_IMPLEMENTATION.md](TECHNICAL_IMPLEMENTATION.md) - Detailed tool reference and code examples | |
| - [DEPLOYMENT_GUIDE.md](DEPLOYMENT_GUIDE.md) - Step-by-step deployment instructions | |
| - [README.md](../README.md) - Project overview and quick start | |
| --- | |
| ## π Version History | |
| - **v4.0** (Current): LangGraph-based conversational agent with MCP integration | |
| - **v3.0**: Experimental agent patterns | |
| - **v2.0**: Basic agent with video processing | |
| - **v1.0**: Initial prototype | |
| --- | |
| **Last Updated**: December 5, 2025 | |
| **Maintained By**: Meeting Intelligence Agent Team | |