GFiaMon's picture
Updated: Readme and add Documentation
ce49379
# System Architecture
> **Meeting Intelligence Agent - Technical Architecture Documentation**
This document provides a high-level overview of the system architecture, design decisions, and component relationships for the Meeting Intelligence Agent.
---
## πŸ“‹ Table of Contents
- [System Overview](#-system-overview)
- [Architecture Diagram](#-architecture-diagram)
- [Core Components](#-core-components)
- [Data Flow](#-data-flow)
- [Key Design Decisions](#-key-design-decisions)
- [State Management](#-state-management)
- [Scalability & Performance](#-scalability--performance)
- [Security Architecture](#-security-architecture)
- [Technology Stack](#-technology-stack)
---
## 🎯 System Overview
The Meeting Intelligence Agent is a **conversational AI system** built on LangGraph that orchestrates meeting video processing, transcription, storage, and intelligent querying through natural language interaction.
### Core Capabilities
1. **Video Processing Pipeline**: Upload β†’ Transcription β†’ Speaker Diarization β†’ Metadata Extraction β†’ Vector Storage
2. **Semantic Search**: RAG-based querying across meeting transcripts using natural language
3. **External Integrations**: MCP (Model Context Protocol) servers for Notion and time-aware queries
4. **Conversational Interface**: Gradio-based chat UI with file upload support
### Design Philosophy
- **Conversational-First**: All functionality accessible through natural language
- **Modular Architecture**: Clear separation between UI, agent, tools, and services
- **Extensible**: MCP protocol enables easy addition of new capabilities
- **Async-Ready**: Supports long-running operations (transcription, MCP calls)
- **Production-Ready**: Docker support, error handling, graceful degradation
---
## πŸ—οΈ Architecture Diagram
```mermaid
graph TB
subgraph "Frontend Layer"
UI[Gradio Interface]
Chat[Chat Component]
Upload[File Upload]
Editor[Transcript Editor]
end
subgraph "Agent Layer (LangGraph)"
Agent[Conversational Agent]
StateMachine[State Machine]
ToolRouter[Tool Router]
end
subgraph "Tool Layer"
VideoTools[Video Processing Tools]
QueryTools[Meeting Query Tools]
MCPTools[MCP Integration Tools]
end
subgraph "Processing Layer"
WhisperX[WhisperX Transcription]
Pyannote[Speaker Diarization]
MetadataExtractor[GPT-4o-mini Metadata]
Embeddings[OpenAI Embeddings]
end
subgraph "Storage Layer"
Pinecone[(Pinecone Vector DB)]
LocalState[Local State Cache]
end
subgraph "External Services"
OpenAI[OpenAI API]
NotionMCP[Notion MCP Server]
TimeMCP[Time MCP Server]
ZoomMCP[Zoom MCP Server<br/>In Development]
end
UI --> Agent
Agent --> StateMachine
StateMachine --> ToolRouter
ToolRouter --> VideoTools
ToolRouter --> QueryTools
ToolRouter --> MCPTools
VideoTools --> WhisperX
VideoTools --> Pyannote
VideoTools --> MetadataExtractor
VideoTools --> Embeddings
QueryTools --> Embeddings
QueryTools --> Pinecone
MCPTools --> NotionMCP
MCPTools --> TimeMCP
MCPTools -.-> ZoomMCP
Embeddings --> Pinecone
MetadataExtractor --> OpenAI
Agent --> OpenAI
classDef frontend fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000000
classDef agent fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000000
classDef tools fill:#fff3e0,stroke:#ef6c00,stroke-width:2px,color:#000000
classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000000
classDef storage fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,color:#000000
classDef external fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#000000
class UI,Chat,Upload,Editor frontend
class Agent,StateMachine,ToolRouter agent
class VideoTools,QueryTools,MCPTools tools
class WhisperX,Pyannote,MetadataExtractor,Embeddings processing
class Pinecone,LocalState storage
class OpenAI,NotionMCP,TimeMCP,ZoomMCP external
```
---
## 🧩 Core Components
### 1. Frontend Layer (Gradio)
**Purpose**: User interface for interaction and file management
**Components**:
- **Chat Interface**: Primary conversational UI using `gr.ChatInterface`
- **File Upload**: Video file upload widget
- **Transcript Editor**: Editable text area for manual corrections
- **State Display**: Real-time feedback on processing status
**Technology**: Gradio 5.x with async support
**Key Files**:
- `src/ui/gradio_app.py` - UI component definitions and event handlers
---
### 2. Agent Layer (LangGraph)
**Purpose**: Orchestrates the entire workflow through conversational AI
**Architecture**: State machine with three nodes:
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”
β”‚ PREPARE β”‚ --> β”‚ AGENT β”‚ --> β”‚ TOOLS β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”˜
↑ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
**Components**:
1. **Prepare Node**: Converts chat history to LangChain messages
2. **Agent Node**: LLM decides which tools to call
3. **Tools Node**: Executes selected tools
4. **Conditional Router**: Determines if more tool calls are needed
**State Structure**:
```python
{
"message": str, # Current user query
"history": List[List[str]], # Conversation history
"llm_messages": List[Message], # LangChain message format
"response": str, # Generated response
"error": Optional[str] # Error tracking
}
```
**Key Files**:
- `src/agents/conversational.py` - LangGraph agent implementation (570 lines)
---
### 3. Tool Layer
**Purpose**: Provides discrete capabilities that the agent can invoke
**Categories**:
#### Video Processing Tools (8 tools)
- File upload management
- Transcription orchestration
- Speaker name mapping
- Transcript editing
- Pinecone upload
#### Meeting Query Tools (6 tools)
- Semantic search
- Metadata retrieval
- Meeting listing
- Text upsert
- Notion import/export
#### MCP Integration Tools (6+ tools)
- Notion API operations
- Time queries
- Future: Zoom RTMS
**Design Pattern**: LangChain `@tool` decorator for automatic schema generation
**Key Files**:
- `src/tools/video.py` - Video processing tools (528 lines)
- `src/tools/general.py` - Query and integration tools (577 lines)
- `src/tools/mcp/` - MCP client wrappers
---
### 4. Processing Layer
**Purpose**: Handles compute-intensive operations
**Components**:
#### WhisperX Transcription
- **Model**: Configurable (tiny/small/medium/large)
- **Features**: Word-level timestamps, language detection
- **Performance**: GPU-accelerated when available
#### Pyannote Speaker Diarization
- **Model**: `pyannote/speaker-diarization-3.1`
- **Output**: Speaker segments with timestamps
- **Integration**: Aligned with WhisperX word timestamps
#### Metadata Extraction
- **Model**: GPT-4o-mini (cost-optimized)
- **Extracts**: Title, date, summary, speaker mapping
- **Format**: Structured JSON output
#### Embeddings
- **Model**: OpenAI `text-embedding-3-small`
- **Dimension**: 1536
- **Usage**: Query and document embedding
**Key Files**:
- `src/processing/transcription.py` - WhisperX + Pyannote pipeline
- `src/processing/metadata_extractor.py` - GPT-4o-mini extraction
---
### 5. Storage Layer
**Purpose**: Persistent and temporary data storage
#### Pinecone Vector Database
- **Type**: Serverless
- **Index**: `meeting-transcripts-1-dev`
- **Namespace**: Environment-based (`development`/`production`)
- **Metadata**: Rich metadata for filtering (title, date, source, speakers)
**Schema**:
```python
{
"id": "meeting_abc12345_chunk_001",
"values": [1536-dim embedding],
"metadata": {
"meeting_id": "meeting_abc12345",
"meeting_title": "Q4 Planning",
"meeting_date": "2024-12-07",
"summary": "...",
"speaker_mapping": {...},
"source": "video",
"chunk_index": 1,
"text": "actual transcript chunk"
}
}
```
#### Local State Cache
- **Purpose**: Temporary storage for video processing workflow
- **Scope**: In-memory, per-session
- **Contents**: Uploaded video path, transcription text, timing info
**Key Files**:
- `src/retrievers/pinecone.py` - Vector database manager
---
### 6. External Services
**Purpose**: Third-party APIs and custom MCP servers
#### OpenAI API
- **Models**: GPT-3.5-turbo (agent), GPT-4o-mini (metadata)
- **Usage**: Agent reasoning, metadata extraction, embeddings
#### Notion MCP Server
- **Type**: Official `@notionhq/notion-mcp-server`
- **Transport**: stdio (local subprocess)
- **Capabilities**: Search, read, create, update pages
#### Time MCP Server (Custom)
- **Type**: Gradio-based MCP server
- **Transport**: SSE (Server-Sent Events)
- **Deployment**: HuggingFace Spaces
- **URL**: `https://gfiamon-date-time-mpc-server-tool.hf.space/gradio_api/mcp/sse`
- **Purpose**: Time-aware query support
#### Zoom RTMS Server (In Development)
- **Type**: FastAPI + Gradio hybrid
- **Transport**: stdio + webhooks
- **Status**: Prototype, API integration pending
- **Purpose**: Live meeting transcription
**Key Files**:
- `src/tools/mcp/mcp_manager.py` - Multi-server MCP client
- `external_mcp_servers/time_mcp_server/` - Custom time server
- `external_mcp_servers/zoom_mcp/` - Zoom RTMS prototype
---
## πŸ”„ Data Flow
### Video Upload Flow
```
User uploads video.mp4
↓
Gradio saves to temp directory
↓
Agent calls transcribe_uploaded_video(path)
↓
WhisperX extracts audio + transcribes
↓
Pyannote identifies speakers
↓
Alignment: Match speakers to transcript
↓
Format: SPEAKER_00, SPEAKER_01, etc.
↓
Return formatted transcript to agent
↓
Agent shows transcript to user
↓
User optionally edits or updates speaker names
↓
Agent calls upload_transcription_to_pinecone()
↓
GPT-4o-mini extracts metadata
↓
Text chunked into semantic segments
↓
OpenAI embeddings generated
↓
Upsert to Pinecone with metadata
↓
Return meeting_id to user
```
### Query Flow
```
User asks: "What action items were assigned last Tuesday?"
↓
Agent receives query
↓
Agent calls get_time_for_city("Berlin") [Time MCP]
↓
Time server returns: "2024-12-07"
↓
Agent calculates: "Last Tuesday = 2024-12-03"
↓
Agent calls search_meetings(query="action items", date_filter="2024-12-03")
↓
Query embedded via OpenAI
↓
Pinecone vector search
↓
Top-k chunks retrieved with metadata
↓
Results returned to agent
↓
Agent synthesizes answer from chunks
↓
Response streamed to user
```
### Notion Integration Flow
```
User: "Import 'Meeting 3' from Notion"
↓
Agent calls import_notion_to_pinecone(query="Meeting 3")
↓
Tool calls Notion MCP: API-post-search(query="Meeting 3")
↓
Notion returns page_id
↓
Tool calls API-retrieve-a-page(page_id) β†’ metadata
↓
Tool calls API-get-block-children(page_id) β†’ content blocks
↓
Recursive extraction of nested blocks
↓
Full text assembled
↓
GPT-4o-mini extracts metadata
↓
Text chunked and embedded
↓
Upsert to Pinecone
↓
Return success message with meeting_id
```
---
## 🎨 Key Design Decisions
### 1. Why LangGraph?
**Decision**: Use LangGraph instead of LangChain's AgentExecutor or other frameworks
**Rationale**:
- βœ… **Explicit state management**: Full control over conversation state
- βœ… **Async support**: Required for MCP tools (Notion API)
- βœ… **Debugging**: Clear visibility into state transitions
- βœ… **Flexibility**: Easy to add custom nodes and conditional routing
- βœ… **Streaming**: Native support for response streaming
**Alternative Considered**: LangChain AgentExecutor (rejected due to limited async support)
---
### 2. Why Separate MCP Servers?
**Decision**: Deploy custom MCP servers in `external_mcp_servers/` as standalone applications
**Rationale**:
- βœ… **Independent scaling**: Time server can handle multiple agents
- βœ… **Deployment flexibility**: Update servers without redeploying agent
- βœ… **Development isolation**: Test MCP servers independently
- βœ… **Reusability**: Other projects can use the same MCP servers
- βœ… **Transport options**: HTTP (SSE) for remote, stdio for local
**Architecture**:
```
Main Agent (HF Space 1)
↓ HTTP/SSE
Time MCP Server (HF Space 2)
↓ HTTP/SSE
Zoom MCP Server (HF Space 3)
```
**Alternative Considered**: Embed MCP servers in main app (rejected due to coupling)
---
### 3. Why Pinecone Serverless?
**Decision**: Use Pinecone serverless for vector storage
**Rationale**:
- βœ… **No infrastructure management**: Fully managed
- βœ… **Cost-effective**: Pay per usage, no idle costs
- βœ… **Scalability**: Auto-scales with demand
- βœ… **Metadata filtering**: Rich filtering capabilities
- βœ… **Namespaces**: Environment isolation (dev/prod)
**Alternative Considered**: Chroma (rejected due to self-hosting requirements)
---
### 4. Why GPT-3.5-turbo for Agent?
**Decision**: Use GPT-3.5-turbo instead of GPT-4 for agent reasoning
**Rationale**:
- βœ… **Cost**: 10x cheaper than GPT-4
- βœ… **Speed**: Faster response times
- βœ… **Sufficient**: Tool calling works well with 3.5-turbo
- βœ… **Budget**: GPT-4o-mini used for metadata extraction (specialized task)
**Cost Comparison** (per 1M tokens):
- GPT-3.5-turbo: $0.50 input / $1.50 output
- GPT-4: $30 input / $60 output
- GPT-4o-mini: $0.15 input / $0.60 output
---
### 5. Why Async Patterns?
**Decision**: Use `async/await` throughout the agent
**Rationale**:
- βœ… **MCP requirement**: Notion MCP tools are async
- βœ… **Long operations**: Transcription can take minutes
- βœ… **Streaming**: Gradio async streaming for better UX
- βœ… **Concurrency**: Handle multiple tool calls efficiently
**Implementation**:
```python
async def generate_response(self, message, history):
async for event in self.graph.astream(initial_state):
# Process events
yield response_chunk
```
---
## πŸ—‚οΈ State Management
### LangGraph State
**Structure**: TypedDict with annotated message list
```python
class ConversationalAgentState(TypedDict):
message: str # Current query
history: List[List[str]] # Gradio format
llm_messages: Annotated[List[Any], add_messages] # LangChain format
response: str # Generated response
error: Optional[str] # Error tracking
```
**State Transitions**:
1. **Prepare**: `history` β†’ `llm_messages` (format conversion)
2. **Agent**: `llm_messages` β†’ `llm_messages` (append AI response)
3. **Tools**: `llm_messages` β†’ `llm_messages` (append tool results)
**Persistence**: In-memory only, no database persistence (stateless per session)
---
### Video Processing State
**Purpose**: Track video upload workflow across multiple tool calls
**Storage**: Global dictionary in `src/tools/video.py`
```python
_video_state = {
"uploaded_video_path": None,
"transcription_text": None,
"transcription_segments": None,
"timing_info": None,
"show_video_upload": False,
"show_transcription_editor": False,
"transcription_in_progress": False
}
```
**Lifecycle**:
1. `request_video_upload()` β†’ sets `show_video_upload = True`
2. `transcribe_uploaded_video()` β†’ stores transcript
3. `upload_transcription_to_pinecone()` β†’ clears state
**Reset**: Automatic after successful upload or manual via `cancel_video_workflow()`
---
### UI State Synchronization
**Challenge**: Keep Gradio UI in sync with agent state
**Solution**: Tools return UI state changes via `get_video_state()`
```python
# Tool returns state
state = get_video_state()
return {
"show_upload": state["show_video_upload"],
"show_editor": state["show_transcription_editor"],
"transcript": state["transcription_text"]
}
```
**Gradio Integration**: UI components update based on returned state
---
## ⚑ Scalability & Performance
### Concurrency
**Current**: Single-user sessions (Gradio default)
**Scalability**:
- βœ… Stateless agent (can handle multiple sessions)
- βœ… Pinecone auto-scales
- βœ… MCP servers deployed independently
- ⚠️ WhisperX requires GPU (bottleneck for concurrent transcriptions)
**Future Improvements**:
- Queue system for transcription jobs
- Separate transcription service (microservice)
- Redis for shared state across instances
---
### Caching
**Current Caching**:
- ❌ No LLM response caching
- ❌ No embedding caching
- βœ… Pinecone handles vector index caching
**Future Improvements**:
- Cache frequent queries (e.g., "list meetings")
- Cache embeddings for repeated text
- LangChain cache for LLM responses
---
### Performance Bottlenecks
1. **Transcription**: 2-5 minutes for typical meeting (GPU-dependent)
2. **Metadata Extraction**: 5-10 seconds (GPT-4o-mini API call)
3. **Embedding**: 1-2 seconds per chunk (OpenAI API)
4. **Pinecone Upsert**: 1-3 seconds for typical meeting
**Optimization Strategies**:
- Parallel embedding generation
- Batch Pinecone upserts
- Async MCP calls
- Streaming responses to user
---
## πŸ”’ Security Architecture
### API Key Management
**Storage**: Environment variables via `.env` file
```bash
OPENAI_API_KEY=sk-...
PINECONE_API_KEY=...
NOTION_TOKEN=secret_...
```
**Access**: Loaded via `python-dotenv` in `src/config/settings.py`
**Best Practices**:
- βœ… Never commit `.env` to git (`.gitignore` configured)
- βœ… Use HuggingFace Spaces secrets for deployment
- βœ… Rotate keys regularly
---
### Data Privacy
**User Data**:
- Video files: Stored temporarily, deleted after processing
- Transcripts: Stored in Pinecone (user-controlled index)
- Conversation history: In-memory only, not persisted
**Third-Party Data Sharing**:
- OpenAI: Transcripts sent for embedding/metadata extraction
- Pinecone: Encrypted at rest and in transit
- Notion: Only accessed with user's token
**Compliance**:
- GDPR: User can delete Pinecone index
- Data retention: No long-term storage of raw videos
---
### MCP Server Security
**Notion MCP**:
- Authentication: User's Notion token
- Permissions: Limited to token's access scope
- Transport: stdio (local process, no network exposure)
**Time MCP**:
- Authentication: None required (public API)
- Transport: HTTPS (TLS encrypted)
- Rate limiting: HuggingFace Spaces default limits
**Zoom MCP** (planned):
- Authentication: OAuth 2.0
- Webhook validation: HMAC-SHA256 signature
- Transport: HTTPS + WebSocket (TLS)
---
## πŸ› οΈ Technology Stack
### Core Framework
- **Python**: 3.11+
- **LangGraph**: Agent orchestration
- **LangChain**: Tool abstractions, message handling
- **Gradio**: Web UI framework
### AI/ML Models
- **OpenAI GPT-3.5-turbo**: Agent reasoning
- **OpenAI GPT-4o-mini**: Metadata extraction
- **OpenAI text-embedding-3-small**: Vector embeddings
- **WhisperX**: Speech-to-text transcription
- **Pyannote**: Speaker diarization
### Storage & Databases
- **Pinecone**: Vector database (serverless)
- **Local filesystem**: Temporary video storage
### External Integrations
- **Notion API**: Via MCP server
- **Custom Time API**: Via Gradio MCP server
- **Zoom API** (planned): Via custom MCP server
### Development Tools
- **Docker**: Containerization
- **FFmpeg**: Audio extraction
- **pytest**: Testing (planned)
- **LangSmith**: Tracing and debugging (optional)
### Deployment
- **HuggingFace Spaces**: Primary deployment platform
- **Docker**: Container runtime
- **Environment Variables**: Configuration management
---
## πŸ“š Related Documentation
- [TECHNICAL_IMPLEMENTATION.md](TECHNICAL_IMPLEMENTATION.md) - Detailed tool reference and code examples
- [DEPLOYMENT_GUIDE.md](DEPLOYMENT_GUIDE.md) - Step-by-step deployment instructions
- [README.md](../README.md) - Project overview and quick start
---
## πŸ”„ Version History
- **v4.0** (Current): LangGraph-based conversational agent with MCP integration
- **v3.0**: Experimental agent patterns
- **v2.0**: Basic agent with video processing
- **v1.0**: Initial prototype
---
**Last Updated**: December 5, 2025
**Maintained By**: Meeting Intelligence Agent Team