mcp-server

Paused

App Files Files Community

mcp-server / ARCHITECTURE.md

NiWaRe

refactor for stateless: turn stateless on for FastMCP to work with OpenAI client etc

40e1a91 5 months ago

preview code

raw

history blame contribute delete

9.27 kB

	# W&B MCP Server - Architecture & Scalability Guide

	## Table of Contents
	1. [Architecture Decision](#architecture-decision)
	2. [Stateless HTTP Design](#stateless-http-design)
	3. [Performance & Scalability](#performance--scalability)
	4. [Load Test Results](#load-test-results)
	5. [Deployment Recommendations](#deployment-recommendations)

	---

	## Architecture Decision

	### Decision: Pure Stateless HTTP Mode

	The W&B MCP Server uses pure stateless HTTP mode (`stateless_http=True`).

	This fundamental architecture decision enables:
	- ✅ Universal client compatibility (OpenAI, Cursor, LeChat, Claude)
	- ✅ Horizontal scaling capabilities
	- ✅ Simpler operations and maintenance
	- ✅ Cloud-native deployment patterns

	### Why Stateless?

	The Model Context Protocol traditionally used stateful sessions, but this created issues:

	\| Client \| Behavior \| Problem with Stateful \|
	\|--------\|----------\|----------------------\|
	\| OpenAI \| Deletes session after listing tools, then reuses ID \| Session not found errors \|
	\| Cursor \| Sends Bearer token with every request \| Expects stateless behavior \|
	\| Claude \| Can work with either model \| No issues \|

	### The Solution

	```python
	# Pure stateless operation - no session persistence
	mcp = FastMCP("wandb-mcp-server", stateless_http=True)
	```

	With this approach:
	- Session IDs are correlation IDs only - they match requests to responses
	- No state persists between requests - each request is independent
	- Authentication required per request - Bearer token must be included
	- Any worker can handle any request - enables horizontal scaling

	---

	## Stateless HTTP Design

	### Architecture Overview

	```
	┌─────────────────────────────────────┐
	│ MCP Clients (OpenAI/Cursor/etc) │
	│ Bearer Token with Each Request │
	└─────────────┬───────────────────────┘
	│ HTTPS
	┌─────────────▼───────────────────────┐
	│ Load Balancer (Optional) │
	│ Round-Robin Distribution │
	└──┬──────────┬──────────┬────────────┘
	│ │ │
	┌──▼───┐ ┌──▼───┐ ┌──▼───┐
	│ W1 │ │ W2 │ │ W3 │ (Multiple Workers Possible)
	│ │ │ │ │ │
	│ ASGI │ │ ASGI │ │ ASGI │ Uvicorn/Gunicorn
	└──┬───┘ └──┬───┘ └──┬───┘
	│ │ │
	┌──▼──────────▼──────────▼────────────┐
	│ FastAPI Application │
	│ ┌────────────────────────────┐ │
	│ │ Stateless Auth Middleware │ │
	│ │ (Bearer Token Validation) │ │
	│ └────────────────────────────┘ │
	│ ┌────────────────────────────┐ │
	│ │ MCP Stateless Handler │ │
	│ │ (No Session Storage) │ │
	│ └────────────────────────────┘ │
	└─────────────┬───────────────────────┘
	│
	┌─────────────▼───────────────────────┐
	│ W&B API Integration │
	└─────────────────────────────────────┘
	```

	### Request Flow

	1. Client sends request with Bearer token and session ID
	2. Middleware validates Bearer token
	3. MCP processes request (session ID used for correlation only)
	4. Response sent with matching session ID
	5. No state persisted - request complete

	### Key Implementation Details

	```python
	async def thread_safe_auth_middleware(request: Request, call_next):
	"""Stateless authentication middleware."""

	# Session IDs are correlation IDs only
	session_id = request.headers.get("Mcp-Session-Id")
	if session_id:
	logger.debug(f"Correlation ID: {session_id[:8]}...")

	# Every request must have Bearer token
	authorization = request.headers.get("Authorization", "")
	if authorization.startswith("Bearer "):
	api_key = authorization[7:].strip()
	# Use API key for this request only
	# No session storage or retrieval
	```

	---

	## Performance & Scalability

	### Single Worker Performance

	Based on testing with stateless mode:

	\| Metric \| Local Server \| Remote (HF Spaces) \|
	\|--------\|--------------\|-------------------\|
	\| Max Concurrent \| 1000 clients \| 500+ clients \|
	\| Throughput \| ~50-60 req/s \| ~35 req/s \|
	\| Latency (p50) \| <500ms \| <2s \|
	\| Memory Usage \| 200-500MB \| 300-600MB \|

	### Horizontal Scaling Potential

	With stateless mode, the server supports true horizontal scaling:

	\| Workers \| Max Concurrent \| Total Throughput \| Notes \|
	\|---------\|----------------\|------------------\|-------\|
	\| 1 \| 1000 \| ~50 req/s \| Current deployment \|
	\| 2 \| 2000 \| ~100 req/s \| Linear scaling \|
	\| 4 \| 4000 \| ~200 req/s \| Near-linear \|
	\| 8 \| 8000 \| ~400 req/s \| Some overhead \|

	Key Advantage: No session affinity required - any worker can handle any request!

	---

	## Load Test Results

	### Latest Test Results (2025-09-25)

	#### Local Server (MacOS, Single Worker)

	\| Concurrent Clients \| Success Rate \| Throughput \| Mean Response \|
	\|--------------------\|-------------\|------------\|---------------\|
	\| 10 \| 100% \| 47 req/s \| 89ms \|
	\| 100 \| 100% \| 47 req/s \| 1.2s \|
	\| 500 \| 100% \| 56 req/s \| 4.4s \|
	\| 1000 \| 100% \| 48 req/s \| 9.3s \|
	\| 1500 \| 80% \| 51 req/s \| 15.4s \|
	\| 2000 \| 70% \| 53 req/s \| 20.8s \|

	Breaking Point: ~1500 concurrent connections

	#### Remote Server (mcp.withwandb.com)

	\| Concurrent Clients \| Success Rate \| Throughput \| Mean Response \|
	\|--------------------\|-------------\|------------\|---------------\|
	\| 10 \| 100% \| 10 req/s \| 0.8s \|
	\| 50 \| 100% \| 29 req/s \| 1.2s \|
	\| 100 \| 100% \| 33 req/s \| 1.9s \|
	\| 200 \| 100% \| 34 req/s \| 3.3s \|
	\| 500 \| 100% \| 35 req/s \| 7.5s \|

	Key Finding: Remote server handles 500+ concurrent connections reliably!

	### Performance Sweet Spots

	1. Low Latency (<1s response): Use ≤50 concurrent connections
	2. Balanced (good throughput & latency): Use 100-200 concurrent connections
	3. Maximum Throughput: Use 200-300 concurrent connections
	4. Maximum Capacity: Up to 500 concurrent (remote) or 1000 (local)

	---

	## Deployment Recommendations

	### Current Deployment (HuggingFace Spaces)

	```yaml
	Configuration:
	- Single worker (can be increased)
	- Stateless HTTP mode
	- 2 vCPU, 16GB RAM
	- Port 7860

	Performance:
	- 500+ concurrent connections
	- ~35 req/s throughput
	- 100% reliability up to 500 concurrent
	```

	### Scaling Options

	#### Option 1: Vertical Scaling
	- Increase CPU/RAM on HuggingFace Spaces
	- Can improve single-worker throughput

	#### Option 2: Horizontal Scaling (Recommended)
	```python
	# app.py - Enable multiple workers
	uvicorn.run(app, host="0.0.0.0", port=PORT, workers=4)
	```

	#### Option 3: Multi-Region Deployment
	- Deploy to multiple regions
	- Use global load balancer
	- Reduce latency for users worldwide

	### Production Checklist

	✅ Stateless mode enabled (`stateless_http=True`)
	✅ Bearer authentication on every request
	✅ Health check endpoint (`/health`)
	✅ Monitoring for response times and errors
	✅ Rate limiting (recommended: 100 req/s per client)
	✅ Connection limits (recommended: 500 concurrent)

	### Configuration Example

	```python
	# Production configuration
	mcp = FastMCP("wandb-mcp-server", stateless_http=True)

	# Uvicorn with multiple workers (if needed)
	if __name__ == "__main__":
	uvicorn.run(
	app,
	host="0.0.0.0",
	port=7860,
	workers=1, # Increase for horizontal scaling
	limit_concurrency=1000, # Connection limit
	timeout_keep_alive=30, # Keepalive timeout
	)
	```

	### Security Considerations

	1. API Key Validation: Every request validates Bearer token
	2. No Session Storage: No risk of session hijacking
	3. Rate Limiting: Protect against abuse
	4. HTTPS Only: Always use TLS in production
	5. Token Rotation: Encourage regular API key rotation

	---

	## Summary

	The W&B MCP Server's stateless architecture provides:

	- Universal Compatibility: Works with all MCP clients
	- Excellent Performance: 500+ concurrent connections, ~35 req/s
	- Horizontal Scalability: Add workers to increase capacity
	- Simple Operations: No session management complexity
	- Production Ready: Deployed and tested at scale

	The stateless design is not a compromise - it's the optimal architecture for MCP servers in production environments.