| # MAC — MBM AI Cloud | |
| ## API Design & Architecture Plan | |
| ### Phase 1–8 · Complete Platform Blueprint | |
| **Prepared for Professor Review · 07 April 2026** | |
| --- | |
| ## Executive Summary | |
| MAC (MBM AI Cloud) is a **fully self-hosted, zero-cloud AI inference platform** purpose-built for MBM Engineering College. Students and faculty on the college LAN access state-of-the-art AI models — text generation, code assistance, vision understanding, speech-to-text, and mathematical reasoning — through a standardised REST API authenticated by per-student API keys. | |
| **Key design goals:** | |
| - **Zero cloud cost** — all inference runs on college-owned GPUs | |
| - **OpenAI-compatible API** — students use familiar SDKs; existing tutorials work unchanged | |
| - **Scalable from day one** — starts on a single PC, scales horizontally to 30+ nodes with no code changes | |
| - **Secure by default** — JWT authentication, role-based access, rate limiting, input/output guardrails | |
| - **Academically useful** — RAG over college textbooks, web-grounded search, per-student usage tracking | |
| --- | |
| ## Technology Stack | |
| | Layer | Technology | Rationale | | |
| |---|---|---| | |
| | **API Gateway** | FastAPI (Python 3.11+) | Async, auto-docs (OpenAPI/Swagger), type-safe, fastest Python framework | | |
| | **Database** | PostgreSQL 16 | ACID-compliant, battle-tested, excellent JSON support | | |
| | **Cache / Rate Limiter** | Redis 7 | In-memory, sub-ms latency, native rate-limit primitives | | |
| | **LLM Inference** | vLLM | PagedAttention, continuous batching, highest throughput for local GPUs | | |
| | **Model Router / Proxy** | LiteLLM | Unified OpenAI-compatible proxy, load balancing, fallback routing | | |
| | **Vector Database** | Qdrant | Purpose-built for embeddings, filtering, snapshotting | | |
| | **Web Search** | SearXNG | Self-hosted meta-search, no API keys needed | | |
| | **Reverse Proxy** | Nginx | TLS termination, request buffering, static file serving | | |
| | **Containerisation** | Docker + Docker Compose | Reproducible deployments, one-command startup | | |
| | **Frontend** | React + Tailwind CSS | Component-driven dashboard, responsive, fast | | |
| | **Task Queue** | Celery + Redis | Background jobs: document ingestion, model downloads | | |
| --- | |
| ## Architecture Overview | |
| ``` | |
| ┌─────────────────────────────────────────────────────────────────┐ | |
| │ College LAN │ | |
| │ Students / Faculty / Lab PCs │ | |
| └────────────────────────┬────────────────────────────────────────┘ | |
| │ HTTPS | |
| ▼ | |
| ┌─────────────────────┐ | |
| │ Nginx │ | |
| │ Reverse Proxy │ | |
| │ TLS · Caching │ | |
| └─────────┬───────────┘ | |
| │ | |
| ┌─────────▼───────────┐ | |
| │ FastAPI │ | |
| │ API Gateway │ | |
| │ Auth · Routing │ | |
| │ Rate Limiting │ | |
| └───┬────┬────┬───┬──┘ | |
| │ │ │ │ | |
| ┌────────┘ │ │ └────────┐ | |
| ▼ ▼ ▼ ▼ | |
| ┌──────────┐ ┌─────────────┐ ┌──────────┐ | |
| │PostgreSQL│ │ LiteLLM │ │ Qdrant │ | |
| │ Users │ │ Proxy │ │ VectorDB │ | |
| │ Logs │ │ Routing │ │ (RAG) │ | |
| │ Keys │ └──────┬──────┘ └──────────┘ | |
| └──────────┘ │ | |
| ▼ | |
| ┌─────────────────────┐ | |
| │ vLLM │ | |
| │ Model Workers │ | |
| │ GPU Inference │ | |
| └─────────────────────┘ | |
| ``` | |
| **Scaling model:** The entire stack is containerised. To scale from 1 PC to N PCs, new vLLM worker containers are added and registered in the LiteLLM config. The FastAPI gateway, database, and proxy remain centralised on the primary node. Zero code changes required. | |
| --- | |
| ## Build Phases — 8 Phases, Sequential | |
| | # | Phase | Description | Dependencies | | |
| |---|---|---|---| | |
| | 1 | **API Endpoints** | Core REST API — explore, query, usage, auth | None | | |
| | 2 | **LLM Models** | Select, download, feasibility-check, deploy 5 models | Phase 1 | | |
| | 3 | **API–Model Integration** | Wire every model to its dedicated endpoint via LiteLLM + vLLM | Phase 1, 2 | | |
| | 4 | **API Usage Control** | Rate limiting, token accounting, static + refresh API keys | Phase 1 | | |
| | 5 | **Web Interface** | Dashboard, user management, admin panel, model access controls | Phase 1, 4 | | |
| | 6 | **Guardrails** | Input + output content filtering, safety checks | Phase 3 | | |
| | 7 | **Knowledgebase + RAG** | Vector DB, document ingestion, retrieval-augmented generation | Phase 3 | | |
| | 8 | **Retrieval + Search** | SearXNG web search, Wikipedia, real-time grounded answers | Phase 3, 7 | | |
| --- | |
| ## Base URL | |
| ``` | |
| http://<server-ip>/api/v1 | |
| ``` | |
| > The server IP is configured via environment variable `MAC_HOST`. No IP addresses are hardcoded anywhere in the codebase. For production, the Nginx reverse proxy terminates TLS and forwards to the FastAPI gateway. | |
| --- | |
| # Phase 1 — API Endpoints | |
| ## 1.1 Authentication — `/auth` | |
| Handles user login, session management, and JWT token lifecycle. All passwords are hashed with **bcrypt** (work factor 12). Tokens use **RS256** JWT signing. | |
| | Method | Endpoint | Description | | |
| |---|---|---| | |
| | `POST` | `/auth/login` | Roll number + password → JWT access token + refresh token | | |
| | `POST` | `/auth/logout` | Invalidate current session / revoke refresh token | | |
| | `POST` | `/auth/refresh` | Exchange refresh token for a new access token | | |
| | `GET` | `/auth/me` | Current user's profile, role, department, and API key | | |
| | `POST` | `/auth/change-password` | Change password (requires current password verification) | | |
| ### Request — `POST /auth/login` | |
| ```json | |
| { | |
| "roll_number": "21CS045", | |
| "password": "secure_password" | |
| } | |
| ``` | |
| | Field | Type | Required | Description | | |
| |---|---|---|---| | |
| | `roll_number` | string | Yes | Student/faculty roll number, e.g. `21CS045` | | |
| | `password` | string | Yes | Account password (min 8 chars) | | |
| ### Response — `POST /auth/login` | |
| ```json | |
| { | |
| "access_token": "eyJhbG...", | |
| "refresh_token": "dGhpcyBp...", | |
| "token_type": "bearer", | |
| "expires_in": 86400, | |
| "user": { | |
| "roll_number": "21CS045", | |
| "name": "Aryan Sharma", | |
| "department": "CSE", | |
| "role": "student", | |
| "api_key": "mac_sk_live_abc123..." | |
| } | |
| } | |
| ``` | |
| | Field | Description | | |
| |---|---| | |
| | `access_token` | JWT, 24-hour expiry, used for all authenticated requests | | |
| | `refresh_token` | 30-day expiry, used only at `/auth/refresh` | | |
| | `user.role` | One of: `student`, `faculty`, `admin` | | |
| | `user.api_key` | Personal API key for direct model queries | | |
| --- | |
| ## 1.2 Explore — `/explore` | |
| Read-only discovery endpoints. Allow students to see what models and capabilities are available before writing code. | |
| | Method | Endpoint | Description | | |
| |---|---|---| | |
| | `GET` | `/explore/models` | List all deployed models with capabilities, context length, and status | | |
| | `GET` | `/explore/models/{model_id}` | Detailed info — parameters, benchmarks, example prompts | | |
| | `GET` | `/explore/models/search` | Search by capability tag: `?tag=vision`, `?tag=code`, `?tag=math` | | |
| | `GET` | `/explore/endpoints` | List every API endpoint with method, path, auth requirement, description | | |
| | `GET` | `/explore/health` | Platform health — node status, GPU temperatures, inference queue depth | | |
| | `GET` | `/explore/usage-stats` | Aggregated platform analytics (admin-only) — tokens/day, active users | | |
| ### Response — `GET /explore/models` | |
| ```json | |
| { | |
| "models": [ | |
| { | |
| "id": "qwen2.5-coder-7b", | |
| "name": "Qwen2.5-Coder 7B", | |
| "specialty": "Code generation, debugging, explanation", | |
| "parameters": "7B", | |
| "context_length": 32768, | |
| "status": "loaded", | |
| "capabilities": ["code", "chat", "completion"] | |
| } | |
| ], | |
| "total": 5 | |
| } | |
| ``` | |
| ### Response — `GET /explore/health` | |
| ```json | |
| { | |
| "status": "healthy", | |
| "uptime_seconds": 345600, | |
| "nodes": [ | |
| { | |
| "ip": "192.168.1.101", | |
| "gpu": "NVIDIA RTX 3060 12GB", | |
| "gpu_temp_c": 62, | |
| "vram_used_gb": 9.2, | |
| "vram_total_gb": 12.0, | |
| "requests_in_flight": 3, | |
| "status": "active" | |
| } | |
| ], | |
| "queue_depth": 7 | |
| } | |
| ``` | |
| --- | |
| ## 1.3 Query — `/query` | |
| The core inference API. All endpoints accept `Authorization: Bearer <api_key>` header. Responses follow the **OpenAI Chat Completions format** — existing OpenAI SDK code works with zero changes by swapping `base_url`. | |
| | Method | Endpoint | Description | | |
| |---|---|---| | |
| | `POST` | `/query/chat` | Chat completion — multi-turn conversation (text/code/math) | | |
| | `POST` | `/query/completions` | Raw text completion (OpenAI-compatible) | | |
| | `POST` | `/query/vision` | Image + text → answer (vision model) | | |
| | `POST` | `/query/speech-to-text` | Upload audio → transcribed text (Whisper) | | |
| | `POST` | `/query/text-to-speech` | Text → audio file download (TTS) | | |
| | `POST` | `/query/embeddings` | Text → vector embedding (for RAG / similarity search) | | |
| | `POST` | `/query/rerank` | Re-rank passages by relevance to a query | | |
| ### Request — `POST /query/chat` | |
| ```json | |
| { | |
| "model": "auto", | |
| "messages": [ | |
| {"role": "system", "content": "You are a helpful coding assistant."}, | |
| {"role": "user", "content": "Write a Python function to reverse a linked list."} | |
| ], | |
| "temperature": 0.7, | |
| "max_tokens": 2048, | |
| "stream": false | |
| } | |
| ``` | |
| | Field | Type | Required | Description | | |
| |---|---|---|---| | |
| | `model` | string | Yes | Model ID or `"auto"` for smart routing | | |
| | `messages` | array | Yes | Array of `{role, content}` — system / user / assistant | | |
| | `temperature` | float | No | Sampling temperature 0.0–2.0 (default `0.7`) | | |
| | `max_tokens` | integer | No | Max tokens to generate (default `2048`, max `4096`) | | |
| | `stream` | boolean | No | Stream tokens as Server-Sent Events (default `false`) | | |
| | `context_id` | string | No | Maintain conversation state server-side | | |
| ### Response — `POST /query/chat` | |
| ```json | |
| { | |
| "id": "mac-chat-a1b2c3d4", | |
| "object": "chat.completion", | |
| "created": 1743984000, | |
| "model": "qwen2.5-coder-7b", | |
| "choices": [ | |
| { | |
| "index": 0, | |
| "message": { | |
| "role": "assistant", | |
| "content": "Here's a Python function to reverse a linked list..." | |
| }, | |
| "finish_reason": "stop" | |
| } | |
| ], | |
| "usage": { | |
| "prompt_tokens": 42, | |
| "completion_tokens": 187, | |
| "total_tokens": 229 | |
| } | |
| } | |
| ``` | |
| ### Smart Routing (`model: "auto"`) | |
| When students set `model` to `"auto"`, the API gateway inspects the request and routes to the optimal model: | |
| | Signal Detected | Routed To | | |
| |---|---| | |
| | Code keywords, programming language names, `write a function`, `debug this` | `qwen2.5-coder-7b` | | |
| | Math expressions, equations, `solve`, `prove`, `step by step` | `deepseek-r1-8b` | | |
| | Image attachment in request body | `llava-1.6-7b` | | |
| | Audio file upload | `whisper-large-v3` | | |
| | General text, summarisation, writing, Q&A | `qwen2.5-14b` | | |
| --- | |
| ## 1.4 Usage — `/usage` | |
| Per-student consumption tracking. Every API call logs token counts, model used, and timestamp. | |
| | Method | Endpoint | Description | | |
| |---|---|---| | |
| | `GET` | `/usage/me` | My tokens used — today, this week, this month, broken down by model | | |
| | `GET` | `/usage/me/history` | Full request history — timestamps, models, token counts, latency | | |
| | `GET` | `/usage/me/quota` | My current quota limits and remaining balance | | |
| | `GET` | `/usage/admin/all` | All users' usage summary (admin-only) | | |
| | `GET` | `/usage/admin/user/{roll_number}` | Specific student's full usage details (admin-only) | | |
| | `GET` | `/usage/admin/models` | Per-model usage statistics across the platform (admin-only) | | |
| ### Response — `GET /usage/me` | |
| ```json | |
| { | |
| "roll_number": "21CS045", | |
| "usage": { | |
| "today": { | |
| "total_tokens": 12450, | |
| "requests": 23, | |
| "by_model": { | |
| "qwen2.5-coder-7b": {"tokens": 8200, "requests": 15}, | |
| "qwen2.5-14b": {"tokens": 4250, "requests": 8} | |
| } | |
| }, | |
| "this_week": {"total_tokens": 67800, "requests": 142}, | |
| "this_month": {"total_tokens": 234500, "requests": 487} | |
| }, | |
| "quota": { | |
| "daily_limit": 50000, | |
| "remaining_today": 37550 | |
| } | |
| } | |
| ``` | |
| --- | |
| # Phase 2 — LLM Models | |
| ## 2.1 Model Selection | |
| Five best-in-class open-source models, each a domain specialist. All models are quantised to fit within consumer GPU VRAM. | |
| | Model ID | Name | Specialty | Parameters | VRAM Required | Quantisation | | |
| |---|---|---|---|---|---| | |
| | `qwen2.5-coder-7b` | Qwen2.5-Coder 7B | Code generation, debugging, explanation | 7B | ~5 GB | GPTQ-Int4 | | |
| | `deepseek-r1-8b` | DeepSeek-R1 8B | Maths, reasoning, step-by-step logic | 8B | ~6 GB | AWQ-Int4 | | |
| | `llava-1.6-7b` | LLaVA 1.6 7B | Image understanding, visual Q&A | 7B | ~8 GB | FP16 | | |
| | `whisper-large-v3` | Whisper Large v3 | Speech-to-text, transcription | 1.5B | ~3 GB | FP16 | | |
| | `qwen2.5-14b` | Qwen2.5 14B | General chat, summarisation, writing | 14B | ~10 GB | GPTQ-Int4 | | |
| **Total VRAM for all 5 models:** ~32 GB (fits on dual-GPU setups or loads models on demand) | |
| ### Feasibility on Single PC | |
| For a single-PC deployment with one GPU (e.g., RTX 3060 12GB or RTX 4070 16GB): | |
| - Models are loaded/unloaded on demand — only the requested model occupies VRAM at inference time | |
| - A model loading queue ensures graceful transitions | |
| - Frequently-used models remain resident; idle models are evicted after a configurable timeout | |
| ## 2.2 Model Management Endpoints — `/models` | |
| | Method | Endpoint | Description | | |
| |---|---|---| | |
| | `GET` | `/models` | List all models with status: `loaded` / `downloading` / `queued` / `offline` | | |
| | `GET` | `/models/{model_id}` | Single model details — context length, capabilities, benchmark scores | | |
| | `POST` | `/models/{model_id}/load` | Load model into GPU memory (admin-only) | | |
| | `POST` | `/models/{model_id}/unload` | Unload model from VRAM to free resources (admin-only) | | |
| | `GET` | `/models/{model_id}/health` | Ping model — returns latency, memory usage, ready status | | |
| | `POST` | `/models/download` | Download a model from HuggingFace by ID (admin-only) | | |
| | `GET` | `/models/download/{task_id}` | Check download progress for a model download task | | |
| ### Response — `GET /models` | |
| ```json | |
| { | |
| "models": [ | |
| { | |
| "id": "qwen2.5-coder-7b", | |
| "name": "Qwen2.5-Coder 7B", | |
| "status": "loaded", | |
| "vram_mb": 5120, | |
| "context_length": 32768, | |
| "capabilities": ["code", "chat"], | |
| "loaded_at": "2026-04-07T08:30:00Z" | |
| }, | |
| { | |
| "id": "deepseek-r1-8b", | |
| "name": "DeepSeek-R1 8B", | |
| "status": "offline", | |
| "vram_mb": 6144, | |
| "context_length": 65536, | |
| "capabilities": ["reasoning", "math", "chat"], | |
| "loaded_at": null | |
| } | |
| ] | |
| } | |
| ``` | |
| --- | |
| # Phase 3 — API–Model Integration | |
| ## 3.1 Architecture | |
| **LiteLLM Proxy** sits between the FastAPI gateway and vLLM workers. It: | |
| 1. Translates every `/query` request into the correct vLLM inference call | |
| 2. Handles smart routing (auto-model selection) | |
| 3. Load-balances across multiple workers (when scaled) | |
| 4. Retries on worker failure with exponential backoff | |
| 5. Returns OpenAI-compatible JSON responses | |
| ``` | |
| FastAPI Gateway ──► LiteLLM Proxy ──► vLLM Worker(s) | |
| (routing + (GPU inference) | |
| load balance) | |
| ``` | |
| ## 3.2 Integration Endpoints — `/integration` | |
| | Method | Endpoint | Description | | |
| |---|---|---| | |
| | `GET` | `/integration/routing-rules` | Show current routing rules (which task type → which model) | | |
| | `PUT` | `/integration/routing-rules` | Update routing rules (admin-only) | | |
| | `GET` | `/integration/workers` | List all vLLM worker nodes and their current load | | |
| | `GET` | `/integration/workers/{node_id}` | Single worker — GPU temp, VRAM used, requests in flight | | |
| | `POST` | `/integration/workers/{node_id}/drain` | Mark worker as draining — no new requests routed (admin-only) | | |
| | `GET` | `/integration/queue` | Current global inference queue depth across all workers | | |
| ### Response — `GET /integration/routing-rules` | |
| ```json | |
| { | |
| "rules": [ | |
| {"task": "code", "keywords": ["function", "debug", "code", "python", "javascript"], "model": "qwen2.5-coder-7b", "priority": 1}, | |
| {"task": "math", "keywords": ["solve", "equation", "prove", "integral"], "model": "deepseek-r1-8b", "priority": 2}, | |
| {"task": "vision", "trigger": "image_attachment", "model": "llava-1.6-7b", "priority": 3}, | |
| {"task": "speech", "trigger": "audio_upload", "model": "whisper-large-v3", "priority": 4}, | |
| {"task": "general", "trigger": "default", "model": "qwen2.5-14b", "priority": 99} | |
| ] | |
| } | |
| ``` | |
| ## 3.3 Scaling Strategy | |
| | Deployment | Configuration | | |
| |---|---| | |
| | **Single PC** | One vLLM process, models loaded/swapped on demand | | |
| | **2–5 PCs** | Each PC runs a vLLM worker with 1–2 dedicated models; LiteLLM routes by model | | |
| | **6–30 PCs** | Multiple workers per model for redundancy; least-busy routing; auto-failover | | |
| All worker addresses are stored in configuration (environment variables / config file) — **no IPs are hardcoded**. Adding a new node requires only updating the LiteLLM config and restarting the proxy. | |
| --- | |
| # Phase 4 — API Usage Control | |
| ## 4.1 API Key Management — `/keys` | |
| Every student receives a unique API key upon account creation. Two key types are supported: | |
| | Key Type | Behaviour | | |
| |---|---| | |
| | **Static** | Never expires. Manually rotated by the student or admin. Ideal for quick experiments. | | |
| | **Refresh** | Auto-rotates every 30 days. Old key has a 48-hour grace period. Ideal for long-running scripts. | | |
| Key format: `mac_sk_live_<32-char-random-hex>` | |
| | Method | Endpoint | Description | | |
| |---|---|---| | |
| | `GET` | `/keys/my-key` | Get current API key (partially masked) and metadata | | |
| | `POST` | `/keys/generate` | Generate a new API key (invalidates the previous one) | | |
| | `GET` | `/keys/my-key/stats` | Tokens consumed against this key — today, week, month | | |
| | `DELETE` | `/keys/my-key` | Revoke current key permanently (must re-generate) | | |
| | `GET` | `/keys/admin/all` | List all student API keys and status (admin-only) | | |
| | `POST` | `/keys/admin/revoke` | Force-revoke a specific student's API key (admin-only) | | |
| ### Response — `GET /keys/my-key` | |
| ```json | |
| { | |
| "key": "mac_sk_live_a1b2...****", | |
| "type": "refresh", | |
| "created_at": "2026-03-01T10:00:00Z", | |
| "expires_at": "2026-03-31T10:00:00Z", | |
| "last_used": "2026-04-07T14:23:00Z", | |
| "status": "active", | |
| "total_requests": 487 | |
| } | |
| ``` | |
| ## 4.2 Rate Limiting & Quotas — `/quota` | |
| Rate limits are enforced at the **Redis layer** using a sliding-window algorithm. Limits are per-role and can be overridden per-user. | |
| | Role | Daily Token Limit | Requests/Hour | Max Tokens/Request | | |
| |---|---|---|---| | |
| | **Student** | 50,000 | 100 | 4,096 | | |
| | **Faculty** | 200,000 | 500 | 8,192 | | |
| | **Admin** | Unlimited | Unlimited | 16,384 | | |
| | Method | Endpoint | Description | | |
| |---|---|---| | |
| | `GET` | `/quota/limits` | Show default quota limits per role | | |
| | `GET` | `/quota/me` | My personal limits and current consumption | | |
| | `PUT` | `/quota/admin/user/{roll_number}` | Override quota for a specific user (admin-only) | | |
| | `GET` | `/quota/admin/exceeded` | List users who have exceeded their daily quota (admin-only) | | |
| ### Response — `GET /quota/me` | |
| ```json | |
| { | |
| "role": "student", | |
| "limits": { | |
| "daily_tokens": 50000, | |
| "requests_per_hour": 100, | |
| "max_tokens_per_request": 4096 | |
| }, | |
| "current": { | |
| "tokens_used_today": 12450, | |
| "requests_this_hour": 8, | |
| "remaining_tokens": 37550, | |
| "resets_at": "2026-04-08T00:00:00Z" | |
| } | |
| } | |
| ``` | |
| ### Rate Limit Response Headers | |
| Every API response includes: | |
| ``` | |
| X-RateLimit-Limit: 100 | |
| X-RateLimit-Remaining: 92 | |
| X-RateLimit-Reset: 1743987600 | |
| X-TokenLimit-Limit: 50000 | |
| X-TokenLimit-Remaining: 37550 | |
| X-TokenLimit-Reset: 1744070400 | |
| ``` | |
| When a limit is exceeded, the API returns: | |
| ```json | |
| HTTP 429 Too Many Requests | |
| { | |
| "error": { | |
| "code": "rate_limit_exceeded", | |
| "message": "You have exceeded your hourly request limit. Try again in 847 seconds.", | |
| "retry_after": 847 | |
| } | |
| } | |
| ``` | |
| --- | |
| # Phase 5 — Web Interface | |
| A React-based single-page application served by the FastAPI backend. Three views: **Student Dashboard**, **Key Management**, and **Admin Panel**. | |
| ## 5.1 Web Interface Endpoints — `/ui` | |
| | Method | Endpoint | Description | | |
| |---|---|---| | |
| | `GET` | `/ui/dashboard` | Student home — usage chart, quick-start code snippets, model status cards | | |
| | `GET` | `/ui/keys` | Key management — view, copy, regenerate API key | | |
| | `GET` | `/ui/history` | Request history table — timestamp, model, tokens, latency, status | | |
| | `GET` | `/ui/playground` | Interactive chat playground — test models directly in the browser | | |
| ## 5.2 Admin Panel — `/ui/admin` | |
| | Method | Endpoint | Description | | |
| |---|---|---| | |
| | `GET` | `/ui/admin/users` | Full user list with roles, quotas, last active, status | | |
| | `POST` | `/ui/admin/users/create` | Create user / bulk-create from CSV (roll no., name, department) | | |
| | `PUT` | `/ui/admin/users/{roll_number}` | Edit user — change role, quota overrides, model access | | |
| | `DELETE` | `/ui/admin/users/{roll_number}` | Deactivate a user account | | |
| | `GET` | `/ui/admin/models` | Model management — load/unload, see which node serves what | | |
| | `GET` | `/ui/admin/logs` | Live request logs — error rates, latency percentiles, throughput | | |
| | `GET` | `/ui/admin/analytics` | Platform analytics — daily active users, peak hours, top models | | |
| ### Dashboard Wireframe | |
| ``` | |
| ┌──────────────────────────────────────────────────────┐ | |
| │ MAC — MBM AI Cloud [Profile ▼] │ | |
| ├──────────────────────────────────────────────────────┤ | |
| │ │ | |
| │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ | |
| │ │ Tokens │ │Requests │ │ Quota │ │ | |
| │ │ Today │ │ Today │ │Remaining│ │ | |
| │ │ 12,450 │ │ 23 │ │ 75.1% │ │ | |
| │ └─────────┘ └─────────┘ └─────────┘ │ | |
| │ │ | |
| │ Models Available Quick Start │ | |
| │ ┌──────────────────┐ ┌────────────────────┐ │ | |
| │ │ ✅ Qwen2.5-Coder │ │ from openai import │ │ | |
| │ │ ✅ DeepSeek-R1 │ │ OpenAI │ │ | |
| │ │ ✅ Qwen2.5 14B │ │ client = OpenAI( │ │ | |
| │ │ ⏳ LLaVA (load.) │ │ base_url= │ │ | |
| │ │ ✅ Whisper │ │ "http://mac/v1", │ │ | |
| │ └──────────────────┘ │ api_key="mac_sk_" │ │ | |
| │ │ ) │ │ | |
| │ Usage This Week └────────────────────┘ │ | |
| │ ┌──────────────────┐ │ | |
| │ │ ▁▃▅▇▅▃▁ (chart) │ │ | |
| │ └──────────────────┘ │ | |
| └──────────────────────────────────────────────────────┘ | |
| ``` | |
| --- | |
| # Phase 6 — Guardrails | |
| Input and output content filtering to ensure safe, appropriate use of AI models within an academic environment. | |
| ## 6.1 Guardrails Endpoints — `/guardrails` | |
| | Method | Endpoint | Description | | |
| |---|---|---| | |
| | `POST` | `/guardrails/check-input` | Run input text through content filter before sending to model | | |
| | `POST` | `/guardrails/check-output` | Run model output through safety filter before returning to user | | |
| | `GET` | `/guardrails/rules` | List active guardrail rules (admin-only) | | |
| | `PUT` | `/guardrails/rules` | Update rules — blocked categories, max prompt length (admin-only) | | |
| ## 6.2 Filtering Pipeline | |
| ``` | |
| User Input ──► Input Filter ──► Model Inference ──► Output Filter ──► Response | |
| │ │ | |
| ├─ Prompt injection detection ├─ PII/sensitive data redaction | |
| ├─ Blocked topic detection ├─ Harmful content detection | |
| ├─ Max prompt length enforcement ├─ Hallucination disclaimer | |
| └─ Academic integrity checks └─ Source attribution | |
| ``` | |
| ### Guardrail Categories | |
| | Category | Action | Description | | |
| |---|---|---| | |
| | **Prompt Injection** | Block + log | Detect attempts to override system prompts | | |
| | **Harmful Content** | Block | Violence, self-harm, illegal activities | | |
| | **Academic Dishonesty** | Flag + disclaimer | Full essay/assignment generation adds academic integrity notice | | |
| | **PII in Output** | Redact | Strip emails, phone numbers, addresses from model output | | |
| | **Max Prompt Length** | Reject | Configurable per-role; prevents resource abuse | | |
| ### Request — `POST /guardrails/check-input` | |
| ```json | |
| { | |
| "text": "User input to check", | |
| "context": "chat" | |
| } | |
| ``` | |
| ### Response | |
| ```json | |
| { | |
| "safe": true, | |
| "flags": [], | |
| "modified_text": null | |
| } | |
| ``` | |
| --- | |
| # Phase 7 — Knowledgebase + RAG | |
| Retrieval-Augmented Generation over college textbooks, course materials, and reference documents. Students can ask questions grounded in actual course content. | |
| ## 7.1 RAG Endpoints — `/rag` | |
| | Method | Endpoint | Description | | |
| |---|---|---| | |
| | `POST` | `/rag/ingest` | Upload PDF / DOCX / TXT — chunk, embed, store in Qdrant vector DB | | |
| | `GET` | `/rag/documents` | List all ingested documents with metadata (title, pages, chunk count) | | |
| | `GET` | `/rag/documents/{id}` | Single document details and its chunks | | |
| | `DELETE` | `/rag/documents/{id}` | Remove a document from the knowledgebase (admin-only) | | |
| | `POST` | `/rag/query` | Ask a question — retrieves top-k relevant chunks → sends to LLM with context | | |
| | `GET` | `/rag/query/{query_id}/sources` | Get source citations for a RAG response | | |
| | `POST` | `/rag/collections` | Create a named collection (e.g., "DSA", "DBMS", "OS") (admin-only) | | |
| | `GET` | `/rag/collections` | List all collections | | |
| ## 7.2 RAG Pipeline | |
| ``` | |
| Document Upload ──► Chunking (512 tokens, 50 overlap) | |
| │ | |
| ▼ | |
| Embedding (768-dim) | |
| │ | |
| ▼ | |
| Qdrant Vector DB | |
| │ | |
| User Question ──► Embed Query ──► Similarity Search (top-k=5) | |
| │ | |
| ▼ | |
| Retrieved Chunks + Question | |
| │ | |
| ▼ | |
| LLM Generation | |
| │ | |
| ▼ | |
| Answer with Citations | |
| ``` | |
| ### Request — `POST /rag/query` | |
| ```json | |
| { | |
| "question": "Explain the difference between process and thread in operating systems", | |
| "collection": "OS", | |
| "top_k": 5, | |
| "model": "auto" | |
| } | |
| ``` | |
| ### Response — `POST /rag/query` | |
| ```json | |
| { | |
| "answer": "A process is an independent program in execution with its own memory space...", | |
| "sources": [ | |
| { | |
| "document": "Galvin - Operating System Concepts, 10th Ed", | |
| "chapter": "Chapter 3: Processes", | |
| "page": 105, | |
| "relevance_score": 0.92, | |
| "chunk_preview": "A process is a program in execution. A process is more than..." | |
| }, | |
| { | |
| "document": "Galvin - Operating System Concepts, 10th Ed", | |
| "chapter": "Chapter 4: Threads", | |
| "page": 163, | |
| "relevance_score": 0.89, | |
| "chunk_preview": "A thread is a basic unit of CPU utilization..." | |
| } | |
| ], | |
| "model_used": "qwen2.5-14b", | |
| "tokens_used": 847 | |
| } | |
| ``` | |
| --- | |
| # Phase 8 — Retrieval + Search | |
| Real-time web grounding via self-hosted SearXNG. Enables the platform to answer questions about current events, recent research, and topics not covered in the knowledgebase. | |
| ## 8.1 Search Endpoints — `/search` | |
| | Method | Endpoint | Description | | |
| |---|---|---| | |
| | `POST` | `/search/web` | Query SearXNG — aggregates results from Google, Bing, DuckDuckGo, Wikipedia | | |
| | `POST` | `/search/wikipedia` | Targeted Wikipedia search with summary extraction | | |
| | `POST` | `/search/grounded` | Search + LLM — retrieves web results, then generates a cited answer | | |
| | `GET` | `/search/cache` | List recently cached search results (reduces redundant fetches) | | |
| ### Request — `POST /search/grounded` | |
| ```json | |
| { | |
| "query": "What are the latest improvements in vLLM v0.8?", | |
| "max_sources": 5, | |
| "model": "auto" | |
| } | |
| ``` | |
| ### Response — `POST /search/grounded` | |
| ```json | |
| { | |
| "answer": "vLLM v0.8 introduced several key improvements including...", | |
| "sources": [ | |
| { | |
| "title": "vLLM v0.8.0 Release Notes", | |
| "url": "https://github.com/vllm-project/vllm/releases", | |
| "snippet": "Key features: improved prefix caching, multi-modal support..." | |
| } | |
| ], | |
| "model_used": "qwen2.5-14b", | |
| "cached": false | |
| } | |
| ``` | |
| --- | |
| # Database Schema | |
| ## Entity-Relationship Overview | |
| ``` | |
| ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ | |
| │ users │ │ api_keys │ │ request_logs │ | |
| ├─────────────┤ ├─────────────┤ ├─────────────────┤ | |
| │ id (PK) │──┐ │ id (PK) │ │ id (PK) │ | |
| │ roll_number │ ├───►│ user_id(FK) │ ┌──►│ user_id (FK) │ | |
| │ name │ │ │ key_hash │ │ │ api_key_id (FK) │ | |
| │ department │ │ │ key_prefix │ │ │ model │ | |
| │ role │ │ │ type │ │ │ endpoint │ | |
| │ password │ │ │ is_active │ │ │ tokens_in │ | |
| │ is_active │ │ │ created_at │ │ │ tokens_out │ | |
| │ created_at │ │ │ expires_at │ │ │ latency_ms │ | |
| │ updated_at │ │ │ last_used │ │ │ status_code │ | |
| └─────────────┘ │ └─────────────┘ │ │ created_at │ | |
| │ │ └─────────────────┘ | |
| └──────────────────────┘ | |
| ┌──────────────────┐ ┌─────────────────┐ | |
| │ quota_overrides │ │ rag_documents │ | |
| ├──────────────────┤ ├─────────────────┤ | |
| │ id (PK) │ │ id (PK) │ | |
| │ user_id (FK) │ │ title │ | |
| │ daily_tokens │ │ collection │ | |
| │ requests_per_hr │ │ file_type │ | |
| │ max_tokens_req │ │ chunk_count │ | |
| │ set_by (FK) │ │ uploaded_by(FK) │ | |
| │ created_at │ │ created_at │ | |
| └──────────────────┘ └─────────────────┘ | |
| ``` | |
| ## SQL Schema — Key Tables | |
| ### users | |
| ```sql | |
| CREATE TABLE users ( | |
| id UUID PRIMARY KEY DEFAULT gen_random_uuid(), | |
| roll_number VARCHAR(20) UNIQUE NOT NULL, | |
| name VARCHAR(100) NOT NULL, | |
| department VARCHAR(50) NOT NULL, | |
| role VARCHAR(20) NOT NULL DEFAULT 'student' | |
| CHECK (role IN ('student', 'faculty', 'admin')), | |
| hashed_password VARCHAR(255) NOT NULL, | |
| is_active BOOLEAN DEFAULT TRUE, | |
| created_at TIMESTAMPTZ DEFAULT NOW(), | |
| updated_at TIMESTAMPTZ DEFAULT NOW() | |
| ); | |
| ``` | |
| ### api_keys | |
| ```sql | |
| CREATE TABLE api_keys ( | |
| id UUID PRIMARY KEY DEFAULT gen_random_uuid(), | |
| user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE, | |
| key_hash VARCHAR(255) NOT NULL, | |
| key_prefix VARCHAR(20) NOT NULL, | |
| type VARCHAR(10) NOT NULL DEFAULT 'static' | |
| CHECK (type IN ('static', 'refresh')), | |
| is_active BOOLEAN DEFAULT TRUE, | |
| created_at TIMESTAMPTZ DEFAULT NOW(), | |
| expires_at TIMESTAMPTZ, | |
| last_used TIMESTAMPTZ | |
| ); | |
| ``` | |
| ### request_logs | |
| ```sql | |
| CREATE TABLE request_logs ( | |
| id UUID PRIMARY KEY DEFAULT gen_random_uuid(), | |
| user_id UUID NOT NULL REFERENCES users(id), | |
| api_key_id UUID REFERENCES api_keys(id), | |
| model VARCHAR(50) NOT NULL, | |
| endpoint VARCHAR(100) NOT NULL, | |
| tokens_in INTEGER NOT NULL DEFAULT 0, | |
| tokens_out INTEGER NOT NULL DEFAULT 0, | |
| latency_ms INTEGER, | |
| status_code SMALLINT NOT NULL, | |
| created_at TIMESTAMPTZ DEFAULT NOW() | |
| ); | |
| CREATE INDEX idx_request_logs_user_date ON request_logs (user_id, created_at); | |
| CREATE INDEX idx_request_logs_model ON request_logs (model, created_at); | |
| ``` | |
| --- | |
| # Project Folder Structure | |
| ``` | |
| mac/ | |
| ├── api/ | |
| │ ├── main.py # FastAPI application entry point | |
| │ ├── routers/ | |
| │ │ ├── auth.py # /auth endpoints | |
| │ │ ├── explore.py # /explore endpoints | |
| │ │ ├── query.py # /query endpoints | |
| │ │ ├── usage.py # /usage endpoints | |
| │ │ ├── keys.py # /keys endpoints | |
| │ │ ├── models.py # /models endpoints | |
| │ │ ├── integration.py # /integration endpoints | |
| │ │ ├── quota.py # /quota endpoints | |
| │ │ ├── guardrails.py # /guardrails endpoints | |
| │ │ ├── rag.py # /rag endpoints | |
| │ │ └── search.py # /search endpoints | |
| │ ├── models/ | |
| │ │ ├── user.py # SQLAlchemy User model | |
| │ │ ├── api_key.py # APIKey model | |
| │ │ ├── request_log.py # RequestLog model | |
| │ │ └── document.py # RAG Document model | |
| │ ├── schemas/ | |
| │ │ ├── auth.py # Pydantic request/response schemas | |
| │ │ ├── query.py # Chat, completions, vision schemas | |
| │ │ ├── usage.py # Usage response schemas | |
| │ │ └── common.py # Shared schemas (pagination, errors) | |
| │ ├── core/ | |
| │ │ ├── config.py # Settings from environment variables | |
| │ │ ├── security.py # JWT creation/validation, password hashing | |
| │ │ ├── rate_limit.py # Redis-based sliding window rate limiter | |
| │ │ ├── dependencies.py # FastAPI dependency injection (get_current_user, etc.) | |
| │ │ └── db.py # Database session factory | |
| │ ├── services/ | |
| │ │ ├── model_router.py # Smart routing logic (auto model selection) | |
| │ │ ├── inference.py # LiteLLM client for model calls | |
| │ │ ├── usage_tracker.py # Token counting and logging | |
| │ │ ├── guardrails.py # Input/output filtering logic | |
| │ │ └── rag.py # Document ingestion, embedding, retrieval | |
| │ └── requirements.txt | |
| ├── litellm/ | |
| │ └── config.yaml # LiteLLM proxy routing configuration | |
| ├── nginx/ | |
| │ └── nginx.conf # Reverse proxy configuration | |
| ├── frontend/ | |
| │ ├── src/ | |
| │ │ ├── pages/ # Dashboard, Keys, History, Admin pages | |
| │ │ ├── components/ # Reusable UI components | |
| │ │ └── api/ # API client (Axios/fetch wrappers) | |
| │ ├── package.json | |
| │ └── vite.config.ts | |
| ├── scripts/ | |
| │ ├── seed_users.py # Bulk-create students from CSV | |
| │ ├── download_models.py # Pull models from HuggingFace | |
| │ └── health_check.py # Verify all services are running | |
| ├── docker-compose.yml # Full stack: api, db, redis, litellm, nginx, qdrant | |
| ├── Dockerfile # FastAPI container | |
| ├── .env.example # Template for environment variables | |
| ├── alembic/ # Database migrations | |
| │ └── versions/ | |
| └── README.md | |
| ``` | |
| --- | |
| # Security Design | |
| | Concern | Implementation | | |
| |---|---| | |
| | **Password Storage** | bcrypt with work factor 12; passwords never stored in plaintext | | |
| | **JWT Signing** | RS256 asymmetric keys; access tokens short-lived (24h) | | |
| | **API Key Storage** | Only SHA-256 hash stored in DB; raw key shown once at creation | | |
| | **Transport** | Nginx terminates TLS (HTTPS); internal services communicate over Docker network | | |
| | **Input Validation** | Pydantic schema validation on every request; max payload size enforced | | |
| | **SQL Injection** | SQLAlchemy ORM with parameterised queries throughout | | |
| | **Rate Limiting** | Redis sliding-window per user + per IP; prevents abuse and DoS | | |
| | **CORS** | Strict origin allowlist — only the MAC frontend domain | | |
| | **Admin Endpoints** | Role-based access control; admin-only routes check JWT role claim | | |
| | **Prompt Injection** | Input guardrails detect and block prompt override attempts | | |
| | **Secrets Management** | All secrets in `.env` file; never committed to version control | | |
| --- | |
| # Error Response Format | |
| All errors follow a consistent structure: | |
| ```json | |
| { | |
| "error": { | |
| "code": "authentication_failed", | |
| "message": "Invalid roll number or password.", | |
| "status": 401, | |
| "timestamp": "2026-04-07T14:30:00Z", | |
| "request_id": "mac-req-a1b2c3" | |
| } | |
| } | |
| ``` | |
| ### Standard Error Codes | |
| | HTTP Status | Code | Description | | |
| |---|---|---| | |
| | 400 | `bad_request` | Malformed request body or invalid parameters | | |
| | 401 | `authentication_failed` | Missing or invalid credentials / API key | | |
| | 403 | `forbidden` | Valid auth but insufficient role permissions | | |
| | 404 | `not_found` | Resource does not exist | | |
| | 409 | `conflict` | Duplicate resource (e.g., user already exists) | | |
| | 422 | `validation_error` | Request schema validation failed | | |
| | 429 | `rate_limit_exceeded` | Request or token quota exceeded | | |
| | 500 | `internal_error` | Unexpected server error | | |
| | 503 | `model_unavailable` | Requested model is not loaded or all workers are busy | | |
| --- | |
| # Deployment Configuration | |
| ## Environment Variables (`.env`) | |
| ```env | |
| # Server | |
| MAC_HOST=0.0.0.0 | |
| MAC_PORT=8000 | |
| MAC_ENV=production | |
| # Database | |
| POSTGRES_HOST=db | |
| POSTGRES_PORT=5432 | |
| POSTGRES_DB=mac | |
| POSTGRES_USER=mac_admin | |
| POSTGRES_PASSWORD=<generated-secret> | |
| # Redis | |
| REDIS_HOST=redis | |
| REDIS_PORT=6379 | |
| # JWT | |
| JWT_PRIVATE_KEY_PATH=/run/secrets/jwt_private.pem | |
| JWT_PUBLIC_KEY_PATH=/run/secrets/jwt_public.pem | |
| JWT_ACCESS_TOKEN_EXPIRE_MINUTES=1440 | |
| JWT_REFRESH_TOKEN_EXPIRE_DAYS=30 | |
| # LiteLLM | |
| LITELLM_PROXY_URL=http://litellm:4000 | |
| LITELLM_MASTER_KEY=<generated-secret> | |
| # Qdrant | |
| QDRANT_HOST=qdrant | |
| QDRANT_PORT=6333 | |
| # Models | |
| MODEL_STORAGE_PATH=/models | |
| DEFAULT_MODEL=qwen2.5-14b | |
| ``` | |
| ## Docker Compose Services | |
| | Service | Image | Ports | Purpose | | |
| |---|---|---|---| | |
| | `api` | Custom (Dockerfile) | 8000 | FastAPI gateway | | |
| | `db` | postgres:16-alpine | 5432 | User data, logs, keys | | |
| | `redis` | redis:7-alpine | 6379 | Rate limiting, caching | | |
| | `litellm` | ghcr.io/berriai/litellm | 4000 | Model proxy + routing | | |
| | `qdrant` | qdrant/qdrant | 6333 | Vector database (RAG) | | |
| | `searxng` | searxng/searxng | 8888 | Web search engine | | |
| | `nginx` | nginx:alpine | 80, 443 | Reverse proxy + TLS | | |
| --- | |
| # API Versioning & Compatibility | |
| - All endpoints are prefixed with `/api/v1` | |
| - The `/query` endpoints return **OpenAI-compatible JSON** — any OpenAI SDK works by changing `base_url` | |
| - Breaking changes will increment the version (`/api/v2`) while maintaining the previous version for one semester | |
| - Response pagination follows cursor-based pagination for list endpoints | |
| ### OpenAI SDK Compatibility Example | |
| ```python | |
| from openai import OpenAI | |
| client = OpenAI( | |
| base_url="http://mac-server/api/v1", | |
| api_key="mac_sk_live_your_key_here" | |
| ) | |
| response = client.chat.completions.create( | |
| model="auto", | |
| messages=[ | |
| {"role": "user", "content": "Explain binary search in Python"} | |
| ] | |
| ) | |
| print(response.choices[0].message.content) | |
| ``` | |
| --- | |
| # Summary | |
| MAC (MBM AI Cloud) delivers a production-grade, self-hosted AI platform that gives every student access to state-of-the-art AI models at zero recurring cost. The 8-phase build plan ensures a solid foundation before adding advanced features, and the architecture scales from a single lab PC to 30+ machines without rewriting a single line of code. | |
| **Phase 1** (API + Models + Integration + Usage Control) establishes the complete backend — once deployed, students can start querying AI models from their laptops on day one. | |
| --- | |
| *MAC — MBM AI Cloud · API Design Document · Version 1.0 · 07 April 2026* | |