Spaces:

Aaryan17
/

RamMAC

Running

App Files Files Community

RamMAC / docs /MAC-API-Design-Document.md

Aaryan17

feat: upload full MAC source (mac/, frontend/, alembic/, tests/)

9c0b225 verified 16 days ago

preview code

raw

history blame contribute delete

43.8 kB

	# MAC — MBM AI Cloud

	## API Design & Architecture Plan

	### Phase 1–8 · Complete Platform Blueprint

	Prepared for Professor Review · 07 April 2026

	---

	## Executive Summary

	MAC (MBM AI Cloud) is a fully self-hosted, zero-cloud AI inference platform purpose-built for MBM Engineering College. Students and faculty on the college LAN access state-of-the-art AI models — text generation, code assistance, vision understanding, speech-to-text, and mathematical reasoning — through a standardised REST API authenticated by per-student API keys.

	Key design goals:

	- Zero cloud cost — all inference runs on college-owned GPUs
	- OpenAI-compatible API — students use familiar SDKs; existing tutorials work unchanged
	- Scalable from day one — starts on a single PC, scales horizontally to 30+ nodes with no code changes
	- Secure by default — JWT authentication, role-based access, rate limiting, input/output guardrails
	- Academically useful — RAG over college textbooks, web-grounded search, per-student usage tracking

	---

	## Technology Stack

	\| Layer \| Technology \| Rationale \|
	\|---\|---\|---\|
	\| API Gateway \| FastAPI (Python 3.11+) \| Async, auto-docs (OpenAPI/Swagger), type-safe, fastest Python framework \|
	\| Database \| PostgreSQL 16 \| ACID-compliant, battle-tested, excellent JSON support \|
	\| Cache / Rate Limiter \| Redis 7 \| In-memory, sub-ms latency, native rate-limit primitives \|
	\| LLM Inference \| vLLM \| PagedAttention, continuous batching, highest throughput for local GPUs \|
	\| Model Router / Proxy \| LiteLLM \| Unified OpenAI-compatible proxy, load balancing, fallback routing \|
	\| Vector Database \| Qdrant \| Purpose-built for embeddings, filtering, snapshotting \|
	\| Web Search \| SearXNG \| Self-hosted meta-search, no API keys needed \|
	\| Reverse Proxy \| Nginx \| TLS termination, request buffering, static file serving \|
	\| Containerisation \| Docker + Docker Compose \| Reproducible deployments, one-command startup \|
	\| Frontend \| React + Tailwind CSS \| Component-driven dashboard, responsive, fast \|
	\| Task Queue \| Celery + Redis \| Background jobs: document ingestion, model downloads \|

	---

	## Architecture Overview

	```
	┌─────────────────────────────────────────────────────────────────┐
	│ College LAN │
	│ Students / Faculty / Lab PCs │
	└────────────────────────┬────────────────────────────────────────┘
	│ HTTPS
	▼
	┌─────────────────────┐
	│ Nginx │
	│ Reverse Proxy │
	│ TLS · Caching │
	└─────────┬───────────┘
	│
	┌─────────▼───────────┐
	│ FastAPI │
	│ API Gateway │
	│ Auth · Routing │
	│ Rate Limiting │
	└───┬────┬────┬───┬──┘
	│ │ │ │
	┌────────┘ │ │ └────────┐
	▼ ▼ ▼ ▼
	┌──────────┐ ┌─────────────┐ ┌──────────┐
	│PostgreSQL│ │ LiteLLM │ │ Qdrant │
	│ Users │ │ Proxy │ │ VectorDB │
	│ Logs │ │ Routing │ │ (RAG) │
	│ Keys │ └──────┬──────┘ └──────────┘
	└──────────┘ │
	▼
	┌─────────────────────┐
	│ vLLM │
	│ Model Workers │
	│ GPU Inference │
	└─────────────────────┘
	```

	Scaling model: The entire stack is containerised. To scale from 1 PC to N PCs, new vLLM worker containers are added and registered in the LiteLLM config. The FastAPI gateway, database, and proxy remain centralised on the primary node. Zero code changes required.

	---

	## Build Phases — 8 Phases, Sequential

	\| # \| Phase \| Description \| Dependencies \|
	\|---\|---\|---\|---\|
	\| 1 \| API Endpoints \| Core REST API — explore, query, usage, auth \| None \|
	\| 2 \| LLM Models \| Select, download, feasibility-check, deploy 5 models \| Phase 1 \|
	\| 3 \| API–Model Integration \| Wire every model to its dedicated endpoint via LiteLLM + vLLM \| Phase 1, 2 \|
	\| 4 \| API Usage Control \| Rate limiting, token accounting, static + refresh API keys \| Phase 1 \|
	\| 5 \| Web Interface \| Dashboard, user management, admin panel, model access controls \| Phase 1, 4 \|
	\| 6 \| Guardrails \| Input + output content filtering, safety checks \| Phase 3 \|
	\| 7 \| Knowledgebase + RAG \| Vector DB, document ingestion, retrieval-augmented generation \| Phase 3 \|
	\| 8 \| Retrieval + Search \| SearXNG web search, Wikipedia, real-time grounded answers \| Phase 3, 7 \|

	---

	## Base URL

	```
	http://<server-ip>/api/v1
	```

	> The server IP is configured via environment variable `MAC_HOST`. No IP addresses are hardcoded anywhere in the codebase. For production, the Nginx reverse proxy terminates TLS and forwards to the FastAPI gateway.

	---

	# Phase 1 — API Endpoints

	## 1.1 Authentication — `/auth`

	Handles user login, session management, and JWT token lifecycle. All passwords are hashed with bcrypt (work factor 12). Tokens use RS256 JWT signing.

	\| Method \| Endpoint \| Description \|
	\|---\|---\|---\|
	\| `POST` \| `/auth/login` \| Roll number + password → JWT access token + refresh token \|
	\| `POST` \| `/auth/logout` \| Invalidate current session / revoke refresh token \|
	\| `POST` \| `/auth/refresh` \| Exchange refresh token for a new access token \|
	\| `GET` \| `/auth/me` \| Current user's profile, role, department, and API key \|
	\| `POST` \| `/auth/change-password` \| Change password (requires current password verification) \|

	### Request — `POST /auth/login`

	```json
	{
	"roll_number": "21CS045",
	"password": "secure_password"
	}
	```

	\| Field \| Type \| Required \| Description \|
	\|---\|---\|---\|---\|
	\| `roll_number` \| string \| Yes \| Student/faculty roll number, e.g. `21CS045` \|
	\| `password` \| string \| Yes \| Account password (min 8 chars) \|

	### Response — `POST /auth/login`

	```json
	{
	"access_token": "eyJhbG...",
	"refresh_token": "dGhpcyBp...",
	"token_type": "bearer",
	"expires_in": 86400,
	"user": {
	"roll_number": "21CS045",
	"name": "Aryan Sharma",
	"department": "CSE",
	"role": "student",
	"api_key": "mac_sk_live_abc123..."
	}
	}
	```

	\| Field \| Description \|
	\|---\|---\|
	\| `access_token` \| JWT, 24-hour expiry, used for all authenticated requests \|
	\| `refresh_token` \| 30-day expiry, used only at `/auth/refresh` \|
	\| `user.role` \| One of: `student`, `faculty`, `admin` \|
	\| `user.api_key` \| Personal API key for direct model queries \|

	---

	## 1.2 Explore — `/explore`

	Read-only discovery endpoints. Allow students to see what models and capabilities are available before writing code.

	\| Method \| Endpoint \| Description \|
	\|---\|---\|---\|
	\| `GET` \| `/explore/models` \| List all deployed models with capabilities, context length, and status \|
	\| `GET` \| `/explore/models/{model_id}` \| Detailed info — parameters, benchmarks, example prompts \|
	\| `GET` \| `/explore/models/search` \| Search by capability tag: `?tag=vision`, `?tag=code`, `?tag=math` \|
	\| `GET` \| `/explore/endpoints` \| List every API endpoint with method, path, auth requirement, description \|
	\| `GET` \| `/explore/health` \| Platform health — node status, GPU temperatures, inference queue depth \|
	\| `GET` \| `/explore/usage-stats` \| Aggregated platform analytics (admin-only) — tokens/day, active users \|

	### Response — `GET /explore/models`

	```json
	{
	"models": [
	{
	"id": "qwen2.5-coder-7b",
	"name": "Qwen2.5-Coder 7B",
	"specialty": "Code generation, debugging, explanation",
	"parameters": "7B",
	"context_length": 32768,
	"status": "loaded",
	"capabilities": ["code", "chat", "completion"]
	}
	],
	"total": 5
	}
	```

	### Response — `GET /explore/health`

	```json
	{
	"status": "healthy",
	"uptime_seconds": 345600,
	"nodes": [
	{
	"ip": "192.168.1.101",
	"gpu": "NVIDIA RTX 3060 12GB",
	"gpu_temp_c": 62,
	"vram_used_gb": 9.2,
	"vram_total_gb": 12.0,
	"requests_in_flight": 3,
	"status": "active"
	}
	],
	"queue_depth": 7
	}
	```

	---

	## 1.3 Query — `/query`

	The core inference API. All endpoints accept `Authorization: Bearer <api_key>` header. Responses follow the OpenAI Chat Completions format — existing OpenAI SDK code works with zero changes by swapping `base_url`.

	\| Method \| Endpoint \| Description \|
	\|---\|---\|---\|
	\| `POST` \| `/query/chat` \| Chat completion — multi-turn conversation (text/code/math) \|
	\| `POST` \| `/query/completions` \| Raw text completion (OpenAI-compatible) \|
	\| `POST` \| `/query/vision` \| Image + text → answer (vision model) \|
	\| `POST` \| `/query/speech-to-text` \| Upload audio → transcribed text (Whisper) \|
	\| `POST` \| `/query/text-to-speech` \| Text → audio file download (TTS) \|
	\| `POST` \| `/query/embeddings` \| Text → vector embedding (for RAG / similarity search) \|
	\| `POST` \| `/query/rerank` \| Re-rank passages by relevance to a query \|

	### Request — `POST /query/chat`

	```json
	{
	"model": "auto",
	"messages": [
	{"role": "system", "content": "You are a helpful coding assistant."},
	{"role": "user", "content": "Write a Python function to reverse a linked list."}
	],
	"temperature": 0.7,
	"max_tokens": 2048,
	"stream": false
	}
	```

	\| Field \| Type \| Required \| Description \|
	\|---\|---\|---\|---\|
	\| `model` \| string \| Yes \| Model ID or `"auto"` for smart routing \|
	\| `messages` \| array \| Yes \| Array of `{role, content}` — system / user / assistant \|
	\| `temperature` \| float \| No \| Sampling temperature 0.0–2.0 (default `0.7`) \|
	\| `max_tokens` \| integer \| No \| Max tokens to generate (default `2048`, max `4096`) \|
	\| `stream` \| boolean \| No \| Stream tokens as Server-Sent Events (default `false`) \|
	\| `context_id` \| string \| No \| Maintain conversation state server-side \|

	### Response — `POST /query/chat`

	```json
	{
	"id": "mac-chat-a1b2c3d4",
	"object": "chat.completion",
	"created": 1743984000,
	"model": "qwen2.5-coder-7b",
	"choices": [
	{
	"index": 0,
	"message": {
	"role": "assistant",
	"content": "Here's a Python function to reverse a linked list..."
	},
	"finish_reason": "stop"
	}
	],
	"usage": {
	"prompt_tokens": 42,
	"completion_tokens": 187,
	"total_tokens": 229
	}
	}
	```

	### Smart Routing (`model: "auto"`)

	When students set `model` to `"auto"`, the API gateway inspects the request and routes to the optimal model:

	\| Signal Detected \| Routed To \|
	\|---\|---\|
	\| Code keywords, programming language names, `write a function`, `debug this` \| `qwen2.5-coder-7b` \|
	\| Math expressions, equations, `solve`, `prove`, `step by step` \| `deepseek-r1-8b` \|
	\| Image attachment in request body \| `llava-1.6-7b` \|
	\| Audio file upload \| `whisper-large-v3` \|
	\| General text, summarisation, writing, Q&A \| `qwen2.5-14b` \|

	---

	## 1.4 Usage — `/usage`

	Per-student consumption tracking. Every API call logs token counts, model used, and timestamp.

	\| Method \| Endpoint \| Description \|
	\|---\|---\|---\|
	\| `GET` \| `/usage/me` \| My tokens used — today, this week, this month, broken down by model \|
	\| `GET` \| `/usage/me/history` \| Full request history — timestamps, models, token counts, latency \|
	\| `GET` \| `/usage/me/quota` \| My current quota limits and remaining balance \|
	\| `GET` \| `/usage/admin/all` \| All users' usage summary (admin-only) \|
	\| `GET` \| `/usage/admin/user/{roll_number}` \| Specific student's full usage details (admin-only) \|
	\| `GET` \| `/usage/admin/models` \| Per-model usage statistics across the platform (admin-only) \|

	### Response — `GET /usage/me`

	```json
	{
	"roll_number": "21CS045",
	"usage": {
	"today": {
	"total_tokens": 12450,
	"requests": 23,
	"by_model": {
	"qwen2.5-coder-7b": {"tokens": 8200, "requests": 15},
	"qwen2.5-14b": {"tokens": 4250, "requests": 8}
	}
	},
	"this_week": {"total_tokens": 67800, "requests": 142},
	"this_month": {"total_tokens": 234500, "requests": 487}
	},
	"quota": {
	"daily_limit": 50000,
	"remaining_today": 37550
	}
	}
	```

	---

	# Phase 2 — LLM Models

	## 2.1 Model Selection

	Five best-in-class open-source models, each a domain specialist. All models are quantised to fit within consumer GPU VRAM.

	\| Model ID \| Name \| Specialty \| Parameters \| VRAM Required \| Quantisation \|
	\|---\|---\|---\|---\|---\|---\|
	\| `qwen2.5-coder-7b` \| Qwen2.5-Coder 7B \| Code generation, debugging, explanation \| 7B \| ~5 GB \| GPTQ-Int4 \|
	\| `deepseek-r1-8b` \| DeepSeek-R1 8B \| Maths, reasoning, step-by-step logic \| 8B \| ~6 GB \| AWQ-Int4 \|
	\| `llava-1.6-7b` \| LLaVA 1.6 7B \| Image understanding, visual Q&A \| 7B \| ~8 GB \| FP16 \|
	\| `whisper-large-v3` \| Whisper Large v3 \| Speech-to-text, transcription \| 1.5B \| ~3 GB \| FP16 \|
	\| `qwen2.5-14b` \| Qwen2.5 14B \| General chat, summarisation, writing \| 14B \| ~10 GB \| GPTQ-Int4 \|

	Total VRAM for all 5 models: ~32 GB (fits on dual-GPU setups or loads models on demand)

	### Feasibility on Single PC

	For a single-PC deployment with one GPU (e.g., RTX 3060 12GB or RTX 4070 16GB):
	- Models are loaded/unloaded on demand — only the requested model occupies VRAM at inference time
	- A model loading queue ensures graceful transitions
	- Frequently-used models remain resident; idle models are evicted after a configurable timeout

	## 2.2 Model Management Endpoints — `/models`

	\| Method \| Endpoint \| Description \|
	\|---\|---\|---\|
	\| `GET` \| `/models` \| List all models with status: `loaded` / `downloading` / `queued` / `offline` \|
	\| `GET` \| `/models/{model_id}` \| Single model details — context length, capabilities, benchmark scores \|
	\| `POST` \| `/models/{model_id}/load` \| Load model into GPU memory (admin-only) \|
	\| `POST` \| `/models/{model_id}/unload` \| Unload model from VRAM to free resources (admin-only) \|
	\| `GET` \| `/models/{model_id}/health` \| Ping model — returns latency, memory usage, ready status \|
	\| `POST` \| `/models/download` \| Download a model from HuggingFace by ID (admin-only) \|
	\| `GET` \| `/models/download/{task_id}` \| Check download progress for a model download task \|

	### Response — `GET /models`

	```json
	{
	"models": [
	{
	"id": "qwen2.5-coder-7b",
	"name": "Qwen2.5-Coder 7B",
	"status": "loaded",
	"vram_mb": 5120,
	"context_length": 32768,
	"capabilities": ["code", "chat"],
	"loaded_at": "2026-04-07T08:30:00Z"
	},
	{
	"id": "deepseek-r1-8b",
	"name": "DeepSeek-R1 8B",
	"status": "offline",
	"vram_mb": 6144,
	"context_length": 65536,
	"capabilities": ["reasoning", "math", "chat"],
	"loaded_at": null
	}
	]
	}
	```

	---

	# Phase 3 — API–Model Integration

	## 3.1 Architecture

	LiteLLM Proxy sits between the FastAPI gateway and vLLM workers. It:

	1. Translates every `/query` request into the correct vLLM inference call
	2. Handles smart routing (auto-model selection)
	3. Load-balances across multiple workers (when scaled)
	4. Retries on worker failure with exponential backoff
	5. Returns OpenAI-compatible JSON responses

	```
	FastAPI Gateway ──► LiteLLM Proxy ──► vLLM Worker(s)
	(routing + (GPU inference)
	load balance)
	```

	## 3.2 Integration Endpoints — `/integration`

	\| Method \| Endpoint \| Description \|
	\|---\|---\|---\|
	\| `GET` \| `/integration/routing-rules` \| Show current routing rules (which task type → which model) \|
	\| `PUT` \| `/integration/routing-rules` \| Update routing rules (admin-only) \|
	\| `GET` \| `/integration/workers` \| List all vLLM worker nodes and their current load \|
	\| `GET` \| `/integration/workers/{node_id}` \| Single worker — GPU temp, VRAM used, requests in flight \|
	\| `POST` \| `/integration/workers/{node_id}/drain` \| Mark worker as draining — no new requests routed (admin-only) \|
	\| `GET` \| `/integration/queue` \| Current global inference queue depth across all workers \|

	### Response — `GET /integration/routing-rules`

	```json
	{
	"rules": [
	{"task": "code", "keywords": ["function", "debug", "code", "python", "javascript"], "model": "qwen2.5-coder-7b", "priority": 1},
	{"task": "math", "keywords": ["solve", "equation", "prove", "integral"], "model": "deepseek-r1-8b", "priority": 2},
	{"task": "vision", "trigger": "image_attachment", "model": "llava-1.6-7b", "priority": 3},
	{"task": "speech", "trigger": "audio_upload", "model": "whisper-large-v3", "priority": 4},
	{"task": "general", "trigger": "default", "model": "qwen2.5-14b", "priority": 99}
	]
	}
	```

	## 3.3 Scaling Strategy

	\| Deployment \| Configuration \|
	\|---\|---\|
	\| Single PC \| One vLLM process, models loaded/swapped on demand \|
	\| 2–5 PCs \| Each PC runs a vLLM worker with 1–2 dedicated models; LiteLLM routes by model \|
	\| 6–30 PCs \| Multiple workers per model for redundancy; least-busy routing; auto-failover \|

	All worker addresses are stored in configuration (environment variables / config file) — no IPs are hardcoded. Adding a new node requires only updating the LiteLLM config and restarting the proxy.

	---

	# Phase 4 — API Usage Control

	## 4.1 API Key Management — `/keys`

	Every student receives a unique API key upon account creation. Two key types are supported:

	\| Key Type \| Behaviour \|
	\|---\|---\|
	\| Static \| Never expires. Manually rotated by the student or admin. Ideal for quick experiments. \|
	\| Refresh \| Auto-rotates every 30 days. Old key has a 48-hour grace period. Ideal for long-running scripts. \|

	Key format: `mac_sk_live_<32-char-random-hex>`

	\| Method \| Endpoint \| Description \|
	\|---\|---\|---\|
	\| `GET` \| `/keys/my-key` \| Get current API key (partially masked) and metadata \|
	\| `POST` \| `/keys/generate` \| Generate a new API key (invalidates the previous one) \|
	\| `GET` \| `/keys/my-key/stats` \| Tokens consumed against this key — today, week, month \|
	\| `DELETE` \| `/keys/my-key` \| Revoke current key permanently (must re-generate) \|
	\| `GET` \| `/keys/admin/all` \| List all student API keys and status (admin-only) \|
	\| `POST` \| `/keys/admin/revoke` \| Force-revoke a specific student's API key (admin-only) \|

	### Response — `GET /keys/my-key`

	```json
	{
	"key": "mac_sk_live_a1b2...****",
	"type": "refresh",
	"created_at": "2026-03-01T10:00:00Z",
	"expires_at": "2026-03-31T10:00:00Z",
	"last_used": "2026-04-07T14:23:00Z",
	"status": "active",
	"total_requests": 487
	}
	```

	## 4.2 Rate Limiting & Quotas — `/quota`

	Rate limits are enforced at the Redis layer using a sliding-window algorithm. Limits are per-role and can be overridden per-user.

	\| Role \| Daily Token Limit \| Requests/Hour \| Max Tokens/Request \|
	\|---\|---\|---\|---\|
	\| Student \| 50,000 \| 100 \| 4,096 \|
	\| Faculty \| 200,000 \| 500 \| 8,192 \|
	\| Admin \| Unlimited \| Unlimited \| 16,384 \|

	\| Method \| Endpoint \| Description \|
	\|---\|---\|---\|
	\| `GET` \| `/quota/limits` \| Show default quota limits per role \|
	\| `GET` \| `/quota/me` \| My personal limits and current consumption \|
	\| `PUT` \| `/quota/admin/user/{roll_number}` \| Override quota for a specific user (admin-only) \|
	\| `GET` \| `/quota/admin/exceeded` \| List users who have exceeded their daily quota (admin-only) \|

	### Response — `GET /quota/me`

	```json
	{
	"role": "student",
	"limits": {
	"daily_tokens": 50000,
	"requests_per_hour": 100,
	"max_tokens_per_request": 4096
	},
	"current": {
	"tokens_used_today": 12450,
	"requests_this_hour": 8,
	"remaining_tokens": 37550,
	"resets_at": "2026-04-08T00:00:00Z"
	}
	}
	```

	### Rate Limit Response Headers

	Every API response includes:

	```
	X-RateLimit-Limit: 100
	X-RateLimit-Remaining: 92
	X-RateLimit-Reset: 1743987600
	X-TokenLimit-Limit: 50000
	X-TokenLimit-Remaining: 37550
	X-TokenLimit-Reset: 1744070400
	```

	When a limit is exceeded, the API returns:

	```json
	HTTP 429 Too Many Requests
	{
	"error": {
	"code": "rate_limit_exceeded",
	"message": "You have exceeded your hourly request limit. Try again in 847 seconds.",
	"retry_after": 847
	}
	}
	```

	---

	# Phase 5 — Web Interface

	A React-based single-page application served by the FastAPI backend. Three views: Student Dashboard, Key Management, and Admin Panel.

	## 5.1 Web Interface Endpoints — `/ui`

	\| Method \| Endpoint \| Description \|
	\|---\|---\|---\|
	\| `GET` \| `/ui/dashboard` \| Student home — usage chart, quick-start code snippets, model status cards \|
	\| `GET` \| `/ui/keys` \| Key management — view, copy, regenerate API key \|
	\| `GET` \| `/ui/history` \| Request history table — timestamp, model, tokens, latency, status \|
	\| `GET` \| `/ui/playground` \| Interactive chat playground — test models directly in the browser \|

	## 5.2 Admin Panel — `/ui/admin`

	\| Method \| Endpoint \| Description \|
	\|---\|---\|---\|
	\| `GET` \| `/ui/admin/users` \| Full user list with roles, quotas, last active, status \|
	\| `POST` \| `/ui/admin/users/create` \| Create user / bulk-create from CSV (roll no., name, department) \|
	\| `PUT` \| `/ui/admin/users/{roll_number}` \| Edit user — change role, quota overrides, model access \|
	\| `DELETE` \| `/ui/admin/users/{roll_number}` \| Deactivate a user account \|
	\| `GET` \| `/ui/admin/models` \| Model management — load/unload, see which node serves what \|
	\| `GET` \| `/ui/admin/logs` \| Live request logs — error rates, latency percentiles, throughput \|
	\| `GET` \| `/ui/admin/analytics` \| Platform analytics — daily active users, peak hours, top models \|

	### Dashboard Wireframe

	```
	┌──────────────────────────────────────────────────────┐
	│ MAC — MBM AI Cloud [Profile ▼] │
	├──────────────────────────────────────────────────────┤
	│ │
	│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
	│ │ Tokens │ │Requests │ │ Quota │ │
	│ │ Today │ │ Today │ │Remaining│ │
	│ │ 12,450 │ │ 23 │ │ 75.1% │ │
	│ └─────────┘ └─────────┘ └─────────┘ │
	│ │
	│ Models Available Quick Start │
	│ ┌──────────────────┐ ┌────────────────────┐ │
	│ │ ✅ Qwen2.5-Coder │ │ from openai import │ │
	│ │ ✅ DeepSeek-R1 │ │ OpenAI │ │
	│ │ ✅ Qwen2.5 14B │ │ client = OpenAI( │ │
	│ │ ⏳ LLaVA (load.) │ │ base_url= │ │
	│ │ ✅ Whisper │ │ "http://mac/v1", │ │
	│ └──────────────────┘ │ api_key="mac_sk_" │ │
	│ │ ) │ │
	│ Usage This Week └────────────────────┘ │
	│ ┌──────────────────┐ │
	│ │ ▁▃▅▇▅▃▁ (chart) │ │
	│ └──────────────────┘ │
	└──────────────────────────────────────────────────────┘
	```

	---

	# Phase 6 — Guardrails

	Input and output content filtering to ensure safe, appropriate use of AI models within an academic environment.

	## 6.1 Guardrails Endpoints — `/guardrails`

	\| Method \| Endpoint \| Description \|
	\|---\|---\|---\|
	\| `POST` \| `/guardrails/check-input` \| Run input text through content filter before sending to model \|
	\| `POST` \| `/guardrails/check-output` \| Run model output through safety filter before returning to user \|
	\| `GET` \| `/guardrails/rules` \| List active guardrail rules (admin-only) \|
	\| `PUT` \| `/guardrails/rules` \| Update rules — blocked categories, max prompt length (admin-only) \|

	## 6.2 Filtering Pipeline

	```
	User Input ──► Input Filter ──► Model Inference ──► Output Filter ──► Response
	│ │
	├─ Prompt injection detection ├─ PII/sensitive data redaction
	├─ Blocked topic detection ├─ Harmful content detection
	├─ Max prompt length enforcement ├─ Hallucination disclaimer
	└─ Academic integrity checks └─ Source attribution
	```

	### Guardrail Categories

	\| Category \| Action \| Description \|
	\|---\|---\|---\|
	\| Prompt Injection \| Block + log \| Detect attempts to override system prompts \|
	\| Harmful Content \| Block \| Violence, self-harm, illegal activities \|
	\| Academic Dishonesty \| Flag + disclaimer \| Full essay/assignment generation adds academic integrity notice \|
	\| PII in Output \| Redact \| Strip emails, phone numbers, addresses from model output \|
	\| Max Prompt Length \| Reject \| Configurable per-role; prevents resource abuse \|

	### Request — `POST /guardrails/check-input`

	```json
	{
	"text": "User input to check",
	"context": "chat"
	}
	```

	### Response

	```json
	{
	"safe": true,
	"flags": [],
	"modified_text": null
	}
	```

	---

	# Phase 7 — Knowledgebase + RAG

	Retrieval-Augmented Generation over college textbooks, course materials, and reference documents. Students can ask questions grounded in actual course content.

	## 7.1 RAG Endpoints — `/rag`

	\| Method \| Endpoint \| Description \|
	\|---\|---\|---\|
	\| `POST` \| `/rag/ingest` \| Upload PDF / DOCX / TXT — chunk, embed, store in Qdrant vector DB \|
	\| `GET` \| `/rag/documents` \| List all ingested documents with metadata (title, pages, chunk count) \|
	\| `GET` \| `/rag/documents/{id}` \| Single document details and its chunks \|
	\| `DELETE` \| `/rag/documents/{id}` \| Remove a document from the knowledgebase (admin-only) \|
	\| `POST` \| `/rag/query` \| Ask a question — retrieves top-k relevant chunks → sends to LLM with context \|
	\| `GET` \| `/rag/query/{query_id}/sources` \| Get source citations for a RAG response \|
	\| `POST` \| `/rag/collections` \| Create a named collection (e.g., "DSA", "DBMS", "OS") (admin-only) \|
	\| `GET` \| `/rag/collections` \| List all collections \|

	## 7.2 RAG Pipeline

	```
	Document Upload ──► Chunking (512 tokens, 50 overlap)
	│
	▼
	Embedding (768-dim)
	│
	▼
	Qdrant Vector DB
	│
	User Question ──► Embed Query ──► Similarity Search (top-k=5)
	│
	▼
	Retrieved Chunks + Question
	│
	▼
	LLM Generation
	│
	▼
	Answer with Citations
	```

	### Request — `POST /rag/query`

	```json
	{
	"question": "Explain the difference between process and thread in operating systems",
	"collection": "OS",
	"top_k": 5,
	"model": "auto"
	}
	```

	### Response — `POST /rag/query`

	```json
	{
	"answer": "A process is an independent program in execution with its own memory space...",
	"sources": [
	{
	"document": "Galvin - Operating System Concepts, 10th Ed",
	"chapter": "Chapter 3: Processes",
	"page": 105,
	"relevance_score": 0.92,
	"chunk_preview": "A process is a program in execution. A process is more than..."
	},
	{
	"document": "Galvin - Operating System Concepts, 10th Ed",
	"chapter": "Chapter 4: Threads",
	"page": 163,
	"relevance_score": 0.89,
	"chunk_preview": "A thread is a basic unit of CPU utilization..."
	}
	],
	"model_used": "qwen2.5-14b",
	"tokens_used": 847
	}
	```

	---

	# Phase 8 — Retrieval + Search

	Real-time web grounding via self-hosted SearXNG. Enables the platform to answer questions about current events, recent research, and topics not covered in the knowledgebase.

	## 8.1 Search Endpoints — `/search`

	\| Method \| Endpoint \| Description \|
	\|---\|---\|---\|
	\| `POST` \| `/search/web` \| Query SearXNG — aggregates results from Google, Bing, DuckDuckGo, Wikipedia \|
	\| `POST` \| `/search/wikipedia` \| Targeted Wikipedia search with summary extraction \|
	\| `POST` \| `/search/grounded` \| Search + LLM — retrieves web results, then generates a cited answer \|
	\| `GET` \| `/search/cache` \| List recently cached search results (reduces redundant fetches) \|

	### Request — `POST /search/grounded`

	```json
	{
	"query": "What are the latest improvements in vLLM v0.8?",
	"max_sources": 5,
	"model": "auto"
	}
	```

	### Response — `POST /search/grounded`

	```json
	{
	"answer": "vLLM v0.8 introduced several key improvements including...",
	"sources": [
	{
	"title": "vLLM v0.8.0 Release Notes",
	"url": "https://github.com/vllm-project/vllm/releases",
	"snippet": "Key features: improved prefix caching, multi-modal support..."
	}
	],
	"model_used": "qwen2.5-14b",
	"cached": false
	}
	```

	---

	# Database Schema

	## Entity-Relationship Overview

	```
	┌─────────────┐ ┌─────────────┐ ┌─────────────────┐
	│ users │ │ api_keys │ │ request_logs │
	├─────────────┤ ├─────────────┤ ├─────────────────┤
	│ id (PK) │──┐ │ id (PK) │ │ id (PK) │
	│ roll_number │ ├───►│ user_id(FK) │ ┌──►│ user_id (FK) │
	│ name │ │ │ key_hash │ │ │ api_key_id (FK) │
	│ department │ │ │ key_prefix │ │ │ model │
	│ role │ │ │ type │ │ │ endpoint │
	│ password │ │ │ is_active │ │ │ tokens_in │
	│ is_active │ │ │ created_at │ │ │ tokens_out │
	│ created_at │ │ │ expires_at │ │ │ latency_ms │
	│ updated_at │ │ │ last_used │ │ │ status_code │
	└─────────────┘ │ └─────────────┘ │ │ created_at │
	│ │ └─────────────────┘
	└──────────────────────┘

	┌──────────────────┐ ┌─────────────────┐
	│ quota_overrides │ │ rag_documents │
	├──────────────────┤ ├─────────────────┤
	│ id (PK) │ │ id (PK) │
	│ user_id (FK) │ │ title │
	│ daily_tokens │ │ collection │
	│ requests_per_hr │ │ file_type │
	│ max_tokens_req │ │ chunk_count │
	│ set_by (FK) │ │ uploaded_by(FK) │
	│ created_at │ │ created_at │
	└──────────────────┘ └─────────────────┘
	```

	## SQL Schema — Key Tables

	### users

	```sql
	CREATE TABLE users (
	id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
	roll_number VARCHAR(20) UNIQUE NOT NULL,
	name VARCHAR(100) NOT NULL,
	department VARCHAR(50) NOT NULL,
	role VARCHAR(20) NOT NULL DEFAULT 'student'
	CHECK (role IN ('student', 'faculty', 'admin')),
	hashed_password VARCHAR(255) NOT NULL,
	is_active BOOLEAN DEFAULT TRUE,
	created_at TIMESTAMPTZ DEFAULT NOW(),
	updated_at TIMESTAMPTZ DEFAULT NOW()
	);
	```

	### api_keys

	```sql
	CREATE TABLE api_keys (
	id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
	user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
	key_hash VARCHAR(255) NOT NULL,
	key_prefix VARCHAR(20) NOT NULL,
	type VARCHAR(10) NOT NULL DEFAULT 'static'
	CHECK (type IN ('static', 'refresh')),
	is_active BOOLEAN DEFAULT TRUE,
	created_at TIMESTAMPTZ DEFAULT NOW(),
	expires_at TIMESTAMPTZ,
	last_used TIMESTAMPTZ
	);
	```

	### request_logs

	```sql
	CREATE TABLE request_logs (
	id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
	user_id UUID NOT NULL REFERENCES users(id),
	api_key_id UUID REFERENCES api_keys(id),
	model VARCHAR(50) NOT NULL,
	endpoint VARCHAR(100) NOT NULL,
	tokens_in INTEGER NOT NULL DEFAULT 0,
	tokens_out INTEGER NOT NULL DEFAULT 0,
	latency_ms INTEGER,
	status_code SMALLINT NOT NULL,
	created_at TIMESTAMPTZ DEFAULT NOW()
	);

	CREATE INDEX idx_request_logs_user_date ON request_logs (user_id, created_at);
	CREATE INDEX idx_request_logs_model ON request_logs (model, created_at);
	```

	---

	# Project Folder Structure

	```
	mac/
	├── api/
	│ ├── main.py # FastAPI application entry point
	│ ├── routers/
	│ │ ├── auth.py # /auth endpoints
	│ │ ├── explore.py # /explore endpoints
	│ │ ├── query.py # /query endpoints
	│ │ ├── usage.py # /usage endpoints
	│ │ ├── keys.py # /keys endpoints
	│ │ ├── models.py # /models endpoints
	│ │ ├── integration.py # /integration endpoints
	│ │ ├── quota.py # /quota endpoints
	│ │ ├── guardrails.py # /guardrails endpoints
	│ │ ├── rag.py # /rag endpoints
	│ │ └── search.py # /search endpoints
	│ ├── models/
	│ │ ├── user.py # SQLAlchemy User model
	│ │ ├── api_key.py # APIKey model
	│ │ ├── request_log.py # RequestLog model
	│ │ └── document.py # RAG Document model
	│ ├── schemas/
	│ │ ├── auth.py # Pydantic request/response schemas
	│ │ ├── query.py # Chat, completions, vision schemas
	│ │ ├── usage.py # Usage response schemas
	│ │ └── common.py # Shared schemas (pagination, errors)
	│ ├── core/
	│ │ ├── config.py # Settings from environment variables
	│ │ ├── security.py # JWT creation/validation, password hashing
	│ │ ├── rate_limit.py # Redis-based sliding window rate limiter
	│ │ ├── dependencies.py # FastAPI dependency injection (get_current_user, etc.)
	│ │ └── db.py # Database session factory
	│ ├── services/
	│ │ ├── model_router.py # Smart routing logic (auto model selection)
	│ │ ├── inference.py # LiteLLM client for model calls
	│ │ ├── usage_tracker.py # Token counting and logging
	│ │ ├── guardrails.py # Input/output filtering logic
	│ │ └── rag.py # Document ingestion, embedding, retrieval
	│ └── requirements.txt
	├── litellm/
	│ └── config.yaml # LiteLLM proxy routing configuration
	├── nginx/
	│ └── nginx.conf # Reverse proxy configuration
	├── frontend/
	│ ├── src/
	│ │ ├── pages/ # Dashboard, Keys, History, Admin pages
	│ │ ├── components/ # Reusable UI components
	│ │ └── api/ # API client (Axios/fetch wrappers)
	│ ├── package.json
	│ └── vite.config.ts
	├── scripts/
	│ ├── seed_users.py # Bulk-create students from CSV
	│ ├── download_models.py # Pull models from HuggingFace
	│ └── health_check.py # Verify all services are running
	├── docker-compose.yml # Full stack: api, db, redis, litellm, nginx, qdrant
	├── Dockerfile # FastAPI container
	├── .env.example # Template for environment variables
	├── alembic/ # Database migrations
	│ └── versions/
	└── README.md
	```

	---

	# Security Design

	\| Concern \| Implementation \|
	\|---\|---\|
	\| Password Storage \| bcrypt with work factor 12; passwords never stored in plaintext \|
	\| JWT Signing \| RS256 asymmetric keys; access tokens short-lived (24h) \|
	\| API Key Storage \| Only SHA-256 hash stored in DB; raw key shown once at creation \|
	\| Transport \| Nginx terminates TLS (HTTPS); internal services communicate over Docker network \|
	\| Input Validation \| Pydantic schema validation on every request; max payload size enforced \|
	\| SQL Injection \| SQLAlchemy ORM with parameterised queries throughout \|
	\| Rate Limiting \| Redis sliding-window per user + per IP; prevents abuse and DoS \|
	\| CORS \| Strict origin allowlist — only the MAC frontend domain \|
	\| Admin Endpoints \| Role-based access control; admin-only routes check JWT role claim \|
	\| Prompt Injection \| Input guardrails detect and block prompt override attempts \|
	\| Secrets Management \| All secrets in `.env` file; never committed to version control \|

	---

	# Error Response Format

	All errors follow a consistent structure:

	```json
	{
	"error": {
	"code": "authentication_failed",
	"message": "Invalid roll number or password.",
	"status": 401,
	"timestamp": "2026-04-07T14:30:00Z",
	"request_id": "mac-req-a1b2c3"
	}
	}
	```

	### Standard Error Codes

	\| HTTP Status \| Code \| Description \|
	\|---\|---\|---\|
	\| 400 \| `bad_request` \| Malformed request body or invalid parameters \|
	\| 401 \| `authentication_failed` \| Missing or invalid credentials / API key \|
	\| 403 \| `forbidden` \| Valid auth but insufficient role permissions \|
	\| 404 \| `not_found` \| Resource does not exist \|
	\| 409 \| `conflict` \| Duplicate resource (e.g., user already exists) \|
	\| 422 \| `validation_error` \| Request schema validation failed \|
	\| 429 \| `rate_limit_exceeded` \| Request or token quota exceeded \|
	\| 500 \| `internal_error` \| Unexpected server error \|
	\| 503 \| `model_unavailable` \| Requested model is not loaded or all workers are busy \|

	---

	# Deployment Configuration

	## Environment Variables (`.env`)

	```env
	# Server
	MAC_HOST=0.0.0.0
	MAC_PORT=8000
	MAC_ENV=production

	# Database
	POSTGRES_HOST=db
	POSTGRES_PORT=5432
	POSTGRES_DB=mac
	POSTGRES_USER=mac_admin
	POSTGRES_PASSWORD=<generated-secret>

	# Redis
	REDIS_HOST=redis
	REDIS_PORT=6379

	# JWT
	JWT_PRIVATE_KEY_PATH=/run/secrets/jwt_private.pem
	JWT_PUBLIC_KEY_PATH=/run/secrets/jwt_public.pem
	JWT_ACCESS_TOKEN_EXPIRE_MINUTES=1440
	JWT_REFRESH_TOKEN_EXPIRE_DAYS=30

	# LiteLLM
	LITELLM_PROXY_URL=http://litellm:4000
	LITELLM_MASTER_KEY=<generated-secret>

	# Qdrant
	QDRANT_HOST=qdrant
	QDRANT_PORT=6333

	# Models
	MODEL_STORAGE_PATH=/models
	DEFAULT_MODEL=qwen2.5-14b
	```

	## Docker Compose Services

	\| Service \| Image \| Ports \| Purpose \|
	\|---\|---\|---\|---\|
	\| `api` \| Custom (Dockerfile) \| 8000 \| FastAPI gateway \|
	\| `db` \| postgres:16-alpine \| 5432 \| User data, logs, keys \|
	\| `redis` \| redis:7-alpine \| 6379 \| Rate limiting, caching \|
	\| `litellm` \| ghcr.io/berriai/litellm \| 4000 \| Model proxy + routing \|
	\| `qdrant` \| qdrant/qdrant \| 6333 \| Vector database (RAG) \|
	\| `searxng` \| searxng/searxng \| 8888 \| Web search engine \|
	\| `nginx` \| nginx:alpine \| 80, 443 \| Reverse proxy + TLS \|

	---

	# API Versioning & Compatibility

	- All endpoints are prefixed with `/api/v1`
	- The `/query` endpoints return OpenAI-compatible JSON — any OpenAI SDK works by changing `base_url`
	- Breaking changes will increment the version (`/api/v2`) while maintaining the previous version for one semester
	- Response pagination follows cursor-based pagination for list endpoints

	### OpenAI SDK Compatibility Example

	```python
	from openai import OpenAI

	client = OpenAI(
	base_url="http://mac-server/api/v1",
	api_key="mac_sk_live_your_key_here"
	)

	response = client.chat.completions.create(
	model="auto",
	messages=[
	{"role": "user", "content": "Explain binary search in Python"}
	]
	)

	print(response.choices[0].message.content)
	```

	---

	# Summary

	MAC (MBM AI Cloud) delivers a production-grade, self-hosted AI platform that gives every student access to state-of-the-art AI models at zero recurring cost. The 8-phase build plan ensures a solid foundation before adding advanced features, and the architecture scales from a single lab PC to 30+ machines without rewriting a single line of code.

	Phase 1 (API + Models + Integration + Usage Control) establishes the complete backend — once deployed, students can start querying AI models from their laptops on day one.

	---

	MAC — MBM AI Cloud · API Design Document · Version 1.0 · 07 April 2026