RamMAC / docs /MAC-API-Design-Document.md
Aaryan17's picture
feat: upload full MAC source (mac/, frontend/, alembic/, tests/)
9c0b225 verified

MAC β€” MBM AI Cloud

API Design & Architecture Plan

Phase 1–8 Β· Complete Platform Blueprint

Prepared for Professor Review Β· 07 April 2026


Executive Summary

MAC (MBM AI Cloud) is a fully self-hosted, zero-cloud AI inference platform purpose-built for MBM Engineering College. Students and faculty on the college LAN access state-of-the-art AI models β€” text generation, code assistance, vision understanding, speech-to-text, and mathematical reasoning β€” through a standardised REST API authenticated by per-student API keys.

Key design goals:

  • Zero cloud cost β€” all inference runs on college-owned GPUs
  • OpenAI-compatible API β€” students use familiar SDKs; existing tutorials work unchanged
  • Scalable from day one β€” starts on a single PC, scales horizontally to 30+ nodes with no code changes
  • Secure by default β€” JWT authentication, role-based access, rate limiting, input/output guardrails
  • Academically useful β€” RAG over college textbooks, web-grounded search, per-student usage tracking

Technology Stack

Layer Technology Rationale
API Gateway FastAPI (Python 3.11+) Async, auto-docs (OpenAPI/Swagger), type-safe, fastest Python framework
Database PostgreSQL 16 ACID-compliant, battle-tested, excellent JSON support
Cache / Rate Limiter Redis 7 In-memory, sub-ms latency, native rate-limit primitives
LLM Inference vLLM PagedAttention, continuous batching, highest throughput for local GPUs
Model Router / Proxy LiteLLM Unified OpenAI-compatible proxy, load balancing, fallback routing
Vector Database Qdrant Purpose-built for embeddings, filtering, snapshotting
Web Search SearXNG Self-hosted meta-search, no API keys needed
Reverse Proxy Nginx TLS termination, request buffering, static file serving
Containerisation Docker + Docker Compose Reproducible deployments, one-command startup
Frontend React + Tailwind CSS Component-driven dashboard, responsive, fast
Task Queue Celery + Redis Background jobs: document ingestion, model downloads

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        College LAN                              β”‚
β”‚   Students / Faculty / Lab PCs                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚ HTTPS
                         β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚       Nginx         β”‚
              β”‚   Reverse Proxy     β”‚
              β”‚   TLS Β· Caching     β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚      FastAPI        β”‚
              β”‚   API Gateway       β”‚
              β”‚  Auth Β· Routing     β”‚
              β”‚  Rate Limiting      β”‚
              β””β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”˜
                  β”‚    β”‚    β”‚   β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚    β”‚   └────────┐
         β–Ό             β–Ό    β–Ό            β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚PostgreSQLβ”‚  β”‚   LiteLLM   β”‚  β”‚  Qdrant  β”‚
   β”‚  Users   β”‚  β”‚   Proxy     β”‚  β”‚ VectorDB β”‚
   β”‚  Logs    β”‚  β”‚  Routing    β”‚  β”‚  (RAG)   β”‚
   β”‚  Keys    β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
                        β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚       vLLM          β”‚
              β”‚  Model Workers      β”‚
              β”‚  GPU Inference      β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Scaling model: The entire stack is containerised. To scale from 1 PC to N PCs, new vLLM worker containers are added and registered in the LiteLLM config. The FastAPI gateway, database, and proxy remain centralised on the primary node. Zero code changes required.


Build Phases β€” 8 Phases, Sequential

# Phase Description Dependencies
1 API Endpoints Core REST API β€” explore, query, usage, auth None
2 LLM Models Select, download, feasibility-check, deploy 5 models Phase 1
3 API–Model Integration Wire every model to its dedicated endpoint via LiteLLM + vLLM Phase 1, 2
4 API Usage Control Rate limiting, token accounting, static + refresh API keys Phase 1
5 Web Interface Dashboard, user management, admin panel, model access controls Phase 1, 4
6 Guardrails Input + output content filtering, safety checks Phase 3
7 Knowledgebase + RAG Vector DB, document ingestion, retrieval-augmented generation Phase 3
8 Retrieval + Search SearXNG web search, Wikipedia, real-time grounded answers Phase 3, 7

Base URL

http://<server-ip>/api/v1

The server IP is configured via environment variable MAC_HOST. No IP addresses are hardcoded anywhere in the codebase. For production, the Nginx reverse proxy terminates TLS and forwards to the FastAPI gateway.


Phase 1 β€” API Endpoints

1.1 Authentication β€” /auth

Handles user login, session management, and JWT token lifecycle. All passwords are hashed with bcrypt (work factor 12). Tokens use RS256 JWT signing.

Method Endpoint Description
POST /auth/login Roll number + password β†’ JWT access token + refresh token
POST /auth/logout Invalidate current session / revoke refresh token
POST /auth/refresh Exchange refresh token for a new access token
GET /auth/me Current user's profile, role, department, and API key
POST /auth/change-password Change password (requires current password verification)

Request β€” POST /auth/login

{
  "roll_number": "21CS045",
  "password": "secure_password"
}
Field Type Required Description
roll_number string Yes Student/faculty roll number, e.g. 21CS045
password string Yes Account password (min 8 chars)

Response β€” POST /auth/login

{
  "access_token": "eyJhbG...",
  "refresh_token": "dGhpcyBp...",
  "token_type": "bearer",
  "expires_in": 86400,
  "user": {
    "roll_number": "21CS045",
    "name": "Aryan Sharma",
    "department": "CSE",
    "role": "student",
    "api_key": "mac_sk_live_abc123..."
  }
}
Field Description
access_token JWT, 24-hour expiry, used for all authenticated requests
refresh_token 30-day expiry, used only at /auth/refresh
user.role One of: student, faculty, admin
user.api_key Personal API key for direct model queries

1.2 Explore β€” /explore

Read-only discovery endpoints. Allow students to see what models and capabilities are available before writing code.

Method Endpoint Description
GET /explore/models List all deployed models with capabilities, context length, and status
GET /explore/models/{model_id} Detailed info β€” parameters, benchmarks, example prompts
GET /explore/models/search Search by capability tag: ?tag=vision, ?tag=code, ?tag=math
GET /explore/endpoints List every API endpoint with method, path, auth requirement, description
GET /explore/health Platform health β€” node status, GPU temperatures, inference queue depth
GET /explore/usage-stats Aggregated platform analytics (admin-only) β€” tokens/day, active users

Response β€” GET /explore/models

{
  "models": [
    {
      "id": "qwen2.5-coder-7b",
      "name": "Qwen2.5-Coder 7B",
      "specialty": "Code generation, debugging, explanation",
      "parameters": "7B",
      "context_length": 32768,
      "status": "loaded",
      "capabilities": ["code", "chat", "completion"]
    }
  ],
  "total": 5
}

Response β€” GET /explore/health

{
  "status": "healthy",
  "uptime_seconds": 345600,
  "nodes": [
    {
      "ip": "192.168.1.101",
      "gpu": "NVIDIA RTX 3060 12GB",
      "gpu_temp_c": 62,
      "vram_used_gb": 9.2,
      "vram_total_gb": 12.0,
      "requests_in_flight": 3,
      "status": "active"
    }
  ],
  "queue_depth": 7
}

1.3 Query β€” /query

The core inference API. All endpoints accept Authorization: Bearer <api_key> header. Responses follow the OpenAI Chat Completions format β€” existing OpenAI SDK code works with zero changes by swapping base_url.

Method Endpoint Description
POST /query/chat Chat completion β€” multi-turn conversation (text/code/math)
POST /query/completions Raw text completion (OpenAI-compatible)
POST /query/vision Image + text β†’ answer (vision model)
POST /query/speech-to-text Upload audio β†’ transcribed text (Whisper)
POST /query/text-to-speech Text β†’ audio file download (TTS)
POST /query/embeddings Text β†’ vector embedding (for RAG / similarity search)
POST /query/rerank Re-rank passages by relevance to a query

Request β€” POST /query/chat

{
  "model": "auto",
  "messages": [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to reverse a linked list."}
  ],
  "temperature": 0.7,
  "max_tokens": 2048,
  "stream": false
}
Field Type Required Description
model string Yes Model ID or "auto" for smart routing
messages array Yes Array of {role, content} β€” system / user / assistant
temperature float No Sampling temperature 0.0–2.0 (default 0.7)
max_tokens integer No Max tokens to generate (default 2048, max 4096)
stream boolean No Stream tokens as Server-Sent Events (default false)
context_id string No Maintain conversation state server-side

Response β€” POST /query/chat

{
  "id": "mac-chat-a1b2c3d4",
  "object": "chat.completion",
  "created": 1743984000,
  "model": "qwen2.5-coder-7b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Here's a Python function to reverse a linked list..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 42,
    "completion_tokens": 187,
    "total_tokens": 229
  }
}

Smart Routing (model: "auto")

When students set model to "auto", the API gateway inspects the request and routes to the optimal model:

Signal Detected Routed To
Code keywords, programming language names, write a function, debug this qwen2.5-coder-7b
Math expressions, equations, solve, prove, step by step deepseek-r1-8b
Image attachment in request body llava-1.6-7b
Audio file upload whisper-large-v3
General text, summarisation, writing, Q&A qwen2.5-14b

1.4 Usage β€” /usage

Per-student consumption tracking. Every API call logs token counts, model used, and timestamp.

Method Endpoint Description
GET /usage/me My tokens used β€” today, this week, this month, broken down by model
GET /usage/me/history Full request history β€” timestamps, models, token counts, latency
GET /usage/me/quota My current quota limits and remaining balance
GET /usage/admin/all All users' usage summary (admin-only)
GET /usage/admin/user/{roll_number} Specific student's full usage details (admin-only)
GET /usage/admin/models Per-model usage statistics across the platform (admin-only)

Response β€” GET /usage/me

{
  "roll_number": "21CS045",
  "usage": {
    "today": {
      "total_tokens": 12450,
      "requests": 23,
      "by_model": {
        "qwen2.5-coder-7b": {"tokens": 8200, "requests": 15},
        "qwen2.5-14b": {"tokens": 4250, "requests": 8}
      }
    },
    "this_week": {"total_tokens": 67800, "requests": 142},
    "this_month": {"total_tokens": 234500, "requests": 487}
  },
  "quota": {
    "daily_limit": 50000,
    "remaining_today": 37550
  }
}

Phase 2 β€” LLM Models

2.1 Model Selection

Five best-in-class open-source models, each a domain specialist. All models are quantised to fit within consumer GPU VRAM.

Model ID Name Specialty Parameters VRAM Required Quantisation
qwen2.5-coder-7b Qwen2.5-Coder 7B Code generation, debugging, explanation 7B ~5 GB GPTQ-Int4
deepseek-r1-8b DeepSeek-R1 8B Maths, reasoning, step-by-step logic 8B ~6 GB AWQ-Int4
llava-1.6-7b LLaVA 1.6 7B Image understanding, visual Q&A 7B ~8 GB FP16
whisper-large-v3 Whisper Large v3 Speech-to-text, transcription 1.5B ~3 GB FP16
qwen2.5-14b Qwen2.5 14B General chat, summarisation, writing 14B ~10 GB GPTQ-Int4

Total VRAM for all 5 models: ~32 GB (fits on dual-GPU setups or loads models on demand)

Feasibility on Single PC

For a single-PC deployment with one GPU (e.g., RTX 3060 12GB or RTX 4070 16GB):

  • Models are loaded/unloaded on demand β€” only the requested model occupies VRAM at inference time
  • A model loading queue ensures graceful transitions
  • Frequently-used models remain resident; idle models are evicted after a configurable timeout

2.2 Model Management Endpoints β€” /models

Method Endpoint Description
GET /models List all models with status: loaded / downloading / queued / offline
GET /models/{model_id} Single model details β€” context length, capabilities, benchmark scores
POST /models/{model_id}/load Load model into GPU memory (admin-only)
POST /models/{model_id}/unload Unload model from VRAM to free resources (admin-only)
GET /models/{model_id}/health Ping model β€” returns latency, memory usage, ready status
POST /models/download Download a model from HuggingFace by ID (admin-only)
GET /models/download/{task_id} Check download progress for a model download task

Response β€” GET /models

{
  "models": [
    {
      "id": "qwen2.5-coder-7b",
      "name": "Qwen2.5-Coder 7B",
      "status": "loaded",
      "vram_mb": 5120,
      "context_length": 32768,
      "capabilities": ["code", "chat"],
      "loaded_at": "2026-04-07T08:30:00Z"
    },
    {
      "id": "deepseek-r1-8b",
      "name": "DeepSeek-R1 8B",
      "status": "offline",
      "vram_mb": 6144,
      "context_length": 65536,
      "capabilities": ["reasoning", "math", "chat"],
      "loaded_at": null
    }
  ]
}

Phase 3 β€” API–Model Integration

3.1 Architecture

LiteLLM Proxy sits between the FastAPI gateway and vLLM workers. It:

  1. Translates every /query request into the correct vLLM inference call
  2. Handles smart routing (auto-model selection)
  3. Load-balances across multiple workers (when scaled)
  4. Retries on worker failure with exponential backoff
  5. Returns OpenAI-compatible JSON responses
FastAPI Gateway  ──►  LiteLLM Proxy  ──►  vLLM Worker(s)
                      (routing +           (GPU inference)
                       load balance)

3.2 Integration Endpoints β€” /integration

Method Endpoint Description
GET /integration/routing-rules Show current routing rules (which task type β†’ which model)
PUT /integration/routing-rules Update routing rules (admin-only)
GET /integration/workers List all vLLM worker nodes and their current load
GET /integration/workers/{node_id} Single worker β€” GPU temp, VRAM used, requests in flight
POST /integration/workers/{node_id}/drain Mark worker as draining β€” no new requests routed (admin-only)
GET /integration/queue Current global inference queue depth across all workers

Response β€” GET /integration/routing-rules

{
  "rules": [
    {"task": "code", "keywords": ["function", "debug", "code", "python", "javascript"], "model": "qwen2.5-coder-7b", "priority": 1},
    {"task": "math", "keywords": ["solve", "equation", "prove", "integral"], "model": "deepseek-r1-8b", "priority": 2},
    {"task": "vision", "trigger": "image_attachment", "model": "llava-1.6-7b", "priority": 3},
    {"task": "speech", "trigger": "audio_upload", "model": "whisper-large-v3", "priority": 4},
    {"task": "general", "trigger": "default", "model": "qwen2.5-14b", "priority": 99}
  ]
}

3.3 Scaling Strategy

Deployment Configuration
Single PC One vLLM process, models loaded/swapped on demand
2–5 PCs Each PC runs a vLLM worker with 1–2 dedicated models; LiteLLM routes by model
6–30 PCs Multiple workers per model for redundancy; least-busy routing; auto-failover

All worker addresses are stored in configuration (environment variables / config file) β€” no IPs are hardcoded. Adding a new node requires only updating the LiteLLM config and restarting the proxy.


Phase 4 β€” API Usage Control

4.1 API Key Management β€” /keys

Every student receives a unique API key upon account creation. Two key types are supported:

Key Type Behaviour
Static Never expires. Manually rotated by the student or admin. Ideal for quick experiments.
Refresh Auto-rotates every 30 days. Old key has a 48-hour grace period. Ideal for long-running scripts.

Key format: mac_sk_live_<32-char-random-hex>

Method Endpoint Description
GET /keys/my-key Get current API key (partially masked) and metadata
POST /keys/generate Generate a new API key (invalidates the previous one)
GET /keys/my-key/stats Tokens consumed against this key β€” today, week, month
DELETE /keys/my-key Revoke current key permanently (must re-generate)
GET /keys/admin/all List all student API keys and status (admin-only)
POST /keys/admin/revoke Force-revoke a specific student's API key (admin-only)

Response β€” GET /keys/my-key

{
  "key": "mac_sk_live_a1b2...****",
  "type": "refresh",
  "created_at": "2026-03-01T10:00:00Z",
  "expires_at": "2026-03-31T10:00:00Z",
  "last_used": "2026-04-07T14:23:00Z",
  "status": "active",
  "total_requests": 487
}

4.2 Rate Limiting & Quotas β€” /quota

Rate limits are enforced at the Redis layer using a sliding-window algorithm. Limits are per-role and can be overridden per-user.

Role Daily Token Limit Requests/Hour Max Tokens/Request
Student 50,000 100 4,096
Faculty 200,000 500 8,192
Admin Unlimited Unlimited 16,384
Method Endpoint Description
GET /quota/limits Show default quota limits per role
GET /quota/me My personal limits and current consumption
PUT /quota/admin/user/{roll_number} Override quota for a specific user (admin-only)
GET /quota/admin/exceeded List users who have exceeded their daily quota (admin-only)

Response β€” GET /quota/me

{
  "role": "student",
  "limits": {
    "daily_tokens": 50000,
    "requests_per_hour": 100,
    "max_tokens_per_request": 4096
  },
  "current": {
    "tokens_used_today": 12450,
    "requests_this_hour": 8,
    "remaining_tokens": 37550,
    "resets_at": "2026-04-08T00:00:00Z"
  }
}

Rate Limit Response Headers

Every API response includes:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 92
X-RateLimit-Reset: 1743987600
X-TokenLimit-Limit: 50000
X-TokenLimit-Remaining: 37550
X-TokenLimit-Reset: 1744070400

When a limit is exceeded, the API returns:

HTTP 429 Too Many Requests
{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "You have exceeded your hourly request limit. Try again in 847 seconds.",
    "retry_after": 847
  }
}

Phase 5 β€” Web Interface

A React-based single-page application served by the FastAPI backend. Three views: Student Dashboard, Key Management, and Admin Panel.

5.1 Web Interface Endpoints β€” /ui

Method Endpoint Description
GET /ui/dashboard Student home β€” usage chart, quick-start code snippets, model status cards
GET /ui/keys Key management β€” view, copy, regenerate API key
GET /ui/history Request history table β€” timestamp, model, tokens, latency, status
GET /ui/playground Interactive chat playground β€” test models directly in the browser

5.2 Admin Panel β€” /ui/admin

Method Endpoint Description
GET /ui/admin/users Full user list with roles, quotas, last active, status
POST /ui/admin/users/create Create user / bulk-create from CSV (roll no., name, department)
PUT /ui/admin/users/{roll_number} Edit user β€” change role, quota overrides, model access
DELETE /ui/admin/users/{roll_number} Deactivate a user account
GET /ui/admin/models Model management β€” load/unload, see which node serves what
GET /ui/admin/logs Live request logs β€” error rates, latency percentiles, throughput
GET /ui/admin/analytics Platform analytics β€” daily active users, peak hours, top models

Dashboard Wireframe

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  MAC β€” MBM AI Cloud                    [Profile β–Ό]   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚
β”‚  β”‚ Tokens  β”‚  β”‚Requests β”‚  β”‚  Quota  β”‚              β”‚
β”‚  β”‚  Today  β”‚  β”‚  Today  β”‚  β”‚Remainingβ”‚              β”‚
β”‚  β”‚ 12,450  β”‚  β”‚   23    β”‚  β”‚  75.1%  β”‚              β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
β”‚                                                      β”‚
β”‚  Models Available          Quick Start               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ βœ… Qwen2.5-Coder β”‚     β”‚  from openai import β”‚    β”‚
β”‚  β”‚ βœ… DeepSeek-R1   β”‚     β”‚  OpenAI             β”‚    β”‚
β”‚  β”‚ βœ… Qwen2.5 14B   β”‚     β”‚  client = OpenAI(   β”‚    β”‚
β”‚  β”‚ ⏳ LLaVA (load.) β”‚     β”‚    base_url=        β”‚    β”‚
β”‚  β”‚ βœ… Whisper       β”‚     β”‚    "http://mac/v1", β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚    api_key="mac_sk_" β”‚    β”‚
β”‚                           β”‚  )                   β”‚    β”‚
β”‚  Usage This Week          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                               β”‚
β”‚  β”‚ ▁▃▅▇▅▃▁ (chart) β”‚                               β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Phase 6 β€” Guardrails

Input and output content filtering to ensure safe, appropriate use of AI models within an academic environment.

6.1 Guardrails Endpoints β€” /guardrails

Method Endpoint Description
POST /guardrails/check-input Run input text through content filter before sending to model
POST /guardrails/check-output Run model output through safety filter before returning to user
GET /guardrails/rules List active guardrail rules (admin-only)
PUT /guardrails/rules Update rules β€” blocked categories, max prompt length (admin-only)

6.2 Filtering Pipeline

User Input  ──►  Input Filter  ──►  Model Inference  ──►  Output Filter  ──►  Response
                 β”‚                                        β”‚
                 β”œβ”€ Prompt injection detection             β”œβ”€ PII/sensitive data redaction
                 β”œβ”€ Blocked topic detection                β”œβ”€ Harmful content detection
                 β”œβ”€ Max prompt length enforcement           β”œβ”€ Hallucination disclaimer
                 └─ Academic integrity checks               └─ Source attribution

Guardrail Categories

Category Action Description
Prompt Injection Block + log Detect attempts to override system prompts
Harmful Content Block Violence, self-harm, illegal activities
Academic Dishonesty Flag + disclaimer Full essay/assignment generation adds academic integrity notice
PII in Output Redact Strip emails, phone numbers, addresses from model output
Max Prompt Length Reject Configurable per-role; prevents resource abuse

Request β€” POST /guardrails/check-input

{
  "text": "User input to check",
  "context": "chat"
}

Response

{
  "safe": true,
  "flags": [],
  "modified_text": null
}

Phase 7 β€” Knowledgebase + RAG

Retrieval-Augmented Generation over college textbooks, course materials, and reference documents. Students can ask questions grounded in actual course content.

7.1 RAG Endpoints β€” /rag

Method Endpoint Description
POST /rag/ingest Upload PDF / DOCX / TXT β€” chunk, embed, store in Qdrant vector DB
GET /rag/documents List all ingested documents with metadata (title, pages, chunk count)
GET /rag/documents/{id} Single document details and its chunks
DELETE /rag/documents/{id} Remove a document from the knowledgebase (admin-only)
POST /rag/query Ask a question β€” retrieves top-k relevant chunks β†’ sends to LLM with context
GET /rag/query/{query_id}/sources Get source citations for a RAG response
POST /rag/collections Create a named collection (e.g., "DSA", "DBMS", "OS") (admin-only)
GET /rag/collections List all collections

7.2 RAG Pipeline

Document Upload  ──►  Chunking (512 tokens, 50 overlap)
                              β”‚
                              β–Ό
                      Embedding (768-dim)
                              β”‚
                              β–Ό
                      Qdrant Vector DB
                              β”‚
     User Question  ──►  Embed Query  ──►  Similarity Search (top-k=5)
                                                    β”‚
                                                    β–Ό
                                          Retrieved Chunks + Question
                                                    β”‚
                                                    β–Ό
                                              LLM Generation
                                                    β”‚
                                                    β–Ό
                                          Answer with Citations

Request β€” POST /rag/query

{
  "question": "Explain the difference between process and thread in operating systems",
  "collection": "OS",
  "top_k": 5,
  "model": "auto"
}

Response β€” POST /rag/query

{
  "answer": "A process is an independent program in execution with its own memory space...",
  "sources": [
    {
      "document": "Galvin - Operating System Concepts, 10th Ed",
      "chapter": "Chapter 3: Processes",
      "page": 105,
      "relevance_score": 0.92,
      "chunk_preview": "A process is a program in execution. A process is more than..."
    },
    {
      "document": "Galvin - Operating System Concepts, 10th Ed",
      "chapter": "Chapter 4: Threads",
      "page": 163,
      "relevance_score": 0.89,
      "chunk_preview": "A thread is a basic unit of CPU utilization..."
    }
  ],
  "model_used": "qwen2.5-14b",
  "tokens_used": 847
}

Phase 8 β€” Retrieval + Search

Real-time web grounding via self-hosted SearXNG. Enables the platform to answer questions about current events, recent research, and topics not covered in the knowledgebase.

8.1 Search Endpoints β€” /search

Method Endpoint Description
POST /search/web Query SearXNG β€” aggregates results from Google, Bing, DuckDuckGo, Wikipedia
POST /search/wikipedia Targeted Wikipedia search with summary extraction
POST /search/grounded Search + LLM β€” retrieves web results, then generates a cited answer
GET /search/cache List recently cached search results (reduces redundant fetches)

Request β€” POST /search/grounded

{
  "query": "What are the latest improvements in vLLM v0.8?",
  "max_sources": 5,
  "model": "auto"
}

Response β€” POST /search/grounded

{
  "answer": "vLLM v0.8 introduced several key improvements including...",
  "sources": [
    {
      "title": "vLLM v0.8.0 Release Notes",
      "url": "https://github.com/vllm-project/vllm/releases",
      "snippet": "Key features: improved prefix caching, multi-modal support..."
    }
  ],
  "model_used": "qwen2.5-14b",
  "cached": false
}

Database Schema

Entity-Relationship Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    users     β”‚       β”‚  api_keys   β”‚       β”‚  request_logs   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€       β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€       β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ id (PK)     │──┐    β”‚ id (PK)     β”‚       β”‚ id (PK)         β”‚
β”‚ roll_number β”‚  β”œβ”€β”€β”€β–Ίβ”‚ user_id(FK) β”‚   β”Œβ”€β”€β–Ίβ”‚ user_id (FK)    β”‚
β”‚ name        β”‚  β”‚    β”‚ key_hash    β”‚   β”‚   β”‚ api_key_id (FK) β”‚
β”‚ department  β”‚  β”‚    β”‚ key_prefix  β”‚   β”‚   β”‚ model           β”‚
β”‚ role        β”‚  β”‚    β”‚ type        β”‚   β”‚   β”‚ endpoint        β”‚
β”‚ password    β”‚  β”‚    β”‚ is_active   β”‚   β”‚   β”‚ tokens_in       β”‚
β”‚ is_active   β”‚  β”‚    β”‚ created_at  β”‚   β”‚   β”‚ tokens_out      β”‚
β”‚ created_at  β”‚  β”‚    β”‚ expires_at  β”‚   β”‚   β”‚ latency_ms      β”‚
β”‚ updated_at  β”‚  β”‚    β”‚ last_used   β”‚   β”‚   β”‚ status_code     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚   β”‚ created_at      β”‚
                 β”‚                      β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  quota_overrides β”‚       β”‚  rag_documents  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€       β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ id (PK)          β”‚       β”‚ id (PK)         β”‚
β”‚ user_id (FK)     β”‚       β”‚ title           β”‚
β”‚ daily_tokens     β”‚       β”‚ collection      β”‚
β”‚ requests_per_hr  β”‚       β”‚ file_type       β”‚
β”‚ max_tokens_req   β”‚       β”‚ chunk_count     β”‚
β”‚ set_by (FK)      β”‚       β”‚ uploaded_by(FK) β”‚
β”‚ created_at       β”‚       β”‚ created_at      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

SQL Schema β€” Key Tables

users

CREATE TABLE users (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    roll_number     VARCHAR(20) UNIQUE NOT NULL,
    name            VARCHAR(100) NOT NULL,
    department      VARCHAR(50) NOT NULL,
    role            VARCHAR(20) NOT NULL DEFAULT 'student'
                    CHECK (role IN ('student', 'faculty', 'admin')),
    hashed_password VARCHAR(255) NOT NULL,
    is_active       BOOLEAN DEFAULT TRUE,
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    updated_at      TIMESTAMPTZ DEFAULT NOW()
);

api_keys

CREATE TABLE api_keys (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id     UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    key_hash    VARCHAR(255) NOT NULL,
    key_prefix  VARCHAR(20) NOT NULL,
    type        VARCHAR(10) NOT NULL DEFAULT 'static'
                CHECK (type IN ('static', 'refresh')),
    is_active   BOOLEAN DEFAULT TRUE,
    created_at  TIMESTAMPTZ DEFAULT NOW(),
    expires_at  TIMESTAMPTZ,
    last_used   TIMESTAMPTZ
);

request_logs

CREATE TABLE request_logs (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id     UUID NOT NULL REFERENCES users(id),
    api_key_id  UUID REFERENCES api_keys(id),
    model       VARCHAR(50) NOT NULL,
    endpoint    VARCHAR(100) NOT NULL,
    tokens_in   INTEGER NOT NULL DEFAULT 0,
    tokens_out  INTEGER NOT NULL DEFAULT 0,
    latency_ms  INTEGER,
    status_code SMALLINT NOT NULL,
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_request_logs_user_date ON request_logs (user_id, created_at);
CREATE INDEX idx_request_logs_model ON request_logs (model, created_at);

Project Folder Structure

mac/
β”œβ”€β”€ api/
β”‚   β”œβ”€β”€ main.py                 # FastAPI application entry point
β”‚   β”œβ”€β”€ routers/
β”‚   β”‚   β”œβ”€β”€ auth.py             # /auth endpoints
β”‚   β”‚   β”œβ”€β”€ explore.py          # /explore endpoints
β”‚   β”‚   β”œβ”€β”€ query.py            # /query endpoints
β”‚   β”‚   β”œβ”€β”€ usage.py            # /usage endpoints
β”‚   β”‚   β”œβ”€β”€ keys.py             # /keys endpoints
β”‚   β”‚   β”œβ”€β”€ models.py           # /models endpoints
β”‚   β”‚   β”œβ”€β”€ integration.py      # /integration endpoints
β”‚   β”‚   β”œβ”€β”€ quota.py            # /quota endpoints
β”‚   β”‚   β”œβ”€β”€ guardrails.py       # /guardrails endpoints
β”‚   β”‚   β”œβ”€β”€ rag.py              # /rag endpoints
β”‚   β”‚   └── search.py           # /search endpoints
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ user.py             # SQLAlchemy User model
β”‚   β”‚   β”œβ”€β”€ api_key.py          # APIKey model
β”‚   β”‚   β”œβ”€β”€ request_log.py      # RequestLog model
β”‚   β”‚   └── document.py         # RAG Document model
β”‚   β”œβ”€β”€ schemas/
β”‚   β”‚   β”œβ”€β”€ auth.py             # Pydantic request/response schemas
β”‚   β”‚   β”œβ”€β”€ query.py            # Chat, completions, vision schemas
β”‚   β”‚   β”œβ”€β”€ usage.py            # Usage response schemas
β”‚   β”‚   └── common.py           # Shared schemas (pagination, errors)
β”‚   β”œβ”€β”€ core/
β”‚   β”‚   β”œβ”€β”€ config.py           # Settings from environment variables
β”‚   β”‚   β”œβ”€β”€ security.py         # JWT creation/validation, password hashing
β”‚   β”‚   β”œβ”€β”€ rate_limit.py       # Redis-based sliding window rate limiter
β”‚   β”‚   β”œβ”€β”€ dependencies.py     # FastAPI dependency injection (get_current_user, etc.)
β”‚   β”‚   └── db.py               # Database session factory
β”‚   β”œβ”€β”€ services/
β”‚   β”‚   β”œβ”€β”€ model_router.py     # Smart routing logic (auto model selection)
β”‚   β”‚   β”œβ”€β”€ inference.py        # LiteLLM client for model calls
β”‚   β”‚   β”œβ”€β”€ usage_tracker.py    # Token counting and logging
β”‚   β”‚   β”œβ”€β”€ guardrails.py       # Input/output filtering logic
β”‚   β”‚   └── rag.py              # Document ingestion, embedding, retrieval
β”‚   └── requirements.txt
β”œβ”€β”€ litellm/
β”‚   └── config.yaml             # LiteLLM proxy routing configuration
β”œβ”€β”€ nginx/
β”‚   └── nginx.conf              # Reverse proxy configuration
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ pages/              # Dashboard, Keys, History, Admin pages
β”‚   β”‚   β”œβ”€β”€ components/         # Reusable UI components
β”‚   β”‚   └── api/                # API client (Axios/fetch wrappers)
β”‚   β”œβ”€β”€ package.json
β”‚   └── vite.config.ts
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ seed_users.py           # Bulk-create students from CSV
β”‚   β”œβ”€β”€ download_models.py      # Pull models from HuggingFace
β”‚   └── health_check.py         # Verify all services are running
β”œβ”€β”€ docker-compose.yml          # Full stack: api, db, redis, litellm, nginx, qdrant
β”œβ”€β”€ Dockerfile                  # FastAPI container
β”œβ”€β”€ .env.example                # Template for environment variables
β”œβ”€β”€ alembic/                    # Database migrations
β”‚   └── versions/
└── README.md

Security Design

Concern Implementation
Password Storage bcrypt with work factor 12; passwords never stored in plaintext
JWT Signing RS256 asymmetric keys; access tokens short-lived (24h)
API Key Storage Only SHA-256 hash stored in DB; raw key shown once at creation
Transport Nginx terminates TLS (HTTPS); internal services communicate over Docker network
Input Validation Pydantic schema validation on every request; max payload size enforced
SQL Injection SQLAlchemy ORM with parameterised queries throughout
Rate Limiting Redis sliding-window per user + per IP; prevents abuse and DoS
CORS Strict origin allowlist β€” only the MAC frontend domain
Admin Endpoints Role-based access control; admin-only routes check JWT role claim
Prompt Injection Input guardrails detect and block prompt override attempts
Secrets Management All secrets in .env file; never committed to version control

Error Response Format

All errors follow a consistent structure:

{
  "error": {
    "code": "authentication_failed",
    "message": "Invalid roll number or password.",
    "status": 401,
    "timestamp": "2026-04-07T14:30:00Z",
    "request_id": "mac-req-a1b2c3"
  }
}

Standard Error Codes

HTTP Status Code Description
400 bad_request Malformed request body or invalid parameters
401 authentication_failed Missing or invalid credentials / API key
403 forbidden Valid auth but insufficient role permissions
404 not_found Resource does not exist
409 conflict Duplicate resource (e.g., user already exists)
422 validation_error Request schema validation failed
429 rate_limit_exceeded Request or token quota exceeded
500 internal_error Unexpected server error
503 model_unavailable Requested model is not loaded or all workers are busy

Deployment Configuration

Environment Variables (.env)

# Server
MAC_HOST=0.0.0.0
MAC_PORT=8000
MAC_ENV=production

# Database
POSTGRES_HOST=db
POSTGRES_PORT=5432
POSTGRES_DB=mac
POSTGRES_USER=mac_admin
POSTGRES_PASSWORD=<generated-secret>

# Redis
REDIS_HOST=redis
REDIS_PORT=6379

# JWT
JWT_PRIVATE_KEY_PATH=/run/secrets/jwt_private.pem
JWT_PUBLIC_KEY_PATH=/run/secrets/jwt_public.pem
JWT_ACCESS_TOKEN_EXPIRE_MINUTES=1440
JWT_REFRESH_TOKEN_EXPIRE_DAYS=30

# LiteLLM
LITELLM_PROXY_URL=http://litellm:4000
LITELLM_MASTER_KEY=<generated-secret>

# Qdrant
QDRANT_HOST=qdrant
QDRANT_PORT=6333

# Models
MODEL_STORAGE_PATH=/models
DEFAULT_MODEL=qwen2.5-14b

Docker Compose Services

Service Image Ports Purpose
api Custom (Dockerfile) 8000 FastAPI gateway
db postgres:16-alpine 5432 User data, logs, keys
redis redis:7-alpine 6379 Rate limiting, caching
litellm ghcr.io/berriai/litellm 4000 Model proxy + routing
qdrant qdrant/qdrant 6333 Vector database (RAG)
searxng searxng/searxng 8888 Web search engine
nginx nginx:alpine 80, 443 Reverse proxy + TLS

API Versioning & Compatibility

  • All endpoints are prefixed with /api/v1
  • The /query endpoints return OpenAI-compatible JSON β€” any OpenAI SDK works by changing base_url
  • Breaking changes will increment the version (/api/v2) while maintaining the previous version for one semester
  • Response pagination follows cursor-based pagination for list endpoints

OpenAI SDK Compatibility Example

from openai import OpenAI

client = OpenAI(
    base_url="http://mac-server/api/v1",
    api_key="mac_sk_live_your_key_here"
)

response = client.chat.completions.create(
    model="auto",
    messages=[
        {"role": "user", "content": "Explain binary search in Python"}
    ]
)

print(response.choices[0].message.content)

Summary

MAC (MBM AI Cloud) delivers a production-grade, self-hosted AI platform that gives every student access to state-of-the-art AI models at zero recurring cost. The 8-phase build plan ensures a solid foundation before adding advanced features, and the architecture scales from a single lab PC to 30+ machines without rewriting a single line of code.

Phase 1 (API + Models + Integration + Usage Control) establishes the complete backend β€” once deployed, students can start querying AI models from their laptops on day one.


MAC β€” MBM AI Cloud Β· API Design Document Β· Version 1.0 Β· 07 April 2026