Spaces:

Aaryan17
/

RamMAC

Running

App Files Files Community

RamMAC / docs /MAC-API-Design-Document.md

Aaryan17

feat: upload full MAC source (mac/, frontend/, alembic/, tests/)

9c0b225 verified 16 days ago

preview code

raw

history blame contribute delete

43.8 kB

MAC — MBM AI Cloud

API Design & Architecture Plan

Phase 1–8 · Complete Platform Blueprint

Prepared for Professor Review · 07 April 2026

Executive Summary

MAC (MBM AI Cloud) is a fully self-hosted, zero-cloud AI inference platform purpose-built for MBM Engineering College. Students and faculty on the college LAN access state-of-the-art AI models — text generation, code assistance, vision understanding, speech-to-text, and mathematical reasoning — through a standardised REST API authenticated by per-student API keys.

Key design goals:

Zero cloud cost — all inference runs on college-owned GPUs
OpenAI-compatible API — students use familiar SDKs; existing tutorials work unchanged
Scalable from day one — starts on a single PC, scales horizontally to 30+ nodes with no code changes
Secure by default — JWT authentication, role-based access, rate limiting, input/output guardrails
Academically useful — RAG over college textbooks, web-grounded search, per-student usage tracking

Technology Stack

Layer	Technology	Rationale
API Gateway	FastAPI (Python 3.11+)	Async, auto-docs (OpenAPI/Swagger), type-safe, fastest Python framework
Database	PostgreSQL 16	ACID-compliant, battle-tested, excellent JSON support
Cache / Rate Limiter	Redis 7	In-memory, sub-ms latency, native rate-limit primitives
LLM Inference	vLLM	PagedAttention, continuous batching, highest throughput for local GPUs
Model Router / Proxy	LiteLLM	Unified OpenAI-compatible proxy, load balancing, fallback routing
Vector Database	Qdrant	Purpose-built for embeddings, filtering, snapshotting
Web Search	SearXNG	Self-hosted meta-search, no API keys needed
Reverse Proxy	Nginx	TLS termination, request buffering, static file serving
Containerisation	Docker + Docker Compose	Reproducible deployments, one-command startup
Frontend	React + Tailwind CSS	Component-driven dashboard, responsive, fast
Task Queue	Celery + Redis	Background jobs: document ingestion, model downloads

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        College LAN                              │
│   Students / Faculty / Lab PCs                                  │
└────────────────────────┬────────────────────────────────────────┘
                         │ HTTPS
                         ▼
              ┌─────────────────────┐
              │       Nginx         │
              │   Reverse Proxy     │
              │   TLS · Caching     │
              └─────────┬───────────┘
                        │
              ┌─────────▼───────────┐
              │      FastAPI        │
              │   API Gateway       │
              │  Auth · Routing     │
              │  Rate Limiting      │
              └───┬────┬────┬───┬──┘
                  │    │    │   │
         ┌────────┘    │    │   └────────┐
         ▼             ▼    ▼            ▼
   ┌──────────┐  ┌─────────────┐  ┌──────────┐
   │PostgreSQL│  │   LiteLLM   │  │  Qdrant  │
   │  Users   │  │   Proxy     │  │ VectorDB │
   │  Logs    │  │  Routing    │  │  (RAG)   │
   │  Keys    │  └──────┬──────┘  └──────────┘
   └──────────┘         │
                        ▼
              ┌─────────────────────┐
              │       vLLM          │
              │  Model Workers      │
              │  GPU Inference      │
              └─────────────────────┘

Scaling model: The entire stack is containerised. To scale from 1 PC to N PCs, new vLLM worker containers are added and registered in the LiteLLM config. The FastAPI gateway, database, and proxy remain centralised on the primary node. Zero code changes required.

Build Phases — 8 Phases, Sequential

#	Phase	Description	Dependencies
1	API Endpoints	Core REST API — explore, query, usage, auth	None
2	LLM Models	Select, download, feasibility-check, deploy 5 models	Phase 1
3	API–Model Integration	Wire every model to its dedicated endpoint via LiteLLM + vLLM	Phase 1, 2
4	API Usage Control	Rate limiting, token accounting, static + refresh API keys	Phase 1
5	Web Interface	Dashboard, user management, admin panel, model access controls	Phase 1, 4
6	Guardrails	Input + output content filtering, safety checks	Phase 3
7	Knowledgebase + RAG	Vector DB, document ingestion, retrieval-augmented generation	Phase 3
8	Retrieval + Search	SearXNG web search, Wikipedia, real-time grounded answers	Phase 3, 7

Base URL

http://<server-ip>/api/v1

The server IP is configured via environment variable MAC_HOST. No IP addresses are hardcoded anywhere in the codebase. For production, the Nginx reverse proxy terminates TLS and forwards to the FastAPI gateway.

Phase 1 — API Endpoints

1.1 Authentication — `/auth`

Handles user login, session management, and JWT token lifecycle. All passwords are hashed with bcrypt (work factor 12). Tokens use RS256 JWT signing.

Method	Endpoint	Description
`POST`	`/auth/login`	Roll number + password → JWT access token + refresh token
`POST`	`/auth/logout`	Invalidate current session / revoke refresh token
`POST`	`/auth/refresh`	Exchange refresh token for a new access token
`GET`	`/auth/me`	Current user's profile, role, department, and API key
`POST`	`/auth/change-password`	Change password (requires current password verification)

Request — `POST /auth/login`

{
  "roll_number": "21CS045",
  "password": "secure_password"
}

Field	Type	Required	Description
`roll_number`	string	Yes	Student/faculty roll number, e.g. `21CS045`
`password`	string	Yes	Account password (min 8 chars)

Response — `POST /auth/login`

{
  "access_token": "eyJhbG...",
  "refresh_token": "dGhpcyBp...",
  "token_type": "bearer",
  "expires_in": 86400,
  "user": {
    "roll_number": "21CS045",
    "name": "Aryan Sharma",
    "department": "CSE",
    "role": "student",
    "api_key": "mac_sk_live_abc123..."
  }
}

Field	Description
`access_token`	JWT, 24-hour expiry, used for all authenticated requests
`refresh_token`	30-day expiry, used only at `/auth/refresh`
`user.role`	One of: `student`, `faculty`, `admin`
`user.api_key`	Personal API key for direct model queries

1.2 Explore — `/explore`

Read-only discovery endpoints. Allow students to see what models and capabilities are available before writing code.

Method	Endpoint	Description
`GET`	`/explore/models`	List all deployed models with capabilities, context length, and status
`GET`	`/explore/models/{model_id}`	Detailed info — parameters, benchmarks, example prompts
`GET`	`/explore/models/search`	Search by capability tag: `?tag=vision`, `?tag=code`, `?tag=math`
`GET`	`/explore/endpoints`	List every API endpoint with method, path, auth requirement, description
`GET`	`/explore/health`	Platform health — node status, GPU temperatures, inference queue depth
`GET`	`/explore/usage-stats`	Aggregated platform analytics (admin-only) — tokens/day, active users

Response — `GET /explore/models`

{
  "models": [
    {
      "id": "qwen2.5-coder-7b",
      "name": "Qwen2.5-Coder 7B",
      "specialty": "Code generation, debugging, explanation",
      "parameters": "7B",
      "context_length": 32768,
      "status": "loaded",
      "capabilities": ["code", "chat", "completion"]
    }
  ],
  "total": 5
}

Response — `GET /explore/health`

{
  "status": "healthy",
  "uptime_seconds": 345600,
  "nodes": [
    {
      "ip": "192.168.1.101",
      "gpu": "NVIDIA RTX 3060 12GB",
      "gpu_temp_c": 62,
      "vram_used_gb": 9.2,
      "vram_total_gb": 12.0,
      "requests_in_flight": 3,
      "status": "active"
    }
  ],
  "queue_depth": 7
}

1.3 Query — `/query`

The core inference API. All endpoints accept Authorization: Bearer <api_key> header. Responses follow the OpenAI Chat Completions format — existing OpenAI SDK code works with zero changes by swapping base_url.

Method	Endpoint	Description
`POST`	`/query/chat`	Chat completion — multi-turn conversation (text/code/math)
`POST`	`/query/completions`	Raw text completion (OpenAI-compatible)
`POST`	`/query/vision`	Image + text → answer (vision model)
`POST`	`/query/speech-to-text`	Upload audio → transcribed text (Whisper)
`POST`	`/query/text-to-speech`	Text → audio file download (TTS)
`POST`	`/query/embeddings`	Text → vector embedding (for RAG / similarity search)
`POST`	`/query/rerank`	Re-rank passages by relevance to a query

Request — `POST /query/chat`

{
  "model": "auto",
  "messages": [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to reverse a linked list."}
  ],
  "temperature": 0.7,
  "max_tokens": 2048,
  "stream": false
}

Field	Type	Required	Description
`model`	string	Yes	Model ID or `"auto"` for smart routing
`messages`	array	Yes	Array of `{role, content}` — system / user / assistant
`temperature`	float	No	Sampling temperature 0.0–2.0 (default `0.7`)
`max_tokens`	integer	No	Max tokens to generate (default `2048`, max `4096`)
`stream`	boolean	No	Stream tokens as Server-Sent Events (default `false`)
`context_id`	string	No	Maintain conversation state server-side

Response — `POST /query/chat`

{
  "id": "mac-chat-a1b2c3d4",
  "object": "chat.completion",
  "created": 1743984000,
  "model": "qwen2.5-coder-7b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Here's a Python function to reverse a linked list..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 42,
    "completion_tokens": 187,
    "total_tokens": 229
  }
}

Smart Routing (`model: "auto"`)

When students set model to "auto", the API gateway inspects the request and routes to the optimal model:

Signal Detected	Routed To
Code keywords, programming language names, `write a function`, `debug this`	`qwen2.5-coder-7b`
Math expressions, equations, `solve`, `prove`, `step by step`	`deepseek-r1-8b`
Image attachment in request body	`llava-1.6-7b`
Audio file upload	`whisper-large-v3`
General text, summarisation, writing, Q&A	`qwen2.5-14b`

1.4 Usage — `/usage`

Per-student consumption tracking. Every API call logs token counts, model used, and timestamp.

Method	Endpoint	Description
`GET`	`/usage/me`	My tokens used — today, this week, this month, broken down by model
`GET`	`/usage/me/history`	Full request history — timestamps, models, token counts, latency
`GET`	`/usage/me/quota`	My current quota limits and remaining balance
`GET`	`/usage/admin/all`	All users' usage summary (admin-only)
`GET`	`/usage/admin/user/{roll_number}`	Specific student's full usage details (admin-only)
`GET`	`/usage/admin/models`	Per-model usage statistics across the platform (admin-only)

Response — `GET /usage/me`

{
  "roll_number": "21CS045",
  "usage": {
    "today": {
      "total_tokens": 12450,
      "requests": 23,
      "by_model": {
        "qwen2.5-coder-7b": {"tokens": 8200, "requests": 15},
        "qwen2.5-14b": {"tokens": 4250, "requests": 8}
      }
    },
    "this_week": {"total_tokens": 67800, "requests": 142},
    "this_month": {"total_tokens": 234500, "requests": 487}
  },
  "quota": {
    "daily_limit": 50000,
    "remaining_today": 37550
  }
}

Phase 2 — LLM Models

2.1 Model Selection

Five best-in-class open-source models, each a domain specialist. All models are quantised to fit within consumer GPU VRAM.

Model ID	Name	Specialty	Parameters	VRAM Required	Quantisation
`qwen2.5-coder-7b`	Qwen2.5-Coder 7B	Code generation, debugging, explanation	7B	~5 GB	GPTQ-Int4
`deepseek-r1-8b`	DeepSeek-R1 8B	Maths, reasoning, step-by-step logic	8B	~6 GB	AWQ-Int4
`llava-1.6-7b`	LLaVA 1.6 7B	Image understanding, visual Q&A	7B	~8 GB	FP16
`whisper-large-v3`	Whisper Large v3	Speech-to-text, transcription	1.5B	~3 GB	FP16
`qwen2.5-14b`	Qwen2.5 14B	General chat, summarisation, writing	14B	~10 GB	GPTQ-Int4

Total VRAM for all 5 models: ~32 GB (fits on dual-GPU setups or loads models on demand)

Feasibility on Single PC

For a single-PC deployment with one GPU (e.g., RTX 3060 12GB or RTX 4070 16GB):

Models are loaded/unloaded on demand — only the requested model occupies VRAM at inference time
A model loading queue ensures graceful transitions
Frequently-used models remain resident; idle models are evicted after a configurable timeout

2.2 Model Management Endpoints — `/models`

Method	Endpoint	Description
`GET`	`/models`	List all models with status: `loaded` / `downloading` / `queued` / `offline`
`GET`	`/models/{model_id}`	Single model details — context length, capabilities, benchmark scores
`POST`	`/models/{model_id}/load`	Load model into GPU memory (admin-only)
`POST`	`/models/{model_id}/unload`	Unload model from VRAM to free resources (admin-only)
`GET`	`/models/{model_id}/health`	Ping model — returns latency, memory usage, ready status
`POST`	`/models/download`	Download a model from HuggingFace by ID (admin-only)
`GET`	`/models/download/{task_id}`	Check download progress for a model download task

Response — `GET /models`

{
  "models": [
    {
      "id": "qwen2.5-coder-7b",
      "name": "Qwen2.5-Coder 7B",
      "status": "loaded",
      "vram_mb": 5120,
      "context_length": 32768,
      "capabilities": ["code", "chat"],
      "loaded_at": "2026-04-07T08:30:00Z"
    },
    {
      "id": "deepseek-r1-8b",
      "name": "DeepSeek-R1 8B",
      "status": "offline",
      "vram_mb": 6144,
      "context_length": 65536,
      "capabilities": ["reasoning", "math", "chat"],
      "loaded_at": null
    }
  ]
}

Phase 3 — API–Model Integration

3.1 Architecture

LiteLLM Proxy sits between the FastAPI gateway and vLLM workers. It:

Translates every /query request into the correct vLLM inference call
Handles smart routing (auto-model selection)
Load-balances across multiple workers (when scaled)
Retries on worker failure with exponential backoff
Returns OpenAI-compatible JSON responses

FastAPI Gateway  ──►  LiteLLM Proxy  ──►  vLLM Worker(s)
                      (routing +           (GPU inference)
                       load balance)

3.2 Integration Endpoints — `/integration`

Method	Endpoint	Description
`GET`	`/integration/routing-rules`	Show current routing rules (which task type → which model)
`PUT`	`/integration/routing-rules`	Update routing rules (admin-only)
`GET`	`/integration/workers`	List all vLLM worker nodes and their current load
`GET`	`/integration/workers/{node_id}`	Single worker — GPU temp, VRAM used, requests in flight
`POST`	`/integration/workers/{node_id}/drain`	Mark worker as draining — no new requests routed (admin-only)
`GET`	`/integration/queue`	Current global inference queue depth across all workers

Response — `GET /integration/routing-rules`

{
  "rules": [
    {"task": "code", "keywords": ["function", "debug", "code", "python", "javascript"], "model": "qwen2.5-coder-7b", "priority": 1},
    {"task": "math", "keywords": ["solve", "equation", "prove", "integral"], "model": "deepseek-r1-8b", "priority": 2},
    {"task": "vision", "trigger": "image_attachment", "model": "llava-1.6-7b", "priority": 3},
    {"task": "speech", "trigger": "audio_upload", "model": "whisper-large-v3", "priority": 4},
    {"task": "general", "trigger": "default", "model": "qwen2.5-14b", "priority": 99}
  ]
}

3.3 Scaling Strategy

Deployment	Configuration
Single PC	One vLLM process, models loaded/swapped on demand
2–5 PCs	Each PC runs a vLLM worker with 1–2 dedicated models; LiteLLM routes by model
6–30 PCs	Multiple workers per model for redundancy; least-busy routing; auto-failover

All worker addresses are stored in configuration (environment variables / config file) — no IPs are hardcoded. Adding a new node requires only updating the LiteLLM config and restarting the proxy.

Phase 4 — API Usage Control

4.1 API Key Management — `/keys`

Every student receives a unique API key upon account creation. Two key types are supported:

Key Type	Behaviour
Static	Never expires. Manually rotated by the student or admin. Ideal for quick experiments.
Refresh	Auto-rotates every 30 days. Old key has a 48-hour grace period. Ideal for long-running scripts.

Key format: mac_sk_live_<32-char-random-hex>

Method	Endpoint	Description
`GET`	`/keys/my-key`	Get current API key (partially masked) and metadata
`POST`	`/keys/generate`	Generate a new API key (invalidates the previous one)
`GET`	`/keys/my-key/stats`	Tokens consumed against this key — today, week, month
`DELETE`	`/keys/my-key`	Revoke current key permanently (must re-generate)
`GET`	`/keys/admin/all`	List all student API keys and status (admin-only)
`POST`	`/keys/admin/revoke`	Force-revoke a specific student's API key (admin-only)

Response — `GET /keys/my-key`

{
  "key": "mac_sk_live_a1b2...****",
  "type": "refresh",
  "created_at": "2026-03-01T10:00:00Z",
  "expires_at": "2026-03-31T10:00:00Z",
  "last_used": "2026-04-07T14:23:00Z",
  "status": "active",
  "total_requests": 487
}

4.2 Rate Limiting & Quotas — `/quota`

Rate limits are enforced at the Redis layer using a sliding-window algorithm. Limits are per-role and can be overridden per-user.

Role	Daily Token Limit	Requests/Hour	Max Tokens/Request
Student	50,000	100	4,096
Faculty	200,000	500	8,192
Admin	Unlimited	Unlimited	16,384

Method	Endpoint	Description
`GET`	`/quota/limits`	Show default quota limits per role
`GET`	`/quota/me`	My personal limits and current consumption
`PUT`	`/quota/admin/user/{roll_number}`	Override quota for a specific user (admin-only)
`GET`	`/quota/admin/exceeded`	List users who have exceeded their daily quota (admin-only)

Response — `GET /quota/me`

{
  "role": "student",
  "limits": {
    "daily_tokens": 50000,
    "requests_per_hour": 100,
    "max_tokens_per_request": 4096
  },
  "current": {
    "tokens_used_today": 12450,
    "requests_this_hour": 8,
    "remaining_tokens": 37550,
    "resets_at": "2026-04-08T00:00:00Z"
  }
}

Rate Limit Response Headers

Every API response includes:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 92
X-RateLimit-Reset: 1743987600
X-TokenLimit-Limit: 50000
X-TokenLimit-Remaining: 37550
X-TokenLimit-Reset: 1744070400

When a limit is exceeded, the API returns:

HTTP 429 Too Many Requests
{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "You have exceeded your hourly request limit. Try again in 847 seconds.",
    "retry_after": 847
  }
}

Phase 5 — Web Interface

A React-based single-page application served by the FastAPI backend. Three views: Student Dashboard, Key Management, and Admin Panel.

5.1 Web Interface Endpoints — `/ui`

Method	Endpoint	Description
`GET`	`/ui/dashboard`	Student home — usage chart, quick-start code snippets, model status cards
`GET`	`/ui/keys`	Key management — view, copy, regenerate API key
`GET`	`/ui/history`	Request history table — timestamp, model, tokens, latency, status
`GET`	`/ui/playground`	Interactive chat playground — test models directly in the browser

5.2 Admin Panel — `/ui/admin`

Method	Endpoint	Description
`GET`	`/ui/admin/users`	Full user list with roles, quotas, last active, status
`POST`	`/ui/admin/users/create`	Create user / bulk-create from CSV (roll no., name, department)
`PUT`	`/ui/admin/users/{roll_number}`	Edit user — change role, quota overrides, model access
`DELETE`	`/ui/admin/users/{roll_number}`	Deactivate a user account
`GET`	`/ui/admin/models`	Model management — load/unload, see which node serves what
`GET`	`/ui/admin/logs`	Live request logs — error rates, latency percentiles, throughput
`GET`	`/ui/admin/analytics`	Platform analytics — daily active users, peak hours, top models

Dashboard Wireframe

┌──────────────────────────────────────────────────────┐
│  MAC — MBM AI Cloud                    [Profile ▼]   │
├──────────────────────────────────────────────────────┤
│                                                      │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐              │
│  │ Tokens  │  │Requests │  │  Quota  │              │
│  │  Today  │  │  Today  │  │Remaining│              │
│  │ 12,450  │  │   23    │  │  75.1%  │              │
│  └─────────┘  └─────────┘  └─────────┘              │
│                                                      │
│  Models Available          Quick Start               │
│  ┌──────────────────┐     ┌────────────────────┐    │
│  │ ✅ Qwen2.5-Coder │     │  from openai import │    │
│  │ ✅ DeepSeek-R1   │     │  OpenAI             │    │
│  │ ✅ Qwen2.5 14B   │     │  client = OpenAI(   │    │
│  │ ⏳ LLaVA (load.) │     │    base_url=        │    │
│  │ ✅ Whisper       │     │    "http://mac/v1", │    │
│  └──────────────────┘     │    api_key="mac_sk_" │    │
│                           │  )                   │    │
│  Usage This Week          └────────────────────┘    │
│  ┌──────────────────┐                               │
│  │ ▁▃▅▇▅▃▁ (chart) │                               │
│  └──────────────────┘                               │
└──────────────────────────────────────────────────────┘

Phase 6 — Guardrails

Input and output content filtering to ensure safe, appropriate use of AI models within an academic environment.

6.1 Guardrails Endpoints — `/guardrails`

Method	Endpoint	Description
`POST`	`/guardrails/check-input`	Run input text through content filter before sending to model
`POST`	`/guardrails/check-output`	Run model output through safety filter before returning to user
`GET`	`/guardrails/rules`	List active guardrail rules (admin-only)
`PUT`	`/guardrails/rules`	Update rules — blocked categories, max prompt length (admin-only)

6.2 Filtering Pipeline

User Input  ──►  Input Filter  ──►  Model Inference  ──►  Output Filter  ──►  Response
                 │                                        │
                 ├─ Prompt injection detection             ├─ PII/sensitive data redaction
                 ├─ Blocked topic detection                ├─ Harmful content detection
                 ├─ Max prompt length enforcement           ├─ Hallucination disclaimer
                 └─ Academic integrity checks               └─ Source attribution

Guardrail Categories

Category	Action	Description
Prompt Injection	Block + log	Detect attempts to override system prompts
Harmful Content	Block	Violence, self-harm, illegal activities
Academic Dishonesty	Flag + disclaimer	Full essay/assignment generation adds academic integrity notice
PII in Output	Redact	Strip emails, phone numbers, addresses from model output
Max Prompt Length	Reject	Configurable per-role; prevents resource abuse

Request — `POST /guardrails/check-input`

{
  "text": "User input to check",
  "context": "chat"
}

Response

{
  "safe": true,
  "flags": [],
  "modified_text": null
}

Phase 7 — Knowledgebase + RAG

Retrieval-Augmented Generation over college textbooks, course materials, and reference documents. Students can ask questions grounded in actual course content.

7.1 RAG Endpoints — `/rag`

Method	Endpoint	Description
`POST`	`/rag/ingest`	Upload PDF / DOCX / TXT — chunk, embed, store in Qdrant vector DB
`GET`	`/rag/documents`	List all ingested documents with metadata (title, pages, chunk count)
`GET`	`/rag/documents/{id}`	Single document details and its chunks
`DELETE`	`/rag/documents/{id}`	Remove a document from the knowledgebase (admin-only)
`POST`	`/rag/query`	Ask a question — retrieves top-k relevant chunks → sends to LLM with context
`GET`	`/rag/query/{query_id}/sources`	Get source citations for a RAG response
`POST`	`/rag/collections`	Create a named collection (e.g., "DSA", "DBMS", "OS") (admin-only)
`GET`	`/rag/collections`	List all collections

7.2 RAG Pipeline

Document Upload  ──►  Chunking (512 tokens, 50 overlap)
                              │
                              ▼
                      Embedding (768-dim)
                              │
                              ▼
                      Qdrant Vector DB
                              │
     User Question  ──►  Embed Query  ──►  Similarity Search (top-k=5)
                                                    │
                                                    ▼
                                          Retrieved Chunks + Question
                                                    │
                                                    ▼
                                              LLM Generation
                                                    │
                                                    ▼
                                          Answer with Citations

Request — `POST /rag/query`

{
  "question": "Explain the difference between process and thread in operating systems",
  "collection": "OS",
  "top_k": 5,
  "model": "auto"
}

Response — `POST /rag/query`

{
  "answer": "A process is an independent program in execution with its own memory space...",
  "sources": [
    {
      "document": "Galvin - Operating System Concepts, 10th Ed",
      "chapter": "Chapter 3: Processes",
      "page": 105,
      "relevance_score": 0.92,
      "chunk_preview": "A process is a program in execution. A process is more than..."
    },
    {
      "document": "Galvin - Operating System Concepts, 10th Ed",
      "chapter": "Chapter 4: Threads",
      "page": 163,
      "relevance_score": 0.89,
      "chunk_preview": "A thread is a basic unit of CPU utilization..."
    }
  ],
  "model_used": "qwen2.5-14b",
  "tokens_used": 847
}

Phase 8 — Retrieval + Search

Real-time web grounding via self-hosted SearXNG. Enables the platform to answer questions about current events, recent research, and topics not covered in the knowledgebase.

8.1 Search Endpoints — `/search`

Method	Endpoint	Description
`POST`	`/search/web`	Query SearXNG — aggregates results from Google, Bing, DuckDuckGo, Wikipedia
`POST`	`/search/wikipedia`	Targeted Wikipedia search with summary extraction
`POST`	`/search/grounded`	Search + LLM — retrieves web results, then generates a cited answer
`GET`	`/search/cache`	List recently cached search results (reduces redundant fetches)

Request — `POST /search/grounded`

{
  "query": "What are the latest improvements in vLLM v0.8?",
  "max_sources": 5,
  "model": "auto"
}

Response — `POST /search/grounded`

{
  "answer": "vLLM v0.8 introduced several key improvements including...",
  "sources": [
    {
      "title": "vLLM v0.8.0 Release Notes",
      "url": "https://github.com/vllm-project/vllm/releases",
      "snippet": "Key features: improved prefix caching, multi-modal support..."
    }
  ],
  "model_used": "qwen2.5-14b",
  "cached": false
}

Database Schema

Entity-Relationship Overview

┌─────────────┐       ┌─────────────┐       ┌─────────────────┐
│    users     │       │  api_keys   │       │  request_logs   │
├─────────────┤       ├─────────────┤       ├─────────────────┤
│ id (PK)     │──┐    │ id (PK)     │       │ id (PK)         │
│ roll_number │  ├───►│ user_id(FK) │   ┌──►│ user_id (FK)    │
│ name        │  │    │ key_hash    │   │   │ api_key_id (FK) │
│ department  │  │    │ key_prefix  │   │   │ model           │
│ role        │  │    │ type        │   │   │ endpoint        │
│ password    │  │    │ is_active   │   │   │ tokens_in       │
│ is_active   │  │    │ created_at  │   │   │ tokens_out      │
│ created_at  │  │    │ expires_at  │   │   │ latency_ms      │
│ updated_at  │  │    │ last_used   │   │   │ status_code     │
└─────────────┘  │    └─────────────┘   │   │ created_at      │
                 │                      │   └─────────────────┘
                 └──────────────────────┘
                 
┌──────────────────┐       ┌─────────────────┐
│  quota_overrides │       │  rag_documents  │
├──────────────────┤       ├─────────────────┤
│ id (PK)          │       │ id (PK)         │
│ user_id (FK)     │       │ title           │
│ daily_tokens     │       │ collection      │
│ requests_per_hr  │       │ file_type       │
│ max_tokens_req   │       │ chunk_count     │
│ set_by (FK)      │       │ uploaded_by(FK) │
│ created_at       │       │ created_at      │
└──────────────────┘       └─────────────────┘

SQL Schema — Key Tables

users

CREATE TABLE users (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    roll_number     VARCHAR(20) UNIQUE NOT NULL,
    name            VARCHAR(100) NOT NULL,
    department      VARCHAR(50) NOT NULL,
    role            VARCHAR(20) NOT NULL DEFAULT 'student'
                    CHECK (role IN ('student', 'faculty', 'admin')),
    hashed_password VARCHAR(255) NOT NULL,
    is_active       BOOLEAN DEFAULT TRUE,
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    updated_at      TIMESTAMPTZ DEFAULT NOW()
);

api_keys

CREATE TABLE api_keys (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id     UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    key_hash    VARCHAR(255) NOT NULL,
    key_prefix  VARCHAR(20) NOT NULL,
    type        VARCHAR(10) NOT NULL DEFAULT 'static'
                CHECK (type IN ('static', 'refresh')),
    is_active   BOOLEAN DEFAULT TRUE,
    created_at  TIMESTAMPTZ DEFAULT NOW(),
    expires_at  TIMESTAMPTZ,
    last_used   TIMESTAMPTZ
);

request_logs

CREATE TABLE request_logs (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id     UUID NOT NULL REFERENCES users(id),
    api_key_id  UUID REFERENCES api_keys(id),
    model       VARCHAR(50) NOT NULL,
    endpoint    VARCHAR(100) NOT NULL,
    tokens_in   INTEGER NOT NULL DEFAULT 0,
    tokens_out  INTEGER NOT NULL DEFAULT 0,
    latency_ms  INTEGER,
    status_code SMALLINT NOT NULL,
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_request_logs_user_date ON request_logs (user_id, created_at);
CREATE INDEX idx_request_logs_model ON request_logs (model, created_at);

Project Folder Structure

mac/
├── api/
│   ├── main.py                 # FastAPI application entry point
│   ├── routers/
│   │   ├── auth.py             # /auth endpoints
│   │   ├── explore.py          # /explore endpoints
│   │   ├── query.py            # /query endpoints
│   │   ├── usage.py            # /usage endpoints
│   │   ├── keys.py             # /keys endpoints
│   │   ├── models.py           # /models endpoints
│   │   ├── integration.py      # /integration endpoints
│   │   ├── quota.py            # /quota endpoints
│   │   ├── guardrails.py       # /guardrails endpoints
│   │   ├── rag.py              # /rag endpoints
│   │   └── search.py           # /search endpoints
│   ├── models/
│   │   ├── user.py             # SQLAlchemy User model
│   │   ├── api_key.py          # APIKey model
│   │   ├── request_log.py      # RequestLog model
│   │   └── document.py         # RAG Document model
│   ├── schemas/
│   │   ├── auth.py             # Pydantic request/response schemas
│   │   ├── query.py            # Chat, completions, vision schemas
│   │   ├── usage.py            # Usage response schemas
│   │   └── common.py           # Shared schemas (pagination, errors)
│   ├── core/
│   │   ├── config.py           # Settings from environment variables
│   │   ├── security.py         # JWT creation/validation, password hashing
│   │   ├── rate_limit.py       # Redis-based sliding window rate limiter
│   │   ├── dependencies.py     # FastAPI dependency injection (get_current_user, etc.)
│   │   └── db.py               # Database session factory
│   ├── services/
│   │   ├── model_router.py     # Smart routing logic (auto model selection)
│   │   ├── inference.py        # LiteLLM client for model calls
│   │   ├── usage_tracker.py    # Token counting and logging
│   │   ├── guardrails.py       # Input/output filtering logic
│   │   └── rag.py              # Document ingestion, embedding, retrieval
│   └── requirements.txt
├── litellm/
│   └── config.yaml             # LiteLLM proxy routing configuration
├── nginx/
│   └── nginx.conf              # Reverse proxy configuration
├── frontend/
│   ├── src/
│   │   ├── pages/              # Dashboard, Keys, History, Admin pages
│   │   ├── components/         # Reusable UI components
│   │   └── api/                # API client (Axios/fetch wrappers)
│   ├── package.json
│   └── vite.config.ts
├── scripts/
│   ├── seed_users.py           # Bulk-create students from CSV
│   ├── download_models.py      # Pull models from HuggingFace
│   └── health_check.py         # Verify all services are running
├── docker-compose.yml          # Full stack: api, db, redis, litellm, nginx, qdrant
├── Dockerfile                  # FastAPI container
├── .env.example                # Template for environment variables
├── alembic/                    # Database migrations
│   └── versions/
└── README.md

Security Design

Concern	Implementation
Password Storage	bcrypt with work factor 12; passwords never stored in plaintext
JWT Signing	RS256 asymmetric keys; access tokens short-lived (24h)
API Key Storage	Only SHA-256 hash stored in DB; raw key shown once at creation
Transport	Nginx terminates TLS (HTTPS); internal services communicate over Docker network
Input Validation	Pydantic schema validation on every request; max payload size enforced
SQL Injection	SQLAlchemy ORM with parameterised queries throughout
Rate Limiting	Redis sliding-window per user + per IP; prevents abuse and DoS
CORS	Strict origin allowlist — only the MAC frontend domain
Admin Endpoints	Role-based access control; admin-only routes check JWT role claim
Prompt Injection	Input guardrails detect and block prompt override attempts
Secrets Management	All secrets in `.env` file; never committed to version control

Error Response Format

All errors follow a consistent structure:

{
  "error": {
    "code": "authentication_failed",
    "message": "Invalid roll number or password.",
    "status": 401,
    "timestamp": "2026-04-07T14:30:00Z",
    "request_id": "mac-req-a1b2c3"
  }
}

Standard Error Codes

HTTP Status	Code	Description
400	`bad_request`	Malformed request body or invalid parameters
401	`authentication_failed`	Missing or invalid credentials / API key
403	`forbidden`	Valid auth but insufficient role permissions
404	`not_found`	Resource does not exist
409	`conflict`	Duplicate resource (e.g., user already exists)
422	`validation_error`	Request schema validation failed
429	`rate_limit_exceeded`	Request or token quota exceeded
500	`internal_error`	Unexpected server error
503	`model_unavailable`	Requested model is not loaded or all workers are busy

Deployment Configuration

Environment Variables (`.env`)

# Server
MAC_HOST=0.0.0.0
MAC_PORT=8000
MAC_ENV=production

# Database
POSTGRES_HOST=db
POSTGRES_PORT=5432
POSTGRES_DB=mac
POSTGRES_USER=mac_admin
POSTGRES_PASSWORD=<generated-secret>

# Redis
REDIS_HOST=redis
REDIS_PORT=6379

# JWT
JWT_PRIVATE_KEY_PATH=/run/secrets/jwt_private.pem
JWT_PUBLIC_KEY_PATH=/run/secrets/jwt_public.pem
JWT_ACCESS_TOKEN_EXPIRE_MINUTES=1440
JWT_REFRESH_TOKEN_EXPIRE_DAYS=30

# LiteLLM
LITELLM_PROXY_URL=http://litellm:4000
LITELLM_MASTER_KEY=<generated-secret>

# Qdrant
QDRANT_HOST=qdrant
QDRANT_PORT=6333

# Models
MODEL_STORAGE_PATH=/models
DEFAULT_MODEL=qwen2.5-14b

Docker Compose Services

Service	Image	Ports	Purpose
`api`	Custom (Dockerfile)	8000	FastAPI gateway
`db`	postgres:16-alpine	5432	User data, logs, keys
`redis`	redis:7-alpine	6379	Rate limiting, caching
`litellm`	ghcr.io/berriai/litellm	4000	Model proxy + routing
`qdrant`	qdrant/qdrant	6333	Vector database (RAG)
`searxng`	searxng/searxng	8888	Web search engine
`nginx`	nginx:alpine	80, 443	Reverse proxy + TLS

API Versioning & Compatibility

All endpoints are prefixed with /api/v1
The /query endpoints return OpenAI-compatible JSON — any OpenAI SDK works by changing base_url
Breaking changes will increment the version (/api/v2) while maintaining the previous version for one semester
Response pagination follows cursor-based pagination for list endpoints

OpenAI SDK Compatibility Example

from openai import OpenAI

client = OpenAI(
    base_url="http://mac-server/api/v1",
    api_key="mac_sk_live_your_key_here"
)

response = client.chat.completions.create(
    model="auto",
    messages=[
        {"role": "user", "content": "Explain binary search in Python"}
    ]
)

print(response.choices[0].message.content)

Summary

MAC (MBM AI Cloud) delivers a production-grade, self-hosted AI platform that gives every student access to state-of-the-art AI models at zero recurring cost. The 8-phase build plan ensures a solid foundation before adding advanced features, and the architecture scales from a single lab PC to 30+ machines without rewriting a single line of code.

Phase 1 (API + Models + Integration + Usage Control) establishes the complete backend — once deployed, students can start querying AI models from their laptops on day one.

MAC — MBM AI Cloud · API Design Document · Version 1.0 · 07 April 2026

MAC — MBM AI Cloud

API Design & Architecture Plan

Phase 1–8 · Complete Platform Blueprint

Executive Summary

Technology Stack

Architecture Overview

Build Phases — 8 Phases, Sequential

Base URL

Phase 1 — API Endpoints

1.1 Authentication — /auth

Request — POST /auth/login

Response — POST /auth/login

1.2 Explore — /explore

Response — GET /explore/models

Response — GET /explore/health

1.3 Query — /query

Request — POST /query/chat

Response — POST /query/chat

Smart Routing (model: "auto")

1.4 Usage — /usage

Response — GET /usage/me