# MAC — MBM AI Cloud

## API Design & Architecture Plan

### Phase 1–8 · Complete Platform Blueprint

**Prepared for Professor Review · 07 April 2026**

---

## Executive Summary

MAC (MBM AI Cloud) is a **fully self-hosted, zero-cloud AI inference platform** purpose-built for MBM Engineering College. Students and faculty on the college LAN access state-of-the-art AI models — text generation, code assistance, vision understanding, speech-to-text, and mathematical reasoning — through a standardised REST API authenticated by per-student API keys.

**Key design goals:**

- **Zero cloud cost** — all inference runs on college-owned GPUs
- **OpenAI-compatible API** — students use familiar SDKs; existing tutorials work unchanged
- **Scalable from day one** — starts on a single PC, scales horizontally to 30+ nodes with no code changes
- **Secure by default** — JWT authentication, role-based access, rate limiting, input/output guardrails
- **Academically useful** — RAG over college textbooks, web-grounded search, per-student usage tracking

---

## Technology Stack

| Layer | Technology | Rationale |
|---|---|---|
| **API Gateway** | FastAPI (Python 3.11+) | Async, auto-docs (OpenAPI/Swagger), type-safe, fastest Python framework |
| **Database** | PostgreSQL 16 | ACID-compliant, battle-tested, excellent JSON support |
| **Cache / Rate Limiter** | Redis 7 | In-memory, sub-ms latency, native rate-limit primitives |
| **LLM Inference** | vLLM | PagedAttention, continuous batching, highest throughput for local GPUs |
| **Model Router / Proxy** | LiteLLM | Unified OpenAI-compatible proxy, load balancing, fallback routing |
| **Vector Database** | Qdrant | Purpose-built for embeddings, filtering, snapshotting |
| **Web Search** | SearXNG | Self-hosted meta-search, no API keys needed |
| **Reverse Proxy** | Nginx | TLS termination, request buffering, static file serving |
| **Containerisation** | Docker + Docker Compose | Reproducible deployments, one-command startup |
| **Frontend** | React + Tailwind CSS | Component-driven dashboard, responsive, fast |
| **Task Queue** | Celery + Redis | Background jobs: document ingestion, model downloads |

---

## Architecture Overview

```
┌─────────────────────────────────────────────────────────────────┐
│                        College LAN                              │
│   Students / Faculty / Lab PCs                                  │
└────────────────────────┬────────────────────────────────────────┘
                         │ HTTPS
                         ▼
              ┌─────────────────────┐
              │       Nginx         │
              │   Reverse Proxy     │
              │   TLS · Caching     │
              └─────────┬───────────┘
                        │
              ┌─────────▼───────────┐
              │      FastAPI        │
              │   API Gateway       │
              │  Auth · Routing     │
              │  Rate Limiting      │
              └───┬────┬────┬───┬──┘
                  │    │    │   │
         ┌────────┘    │    │   └────────┐
         ▼             ▼    ▼            ▼
   ┌──────────┐  ┌─────────────┐  ┌──────────┐
   │PostgreSQL│  │   LiteLLM   │  │  Qdrant  │
   │  Users   │  │   Proxy     │  │ VectorDB │
   │  Logs    │  │  Routing    │  │  (RAG)   │
   │  Keys    │  └──────┬──────┘  └──────────┘
   └──────────┘         │
                        ▼
              ┌─────────────────────┐
              │       vLLM          │
              │  Model Workers      │
              │  GPU Inference      │
              └─────────────────────┘
```

**Scaling model:** The entire stack is containerised. To scale from 1 PC to N PCs, new vLLM worker containers are added and registered in the LiteLLM config. The FastAPI gateway, database, and proxy remain centralised on the primary node. Zero code changes required.

---

## Build Phases — 8 Phases, Sequential

| # | Phase | Description | Dependencies |
|---|---|---|---|
| 1 | **API Endpoints** | Core REST API — explore, query, usage, auth | None |
| 2 | **LLM Models** | Select, download, feasibility-check, deploy 5 models | Phase 1 |
| 3 | **API–Model Integration** | Wire every model to its dedicated endpoint via LiteLLM + vLLM | Phase 1, 2 |
| 4 | **API Usage Control** | Rate limiting, token accounting, static + refresh API keys | Phase 1 |
| 5 | **Web Interface** | Dashboard, user management, admin panel, model access controls | Phase 1, 4 |
| 6 | **Guardrails** | Input + output content filtering, safety checks | Phase 3 |
| 7 | **Knowledgebase + RAG** | Vector DB, document ingestion, retrieval-augmented generation | Phase 3 |
| 8 | **Retrieval + Search** | SearXNG web search, Wikipedia, real-time grounded answers | Phase 3, 7 |

---

## Base URL

```
http://<server-ip>/api/v1
```

> The server IP is configured via environment variable `MAC_HOST`. No IP addresses are hardcoded anywhere in the codebase. For production, the Nginx reverse proxy terminates TLS and forwards to the FastAPI gateway.

---

# Phase 1 — API Endpoints

## 1.1 Authentication — `/auth`

Handles user login, session management, and JWT token lifecycle. All passwords are hashed with **bcrypt** (work factor 12). Tokens use **RS256** JWT signing.

| Method | Endpoint | Description |
|---|---|---|
| `POST` | `/auth/login` | Roll number + password → JWT access token + refresh token |
| `POST` | `/auth/logout` | Invalidate current session / revoke refresh token |
| `POST` | `/auth/refresh` | Exchange refresh token for a new access token |
| `GET` | `/auth/me` | Current user's profile, role, department, and API key |
| `POST` | `/auth/change-password` | Change password (requires current password verification) |

### Request — `POST /auth/login`

```json
{
  "roll_number": "21CS045",
  "password": "secure_password"
}
```

| Field | Type | Required | Description |
|---|---|---|---|
| `roll_number` | string | Yes | Student/faculty roll number, e.g. `21CS045` |
| `password` | string | Yes | Account password (min 8 chars) |

### Response — `POST /auth/login`

```json
{
  "access_token": "eyJhbG...",
  "refresh_token": "dGhpcyBp...",
  "token_type": "bearer",
  "expires_in": 86400,
  "user": {
    "roll_number": "21CS045",
    "name": "Aryan Sharma",
    "department": "CSE",
    "role": "student",
    "api_key": "mac_sk_live_abc123..."
  }
}
```

| Field | Description |
|---|---|
| `access_token` | JWT, 24-hour expiry, used for all authenticated requests |
| `refresh_token` | 30-day expiry, used only at `/auth/refresh` |
| `user.role` | One of: `student`, `faculty`, `admin` |
| `user.api_key` | Personal API key for direct model queries |

---

## 1.2 Explore — `/explore`

Read-only discovery endpoints. Allow students to see what models and capabilities are available before writing code.

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/explore/models` | List all deployed models with capabilities, context length, and status |
| `GET` | `/explore/models/{model_id}` | Detailed info — parameters, benchmarks, example prompts |
| `GET` | `/explore/models/search` | Search by capability tag: `?tag=vision`, `?tag=code`, `?tag=math` |
| `GET` | `/explore/endpoints` | List every API endpoint with method, path, auth requirement, description |
| `GET` | `/explore/health` | Platform health — node status, GPU temperatures, inference queue depth |
| `GET` | `/explore/usage-stats` | Aggregated platform analytics (admin-only) — tokens/day, active users |

### Response — `GET /explore/models`

```json
{
  "models": [
    {
      "id": "qwen2.5-coder-7b",
      "name": "Qwen2.5-Coder 7B",
      "specialty": "Code generation, debugging, explanation",
      "parameters": "7B",
      "context_length": 32768,
      "status": "loaded",
      "capabilities": ["code", "chat", "completion"]
    }
  ],
  "total": 5
}
```

### Response — `GET /explore/health`

```json
{
  "status": "healthy",
  "uptime_seconds": 345600,
  "nodes": [
    {
      "ip": "192.168.1.101",
      "gpu": "NVIDIA RTX 3060 12GB",
      "gpu_temp_c": 62,
      "vram_used_gb": 9.2,
      "vram_total_gb": 12.0,
      "requests_in_flight": 3,
      "status": "active"
    }
  ],
  "queue_depth": 7
}
```

---

## 1.3 Query — `/query`

The core inference API. All endpoints accept `Authorization: Bearer <api_key>` header. Responses follow the **OpenAI Chat Completions format** — existing OpenAI SDK code works with zero changes by swapping `base_url`.

| Method | Endpoint | Description |
|---|---|---|
| `POST` | `/query/chat` | Chat completion — multi-turn conversation (text/code/math) |
| `POST` | `/query/completions` | Raw text completion (OpenAI-compatible) |
| `POST` | `/query/vision` | Image + text → answer (vision model) |
| `POST` | `/query/speech-to-text` | Upload audio → transcribed text (Whisper) |
| `POST` | `/query/text-to-speech` | Text → audio file download (TTS) |
| `POST` | `/query/embeddings` | Text → vector embedding (for RAG / similarity search) |
| `POST` | `/query/rerank` | Re-rank passages by relevance to a query |

### Request — `POST /query/chat`

```json
{
  "model": "auto",
  "messages": [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to reverse a linked list."}
  ],
  "temperature": 0.7,
  "max_tokens": 2048,
  "stream": false
}
```

| Field | Type | Required | Description |
|---|---|---|---|
| `model` | string | Yes | Model ID or `"auto"` for smart routing |
| `messages` | array | Yes | Array of `{role, content}` — system / user / assistant |
| `temperature` | float | No | Sampling temperature 0.0–2.0 (default `0.7`) |
| `max_tokens` | integer | No | Max tokens to generate (default `2048`, max `4096`) |
| `stream` | boolean | No | Stream tokens as Server-Sent Events (default `false`) |
| `context_id` | string | No | Maintain conversation state server-side |

### Response — `POST /query/chat`

```json
{
  "id": "mac-chat-a1b2c3d4",
  "object": "chat.completion",
  "created": 1743984000,
  "model": "qwen2.5-coder-7b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Here's a Python function to reverse a linked list..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 42,
    "completion_tokens": 187,
    "total_tokens": 229
  }
}
```

### Smart Routing (`model: "auto"`)

When students set `model` to `"auto"`, the API gateway inspects the request and routes to the optimal model:

| Signal Detected | Routed To |
|---|---|
| Code keywords, programming language names, `write a function`, `debug this` | `qwen2.5-coder-7b` |
| Math expressions, equations, `solve`, `prove`, `step by step` | `deepseek-r1-8b` |
| Image attachment in request body | `llava-1.6-7b` |
| Audio file upload | `whisper-large-v3` |
| General text, summarisation, writing, Q&A | `qwen2.5-14b` |

---

## 1.4 Usage — `/usage`

Per-student consumption tracking. Every API call logs token counts, model used, and timestamp.

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/usage/me` | My tokens used — today, this week, this month, broken down by model |
| `GET` | `/usage/me/history` | Full request history — timestamps, models, token counts, latency |
| `GET` | `/usage/me/quota` | My current quota limits and remaining balance |
| `GET` | `/usage/admin/all` | All users' usage summary (admin-only) |
| `GET` | `/usage/admin/user/{roll_number}` | Specific student's full usage details (admin-only) |
| `GET` | `/usage/admin/models` | Per-model usage statistics across the platform (admin-only) |

### Response — `GET /usage/me`

```json
{
  "roll_number": "21CS045",
  "usage": {
    "today": {
      "total_tokens": 12450,
      "requests": 23,
      "by_model": {
        "qwen2.5-coder-7b": {"tokens": 8200, "requests": 15},
        "qwen2.5-14b": {"tokens": 4250, "requests": 8}
      }
    },
    "this_week": {"total_tokens": 67800, "requests": 142},
    "this_month": {"total_tokens": 234500, "requests": 487}
  },
  "quota": {
    "daily_limit": 50000,
    "remaining_today": 37550
  }
}
```

---

# Phase 2 — LLM Models

## 2.1 Model Selection

Five best-in-class open-source models, each a domain specialist. All models are quantised to fit within consumer GPU VRAM.

| Model ID | Name | Specialty | Parameters | VRAM Required | Quantisation |
|---|---|---|---|---|---|
| `qwen2.5-coder-7b` | Qwen2.5-Coder 7B | Code generation, debugging, explanation | 7B | ~5 GB | GPTQ-Int4 |
| `deepseek-r1-8b` | DeepSeek-R1 8B | Maths, reasoning, step-by-step logic | 8B | ~6 GB | AWQ-Int4 |
| `llava-1.6-7b` | LLaVA 1.6 7B | Image understanding, visual Q&A | 7B | ~8 GB | FP16 |
| `whisper-large-v3` | Whisper Large v3 | Speech-to-text, transcription | 1.5B | ~3 GB | FP16 |
| `qwen2.5-14b` | Qwen2.5 14B | General chat, summarisation, writing | 14B | ~10 GB | GPTQ-Int4 |

**Total VRAM for all 5 models:** ~32 GB (fits on dual-GPU setups or loads models on demand)

### Feasibility on Single PC

For a single-PC deployment with one GPU (e.g., RTX 3060 12GB or RTX 4070 16GB):
- Models are loaded/unloaded on demand — only the requested model occupies VRAM at inference time
- A model loading queue ensures graceful transitions
- Frequently-used models remain resident; idle models are evicted after a configurable timeout

## 2.2 Model Management Endpoints — `/models`

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/models` | List all models with status: `loaded` / `downloading` / `queued` / `offline` |
| `GET` | `/models/{model_id}` | Single model details — context length, capabilities, benchmark scores |
| `POST` | `/models/{model_id}/load` | Load model into GPU memory (admin-only) |
| `POST` | `/models/{model_id}/unload` | Unload model from VRAM to free resources (admin-only) |
| `GET` | `/models/{model_id}/health` | Ping model — returns latency, memory usage, ready status |
| `POST` | `/models/download` | Download a model from HuggingFace by ID (admin-only) |
| `GET` | `/models/download/{task_id}` | Check download progress for a model download task |

### Response — `GET /models`

```json
{
  "models": [
    {
      "id": "qwen2.5-coder-7b",
      "name": "Qwen2.5-Coder 7B",
      "status": "loaded",
      "vram_mb": 5120,
      "context_length": 32768,
      "capabilities": ["code", "chat"],
      "loaded_at": "2026-04-07T08:30:00Z"
    },
    {
      "id": "deepseek-r1-8b",
      "name": "DeepSeek-R1 8B",
      "status": "offline",
      "vram_mb": 6144,
      "context_length": 65536,
      "capabilities": ["reasoning", "math", "chat"],
      "loaded_at": null
    }
  ]
}
```

---

# Phase 3 — API–Model Integration

## 3.1 Architecture

**LiteLLM Proxy** sits between the FastAPI gateway and vLLM workers. It:

1. Translates every `/query` request into the correct vLLM inference call
2. Handles smart routing (auto-model selection)
3. Load-balances across multiple workers (when scaled)
4. Retries on worker failure with exponential backoff
5. Returns OpenAI-compatible JSON responses

```
FastAPI Gateway  ──►  LiteLLM Proxy  ──►  vLLM Worker(s)
                      (routing +           (GPU inference)
                       load balance)
```

## 3.2 Integration Endpoints — `/integration`

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/integration/routing-rules` | Show current routing rules (which task type → which model) |
| `PUT` | `/integration/routing-rules` | Update routing rules (admin-only) |
| `GET` | `/integration/workers` | List all vLLM worker nodes and their current load |
| `GET` | `/integration/workers/{node_id}` | Single worker — GPU temp, VRAM used, requests in flight |
| `POST` | `/integration/workers/{node_id}/drain` | Mark worker as draining — no new requests routed (admin-only) |
| `GET` | `/integration/queue` | Current global inference queue depth across all workers |

### Response — `GET /integration/routing-rules`

```json
{
  "rules": [
    {"task": "code", "keywords": ["function", "debug", "code", "python", "javascript"], "model": "qwen2.5-coder-7b", "priority": 1},
    {"task": "math", "keywords": ["solve", "equation", "prove", "integral"], "model": "deepseek-r1-8b", "priority": 2},
    {"task": "vision", "trigger": "image_attachment", "model": "llava-1.6-7b", "priority": 3},
    {"task": "speech", "trigger": "audio_upload", "model": "whisper-large-v3", "priority": 4},
    {"task": "general", "trigger": "default", "model": "qwen2.5-14b", "priority": 99}
  ]
}
```

## 3.3 Scaling Strategy

| Deployment | Configuration |
|---|---|
| **Single PC** | One vLLM process, models loaded/swapped on demand |
| **2–5 PCs** | Each PC runs a vLLM worker with 1–2 dedicated models; LiteLLM routes by model |
| **6–30 PCs** | Multiple workers per model for redundancy; least-busy routing; auto-failover |

All worker addresses are stored in configuration (environment variables / config file) — **no IPs are hardcoded**. Adding a new node requires only updating the LiteLLM config and restarting the proxy.

---

# Phase 4 — API Usage Control

## 4.1 API Key Management — `/keys`

Every student receives a unique API key upon account creation. Two key types are supported:

| Key Type | Behaviour |
|---|---|
| **Static** | Never expires. Manually rotated by the student or admin. Ideal for quick experiments. |
| **Refresh** | Auto-rotates every 30 days. Old key has a 48-hour grace period. Ideal for long-running scripts. |

Key format: `mac_sk_live_<32-char-random-hex>`

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/keys/my-key` | Get current API key (partially masked) and metadata |
| `POST` | `/keys/generate` | Generate a new API key (invalidates the previous one) |
| `GET` | `/keys/my-key/stats` | Tokens consumed against this key — today, week, month |
| `DELETE` | `/keys/my-key` | Revoke current key permanently (must re-generate) |
| `GET` | `/keys/admin/all` | List all student API keys and status (admin-only) |
| `POST` | `/keys/admin/revoke` | Force-revoke a specific student's API key (admin-only) |

### Response — `GET /keys/my-key`

```json
{
  "key": "mac_sk_live_a1b2...****",
  "type": "refresh",
  "created_at": "2026-03-01T10:00:00Z",
  "expires_at": "2026-03-31T10:00:00Z",
  "last_used": "2026-04-07T14:23:00Z",
  "status": "active",
  "total_requests": 487
}
```

## 4.2 Rate Limiting & Quotas — `/quota`

Rate limits are enforced at the **Redis layer** using a sliding-window algorithm. Limits are per-role and can be overridden per-user.

| Role | Daily Token Limit | Requests/Hour | Max Tokens/Request |
|---|---|---|---|
| **Student** | 50,000 | 100 | 4,096 |
| **Faculty** | 200,000 | 500 | 8,192 |
| **Admin** | Unlimited | Unlimited | 16,384 |

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/quota/limits` | Show default quota limits per role |
| `GET` | `/quota/me` | My personal limits and current consumption |
| `PUT` | `/quota/admin/user/{roll_number}` | Override quota for a specific user (admin-only) |
| `GET` | `/quota/admin/exceeded` | List users who have exceeded their daily quota (admin-only) |

### Response — `GET /quota/me`

```json
{
  "role": "student",
  "limits": {
    "daily_tokens": 50000,
    "requests_per_hour": 100,
    "max_tokens_per_request": 4096
  },
  "current": {
    "tokens_used_today": 12450,
    "requests_this_hour": 8,
    "remaining_tokens": 37550,
    "resets_at": "2026-04-08T00:00:00Z"
  }
}
```

### Rate Limit Response Headers

Every API response includes:

```
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 92
X-RateLimit-Reset: 1743987600
X-TokenLimit-Limit: 50000
X-TokenLimit-Remaining: 37550
X-TokenLimit-Reset: 1744070400
```

When a limit is exceeded, the API returns:

```json
HTTP 429 Too Many Requests
{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "You have exceeded your hourly request limit. Try again in 847 seconds.",
    "retry_after": 847
  }
}
```

---

# Phase 5 — Web Interface

A React-based single-page application served by the FastAPI backend. Three views: **Student Dashboard**, **Key Management**, and **Admin Panel**.

## 5.1 Web Interface Endpoints — `/ui`

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/ui/dashboard` | Student home — usage chart, quick-start code snippets, model status cards |
| `GET` | `/ui/keys` | Key management — view, copy, regenerate API key |
| `GET` | `/ui/history` | Request history table — timestamp, model, tokens, latency, status |
| `GET` | `/ui/playground` | Interactive chat playground — test models directly in the browser |

## 5.2 Admin Panel — `/ui/admin`

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/ui/admin/users` | Full user list with roles, quotas, last active, status |
| `POST` | `/ui/admin/users/create` | Create user / bulk-create from CSV (roll no., name, department) |
| `PUT` | `/ui/admin/users/{roll_number}` | Edit user — change role, quota overrides, model access |
| `DELETE` | `/ui/admin/users/{roll_number}` | Deactivate a user account |
| `GET` | `/ui/admin/models` | Model management — load/unload, see which node serves what |
| `GET` | `/ui/admin/logs` | Live request logs — error rates, latency percentiles, throughput |
| `GET` | `/ui/admin/analytics` | Platform analytics — daily active users, peak hours, top models |

### Dashboard Wireframe

```
┌──────────────────────────────────────────────────────┐
│  MAC — MBM AI Cloud                    [Profile ▼]   │
├──────────────────────────────────────────────────────┤
│                                                      │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐              │
│  │ Tokens  │  │Requests │  │  Quota  │              │
│  │  Today  │  │  Today  │  │Remaining│              │
│  │ 12,450  │  │   23    │  │  75.1%  │              │
│  └─────────┘  └─────────┘  └─────────┘              │
│                                                      │
│  Models Available          Quick Start               │
│  ┌──────────────────┐     ┌────────────────────┐    │
│  │ ✅ Qwen2.5-Coder │     │  from openai import │    │
│  │ ✅ DeepSeek-R1   │     │  OpenAI             │    │
│  │ ✅ Qwen2.5 14B   │     │  client = OpenAI(   │    │
│  │ ⏳ LLaVA (load.) │     │    base_url=        │    │
│  │ ✅ Whisper       │     │    "http://mac/v1", │    │
│  └──────────────────┘     │    api_key="mac_sk_" │    │
│                           │  )                   │    │
│  Usage This Week          └────────────────────┘    │
│  ┌──────────────────┐                               │
│  │ ▁▃▅▇▅▃▁ (chart) │                               │
│  └──────────────────┘                               │
└──────────────────────────────────────────────────────┘
```

---

# Phase 6 — Guardrails

Input and output content filtering to ensure safe, appropriate use of AI models within an academic environment.

## 6.1 Guardrails Endpoints — `/guardrails`

| Method | Endpoint | Description |
|---|---|---|
| `POST` | `/guardrails/check-input` | Run input text through content filter before sending to model |
| `POST` | `/guardrails/check-output` | Run model output through safety filter before returning to user |
| `GET` | `/guardrails/rules` | List active guardrail rules (admin-only) |
| `PUT` | `/guardrails/rules` | Update rules — blocked categories, max prompt length (admin-only) |

## 6.2 Filtering Pipeline

```
User Input  ──►  Input Filter  ──►  Model Inference  ──►  Output Filter  ──►  Response
                 │                                        │
                 ├─ Prompt injection detection             ├─ PII/sensitive data redaction
                 ├─ Blocked topic detection                ├─ Harmful content detection
                 ├─ Max prompt length enforcement           ├─ Hallucination disclaimer
                 └─ Academic integrity checks               └─ Source attribution
```

### Guardrail Categories

| Category | Action | Description |
|---|---|---|
| **Prompt Injection** | Block + log | Detect attempts to override system prompts |
| **Harmful Content** | Block | Violence, self-harm, illegal activities |
| **Academic Dishonesty** | Flag + disclaimer | Full essay/assignment generation adds academic integrity notice |
| **PII in Output** | Redact | Strip emails, phone numbers, addresses from model output |
| **Max Prompt Length** | Reject | Configurable per-role; prevents resource abuse |

### Request — `POST /guardrails/check-input`

```json
{
  "text": "User input to check",
  "context": "chat"
}
```

### Response

```json
{
  "safe": true,
  "flags": [],
  "modified_text": null
}
```

---

# Phase 7 — Knowledgebase + RAG

Retrieval-Augmented Generation over college textbooks, course materials, and reference documents. Students can ask questions grounded in actual course content.

## 7.1 RAG Endpoints — `/rag`

| Method | Endpoint | Description |
|---|---|---|
| `POST` | `/rag/ingest` | Upload PDF / DOCX / TXT — chunk, embed, store in Qdrant vector DB |
| `GET` | `/rag/documents` | List all ingested documents with metadata (title, pages, chunk count) |
| `GET` | `/rag/documents/{id}` | Single document details and its chunks |
| `DELETE` | `/rag/documents/{id}` | Remove a document from the knowledgebase (admin-only) |
| `POST` | `/rag/query` | Ask a question — retrieves top-k relevant chunks → sends to LLM with context |
| `GET` | `/rag/query/{query_id}/sources` | Get source citations for a RAG response |
| `POST` | `/rag/collections` | Create a named collection (e.g., "DSA", "DBMS", "OS") (admin-only) |
| `GET` | `/rag/collections` | List all collections |

## 7.2 RAG Pipeline

```
Document Upload  ──►  Chunking (512 tokens, 50 overlap)
                              │
                              ▼
                      Embedding (768-dim)
                              │
                              ▼
                      Qdrant Vector DB
                              │
     User Question  ──►  Embed Query  ──►  Similarity Search (top-k=5)
                                                    │
                                                    ▼
                                          Retrieved Chunks + Question
                                                    │
                                                    ▼
                                              LLM Generation
                                                    │
                                                    ▼
                                          Answer with Citations
```

### Request — `POST /rag/query`

```json
{
  "question": "Explain the difference between process and thread in operating systems",
  "collection": "OS",
  "top_k": 5,
  "model": "auto"
}
```

### Response — `POST /rag/query`

```json
{
  "answer": "A process is an independent program in execution with its own memory space...",
  "sources": [
    {
      "document": "Galvin - Operating System Concepts, 10th Ed",
      "chapter": "Chapter 3: Processes",
      "page": 105,
      "relevance_score": 0.92,
      "chunk_preview": "A process is a program in execution. A process is more than..."
    },
    {
      "document": "Galvin - Operating System Concepts, 10th Ed",
      "chapter": "Chapter 4: Threads",
      "page": 163,
      "relevance_score": 0.89,
      "chunk_preview": "A thread is a basic unit of CPU utilization..."
    }
  ],
  "model_used": "qwen2.5-14b",
  "tokens_used": 847
}
```

---

# Phase 8 — Retrieval + Search

Real-time web grounding via self-hosted SearXNG. Enables the platform to answer questions about current events, recent research, and topics not covered in the knowledgebase.

## 8.1 Search Endpoints — `/search`

| Method | Endpoint | Description |
|---|---|---|
| `POST` | `/search/web` | Query SearXNG — aggregates results from Google, Bing, DuckDuckGo, Wikipedia |
| `POST` | `/search/wikipedia` | Targeted Wikipedia search with summary extraction |
| `POST` | `/search/grounded` | Search + LLM — retrieves web results, then generates a cited answer |
| `GET` | `/search/cache` | List recently cached search results (reduces redundant fetches) |

### Request — `POST /search/grounded`

```json
{
  "query": "What are the latest improvements in vLLM v0.8?",
  "max_sources": 5,
  "model": "auto"
}
```

### Response — `POST /search/grounded`

```json
{
  "answer": "vLLM v0.8 introduced several key improvements including...",
  "sources": [
    {
      "title": "vLLM v0.8.0 Release Notes",
      "url": "https://github.com/vllm-project/vllm/releases",
      "snippet": "Key features: improved prefix caching, multi-modal support..."
    }
  ],
  "model_used": "qwen2.5-14b",
  "cached": false
}
```

---

# Database Schema

## Entity-Relationship Overview

```
┌─────────────┐       ┌─────────────┐       ┌─────────────────┐
│    users     │       │  api_keys   │       │  request_logs   │
├─────────────┤       ├─────────────┤       ├─────────────────┤
│ id (PK)     │──┐    │ id (PK)     │       │ id (PK)         │
│ roll_number │  ├───►│ user_id(FK) │   ┌──►│ user_id (FK)    │
│ name        │  │    │ key_hash    │   │   │ api_key_id (FK) │
│ department  │  │    │ key_prefix  │   │   │ model           │
│ role        │  │    │ type        │   │   │ endpoint        │
│ password    │  │    │ is_active   │   │   │ tokens_in       │
│ is_active   │  │    │ created_at  │   │   │ tokens_out      │
│ created_at  │  │    │ expires_at  │   │   │ latency_ms      │
│ updated_at  │  │    │ last_used   │   │   │ status_code     │
└─────────────┘  │    └─────────────┘   │   │ created_at      │
                 │                      │   └─────────────────┘
                 └──────────────────────┘
                 
┌──────────────────┐       ┌─────────────────┐
│  quota_overrides │       │  rag_documents  │
├──────────────────┤       ├─────────────────┤
│ id (PK)          │       │ id (PK)         │
│ user_id (FK)     │       │ title           │
│ daily_tokens     │       │ collection      │
│ requests_per_hr  │       │ file_type       │
│ max_tokens_req   │       │ chunk_count     │
│ set_by (FK)      │       │ uploaded_by(FK) │
│ created_at       │       │ created_at      │
└──────────────────┘       └─────────────────┘
```

## SQL Schema — Key Tables

### users

```sql
CREATE TABLE users (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    roll_number     VARCHAR(20) UNIQUE NOT NULL,
    name            VARCHAR(100) NOT NULL,
    department      VARCHAR(50) NOT NULL,
    role            VARCHAR(20) NOT NULL DEFAULT 'student'
                    CHECK (role IN ('student', 'faculty', 'admin')),
    hashed_password VARCHAR(255) NOT NULL,
    is_active       BOOLEAN DEFAULT TRUE,
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    updated_at      TIMESTAMPTZ DEFAULT NOW()
);
```

### api_keys

```sql
CREATE TABLE api_keys (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id     UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    key_hash    VARCHAR(255) NOT NULL,
    key_prefix  VARCHAR(20) NOT NULL,
    type        VARCHAR(10) NOT NULL DEFAULT 'static'
                CHECK (type IN ('static', 'refresh')),
    is_active   BOOLEAN DEFAULT TRUE,
    created_at  TIMESTAMPTZ DEFAULT NOW(),
    expires_at  TIMESTAMPTZ,
    last_used   TIMESTAMPTZ
);
```

### request_logs

```sql
CREATE TABLE request_logs (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id     UUID NOT NULL REFERENCES users(id),
    api_key_id  UUID REFERENCES api_keys(id),
    model       VARCHAR(50) NOT NULL,
    endpoint    VARCHAR(100) NOT NULL,
    tokens_in   INTEGER NOT NULL DEFAULT 0,
    tokens_out  INTEGER NOT NULL DEFAULT 0,
    latency_ms  INTEGER,
    status_code SMALLINT NOT NULL,
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_request_logs_user_date ON request_logs (user_id, created_at);
CREATE INDEX idx_request_logs_model ON request_logs (model, created_at);
```

---

# Project Folder Structure

```
mac/
├── api/
│   ├── main.py                 # FastAPI application entry point
│   ├── routers/
│   │   ├── auth.py             # /auth endpoints
│   │   ├── explore.py          # /explore endpoints
│   │   ├── query.py            # /query endpoints
│   │   ├── usage.py            # /usage endpoints
│   │   ├── keys.py             # /keys endpoints
│   │   ├── models.py           # /models endpoints
│   │   ├── integration.py      # /integration endpoints
│   │   ├── quota.py            # /quota endpoints
│   │   ├── guardrails.py       # /guardrails endpoints
│   │   ├── rag.py              # /rag endpoints
│   │   └── search.py           # /search endpoints
│   ├── models/
│   │   ├── user.py             # SQLAlchemy User model
│   │   ├── api_key.py          # APIKey model
│   │   ├── request_log.py      # RequestLog model
│   │   └── document.py         # RAG Document model
│   ├── schemas/
│   │   ├── auth.py             # Pydantic request/response schemas
│   │   ├── query.py            # Chat, completions, vision schemas
│   │   ├── usage.py            # Usage response schemas
│   │   └── common.py           # Shared schemas (pagination, errors)
│   ├── core/
│   │   ├── config.py           # Settings from environment variables
│   │   ├── security.py         # JWT creation/validation, password hashing
│   │   ├── rate_limit.py       # Redis-based sliding window rate limiter
│   │   ├── dependencies.py     # FastAPI dependency injection (get_current_user, etc.)
│   │   └── db.py               # Database session factory
│   ├── services/
│   │   ├── model_router.py     # Smart routing logic (auto model selection)
│   │   ├── inference.py        # LiteLLM client for model calls
│   │   ├── usage_tracker.py    # Token counting and logging
│   │   ├── guardrails.py       # Input/output filtering logic
│   │   └── rag.py              # Document ingestion, embedding, retrieval
│   └── requirements.txt
├── litellm/
│   └── config.yaml             # LiteLLM proxy routing configuration
├── nginx/
│   └── nginx.conf              # Reverse proxy configuration
├── frontend/
│   ├── src/
│   │   ├── pages/              # Dashboard, Keys, History, Admin pages
│   │   ├── components/         # Reusable UI components
│   │   └── api/                # API client (Axios/fetch wrappers)
│   ├── package.json
│   └── vite.config.ts
├── scripts/
│   ├── seed_users.py           # Bulk-create students from CSV
│   ├── download_models.py      # Pull models from HuggingFace
│   └── health_check.py         # Verify all services are running
├── docker-compose.yml          # Full stack: api, db, redis, litellm, nginx, qdrant
├── Dockerfile                  # FastAPI container
├── .env.example                # Template for environment variables
├── alembic/                    # Database migrations
│   └── versions/
└── README.md
```

---

# Security Design

| Concern | Implementation |
|---|---|
| **Password Storage** | bcrypt with work factor 12; passwords never stored in plaintext |
| **JWT Signing** | RS256 asymmetric keys; access tokens short-lived (24h) |
| **API Key Storage** | Only SHA-256 hash stored in DB; raw key shown once at creation |
| **Transport** | Nginx terminates TLS (HTTPS); internal services communicate over Docker network |
| **Input Validation** | Pydantic schema validation on every request; max payload size enforced |
| **SQL Injection** | SQLAlchemy ORM with parameterised queries throughout |
| **Rate Limiting** | Redis sliding-window per user + per IP; prevents abuse and DoS |
| **CORS** | Strict origin allowlist — only the MAC frontend domain |
| **Admin Endpoints** | Role-based access control; admin-only routes check JWT role claim |
| **Prompt Injection** | Input guardrails detect and block prompt override attempts |
| **Secrets Management** | All secrets in `.env` file; never committed to version control |

---

# Error Response Format

All errors follow a consistent structure:

```json
{
  "error": {
    "code": "authentication_failed",
    "message": "Invalid roll number or password.",
    "status": 401,
    "timestamp": "2026-04-07T14:30:00Z",
    "request_id": "mac-req-a1b2c3"
  }
}
```

### Standard Error Codes

| HTTP Status | Code | Description |
|---|---|---|
| 400 | `bad_request` | Malformed request body or invalid parameters |
| 401 | `authentication_failed` | Missing or invalid credentials / API key |
| 403 | `forbidden` | Valid auth but insufficient role permissions |
| 404 | `not_found` | Resource does not exist |
| 409 | `conflict` | Duplicate resource (e.g., user already exists) |
| 422 | `validation_error` | Request schema validation failed |
| 429 | `rate_limit_exceeded` | Request or token quota exceeded |
| 500 | `internal_error` | Unexpected server error |
| 503 | `model_unavailable` | Requested model is not loaded or all workers are busy |

---

# Deployment Configuration

## Environment Variables (`.env`)

```env
# Server
MAC_HOST=0.0.0.0
MAC_PORT=8000
MAC_ENV=production

# Database
POSTGRES_HOST=db
POSTGRES_PORT=5432
POSTGRES_DB=mac
POSTGRES_USER=mac_admin
POSTGRES_PASSWORD=<generated-secret>

# Redis
REDIS_HOST=redis
REDIS_PORT=6379

# JWT
JWT_PRIVATE_KEY_PATH=/run/secrets/jwt_private.pem
JWT_PUBLIC_KEY_PATH=/run/secrets/jwt_public.pem
JWT_ACCESS_TOKEN_EXPIRE_MINUTES=1440
JWT_REFRESH_TOKEN_EXPIRE_DAYS=30

# LiteLLM
LITELLM_PROXY_URL=http://litellm:4000
LITELLM_MASTER_KEY=<generated-secret>

# Qdrant
QDRANT_HOST=qdrant
QDRANT_PORT=6333

# Models
MODEL_STORAGE_PATH=/models
DEFAULT_MODEL=qwen2.5-14b
```

## Docker Compose Services

| Service | Image | Ports | Purpose |
|---|---|---|---|
| `api` | Custom (Dockerfile) | 8000 | FastAPI gateway |
| `db` | postgres:16-alpine | 5432 | User data, logs, keys |
| `redis` | redis:7-alpine | 6379 | Rate limiting, caching |
| `litellm` | ghcr.io/berriai/litellm | 4000 | Model proxy + routing |
| `qdrant` | qdrant/qdrant | 6333 | Vector database (RAG) |
| `searxng` | searxng/searxng | 8888 | Web search engine |
| `nginx` | nginx:alpine | 80, 443 | Reverse proxy + TLS |

---

# API Versioning & Compatibility

- All endpoints are prefixed with `/api/v1`
- The `/query` endpoints return **OpenAI-compatible JSON** — any OpenAI SDK works by changing `base_url`
- Breaking changes will increment the version (`/api/v2`) while maintaining the previous version for one semester
- Response pagination follows cursor-based pagination for list endpoints

### OpenAI SDK Compatibility Example

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://mac-server/api/v1",
    api_key="mac_sk_live_your_key_here"
)

response = client.chat.completions.create(
    model="auto",
    messages=[
        {"role": "user", "content": "Explain binary search in Python"}
    ]
)

print(response.choices[0].message.content)
```

---

# Summary

MAC (MBM AI Cloud) delivers a production-grade, self-hosted AI platform that gives every student access to state-of-the-art AI models at zero recurring cost. The 8-phase build plan ensures a solid foundation before adding advanced features, and the architecture scales from a single lab PC to 30+ machines without rewriting a single line of code.

**Phase 1** (API + Models + Integration + Usage Control) establishes the complete backend — once deployed, students can start querying AI models from their laptops on day one.

---

*MAC — MBM AI Cloud · API Design Document · Version 1.0 · 07 April 2026*