Spaces:

Aaryan17
/

RamMAC

Running

File size: 43,825 Bytes

9c0b225

# MAC — MBM AI Cloud

## API Design & Architecture Plan

### Phase 1–8 · Complete Platform Blueprint

**Prepared for Professor Review · 07 April 2026**

---

## Executive Summary

MAC (MBM AI Cloud) is a **fully self-hosted, zero-cloud AI inference platform** purpose-built for MBM Engineering College. Students and faculty on the college LAN access state-of-the-art AI models — text generation, code assistance, vision understanding, speech-to-text, and mathematical reasoning — through a standardised REST API authenticated by per-student API keys.

**Key design goals:**

- **Zero cloud cost** — all inference runs on college-owned GPUs
- **OpenAI-compatible API** — students use familiar SDKs; existing tutorials work unchanged
- **Scalable from day one** — starts on a single PC, scales horizontally to 30+ nodes with no code changes
- **Secure by default** — JWT authentication, role-based access, rate limiting, input/output guardrails
- **Academically useful** — RAG over college textbooks, web-grounded search, per-student usage tracking

---

## Technology Stack

| Layer | Technology | Rationale |
|---|---|---|
| **API Gateway** | FastAPI (Python 3.11+) | Async, auto-docs (OpenAPI/Swagger), type-safe, fastest Python framework |
| **Database** | PostgreSQL 16 | ACID-compliant, battle-tested, excellent JSON support |
| **Cache / Rate Limiter** | Redis 7 | In-memory, sub-ms latency, native rate-limit primitives |
| **LLM Inference** | vLLM | PagedAttention, continuous batching, highest throughput for local GPUs |
| **Model Router / Proxy** | LiteLLM | Unified OpenAI-compatible proxy, load balancing, fallback routing |
| **Vector Database** | Qdrant | Purpose-built for embeddings, filtering, snapshotting |
| **Web Search** | SearXNG | Self-hosted meta-search, no API keys needed |
| **Reverse Proxy** | Nginx | TLS termination, request buffering, static file serving |
| **Containerisation** | Docker + Docker Compose | Reproducible deployments, one-command startup |
| **Frontend** | React + Tailwind CSS | Component-driven dashboard, responsive, fast |
| **Task Queue** | Celery + Redis | Background jobs: document ingestion, model downloads |

---

## Architecture Overview

```

┌─────────────────────────────────────────────────────────────────┐

│                        College LAN                              │

│   Students / Faculty / Lab PCs                                  │

└────────────────────────┬────────────────────────────────────────┘

                         │ HTTPS

                         ▼

              ┌─────────────────────┐

              │       Nginx         │

              │   Reverse Proxy     │

              │   TLS · Caching     │

              └─────────┬───────────┘

                        │

              ┌─────────▼───────────┐

              │      FastAPI        │

              │   API Gateway       │

              │  Auth · Routing     │

              │  Rate Limiting      │

              └───┬────┬────┬───┬──┘

                  │    │    │   │

         ┌────────┘    │    │   └────────┐

         ▼             ▼    ▼            ▼

   ┌──────────┐  ┌─────────────┐  ┌──────────┐

   │PostgreSQL│  │   LiteLLM   │  │  Qdrant  │

   │  Users   │  │   Proxy     │  │ VectorDB │

   │  Logs    │  │  Routing    │  │  (RAG)   │

   │  Keys    │  └──────┬──────┘  └──────────┘

   └──────────┘         │

                        ▼

              ┌─────────────────────┐

              │       vLLM          │

              │  Model Workers      │

              │  GPU Inference      │

              └─────────────────────┘

```

**Scaling model:** The entire stack is containerised. To scale from 1 PC to N PCs, new vLLM worker containers are added and registered in the LiteLLM config. The FastAPI gateway, database, and proxy remain centralised on the primary node. Zero code changes required.

---

## Build Phases — 8 Phases, Sequential

| # | Phase | Description | Dependencies |
|---|---|---|---|
| 1 | **API Endpoints** | Core REST API — explore, query, usage, auth | None |
| 2 | **LLM Models** | Select, download, feasibility-check, deploy 5 models | Phase 1 |
| 3 | **API–Model Integration** | Wire every model to its dedicated endpoint via LiteLLM + vLLM | Phase 1, 2 |
| 4 | **API Usage Control** | Rate limiting, token accounting, static + refresh API keys | Phase 1 |
| 5 | **Web Interface** | Dashboard, user management, admin panel, model access controls | Phase 1, 4 |
| 6 | **Guardrails** | Input + output content filtering, safety checks | Phase 3 |
| 7 | **Knowledgebase + RAG** | Vector DB, document ingestion, retrieval-augmented generation | Phase 3 |
| 8 | **Retrieval + Search** | SearXNG web search, Wikipedia, real-time grounded answers | Phase 3, 7 |

---

## Base URL

```

http://<server-ip>/api/v1

```

> The server IP is configured via environment variable `MAC_HOST`. No IP addresses are hardcoded anywhere in the codebase. For production, the Nginx reverse proxy terminates TLS and forwards to the FastAPI gateway.



---



# Phase 1 — API Endpoints



## 1.1 Authentication — `/auth`



Handles user login, session management, and JWT token lifecycle. All passwords are hashed with **bcrypt** (work factor 12). Tokens use **RS256** JWT signing.



| Method | Endpoint | Description |

|---|---|---|

| `POST` | `/auth/login` | Roll number + password → JWT access token + refresh token |

| `POST` | `/auth/logout` | Invalidate current session / revoke refresh token |

| `POST` | `/auth/refresh` | Exchange refresh token for a new access token |

| `GET` | `/auth/me` | Current user's profile, role, department, and API key |

| `POST` | `/auth/change-password` | Change password (requires current password verification) |



### Request — `POST /auth/login`



```json

{

  "roll_number": "21CS045",
  "password": "secure_password"

}

```



| Field | Type | Required | Description |

|---|---|---|---|

| `roll_number` | string | Yes | Student/faculty roll number, e.g. `21CS045` |
| `password` | string | Yes | Account password (min 8 chars) |

### Response — `POST /auth/login`

```json

{

  "access_token": "eyJhbG...",

  "refresh_token": "dGhpcyBp...",

  "token_type": "bearer",

  "expires_in": 86400,

  "user": {

    "roll_number": "21CS045",

    "name": "Aryan Sharma",

    "department": "CSE",

    "role": "student",

    "api_key": "mac_sk_live_abc123..."

  }

}

```

| Field | Description |
|---|---|
| `access_token` | JWT, 24-hour expiry, used for all authenticated requests |
| `refresh_token` | 30-day expiry, used only at `/auth/refresh` |
| `user.role` | One of: `student`, `faculty`, `admin` |
| `user.api_key` | Personal API key for direct model queries |

---

## 1.2 Explore — `/explore`

Read-only discovery endpoints. Allow students to see what models and capabilities are available before writing code.

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/explore/models` | List all deployed models with capabilities, context length, and status |
| `GET` | `/explore/models/{model_id}` | Detailed info — parameters, benchmarks, example prompts |
| `GET` | `/explore/models/search` | Search by capability tag: `?tag=vision`, `?tag=code`, `?tag=math` |
| `GET` | `/explore/endpoints` | List every API endpoint with method, path, auth requirement, description |
| `GET` | `/explore/health` | Platform health — node status, GPU temperatures, inference queue depth |
| `GET` | `/explore/usage-stats` | Aggregated platform analytics (admin-only) — tokens/day, active users |

### Response — `GET /explore/models`

```json

{

  "models": [

    {

      "id": "qwen2.5-coder-7b",

      "name": "Qwen2.5-Coder 7B",

      "specialty": "Code generation, debugging, explanation",

      "parameters": "7B",

      "context_length": 32768,

      "status": "loaded",

      "capabilities": ["code", "chat", "completion"]

    }

  ],

  "total": 5

}

```

### Response — `GET /explore/health`

```json

{

  "status": "healthy",

  "uptime_seconds": 345600,

  "nodes": [

    {

      "ip": "192.168.1.101",

      "gpu": "NVIDIA RTX 3060 12GB",

      "gpu_temp_c": 62,

      "vram_used_gb": 9.2,

      "vram_total_gb": 12.0,

      "requests_in_flight": 3,

      "status": "active"

    }

  ],

  "queue_depth": 7

}

```

---

## 1.3 Query — `/query`

The core inference API. All endpoints accept `Authorization: Bearer <api_key>` header. Responses follow the **OpenAI Chat Completions format** — existing OpenAI SDK code works with zero changes by swapping `base_url`.

| Method | Endpoint | Description |
|---|---|---|
| `POST` | `/query/chat` | Chat completion — multi-turn conversation (text/code/math) |
| `POST` | `/query/completions` | Raw text completion (OpenAI-compatible) |
| `POST` | `/query/vision` | Image + text → answer (vision model) |
| `POST` | `/query/speech-to-text` | Upload audio → transcribed text (Whisper) |
| `POST` | `/query/text-to-speech` | Text → audio file download (TTS) |
| `POST` | `/query/embeddings` | Text → vector embedding (for RAG / similarity search) |
| `POST` | `/query/rerank` | Re-rank passages by relevance to a query |

### Request — `POST /query/chat`

```json

{

  "model": "auto",

  "messages": [

    {"role": "system", "content": "You are a helpful coding assistant."},

    {"role": "user", "content": "Write a Python function to reverse a linked list."}

  ],

  "temperature": 0.7,

  "max_tokens": 2048,

  "stream": false

}

```

| Field | Type | Required | Description |
|---|---|---|---|
| `model` | string | Yes | Model ID or `"auto"` for smart routing |
| `messages` | array | Yes | Array of `{role, content}` — system / user / assistant |
| `temperature` | float | No | Sampling temperature 0.0–2.0 (default `0.7`) |
| `max_tokens` | integer | No | Max tokens to generate (default `2048`, max `4096`) |
| `stream` | boolean | No | Stream tokens as Server-Sent Events (default `false`) |
| `context_id` | string | No | Maintain conversation state server-side |

### Response — `POST /query/chat`

```json

{

  "id": "mac-chat-a1b2c3d4",

  "object": "chat.completion",

  "created": 1743984000,

  "model": "qwen2.5-coder-7b",

  "choices": [

    {

      "index": 0,

      "message": {

        "role": "assistant",

        "content": "Here's a Python function to reverse a linked list..."

      },

      "finish_reason": "stop"

    }

  ],

  "usage": {

    "prompt_tokens": 42,

    "completion_tokens": 187,

    "total_tokens": 229

  }

}

```

### Smart Routing (`model: "auto"`)

When students set `model` to `"auto"`, the API gateway inspects the request and routes to the optimal model:

| Signal Detected | Routed To |
|---|---|
| Code keywords, programming language names, `write a function`, `debug this` | `qwen2.5-coder-7b` |
| Math expressions, equations, `solve`, `prove`, `step by step` | `deepseek-r1-8b` |
| Image attachment in request body | `llava-1.6-7b` |
| Audio file upload | `whisper-large-v3` |
| General text, summarisation, writing, Q&A | `qwen2.5-14b` |

---

## 1.4 Usage — `/usage`

Per-student consumption tracking. Every API call logs token counts, model used, and timestamp.

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/usage/me` | My tokens used — today, this week, this month, broken down by model |
| `GET` | `/usage/me/history` | Full request history — timestamps, models, token counts, latency |
| `GET` | `/usage/me/quota` | My current quota limits and remaining balance |
| `GET` | `/usage/admin/all` | All users' usage summary (admin-only) |
| `GET` | `/usage/admin/user/{roll_number}` | Specific student's full usage details (admin-only) |
| `GET` | `/usage/admin/models` | Per-model usage statistics across the platform (admin-only) |

### Response — `GET /usage/me`

```json

{

  "roll_number": "21CS045",

  "usage": {

    "today": {

      "total_tokens": 12450,

      "requests": 23,

      "by_model": {

        "qwen2.5-coder-7b": {"tokens": 8200, "requests": 15},

        "qwen2.5-14b": {"tokens": 4250, "requests": 8}

      }

    },

    "this_week": {"total_tokens": 67800, "requests": 142},

    "this_month": {"total_tokens": 234500, "requests": 487}

  },

  "quota": {

    "daily_limit": 50000,

    "remaining_today": 37550

  }

}

```

---

# Phase 2 — LLM Models

## 2.1 Model Selection

Five best-in-class open-source models, each a domain specialist. All models are quantised to fit within consumer GPU VRAM.

| Model ID | Name | Specialty | Parameters | VRAM Required | Quantisation |
|---|---|---|---|---|---|
| `qwen2.5-coder-7b` | Qwen2.5-Coder 7B | Code generation, debugging, explanation | 7B | ~5 GB | GPTQ-Int4 |
| `deepseek-r1-8b` | DeepSeek-R1 8B | Maths, reasoning, step-by-step logic | 8B | ~6 GB | AWQ-Int4 |
| `llava-1.6-7b` | LLaVA 1.6 7B | Image understanding, visual Q&A | 7B | ~8 GB | FP16 |
| `whisper-large-v3` | Whisper Large v3 | Speech-to-text, transcription | 1.5B | ~3 GB | FP16 |
| `qwen2.5-14b` | Qwen2.5 14B | General chat, summarisation, writing | 14B | ~10 GB | GPTQ-Int4 |

**Total VRAM for all 5 models:** ~32 GB (fits on dual-GPU setups or loads models on demand)

### Feasibility on Single PC

For a single-PC deployment with one GPU (e.g., RTX 3060 12GB or RTX 4070 16GB):
- Models are loaded/unloaded on demand — only the requested model occupies VRAM at inference time
- A model loading queue ensures graceful transitions
- Frequently-used models remain resident; idle models are evicted after a configurable timeout

## 2.2 Model Management Endpoints — `/models`

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/models` | List all models with status: `loaded` / `downloading` / `queued` / `offline` |
| `GET` | `/models/{model_id}` | Single model details — context length, capabilities, benchmark scores |
| `POST` | `/models/{model_id}/load` | Load model into GPU memory (admin-only) |
| `POST` | `/models/{model_id}/unload` | Unload model from VRAM to free resources (admin-only) |
| `GET` | `/models/{model_id}/health` | Ping model — returns latency, memory usage, ready status |
| `POST` | `/models/download` | Download a model from HuggingFace by ID (admin-only) |
| `GET` | `/models/download/{task_id}` | Check download progress for a model download task |

### Response — `GET /models`

```json

{

  "models": [

    {

      "id": "qwen2.5-coder-7b",

      "name": "Qwen2.5-Coder 7B",

      "status": "loaded",

      "vram_mb": 5120,

      "context_length": 32768,

      "capabilities": ["code", "chat"],

      "loaded_at": "2026-04-07T08:30:00Z"

    },

    {

      "id": "deepseek-r1-8b",

      "name": "DeepSeek-R1 8B",

      "status": "offline",

      "vram_mb": 6144,

      "context_length": 65536,

      "capabilities": ["reasoning", "math", "chat"],

      "loaded_at": null

    }

  ]

}

```

---

# Phase 3 — API–Model Integration

## 3.1 Architecture

**LiteLLM Proxy** sits between the FastAPI gateway and vLLM workers. It:

1. Translates every `/query` request into the correct vLLM inference call
2. Handles smart routing (auto-model selection)
3. Load-balances across multiple workers (when scaled)
4. Retries on worker failure with exponential backoff
5. Returns OpenAI-compatible JSON responses

```

FastAPI Gateway  ──►  LiteLLM Proxy  ──►  vLLM Worker(s)

                      (routing +           (GPU inference)

                       load balance)

```

## 3.2 Integration Endpoints — `/integration`

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/integration/routing-rules` | Show current routing rules (which task type → which model) |
| `PUT` | `/integration/routing-rules` | Update routing rules (admin-only) |
| `GET` | `/integration/workers` | List all vLLM worker nodes and their current load |
| `GET` | `/integration/workers/{node_id}` | Single worker — GPU temp, VRAM used, requests in flight |
| `POST` | `/integration/workers/{node_id}/drain` | Mark worker as draining — no new requests routed (admin-only) |
| `GET` | `/integration/queue` | Current global inference queue depth across all workers |

### Response — `GET /integration/routing-rules`

```json

{

  "rules": [

    {"task": "code", "keywords": ["function", "debug", "code", "python", "javascript"], "model": "qwen2.5-coder-7b", "priority": 1},

    {"task": "math", "keywords": ["solve", "equation", "prove", "integral"], "model": "deepseek-r1-8b", "priority": 2},

    {"task": "vision", "trigger": "image_attachment", "model": "llava-1.6-7b", "priority": 3},

    {"task": "speech", "trigger": "audio_upload", "model": "whisper-large-v3", "priority": 4},

    {"task": "general", "trigger": "default", "model": "qwen2.5-14b", "priority": 99}

  ]

}

```

## 3.3 Scaling Strategy

| Deployment | Configuration |
|---|---|
| **Single PC** | One vLLM process, models loaded/swapped on demand |
| **2–5 PCs** | Each PC runs a vLLM worker with 1–2 dedicated models; LiteLLM routes by model |
| **6–30 PCs** | Multiple workers per model for redundancy; least-busy routing; auto-failover |

All worker addresses are stored in configuration (environment variables / config file) — **no IPs are hardcoded**. Adding a new node requires only updating the LiteLLM config and restarting the proxy.

---

# Phase 4 — API Usage Control

## 4.1 API Key Management — `/keys`

Every student receives a unique API key upon account creation. Two key types are supported:

| Key Type | Behaviour |
|---|---|
| **Static** | Never expires. Manually rotated by the student or admin. Ideal for quick experiments. |
| **Refresh** | Auto-rotates every 30 days. Old key has a 48-hour grace period. Ideal for long-running scripts. |

Key format: `mac_sk_live_<32-char-random-hex>`

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/keys/my-key` | Get current API key (partially masked) and metadata |
| `POST` | `/keys/generate` | Generate a new API key (invalidates the previous one) |
| `GET` | `/keys/my-key/stats` | Tokens consumed against this key — today, week, month |
| `DELETE` | `/keys/my-key` | Revoke current key permanently (must re-generate) |
| `GET` | `/keys/admin/all` | List all student API keys and status (admin-only) |
| `POST` | `/keys/admin/revoke` | Force-revoke a specific student's API key (admin-only) |

### Response — `GET /keys/my-key`

```json

{

  "key": "mac_sk_live_a1b2...****",

  "type": "refresh",

  "created_at": "2026-03-01T10:00:00Z",

  "expires_at": "2026-03-31T10:00:00Z",

  "last_used": "2026-04-07T14:23:00Z",

  "status": "active",

  "total_requests": 487

}

```

## 4.2 Rate Limiting & Quotas — `/quota`

Rate limits are enforced at the **Redis layer** using a sliding-window algorithm. Limits are per-role and can be overridden per-user.

| Role | Daily Token Limit | Requests/Hour | Max Tokens/Request |
|---|---|---|---|
| **Student** | 50,000 | 100 | 4,096 |
| **Faculty** | 200,000 | 500 | 8,192 |
| **Admin** | Unlimited | Unlimited | 16,384 |

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/quota/limits` | Show default quota limits per role |
| `GET` | `/quota/me` | My personal limits and current consumption |
| `PUT` | `/quota/admin/user/{roll_number}` | Override quota for a specific user (admin-only) |
| `GET` | `/quota/admin/exceeded` | List users who have exceeded their daily quota (admin-only) |

### Response — `GET /quota/me`

```json

{

  "role": "student",

  "limits": {

    "daily_tokens": 50000,

    "requests_per_hour": 100,

    "max_tokens_per_request": 4096

  },

  "current": {

    "tokens_used_today": 12450,

    "requests_this_hour": 8,

    "remaining_tokens": 37550,

    "resets_at": "2026-04-08T00:00:00Z"

  }

}

```

### Rate Limit Response Headers

Every API response includes:

```

X-RateLimit-Limit: 100

X-RateLimit-Remaining: 92

X-RateLimit-Reset: 1743987600

X-TokenLimit-Limit: 50000

X-TokenLimit-Remaining: 37550

X-TokenLimit-Reset: 1744070400

```

When a limit is exceeded, the API returns:

```json

HTTP 429 Too Many Requests

{

  "error": {

    "code": "rate_limit_exceeded",

    "message": "You have exceeded your hourly request limit. Try again in 847 seconds.",

    "retry_after": 847

  }

}

```

---

# Phase 5 — Web Interface

A React-based single-page application served by the FastAPI backend. Three views: **Student Dashboard**, **Key Management**, and **Admin Panel**.

## 5.1 Web Interface Endpoints — `/ui`

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/ui/dashboard` | Student home — usage chart, quick-start code snippets, model status cards |
| `GET` | `/ui/keys` | Key management — view, copy, regenerate API key |
| `GET` | `/ui/history` | Request history table — timestamp, model, tokens, latency, status |
| `GET` | `/ui/playground` | Interactive chat playground — test models directly in the browser |

## 5.2 Admin Panel — `/ui/admin`

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/ui/admin/users` | Full user list with roles, quotas, last active, status |
| `POST` | `/ui/admin/users/create` | Create user / bulk-create from CSV (roll no., name, department) |
| `PUT` | `/ui/admin/users/{roll_number}` | Edit user — change role, quota overrides, model access |
| `DELETE` | `/ui/admin/users/{roll_number}` | Deactivate a user account |
| `GET` | `/ui/admin/models` | Model management — load/unload, see which node serves what |
| `GET` | `/ui/admin/logs` | Live request logs — error rates, latency percentiles, throughput |
| `GET` | `/ui/admin/analytics` | Platform analytics — daily active users, peak hours, top models |

### Dashboard Wireframe

```

┌──────────────────────────────────────────────────────┐

│  MAC — MBM AI Cloud                    [Profile ▼]   │

├──────────────────────────────────────────────────────┤

│                                                      │

│  ┌─────────┐  ┌─────────┐  ┌─────────┐              │

│  │ Tokens  │  │Requests │  │  Quota  │              │

│  │  Today  │  │  Today  │  │Remaining│              │

│  │ 12,450  │  │   23    │  │  75.1%  │              │

│  └─────────┘  └─────────┘  └─────────┘              │

│                                                      │

│  Models Available          Quick Start               │

│  ┌──────────────────┐     ┌────────────────────┐    │

│  │ ✅ Qwen2.5-Coder │     │  from openai import │    │

│  │ ✅ DeepSeek-R1   │     │  OpenAI             │    │

│  │ ✅ Qwen2.5 14B   │     │  client = OpenAI(   │    │

│  │ ⏳ LLaVA (load.) │     │    base_url=        │    │

│  │ ✅ Whisper       │     │    "http://mac/v1", │    │

│  └──────────────────┘     │    api_key="mac_sk_" │    │

│                           │  )                   │    │

│  Usage This Week          └────────────────────┘    │

│  ┌──────────────────┐                               │

│  │ ▁▃▅▇▅▃▁ (chart) │                               │

│  └──────────────────┘                               │

└──────────────────────────────────────────────────────┘

```

---

# Phase 6 — Guardrails

Input and output content filtering to ensure safe, appropriate use of AI models within an academic environment.

## 6.1 Guardrails Endpoints — `/guardrails`

| Method | Endpoint | Description |
|---|---|---|
| `POST` | `/guardrails/check-input` | Run input text through content filter before sending to model |
| `POST` | `/guardrails/check-output` | Run model output through safety filter before returning to user |
| `GET` | `/guardrails/rules` | List active guardrail rules (admin-only) |
| `PUT` | `/guardrails/rules` | Update rules — blocked categories, max prompt length (admin-only) |

## 6.2 Filtering Pipeline

```

User Input  ──►  Input Filter  ──►  Model Inference  ──►  Output Filter  ──►  Response

                 │                                        │

                 ├─ Prompt injection detection             ├─ PII/sensitive data redaction

                 ├─ Blocked topic detection                ├─ Harmful content detection

                 ├─ Max prompt length enforcement           ├─ Hallucination disclaimer

                 └─ Academic integrity checks               └─ Source attribution

```

### Guardrail Categories

| Category | Action | Description |
|---|---|---|
| **Prompt Injection** | Block + log | Detect attempts to override system prompts |
| **Harmful Content** | Block | Violence, self-harm, illegal activities |
| **Academic Dishonesty** | Flag + disclaimer | Full essay/assignment generation adds academic integrity notice |
| **PII in Output** | Redact | Strip emails, phone numbers, addresses from model output |
| **Max Prompt Length** | Reject | Configurable per-role; prevents resource abuse |

### Request — `POST /guardrails/check-input`

```json

{

  "text": "User input to check",

  "context": "chat"

}

```

### Response

```json

{

  "safe": true,

  "flags": [],

  "modified_text": null

}

```

---

# Phase 7 — Knowledgebase + RAG

Retrieval-Augmented Generation over college textbooks, course materials, and reference documents. Students can ask questions grounded in actual course content.

## 7.1 RAG Endpoints — `/rag`

| Method | Endpoint | Description |
|---|---|---|
| `POST` | `/rag/ingest` | Upload PDF / DOCX / TXT — chunk, embed, store in Qdrant vector DB |
| `GET` | `/rag/documents` | List all ingested documents with metadata (title, pages, chunk count) |
| `GET` | `/rag/documents/{id}` | Single document details and its chunks |
| `DELETE` | `/rag/documents/{id}` | Remove a document from the knowledgebase (admin-only) |
| `POST` | `/rag/query` | Ask a question — retrieves top-k relevant chunks → sends to LLM with context |
| `GET` | `/rag/query/{query_id}/sources` | Get source citations for a RAG response |
| `POST` | `/rag/collections` | Create a named collection (e.g., "DSA", "DBMS", "OS") (admin-only) |
| `GET` | `/rag/collections` | List all collections |

## 7.2 RAG Pipeline

```

Document Upload  ──►  Chunking (512 tokens, 50 overlap)

                              │

                              ▼

                      Embedding (768-dim)

                              │

                              ▼

                      Qdrant Vector DB

                              │

     User Question  ──►  Embed Query  ──►  Similarity Search (top-k=5)

                                                    │

                                                    ▼

                                          Retrieved Chunks + Question

                                                    │

                                                    ▼

                                              LLM Generation

                                                    │

                                                    ▼

                                          Answer with Citations

```

### Request — `POST /rag/query`

```json

{

  "question": "Explain the difference between process and thread in operating systems",

  "collection": "OS",

  "top_k": 5,

  "model": "auto"

}

```

### Response — `POST /rag/query`

```json

{

  "answer": "A process is an independent program in execution with its own memory space...",

  "sources": [

    {

      "document": "Galvin - Operating System Concepts, 10th Ed",

      "chapter": "Chapter 3: Processes",

      "page": 105,

      "relevance_score": 0.92,

      "chunk_preview": "A process is a program in execution. A process is more than..."

    },

    {

      "document": "Galvin - Operating System Concepts, 10th Ed",

      "chapter": "Chapter 4: Threads",

      "page": 163,

      "relevance_score": 0.89,

      "chunk_preview": "A thread is a basic unit of CPU utilization..."

    }

  ],

  "model_used": "qwen2.5-14b",

  "tokens_used": 847

}

```

---

# Phase 8 — Retrieval + Search

Real-time web grounding via self-hosted SearXNG. Enables the platform to answer questions about current events, recent research, and topics not covered in the knowledgebase.

## 8.1 Search Endpoints — `/search`

| Method | Endpoint | Description |
|---|---|---|
| `POST` | `/search/web` | Query SearXNG — aggregates results from Google, Bing, DuckDuckGo, Wikipedia |
| `POST` | `/search/wikipedia` | Targeted Wikipedia search with summary extraction |
| `POST` | `/search/grounded` | Search + LLM — retrieves web results, then generates a cited answer |
| `GET` | `/search/cache` | List recently cached search results (reduces redundant fetches) |

### Request — `POST /search/grounded`

```json

{

  "query": "What are the latest improvements in vLLM v0.8?",

  "max_sources": 5,

  "model": "auto"

}

```

### Response — `POST /search/grounded`

```json

{

  "answer": "vLLM v0.8 introduced several key improvements including...",

  "sources": [

    {

      "title": "vLLM v0.8.0 Release Notes",

      "url": "https://github.com/vllm-project/vllm/releases",

      "snippet": "Key features: improved prefix caching, multi-modal support..."

    }

  ],

  "model_used": "qwen2.5-14b",

  "cached": false

}

```

---

# Database Schema

## Entity-Relationship Overview

```

┌─────────────┐       ┌─────────────┐       ┌─────────────────┐

│    users     │       │  api_keys   │       │  request_logs   │

├─────────────┤       ├─────────────┤       ├─────────────────┤

│ id (PK)     │──┐    │ id (PK)     │       │ id (PK)         │

│ roll_number │  ├───►│ user_id(FK) │   ┌──►│ user_id (FK)    │

│ name        │  │    │ key_hash    │   │   │ api_key_id (FK) │

│ department  │  │    │ key_prefix  │   │   │ model           │

│ role        │  │    │ type        │   │   │ endpoint        │

│ password    │  │    │ is_active   │   │   │ tokens_in       │

│ is_active   │  │    │ created_at  │   │   │ tokens_out      │

│ created_at  │  │    │ expires_at  │   │   │ latency_ms      │

│ updated_at  │  │    │ last_used   │   │   │ status_code     │

└─────────────┘  │    └─────────────┘   │   │ created_at      │

                 │                      │   └─────────────────┘

                 └──────────────────────┘

                 

┌──────────────────┐       ┌─────────────────┐

│  quota_overrides │       │  rag_documents  │

├──────────────────┤       ├─────────────────┤

│ id (PK)          │       │ id (PK)         │

│ user_id (FK)     │       │ title           │

│ daily_tokens     │       │ collection      │

│ requests_per_hr  │       │ file_type       │

│ max_tokens_req   │       │ chunk_count     │

│ set_by (FK)      │       │ uploaded_by(FK) │

│ created_at       │       │ created_at      │

└──────────────────┘       └─────────────────┘

```

## SQL Schema — Key Tables

### users

```sql

CREATE TABLE users (

    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),

    roll_number     VARCHAR(20) UNIQUE NOT NULL,

    name            VARCHAR(100) NOT NULL,

    department      VARCHAR(50) NOT NULL,

    role            VARCHAR(20) NOT NULL DEFAULT 'student'

                    CHECK (role IN ('student', 'faculty', 'admin')),

    hashed_password VARCHAR(255) NOT NULL,

    is_active       BOOLEAN DEFAULT TRUE,

    created_at      TIMESTAMPTZ DEFAULT NOW(),

    updated_at      TIMESTAMPTZ DEFAULT NOW()

);

```

### api_keys



```sql

CREATE TABLE api_keys (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),

    user_id     UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,

    key_hash    VARCHAR(255) NOT NULL,

    key_prefix  VARCHAR(20) NOT NULL,

    type        VARCHAR(10) NOT NULL DEFAULT 'static'

                CHECK (type IN ('static', 'refresh')),

    is_active   BOOLEAN DEFAULT TRUE,

    created_at  TIMESTAMPTZ DEFAULT NOW(),

    expires_at  TIMESTAMPTZ,

    last_used   TIMESTAMPTZ

);

```


### request_logs



```sql

CREATE TABLE request_logs (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),

    user_id     UUID NOT NULL REFERENCES users(id),

    api_key_id  UUID REFERENCES api_keys(id),

    model       VARCHAR(50) NOT NULL,

    endpoint    VARCHAR(100) NOT NULL,

    tokens_in   INTEGER NOT NULL DEFAULT 0,

    tokens_out  INTEGER NOT NULL DEFAULT 0,

    latency_ms  INTEGER,

    status_code SMALLINT NOT NULL,

    created_at  TIMESTAMPTZ DEFAULT NOW()

);


CREATE INDEX idx_request_logs_user_date ON request_logs (user_id, created_at);

CREATE INDEX idx_request_logs_model ON request_logs (model, created_at);
```



---



# Project Folder Structure



```
mac/
├── api/
│   ├── main.py                 # FastAPI application entry point
│   ├── routers/
│   │   ├── auth.py             # /auth endpoints
│   │   ├── explore.py          # /explore endpoints
│   │   ├── query.py            # /query endpoints
│   │   ├── usage.py            # /usage endpoints
│   │   ├── keys.py             # /keys endpoints
│   │   ├── models.py           # /models endpoints
│   │   ├── integration.py      # /integration endpoints
│   │   ├── quota.py            # /quota endpoints
│   │   ├── guardrails.py       # /guardrails endpoints
│   │   ├── rag.py              # /rag endpoints
│   │   └── search.py           # /search endpoints
│   ├── models/
│   │   ├── user.py             # SQLAlchemy User model
│   │   ├── api_key.py          # APIKey model

│   │   ├── request_log.py      # RequestLog model
│   │   └── document.py         # RAG Document model
│   ├── schemas/
│   │   ├── auth.py             # Pydantic request/response schemas
│   │   ├── query.py            # Chat, completions, vision schemas
│   │   ├── usage.py            # Usage response schemas
│   │   └── common.py           # Shared schemas (pagination, errors)
│   ├── core/
│   │   ├── config.py           # Settings from environment variables
│   │   ├── security.py         # JWT creation/validation, password hashing
│   │   ├── rate_limit.py       # Redis-based sliding window rate limiter

│   │   ├── dependencies.py     # FastAPI dependency injection (get_current_user, etc.)

│   │   └── db.py               # Database session factory

│   ├── services/

│   │   ├── model_router.py     # Smart routing logic (auto model selection)
│   │   ├── inference.py        # LiteLLM client for model calls
│   │   ├── usage_tracker.py    # Token counting and logging

│   │   ├── guardrails.py       # Input/output filtering logic

│   │   └── rag.py              # Document ingestion, embedding, retrieval

│   └── requirements.txt

├── litellm/

│   └── config.yaml             # LiteLLM proxy routing configuration

├── nginx/

│   └── nginx.conf              # Reverse proxy configuration

├── frontend/

│   ├── src/

│   │   ├── pages/              # Dashboard, Keys, History, Admin pages

│   │   ├── components/         # Reusable UI components

│   │   └── api/                # API client (Axios/fetch wrappers)

│   ├── package.json

│   └── vite.config.ts

├── scripts/

│   ├── seed_users.py           # Bulk-create students from CSV
│   ├── download_models.py      # Pull models from HuggingFace

│   └── health_check.py         # Verify all services are running
├── docker-compose.yml          # Full stack: api, db, redis, litellm, nginx, qdrant
├── Dockerfile                  # FastAPI container
├── .env.example                # Template for environment variables
├── alembic/                    # Database migrations
│   └── versions/
└── README.md
```



---



# Security Design



| Concern | Implementation |

|---|---|

| **Password Storage** | bcrypt with work factor 12; passwords never stored in plaintext |

| **JWT Signing** | RS256 asymmetric keys; access tokens short-lived (24h) |

| **API Key Storage** | Only SHA-256 hash stored in DB; raw key shown once at creation |

| **Transport** | Nginx terminates TLS (HTTPS); internal services communicate over Docker network |

| **Input Validation** | Pydantic schema validation on every request; max payload size enforced |

| **SQL Injection** | SQLAlchemy ORM with parameterised queries throughout |

| **Rate Limiting** | Redis sliding-window per user + per IP; prevents abuse and DoS |

| **CORS** | Strict origin allowlist — only the MAC frontend domain |

| **Admin Endpoints** | Role-based access control; admin-only routes check JWT role claim |

| **Prompt Injection** | Input guardrails detect and block prompt override attempts |

| **Secrets Management** | All secrets in `.env` file; never committed to version control |



---



# Error Response Format



All errors follow a consistent structure:



```json

{

  "error": {

    "code": "authentication_failed",

    "message": "Invalid roll number or password.",

    "status": 401,

    "timestamp": "2026-04-07T14:30:00Z",

    "request_id": "mac-req-a1b2c3"

  }

}

```

### Standard Error Codes

| HTTP Status | Code | Description |
|---|---|---|
| 400 | `bad_request` | Malformed request body or invalid parameters |
| 401 | `authentication_failed` | Missing or invalid credentials / API key |
| 403 | `forbidden` | Valid auth but insufficient role permissions |
| 404 | `not_found` | Resource does not exist |
| 409 | `conflict` | Duplicate resource (e.g., user already exists) |
| 422 | `validation_error` | Request schema validation failed |
| 429 | `rate_limit_exceeded` | Request or token quota exceeded |
| 500 | `internal_error` | Unexpected server error |
| 503 | `model_unavailable` | Requested model is not loaded or all workers are busy |

---

# Deployment Configuration

## Environment Variables (`.env`)

```env

# Server

MAC_HOST=0.0.0.0

MAC_PORT=8000

MAC_ENV=production



# Database

POSTGRES_HOST=db

POSTGRES_PORT=5432

POSTGRES_DB=mac

POSTGRES_USER=mac_admin

POSTGRES_PASSWORD=<generated-secret>



# Redis

REDIS_HOST=redis

REDIS_PORT=6379



# JWT

JWT_PRIVATE_KEY_PATH=/run/secrets/jwt_private.pem

JWT_PUBLIC_KEY_PATH=/run/secrets/jwt_public.pem

JWT_ACCESS_TOKEN_EXPIRE_MINUTES=1440

JWT_REFRESH_TOKEN_EXPIRE_DAYS=30



# LiteLLM

LITELLM_PROXY_URL=http://litellm:4000

LITELLM_MASTER_KEY=<generated-secret>



# Qdrant

QDRANT_HOST=qdrant

QDRANT_PORT=6333



# Models

MODEL_STORAGE_PATH=/models

DEFAULT_MODEL=qwen2.5-14b

```

## Docker Compose Services

| Service | Image | Ports | Purpose |
|---|---|---|---|
| `api` | Custom (Dockerfile) | 8000 | FastAPI gateway |
| `db` | postgres:16-alpine | 5432 | User data, logs, keys |
| `redis` | redis:7-alpine | 6379 | Rate limiting, caching |
| `litellm` | ghcr.io/berriai/litellm | 4000 | Model proxy + routing |
| `qdrant` | qdrant/qdrant | 6333 | Vector database (RAG) |
| `searxng` | searxng/searxng | 8888 | Web search engine |
| `nginx` | nginx:alpine | 80, 443 | Reverse proxy + TLS |

---

# API Versioning & Compatibility

- All endpoints are prefixed with `/api/v1`
- The `/query` endpoints return **OpenAI-compatible JSON** — any OpenAI SDK works by changing `base_url`
- Breaking changes will increment the version (`/api/v2`) while maintaining the previous version for one semester
- Response pagination follows cursor-based pagination for list endpoints

### OpenAI SDK Compatibility Example

```python

from openai import OpenAI



client = OpenAI(

    base_url="http://mac-server/api/v1",

    api_key="mac_sk_live_your_key_here"

)



response = client.chat.completions.create(

    model="auto",

    messages=[

        {"role": "user", "content": "Explain binary search in Python"}

    ]

)



print(response.choices[0].message.content)

```

---

# Summary

MAC (MBM AI Cloud) delivers a production-grade, self-hosted AI platform that gives every student access to state-of-the-art AI models at zero recurring cost. The 8-phase build plan ensures a solid foundation before adding advanced features, and the architecture scales from a single lab PC to 30+ machines without rewriting a single line of code.

**Phase 1** (API + Models + Integration + Usage Control) establishes the complete backend — once deployed, students can start querying AI models from their laptops on day one.

---

*MAC — MBM AI Cloud · API Design Document · Version 1.0 · 07 April 2026*