MAC β MBM AI Cloud
API Design & Architecture Plan
Phase 1β8 Β· Complete Platform Blueprint
Prepared for Professor Review Β· 07 April 2026
Executive Summary
MAC (MBM AI Cloud) is a fully self-hosted, zero-cloud AI inference platform purpose-built for MBM Engineering College. Students and faculty on the college LAN access state-of-the-art AI models β text generation, code assistance, vision understanding, speech-to-text, and mathematical reasoning β through a standardised REST API authenticated by per-student API keys.
Key design goals:
- Zero cloud cost β all inference runs on college-owned GPUs
- OpenAI-compatible API β students use familiar SDKs; existing tutorials work unchanged
- Scalable from day one β starts on a single PC, scales horizontally to 30+ nodes with no code changes
- Secure by default β JWT authentication, role-based access, rate limiting, input/output guardrails
- Academically useful β RAG over college textbooks, web-grounded search, per-student usage tracking
Technology Stack
| Layer | Technology | Rationale |
|---|---|---|
| API Gateway | FastAPI (Python 3.11+) | Async, auto-docs (OpenAPI/Swagger), type-safe, fastest Python framework |
| Database | PostgreSQL 16 | ACID-compliant, battle-tested, excellent JSON support |
| Cache / Rate Limiter | Redis 7 | In-memory, sub-ms latency, native rate-limit primitives |
| LLM Inference | vLLM | PagedAttention, continuous batching, highest throughput for local GPUs |
| Model Router / Proxy | LiteLLM | Unified OpenAI-compatible proxy, load balancing, fallback routing |
| Vector Database | Qdrant | Purpose-built for embeddings, filtering, snapshotting |
| Web Search | SearXNG | Self-hosted meta-search, no API keys needed |
| Reverse Proxy | Nginx | TLS termination, request buffering, static file serving |
| Containerisation | Docker + Docker Compose | Reproducible deployments, one-command startup |
| Frontend | React + Tailwind CSS | Component-driven dashboard, responsive, fast |
| Task Queue | Celery + Redis | Background jobs: document ingestion, model downloads |
Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β College LAN β
β Students / Faculty / Lab PCs β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β HTTPS
βΌ
βββββββββββββββββββββββ
β Nginx β
β Reverse Proxy β
β TLS Β· Caching β
βββββββββββ¬ββββββββββββ
β
βββββββββββΌββββββββββββ
β FastAPI β
β API Gateway β
β Auth Β· Routing β
β Rate Limiting β
βββββ¬βββββ¬βββββ¬ββββ¬βββ
β β β β
ββββββββββ β β ββββββββββ
βΌ βΌ βΌ βΌ
ββββββββββββ βββββββββββββββ ββββββββββββ
βPostgreSQLβ β LiteLLM β β Qdrant β
β Users β β Proxy β β VectorDB β
β Logs β β Routing β β (RAG) β
β Keys β ββββββββ¬βββββββ ββββββββββββ
ββββββββββββ β
βΌ
βββββββββββββββββββββββ
β vLLM β
β Model Workers β
β GPU Inference β
βββββββββββββββββββββββ
Scaling model: The entire stack is containerised. To scale from 1 PC to N PCs, new vLLM worker containers are added and registered in the LiteLLM config. The FastAPI gateway, database, and proxy remain centralised on the primary node. Zero code changes required.
Build Phases β 8 Phases, Sequential
| # | Phase | Description | Dependencies |
|---|---|---|---|
| 1 | API Endpoints | Core REST API β explore, query, usage, auth | None |
| 2 | LLM Models | Select, download, feasibility-check, deploy 5 models | Phase 1 |
| 3 | APIβModel Integration | Wire every model to its dedicated endpoint via LiteLLM + vLLM | Phase 1, 2 |
| 4 | API Usage Control | Rate limiting, token accounting, static + refresh API keys | Phase 1 |
| 5 | Web Interface | Dashboard, user management, admin panel, model access controls | Phase 1, 4 |
| 6 | Guardrails | Input + output content filtering, safety checks | Phase 3 |
| 7 | Knowledgebase + RAG | Vector DB, document ingestion, retrieval-augmented generation | Phase 3 |
| 8 | Retrieval + Search | SearXNG web search, Wikipedia, real-time grounded answers | Phase 3, 7 |
Base URL
http://<server-ip>/api/v1
The server IP is configured via environment variable
MAC_HOST. No IP addresses are hardcoded anywhere in the codebase. For production, the Nginx reverse proxy terminates TLS and forwards to the FastAPI gateway.
Phase 1 β API Endpoints
1.1 Authentication β /auth
Handles user login, session management, and JWT token lifecycle. All passwords are hashed with bcrypt (work factor 12). Tokens use RS256 JWT signing.
| Method | Endpoint | Description |
|---|---|---|
POST |
/auth/login |
Roll number + password β JWT access token + refresh token |
POST |
/auth/logout |
Invalidate current session / revoke refresh token |
POST |
/auth/refresh |
Exchange refresh token for a new access token |
GET |
/auth/me |
Current user's profile, role, department, and API key |
POST |
/auth/change-password |
Change password (requires current password verification) |
Request β POST /auth/login
{
"roll_number": "21CS045",
"password": "secure_password"
}
| Field | Type | Required | Description |
|---|---|---|---|
roll_number |
string | Yes | Student/faculty roll number, e.g. 21CS045 |
password |
string | Yes | Account password (min 8 chars) |
Response β POST /auth/login
{
"access_token": "eyJhbG...",
"refresh_token": "dGhpcyBp...",
"token_type": "bearer",
"expires_in": 86400,
"user": {
"roll_number": "21CS045",
"name": "Aryan Sharma",
"department": "CSE",
"role": "student",
"api_key": "mac_sk_live_abc123..."
}
}
| Field | Description |
|---|---|
access_token |
JWT, 24-hour expiry, used for all authenticated requests |
refresh_token |
30-day expiry, used only at /auth/refresh |
user.role |
One of: student, faculty, admin |
user.api_key |
Personal API key for direct model queries |
1.2 Explore β /explore
Read-only discovery endpoints. Allow students to see what models and capabilities are available before writing code.
| Method | Endpoint | Description |
|---|---|---|
GET |
/explore/models |
List all deployed models with capabilities, context length, and status |
GET |
/explore/models/{model_id} |
Detailed info β parameters, benchmarks, example prompts |
GET |
/explore/models/search |
Search by capability tag: ?tag=vision, ?tag=code, ?tag=math |
GET |
/explore/endpoints |
List every API endpoint with method, path, auth requirement, description |
GET |
/explore/health |
Platform health β node status, GPU temperatures, inference queue depth |
GET |
/explore/usage-stats |
Aggregated platform analytics (admin-only) β tokens/day, active users |
Response β GET /explore/models
{
"models": [
{
"id": "qwen2.5-coder-7b",
"name": "Qwen2.5-Coder 7B",
"specialty": "Code generation, debugging, explanation",
"parameters": "7B",
"context_length": 32768,
"status": "loaded",
"capabilities": ["code", "chat", "completion"]
}
],
"total": 5
}
Response β GET /explore/health
{
"status": "healthy",
"uptime_seconds": 345600,
"nodes": [
{
"ip": "192.168.1.101",
"gpu": "NVIDIA RTX 3060 12GB",
"gpu_temp_c": 62,
"vram_used_gb": 9.2,
"vram_total_gb": 12.0,
"requests_in_flight": 3,
"status": "active"
}
],
"queue_depth": 7
}
1.3 Query β /query
The core inference API. All endpoints accept Authorization: Bearer <api_key> header. Responses follow the OpenAI Chat Completions format β existing OpenAI SDK code works with zero changes by swapping base_url.
| Method | Endpoint | Description |
|---|---|---|
POST |
/query/chat |
Chat completion β multi-turn conversation (text/code/math) |
POST |
/query/completions |
Raw text completion (OpenAI-compatible) |
POST |
/query/vision |
Image + text β answer (vision model) |
POST |
/query/speech-to-text |
Upload audio β transcribed text (Whisper) |
POST |
/query/text-to-speech |
Text β audio file download (TTS) |
POST |
/query/embeddings |
Text β vector embedding (for RAG / similarity search) |
POST |
/query/rerank |
Re-rank passages by relevance to a query |
Request β POST /query/chat
{
"model": "auto",
"messages": [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to reverse a linked list."}
],
"temperature": 0.7,
"max_tokens": 2048,
"stream": false
}
| Field | Type | Required | Description |
|---|---|---|---|
model |
string | Yes | Model ID or "auto" for smart routing |
messages |
array | Yes | Array of {role, content} β system / user / assistant |
temperature |
float | No | Sampling temperature 0.0β2.0 (default 0.7) |
max_tokens |
integer | No | Max tokens to generate (default 2048, max 4096) |
stream |
boolean | No | Stream tokens as Server-Sent Events (default false) |
context_id |
string | No | Maintain conversation state server-side |
Response β POST /query/chat
{
"id": "mac-chat-a1b2c3d4",
"object": "chat.completion",
"created": 1743984000,
"model": "qwen2.5-coder-7b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Here's a Python function to reverse a linked list..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 42,
"completion_tokens": 187,
"total_tokens": 229
}
}
Smart Routing (model: "auto")
When students set model to "auto", the API gateway inspects the request and routes to the optimal model:
| Signal Detected | Routed To |
|---|---|
Code keywords, programming language names, write a function, debug this |
qwen2.5-coder-7b |
Math expressions, equations, solve, prove, step by step |
deepseek-r1-8b |
| Image attachment in request body | llava-1.6-7b |
| Audio file upload | whisper-large-v3 |
| General text, summarisation, writing, Q&A | qwen2.5-14b |
1.4 Usage β /usage
Per-student consumption tracking. Every API call logs token counts, model used, and timestamp.
| Method | Endpoint | Description |
|---|---|---|
GET |
/usage/me |
My tokens used β today, this week, this month, broken down by model |
GET |
/usage/me/history |
Full request history β timestamps, models, token counts, latency |
GET |
/usage/me/quota |
My current quota limits and remaining balance |
GET |
/usage/admin/all |
All users' usage summary (admin-only) |
GET |
/usage/admin/user/{roll_number} |
Specific student's full usage details (admin-only) |
GET |
/usage/admin/models |
Per-model usage statistics across the platform (admin-only) |
Response β GET /usage/me
{
"roll_number": "21CS045",
"usage": {
"today": {
"total_tokens": 12450,
"requests": 23,
"by_model": {
"qwen2.5-coder-7b": {"tokens": 8200, "requests": 15},
"qwen2.5-14b": {"tokens": 4250, "requests": 8}
}
},
"this_week": {"total_tokens": 67800, "requests": 142},
"this_month": {"total_tokens": 234500, "requests": 487}
},
"quota": {
"daily_limit": 50000,
"remaining_today": 37550
}
}
Phase 2 β LLM Models
2.1 Model Selection
Five best-in-class open-source models, each a domain specialist. All models are quantised to fit within consumer GPU VRAM.
| Model ID | Name | Specialty | Parameters | VRAM Required | Quantisation |
|---|---|---|---|---|---|
qwen2.5-coder-7b |
Qwen2.5-Coder 7B | Code generation, debugging, explanation | 7B | ~5 GB | GPTQ-Int4 |
deepseek-r1-8b |
DeepSeek-R1 8B | Maths, reasoning, step-by-step logic | 8B | ~6 GB | AWQ-Int4 |
llava-1.6-7b |
LLaVA 1.6 7B | Image understanding, visual Q&A | 7B | ~8 GB | FP16 |
whisper-large-v3 |
Whisper Large v3 | Speech-to-text, transcription | 1.5B | ~3 GB | FP16 |
qwen2.5-14b |
Qwen2.5 14B | General chat, summarisation, writing | 14B | ~10 GB | GPTQ-Int4 |
Total VRAM for all 5 models: ~32 GB (fits on dual-GPU setups or loads models on demand)
Feasibility on Single PC
For a single-PC deployment with one GPU (e.g., RTX 3060 12GB or RTX 4070 16GB):
- Models are loaded/unloaded on demand β only the requested model occupies VRAM at inference time
- A model loading queue ensures graceful transitions
- Frequently-used models remain resident; idle models are evicted after a configurable timeout
2.2 Model Management Endpoints β /models
| Method | Endpoint | Description |
|---|---|---|
GET |
/models |
List all models with status: loaded / downloading / queued / offline |
GET |
/models/{model_id} |
Single model details β context length, capabilities, benchmark scores |
POST |
/models/{model_id}/load |
Load model into GPU memory (admin-only) |
POST |
/models/{model_id}/unload |
Unload model from VRAM to free resources (admin-only) |
GET |
/models/{model_id}/health |
Ping model β returns latency, memory usage, ready status |
POST |
/models/download |
Download a model from HuggingFace by ID (admin-only) |
GET |
/models/download/{task_id} |
Check download progress for a model download task |
Response β GET /models
{
"models": [
{
"id": "qwen2.5-coder-7b",
"name": "Qwen2.5-Coder 7B",
"status": "loaded",
"vram_mb": 5120,
"context_length": 32768,
"capabilities": ["code", "chat"],
"loaded_at": "2026-04-07T08:30:00Z"
},
{
"id": "deepseek-r1-8b",
"name": "DeepSeek-R1 8B",
"status": "offline",
"vram_mb": 6144,
"context_length": 65536,
"capabilities": ["reasoning", "math", "chat"],
"loaded_at": null
}
]
}
Phase 3 β APIβModel Integration
3.1 Architecture
LiteLLM Proxy sits between the FastAPI gateway and vLLM workers. It:
- Translates every
/queryrequest into the correct vLLM inference call - Handles smart routing (auto-model selection)
- Load-balances across multiple workers (when scaled)
- Retries on worker failure with exponential backoff
- Returns OpenAI-compatible JSON responses
FastAPI Gateway βββΊ LiteLLM Proxy βββΊ vLLM Worker(s)
(routing + (GPU inference)
load balance)
3.2 Integration Endpoints β /integration
| Method | Endpoint | Description |
|---|---|---|
GET |
/integration/routing-rules |
Show current routing rules (which task type β which model) |
PUT |
/integration/routing-rules |
Update routing rules (admin-only) |
GET |
/integration/workers |
List all vLLM worker nodes and their current load |
GET |
/integration/workers/{node_id} |
Single worker β GPU temp, VRAM used, requests in flight |
POST |
/integration/workers/{node_id}/drain |
Mark worker as draining β no new requests routed (admin-only) |
GET |
/integration/queue |
Current global inference queue depth across all workers |
Response β GET /integration/routing-rules
{
"rules": [
{"task": "code", "keywords": ["function", "debug", "code", "python", "javascript"], "model": "qwen2.5-coder-7b", "priority": 1},
{"task": "math", "keywords": ["solve", "equation", "prove", "integral"], "model": "deepseek-r1-8b", "priority": 2},
{"task": "vision", "trigger": "image_attachment", "model": "llava-1.6-7b", "priority": 3},
{"task": "speech", "trigger": "audio_upload", "model": "whisper-large-v3", "priority": 4},
{"task": "general", "trigger": "default", "model": "qwen2.5-14b", "priority": 99}
]
}
3.3 Scaling Strategy
| Deployment | Configuration |
|---|---|
| Single PC | One vLLM process, models loaded/swapped on demand |
| 2β5 PCs | Each PC runs a vLLM worker with 1β2 dedicated models; LiteLLM routes by model |
| 6β30 PCs | Multiple workers per model for redundancy; least-busy routing; auto-failover |
All worker addresses are stored in configuration (environment variables / config file) β no IPs are hardcoded. Adding a new node requires only updating the LiteLLM config and restarting the proxy.
Phase 4 β API Usage Control
4.1 API Key Management β /keys
Every student receives a unique API key upon account creation. Two key types are supported:
| Key Type | Behaviour |
|---|---|
| Static | Never expires. Manually rotated by the student or admin. Ideal for quick experiments. |
| Refresh | Auto-rotates every 30 days. Old key has a 48-hour grace period. Ideal for long-running scripts. |
Key format: mac_sk_live_<32-char-random-hex>
| Method | Endpoint | Description |
|---|---|---|
GET |
/keys/my-key |
Get current API key (partially masked) and metadata |
POST |
/keys/generate |
Generate a new API key (invalidates the previous one) |
GET |
/keys/my-key/stats |
Tokens consumed against this key β today, week, month |
DELETE |
/keys/my-key |
Revoke current key permanently (must re-generate) |
GET |
/keys/admin/all |
List all student API keys and status (admin-only) |
POST |
/keys/admin/revoke |
Force-revoke a specific student's API key (admin-only) |
Response β GET /keys/my-key
{
"key": "mac_sk_live_a1b2...****",
"type": "refresh",
"created_at": "2026-03-01T10:00:00Z",
"expires_at": "2026-03-31T10:00:00Z",
"last_used": "2026-04-07T14:23:00Z",
"status": "active",
"total_requests": 487
}
4.2 Rate Limiting & Quotas β /quota
Rate limits are enforced at the Redis layer using a sliding-window algorithm. Limits are per-role and can be overridden per-user.
| Role | Daily Token Limit | Requests/Hour | Max Tokens/Request |
|---|---|---|---|
| Student | 50,000 | 100 | 4,096 |
| Faculty | 200,000 | 500 | 8,192 |
| Admin | Unlimited | Unlimited | 16,384 |
| Method | Endpoint | Description |
|---|---|---|
GET |
/quota/limits |
Show default quota limits per role |
GET |
/quota/me |
My personal limits and current consumption |
PUT |
/quota/admin/user/{roll_number} |
Override quota for a specific user (admin-only) |
GET |
/quota/admin/exceeded |
List users who have exceeded their daily quota (admin-only) |
Response β GET /quota/me
{
"role": "student",
"limits": {
"daily_tokens": 50000,
"requests_per_hour": 100,
"max_tokens_per_request": 4096
},
"current": {
"tokens_used_today": 12450,
"requests_this_hour": 8,
"remaining_tokens": 37550,
"resets_at": "2026-04-08T00:00:00Z"
}
}
Rate Limit Response Headers
Every API response includes:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 92
X-RateLimit-Reset: 1743987600
X-TokenLimit-Limit: 50000
X-TokenLimit-Remaining: 37550
X-TokenLimit-Reset: 1744070400
When a limit is exceeded, the API returns:
HTTP 429 Too Many Requests
{
"error": {
"code": "rate_limit_exceeded",
"message": "You have exceeded your hourly request limit. Try again in 847 seconds.",
"retry_after": 847
}
}
Phase 5 β Web Interface
A React-based single-page application served by the FastAPI backend. Three views: Student Dashboard, Key Management, and Admin Panel.
5.1 Web Interface Endpoints β /ui
| Method | Endpoint | Description |
|---|---|---|
GET |
/ui/dashboard |
Student home β usage chart, quick-start code snippets, model status cards |
GET |
/ui/keys |
Key management β view, copy, regenerate API key |
GET |
/ui/history |
Request history table β timestamp, model, tokens, latency, status |
GET |
/ui/playground |
Interactive chat playground β test models directly in the browser |
5.2 Admin Panel β /ui/admin
| Method | Endpoint | Description |
|---|---|---|
GET |
/ui/admin/users |
Full user list with roles, quotas, last active, status |
POST |
/ui/admin/users/create |
Create user / bulk-create from CSV (roll no., name, department) |
PUT |
/ui/admin/users/{roll_number} |
Edit user β change role, quota overrides, model access |
DELETE |
/ui/admin/users/{roll_number} |
Deactivate a user account |
GET |
/ui/admin/models |
Model management β load/unload, see which node serves what |
GET |
/ui/admin/logs |
Live request logs β error rates, latency percentiles, throughput |
GET |
/ui/admin/analytics |
Platform analytics β daily active users, peak hours, top models |
Dashboard Wireframe
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MAC β MBM AI Cloud [Profile βΌ] β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββ βββββββββββ βββββββββββ β
β β Tokens β βRequests β β Quota β β
β β Today β β Today β βRemainingβ β
β β 12,450 β β 23 β β 75.1% β β
β βββββββββββ βββββββββββ βββββββββββ β
β β
β Models Available Quick Start β
β ββββββββββββββββββββ ββββββββββββββββββββββ β
β β β
Qwen2.5-Coder β β from openai import β β
β β β
DeepSeek-R1 β β OpenAI β β
β β β
Qwen2.5 14B β β client = OpenAI( β β
β β β³ LLaVA (load.) β β base_url= β β
β β β
Whisper β β "http://mac/v1", β β
β ββββββββββββββββββββ β api_key="mac_sk_" β β
β β ) β β
β Usage This Week ββββββββββββββββββββββ β
β ββββββββββββββββββββ β
β β βββ
ββ
ββ (chart) β β
β ββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Phase 6 β Guardrails
Input and output content filtering to ensure safe, appropriate use of AI models within an academic environment.
6.1 Guardrails Endpoints β /guardrails
| Method | Endpoint | Description |
|---|---|---|
POST |
/guardrails/check-input |
Run input text through content filter before sending to model |
POST |
/guardrails/check-output |
Run model output through safety filter before returning to user |
GET |
/guardrails/rules |
List active guardrail rules (admin-only) |
PUT |
/guardrails/rules |
Update rules β blocked categories, max prompt length (admin-only) |
6.2 Filtering Pipeline
User Input βββΊ Input Filter βββΊ Model Inference βββΊ Output Filter βββΊ Response
β β
ββ Prompt injection detection ββ PII/sensitive data redaction
ββ Blocked topic detection ββ Harmful content detection
ββ Max prompt length enforcement ββ Hallucination disclaimer
ββ Academic integrity checks ββ Source attribution
Guardrail Categories
| Category | Action | Description |
|---|---|---|
| Prompt Injection | Block + log | Detect attempts to override system prompts |
| Harmful Content | Block | Violence, self-harm, illegal activities |
| Academic Dishonesty | Flag + disclaimer | Full essay/assignment generation adds academic integrity notice |
| PII in Output | Redact | Strip emails, phone numbers, addresses from model output |
| Max Prompt Length | Reject | Configurable per-role; prevents resource abuse |
Request β POST /guardrails/check-input
{
"text": "User input to check",
"context": "chat"
}
Response
{
"safe": true,
"flags": [],
"modified_text": null
}
Phase 7 β Knowledgebase + RAG
Retrieval-Augmented Generation over college textbooks, course materials, and reference documents. Students can ask questions grounded in actual course content.
7.1 RAG Endpoints β /rag
| Method | Endpoint | Description |
|---|---|---|
POST |
/rag/ingest |
Upload PDF / DOCX / TXT β chunk, embed, store in Qdrant vector DB |
GET |
/rag/documents |
List all ingested documents with metadata (title, pages, chunk count) |
GET |
/rag/documents/{id} |
Single document details and its chunks |
DELETE |
/rag/documents/{id} |
Remove a document from the knowledgebase (admin-only) |
POST |
/rag/query |
Ask a question β retrieves top-k relevant chunks β sends to LLM with context |
GET |
/rag/query/{query_id}/sources |
Get source citations for a RAG response |
POST |
/rag/collections |
Create a named collection (e.g., "DSA", "DBMS", "OS") (admin-only) |
GET |
/rag/collections |
List all collections |
7.2 RAG Pipeline
Document Upload βββΊ Chunking (512 tokens, 50 overlap)
β
βΌ
Embedding (768-dim)
β
βΌ
Qdrant Vector DB
β
User Question βββΊ Embed Query βββΊ Similarity Search (top-k=5)
β
βΌ
Retrieved Chunks + Question
β
βΌ
LLM Generation
β
βΌ
Answer with Citations
Request β POST /rag/query
{
"question": "Explain the difference between process and thread in operating systems",
"collection": "OS",
"top_k": 5,
"model": "auto"
}
Response β POST /rag/query
{
"answer": "A process is an independent program in execution with its own memory space...",
"sources": [
{
"document": "Galvin - Operating System Concepts, 10th Ed",
"chapter": "Chapter 3: Processes",
"page": 105,
"relevance_score": 0.92,
"chunk_preview": "A process is a program in execution. A process is more than..."
},
{
"document": "Galvin - Operating System Concepts, 10th Ed",
"chapter": "Chapter 4: Threads",
"page": 163,
"relevance_score": 0.89,
"chunk_preview": "A thread is a basic unit of CPU utilization..."
}
],
"model_used": "qwen2.5-14b",
"tokens_used": 847
}
Phase 8 β Retrieval + Search
Real-time web grounding via self-hosted SearXNG. Enables the platform to answer questions about current events, recent research, and topics not covered in the knowledgebase.
8.1 Search Endpoints β /search
| Method | Endpoint | Description |
|---|---|---|
POST |
/search/web |
Query SearXNG β aggregates results from Google, Bing, DuckDuckGo, Wikipedia |
POST |
/search/wikipedia |
Targeted Wikipedia search with summary extraction |
POST |
/search/grounded |
Search + LLM β retrieves web results, then generates a cited answer |
GET |
/search/cache |
List recently cached search results (reduces redundant fetches) |
Request β POST /search/grounded
{
"query": "What are the latest improvements in vLLM v0.8?",
"max_sources": 5,
"model": "auto"
}
Response β POST /search/grounded
{
"answer": "vLLM v0.8 introduced several key improvements including...",
"sources": [
{
"title": "vLLM v0.8.0 Release Notes",
"url": "https://github.com/vllm-project/vllm/releases",
"snippet": "Key features: improved prefix caching, multi-modal support..."
}
],
"model_used": "qwen2.5-14b",
"cached": false
}
Database Schema
Entity-Relationship Overview
βββββββββββββββ βββββββββββββββ βββββββββββββββββββ
β users β β api_keys β β request_logs β
βββββββββββββββ€ βββββββββββββββ€ βββββββββββββββββββ€
β id (PK) ββββ β id (PK) β β id (PK) β
β roll_number β βββββΊβ user_id(FK) β ββββΊβ user_id (FK) β
β name β β β key_hash β β β api_key_id (FK) β
β department β β β key_prefix β β β model β
β role β β β type β β β endpoint β
β password β β β is_active β β β tokens_in β
β is_active β β β created_at β β β tokens_out β
β created_at β β β expires_at β β β latency_ms β
β updated_at β β β last_used β β β status_code β
βββββββββββββββ β βββββββββββββββ β β created_at β
β β βββββββββββββββββββ
ββββββββββββββββββββββββ
ββββββββββββββββββββ βββββββββββββββββββ
β quota_overrides β β rag_documents β
ββββββββββββββββββββ€ βββββββββββββββββββ€
β id (PK) β β id (PK) β
β user_id (FK) β β title β
β daily_tokens β β collection β
β requests_per_hr β β file_type β
β max_tokens_req β β chunk_count β
β set_by (FK) β β uploaded_by(FK) β
β created_at β β created_at β
ββββββββββββββββββββ βββββββββββββββββββ
SQL Schema β Key Tables
users
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
roll_number VARCHAR(20) UNIQUE NOT NULL,
name VARCHAR(100) NOT NULL,
department VARCHAR(50) NOT NULL,
role VARCHAR(20) NOT NULL DEFAULT 'student'
CHECK (role IN ('student', 'faculty', 'admin')),
hashed_password VARCHAR(255) NOT NULL,
is_active BOOLEAN DEFAULT TRUE,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
api_keys
CREATE TABLE api_keys (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
key_hash VARCHAR(255) NOT NULL,
key_prefix VARCHAR(20) NOT NULL,
type VARCHAR(10) NOT NULL DEFAULT 'static'
CHECK (type IN ('static', 'refresh')),
is_active BOOLEAN DEFAULT TRUE,
created_at TIMESTAMPTZ DEFAULT NOW(),
expires_at TIMESTAMPTZ,
last_used TIMESTAMPTZ
);
request_logs
CREATE TABLE request_logs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id),
api_key_id UUID REFERENCES api_keys(id),
model VARCHAR(50) NOT NULL,
endpoint VARCHAR(100) NOT NULL,
tokens_in INTEGER NOT NULL DEFAULT 0,
tokens_out INTEGER NOT NULL DEFAULT 0,
latency_ms INTEGER,
status_code SMALLINT NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_request_logs_user_date ON request_logs (user_id, created_at);
CREATE INDEX idx_request_logs_model ON request_logs (model, created_at);
Project Folder Structure
mac/
βββ api/
β βββ main.py # FastAPI application entry point
β βββ routers/
β β βββ auth.py # /auth endpoints
β β βββ explore.py # /explore endpoints
β β βββ query.py # /query endpoints
β β βββ usage.py # /usage endpoints
β β βββ keys.py # /keys endpoints
β β βββ models.py # /models endpoints
β β βββ integration.py # /integration endpoints
β β βββ quota.py # /quota endpoints
β β βββ guardrails.py # /guardrails endpoints
β β βββ rag.py # /rag endpoints
β β βββ search.py # /search endpoints
β βββ models/
β β βββ user.py # SQLAlchemy User model
β β βββ api_key.py # APIKey model
β β βββ request_log.py # RequestLog model
β β βββ document.py # RAG Document model
β βββ schemas/
β β βββ auth.py # Pydantic request/response schemas
β β βββ query.py # Chat, completions, vision schemas
β β βββ usage.py # Usage response schemas
β β βββ common.py # Shared schemas (pagination, errors)
β βββ core/
β β βββ config.py # Settings from environment variables
β β βββ security.py # JWT creation/validation, password hashing
β β βββ rate_limit.py # Redis-based sliding window rate limiter
β β βββ dependencies.py # FastAPI dependency injection (get_current_user, etc.)
β β βββ db.py # Database session factory
β βββ services/
β β βββ model_router.py # Smart routing logic (auto model selection)
β β βββ inference.py # LiteLLM client for model calls
β β βββ usage_tracker.py # Token counting and logging
β β βββ guardrails.py # Input/output filtering logic
β β βββ rag.py # Document ingestion, embedding, retrieval
β βββ requirements.txt
βββ litellm/
β βββ config.yaml # LiteLLM proxy routing configuration
βββ nginx/
β βββ nginx.conf # Reverse proxy configuration
βββ frontend/
β βββ src/
β β βββ pages/ # Dashboard, Keys, History, Admin pages
β β βββ components/ # Reusable UI components
β β βββ api/ # API client (Axios/fetch wrappers)
β βββ package.json
β βββ vite.config.ts
βββ scripts/
β βββ seed_users.py # Bulk-create students from CSV
β βββ download_models.py # Pull models from HuggingFace
β βββ health_check.py # Verify all services are running
βββ docker-compose.yml # Full stack: api, db, redis, litellm, nginx, qdrant
βββ Dockerfile # FastAPI container
βββ .env.example # Template for environment variables
βββ alembic/ # Database migrations
β βββ versions/
βββ README.md
Security Design
| Concern | Implementation |
|---|---|
| Password Storage | bcrypt with work factor 12; passwords never stored in plaintext |
| JWT Signing | RS256 asymmetric keys; access tokens short-lived (24h) |
| API Key Storage | Only SHA-256 hash stored in DB; raw key shown once at creation |
| Transport | Nginx terminates TLS (HTTPS); internal services communicate over Docker network |
| Input Validation | Pydantic schema validation on every request; max payload size enforced |
| SQL Injection | SQLAlchemy ORM with parameterised queries throughout |
| Rate Limiting | Redis sliding-window per user + per IP; prevents abuse and DoS |
| CORS | Strict origin allowlist β only the MAC frontend domain |
| Admin Endpoints | Role-based access control; admin-only routes check JWT role claim |
| Prompt Injection | Input guardrails detect and block prompt override attempts |
| Secrets Management | All secrets in .env file; never committed to version control |
Error Response Format
All errors follow a consistent structure:
{
"error": {
"code": "authentication_failed",
"message": "Invalid roll number or password.",
"status": 401,
"timestamp": "2026-04-07T14:30:00Z",
"request_id": "mac-req-a1b2c3"
}
}
Standard Error Codes
| HTTP Status | Code | Description |
|---|---|---|
| 400 | bad_request |
Malformed request body or invalid parameters |
| 401 | authentication_failed |
Missing or invalid credentials / API key |
| 403 | forbidden |
Valid auth but insufficient role permissions |
| 404 | not_found |
Resource does not exist |
| 409 | conflict |
Duplicate resource (e.g., user already exists) |
| 422 | validation_error |
Request schema validation failed |
| 429 | rate_limit_exceeded |
Request or token quota exceeded |
| 500 | internal_error |
Unexpected server error |
| 503 | model_unavailable |
Requested model is not loaded or all workers are busy |
Deployment Configuration
Environment Variables (.env)
# Server
MAC_HOST=0.0.0.0
MAC_PORT=8000
MAC_ENV=production
# Database
POSTGRES_HOST=db
POSTGRES_PORT=5432
POSTGRES_DB=mac
POSTGRES_USER=mac_admin
POSTGRES_PASSWORD=<generated-secret>
# Redis
REDIS_HOST=redis
REDIS_PORT=6379
# JWT
JWT_PRIVATE_KEY_PATH=/run/secrets/jwt_private.pem
JWT_PUBLIC_KEY_PATH=/run/secrets/jwt_public.pem
JWT_ACCESS_TOKEN_EXPIRE_MINUTES=1440
JWT_REFRESH_TOKEN_EXPIRE_DAYS=30
# LiteLLM
LITELLM_PROXY_URL=http://litellm:4000
LITELLM_MASTER_KEY=<generated-secret>
# Qdrant
QDRANT_HOST=qdrant
QDRANT_PORT=6333
# Models
MODEL_STORAGE_PATH=/models
DEFAULT_MODEL=qwen2.5-14b
Docker Compose Services
| Service | Image | Ports | Purpose |
|---|---|---|---|
api |
Custom (Dockerfile) | 8000 | FastAPI gateway |
db |
postgres:16-alpine | 5432 | User data, logs, keys |
redis |
redis:7-alpine | 6379 | Rate limiting, caching |
litellm |
ghcr.io/berriai/litellm | 4000 | Model proxy + routing |
qdrant |
qdrant/qdrant | 6333 | Vector database (RAG) |
searxng |
searxng/searxng | 8888 | Web search engine |
nginx |
nginx:alpine | 80, 443 | Reverse proxy + TLS |
API Versioning & Compatibility
- All endpoints are prefixed with
/api/v1 - The
/queryendpoints return OpenAI-compatible JSON β any OpenAI SDK works by changingbase_url - Breaking changes will increment the version (
/api/v2) while maintaining the previous version for one semester - Response pagination follows cursor-based pagination for list endpoints
OpenAI SDK Compatibility Example
from openai import OpenAI
client = OpenAI(
base_url="http://mac-server/api/v1",
api_key="mac_sk_live_your_key_here"
)
response = client.chat.completions.create(
model="auto",
messages=[
{"role": "user", "content": "Explain binary search in Python"}
]
)
print(response.choices[0].message.content)
Summary
MAC (MBM AI Cloud) delivers a production-grade, self-hosted AI platform that gives every student access to state-of-the-art AI models at zero recurring cost. The 8-phase build plan ensures a solid foundation before adding advanced features, and the architecture scales from a single lab PC to 30+ machines without rewriting a single line of code.
Phase 1 (API + Models + Integration + Usage Control) establishes the complete backend β once deployed, students can start querying AI models from their laptops on day one.
MAC β MBM AI Cloud Β· API Design Document Β· Version 1.0 Β· 07 April 2026