File size: 11,309 Bytes
820c76e 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 fed7eb0 5a47824 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 |
---
title: matrix-ai
emoji: π§
colorFrom: purple
colorTo: indigo
sdk: docker
pinned: false
---
# matrix-ai
**matrix-ai** is the AI planning microservice for the Matrix EcoSystem. It generates **short, low-risk, auditable remediation plans** from compact health context provided by **Matrix Guardian**, and also exposes a lightweight **RAG** Q&A over MatrixHub documents.
It is optimized for **Hugging Face Spaces / Inference Endpoints**, but also runs locally and in containers.
> **Endpoints**
>
> * `POST /v1/plan` β internal API for Matrix Guardian: returns a safe JSON plan.
> * `POST /v1/chat` β Q&A (RAG-assisted) over MatrixHub content; returns a single answer.
> * `GET /v1/chat/stream` β **SSE** token stream for interactive chat (production-hardened).
> * `POST /v1/chat/stream` β same as `GET` but with JSON payloads.
The service emphasizes **safety, performance, and auditability**:
* Strict, schema-validated JSON plans (bounded steps, risk label, rationale)
* PII redaction before calling upstream model endpoints
* **Multi-provider LLM cascade:** **GROQ β Gemini β HF Router (Zephyr β Mistral)** with automatic failover
* Production-safe **SSE** streaming & middleware (no body buffering, trace IDs, CORS, gzip)
* Exponential backoff, short timeouts, and structured JSON logs
* Per-IP rate limiting; optional `ADMIN_TOKEN` for private deployments
* RAG with SentenceTransformers (optional CrossEncoder re-ranker) over `data/kb.jsonl`
* ETag & response caching for non-mutating reads (where applicable)
*Last Updated: 2025-10-01 (UTC)*
---
## Architecture (at a glance)
```mermaid
flowchart LR
subgraph Client [Matrix Operators / Observers]
end
Client -->|monitor| HubAPI[Matrix-Hub API]
Guardian[Matrix-Guardian<br/>control plane] -->|/v1/plan| AI[matrix-ai<br/>FastAPI service]
Guardian -->|/status,/apps,...| HubAPI
HubAPI <-->|SQL| DB[MatrixDB<br/>Postgres]
subgraph LLM [LLM Providers fallback cascade]
GROQ[Groq<br/>llama-3.1-8b-instant]
GEM[Google Gemini<br/>gemini-2.5-flash]
HF[Hugging Face Router<br/>Zephyr β Mistral]
end
AI -->|primary| GROQ
AI -->|fallback| GEM
AI -->|final| HF
classDef svc fill:#0ea5e9,stroke:#0b4,stroke-width:1,color:#fff
classDef db fill:#f59e0b,stroke:#0b4,stroke-width:1,color:#fff
class Guardian,AI,HubAPI svc
class DB db
```
### Sequence: `POST /v1/plan` (planning)
```mermaid
sequenceDiagram
participant G as Matrix-Guardian
participant A as matrix-ai
participant P as Provider Cascade
G->>A: POST /v1/plan { context, constraints }
A->>A: redact PII, validate payload (schema)
A->>P: generate plan (timeouts, retries)
alt Provider available
P-->>A: model output text
else Provider unavailable/limited
P-->>A: fallback to next provider
end
A->>A: parse β strict JSON plan (safe defaults if needed)
A-->>G: 200 { plan_id, steps[], risk, explanation }
```
### Sequence: `GET/POST /v1/chat/stream` (SSE chat)
```mermaid
sequenceDiagram
participant C as Client (UI)
participant A as matrix-ai (SSE-safe middleware)
participant P as Provider Cascade
C->>A: GET /v1/chat/stream?query=...
A->>P: chat(messages, stream=True)
loop token chunks
P-->>A: delta (text)
A-->>C: SSE data: {"delta": "..."}
end
A-->>C: SSE data: [DONE]
```
---
## Quick Start (Local Development)
```bash
# 1) Create venv
python3 -m venv .venv
source .venv/bin/activate
# 2) Install deps
pip install -r requirements.txt
# 3) Configure env (local only; use Space Secrets in prod)
cp configs/.env.example configs/.env
# Edit configs/.env with your keys (do NOT commit):
# GROQ_API_KEY=...
# GOOGLE_API_KEY=...
# HF_TOKEN=...
# 4) Run
uvicorn app.main:app --host 0.0.0.0 --port 7860
```
OpenAPI docs: [http://localhost:7860/docs](http://localhost:7860/docs)
---
## Provider Cascade (GROQ β Gemini β HF Router)
**matrix-ai** uses a production-ready multi-provider orchestrator:
1. **Groq** (`llama-3.1-8b-instant`) β free, fast, great latency
2. **Gemini** (`gemini-2.5-flash`) β free tier
3. **HF Router** β `HuggingFaceH4/zephyr-7b-beta` β `mistralai/Mistral-7B-Instruct-v0.2`
Order is configurable via `provider_order`. Providers are skipped automatically if misconfigured or if quotas/credits are exceeded.
**Streaming:** Groq streams true tokens; Gemini/HF may yield one chunk (normalized to SSE).
---
## Configuration
All options can be set via environment variables (Space Secrets in HF), `.env` for local use, and/or `configs/settings.yaml`.
### `configs/settings.yaml` (excerpt)
```yaml
model:
# HF router defaults (used at the last step)
name: "HuggingFaceH4/zephyr-7b-beta"
fallback: "mistralai/Mistral-7B-Instruct-v0.2"
provider: "featherless-ai"
max_new_tokens: 256
temperature: 0.2
# Provider-specific defaults (free-tier friendly)
groq_model: "llama-3.1-8b-instant"
gemini_model: "gemini-2.5-flash"
# Try providers in this order
provider_order:
- groq
- gemini
- router
# Switch to the multi-provider path
chat_backend: "multi"
chat_stream: true
limits:
rate_per_min: 60
cache_size: 256
rag:
index_dataset: ""
top_k: 4
matrixhub:
base_url: "https://api.matrixhub.io"
security:
admin_token: ""
```
### Environment variables
| Variable | Default | Purpose |
| ---------------- | -----------------------------------: | ----------------------------------------- |
| `GROQ_API_KEY` | β | API key for Groq (primary) |
| `GOOGLE_API_KEY` | β | API key for Gemini |
| `HF_TOKEN` | β | Token for Hugging Face Inference Router |
| `GROQ_MODEL` | `llama-3.1-8b-instant` | Override Groq model |
| `GEMINI_MODEL` | `gemini-2.5-flash` | Override Gemini model |
| `MODEL_NAME` | `HuggingFaceH4/zephyr-7b-beta` | HF Router primary model |
| `MODEL_FALLBACK` | `mistralai/Mistral-7B-Instruct-v0.2` | HF Router fallback |
| `MODEL_PROVIDER` | `featherless-ai` | HF provider tag (`model:provider`) |
| `PROVIDER_ORDER` | `groq,gemini,router` | Comma-sep. cascade order |
| `CHAT_STREAM` | `true` | Enable streaming where available |
| `RATE_LIMITS` | `60` | Per-IP req/min (middleware) |
| `ADMIN_TOKEN` | β | Gate `/v1/plan` & `/v1/chat*` (Bearer) |
| `RAG_KB_PATH` | `data/kb.jsonl` | Path to KB (if present) |
| `RAG_RERANK` | `true` | Enable CrossEncoder re-ranker (GPU-aware) |
| `LOG_LEVEL` | `INFO` | Structured JSON logs level |
> Never commit real API keys. Use Space Secrets / Vault in production.
---
## API
### `POST /v1/plan`
**Description:** Generate a short, low-risk remediation plan from a compact app health context.
**Headers**
```
Content-Type: application/json
Authorization: Bearer <ADMIN_TOKEN> # required if ADMIN_TOKEN set
```
**Request (example)**
```json
{
"context": {
"entity_uid": "matrix-ai",
"health": {"score": 0.64, "status": "degraded", "last_checked": "2025-10-01T00:00:00Z"},
"recent_checks": [
{"check": "http", "result": "fail", "latency_ms": 900, "ts": "2025-10-01T00:00:00Z"}
]
},
"constraints": {"max_steps": 3, "risk": "low"}
}
```
**Response (example)**
```json
{
"plan_id": "pln_01J9YX2H6ZP9R2K9THT2J9F7G4",
"risk": "low",
"steps": [
{"action": "reprobe", "target": "https://service/health", "retries": 2},
{"action": "pin_lkg", "entity_uid": "matrix-ai"}
],
"explanation": "Transient HTTP failures observed; re-probe and pin to last-known-good if still failing."
}
```
**Status codes**
* `200` β plan generated
* `400` β invalid payload (schema)
* `401/403` β missing/invalid bearer (only if `ADMIN_TOKEN` configured)
* `429` β rate limited
* `502` β upstream model error after retries
### `POST /v1/chat`
Given a query about MatrixHub, returns an answer with citations **if** a local KB is configured at `RAG_KB_PATH`. Uses the same provider cascade.
### `GET /v1/chat/stream` & `POST /v1/chat/stream`
Server-Sent Events (SSE) streaming of token deltas. Production-safe middleware ensures no body buffering and proper headers (`Cache-Control: no-cache`, `X-Trace-Id`, `X-Process-Time-Ms`, `Server-Timing`).
---
## Safety & Reliability
* **PII redaction** β tokens/emails removed from prompts as a pre-filter
* **Strict schema** β JSON plan parsing with safe defaults; rejects unsafe shapes
* **Time-boxed** β short timeouts and bounded retries to providers
* **Rate-limited** β per-IP fixed window (configurable)
* **Structured logs** β JSON logs with `trace_id` for correlation
* **SSE-safe middleware** β never consumes streaming bodies; avoids Starlette βNo response returnedβ pitfalls
---
## RAG (Optional)
* **Embeddings:** `sentence-transformers/all-MiniLM-L6-v2` (GPU-aware)
* **Re-ranking:** optional `cross-encoder/ms-marco-MiniLM-L-2-v2` (GPU-aware)
* **KB:** `data/kb.jsonl` (one JSON per line: `{ "text": "...", "source": "..." }`)
* **Tunable:** `rag.top_k`, `RAG_RERANK`, `RAG_KB_PATH`
---
## Deployments
### Hugging Face Spaces (recommended for demo)
1. Push repo to a new **Space** (FastAPI).
2. **Settings β Secrets**:
* `GROQ_API_KEY`, `GOOGLE_API_KEY`, `HF_TOKEN` (as needed by cascade)
* `ADMIN_TOKEN` (optional; gates `/v1/plan` & `/v1/chat*`)
3. Choose hardware (CPU is fine; GPU improves RAG throughput and cross-encoder).
4. Space runs `uvicorn` and exposes all endpoints.
### Containers / Cloud
* Use a minimal Python base, install with `pip install -r requirements.txt`.
* Expose port `7860` (configurable).
* Set secrets via your orchestrator (Kubernetes Secrets, ECS, etc.).
* Scale with multiple Uvicorn workers; put behind an HTTP proxy that supports streaming (e.g., nginx with `proxy_buffering off` for SSE).
---
## Observability
* **Trace IDs** (`X-Trace-Id`) attached per request and logged
* **Timing headers**: `X-Process-Time-Ms`, `Server-Timing`
* Provider selection logs (e.g., `Provider 'groq' succeeded in 0.82s`)
* Metrics endpoints can be added behind an auth wall (Prometheus friendly)
---
---
## Development Notes
* Keep `/v1/plan` **internal** behind a network boundary or `ADMIN_TOKEN`.
* Validate payloads rigorously (Pydantic) and write contract tests for the plan schema.
* If you switch models, re-run golden tests to guard against plan drift.
* Avoid logging sensitive data; logs are structured JSON only.
---
## License
Apache-2.0
---
**Tip:** The cascade order is controlled by `provider_order` (`groq,gemini,router`). If Groq is rate-limited or missing, the service automatically falls back to Gemini, then Hugging Face Router (Zephyr β Mistral). Streaming works out of the box and is middleware-safe.
|