matrix-ai / README.md
ruslanmv's picture
Update README.md
d1c625a
|
raw
history blame
11.3 kB
---
title: matrix-ai
emoji: 🧠
colorFrom: purple
colorTo: indigo
sdk: docker
pinned: false
---
# matrix-ai
**matrix-ai** is the AI planning microservice for the Matrix EcoSystem. It generates **short, low-risk, auditable remediation plans** from compact health context provided by **Matrix Guardian**, and also exposes a lightweight **RAG** Q&A over MatrixHub documents.
It is optimized for **Hugging Face Spaces / Inference Endpoints**, but also runs locally and in containers.
> **Endpoints**
>
> * `POST /v1/plan` – internal API for Matrix Guardian: returns a safe JSON plan.
> * `POST /v1/chat` – Q&A (RAG-assisted) over MatrixHub content; returns a single answer.
> * `GET /v1/chat/stream` – **SSE** token stream for interactive chat (production-hardened).
> * `POST /v1/chat/stream` – same as `GET` but with JSON payloads.
The service emphasizes **safety, performance, and auditability**:
* Strict, schema-validated JSON plans (bounded steps, risk label, rationale)
* PII redaction before calling upstream model endpoints
* **Multi-provider LLM cascade:** **GROQ β†’ Gemini β†’ HF Router (Zephyr β†’ Mistral)** with automatic failover
* Production-safe **SSE** streaming & middleware (no body buffering, trace IDs, CORS, gzip)
* Exponential backoff, short timeouts, and structured JSON logs
* Per-IP rate limiting; optional `ADMIN_TOKEN` for private deployments
* RAG with SentenceTransformers (optional CrossEncoder re-ranker) over `data/kb.jsonl`
* ETag & response caching for non-mutating reads (where applicable)
*Last Updated: 2025-10-01 (UTC)*
---
## Architecture (at a glance)
```mermaid
flowchart LR
subgraph Client [Matrix Operators / Observers]
end
Client -->|monitor| HubAPI[Matrix-Hub API]
Guardian[Matrix-Guardian<br/>control plane] -->|/v1/plan| AI[matrix-ai<br/>FastAPI service]
Guardian -->|/status,/apps,...| HubAPI
HubAPI <-->|SQL| DB[MatrixDB<br/>Postgres]
subgraph LLM [LLM Providers fallback cascade]
GROQ[Groq<br/>llama-3.1-8b-instant]
GEM[Google Gemini<br/>gemini-2.5-flash]
HF[Hugging Face Router<br/>Zephyr β†’ Mistral]
end
AI -->|primary| GROQ
AI -->|fallback| GEM
AI -->|final| HF
classDef svc fill:#0ea5e9,stroke:#0b4,stroke-width:1,color:#fff
classDef db fill:#f59e0b,stroke:#0b4,stroke-width:1,color:#fff
class Guardian,AI,HubAPI svc
class DB db
```
### Sequence: `POST /v1/plan` (planning)
```mermaid
sequenceDiagram
participant G as Matrix-Guardian
participant A as matrix-ai
participant P as Provider Cascade
G->>A: POST /v1/plan { context, constraints }
A->>A: redact PII, validate payload (schema)
A->>P: generate plan (timeouts, retries)
alt Provider available
P-->>A: model output text
else Provider unavailable/limited
P-->>A: fallback to next provider
end
A->>A: parse β†’ strict JSON plan (safe defaults if needed)
A-->>G: 200 { plan_id, steps[], risk, explanation }
```
### Sequence: `GET/POST /v1/chat/stream` (SSE chat)
```mermaid
sequenceDiagram
participant C as Client (UI)
participant A as matrix-ai (SSE-safe middleware)
participant P as Provider Cascade
C->>A: GET /v1/chat/stream?query=...
A->>P: chat(messages, stream=True)
loop token chunks
P-->>A: delta (text)
A-->>C: SSE data: {"delta": "..."}
end
A-->>C: SSE data: [DONE]
```
---
## Quick Start (Local Development)
```bash
# 1) Create venv
python3 -m venv .venv
source .venv/bin/activate
# 2) Install deps
pip install -r requirements.txt
# 3) Configure env (local only; use Space Secrets in prod)
cp configs/.env.example configs/.env
# Edit configs/.env with your keys (do NOT commit):
# GROQ_API_KEY=...
# GOOGLE_API_KEY=...
# HF_TOKEN=...
# 4) Run
uvicorn app.main:app --host 0.0.0.0 --port 7860
```
OpenAPI docs: [http://localhost:7860/docs](http://localhost:7860/docs)
---
## Provider Cascade (GROQ β†’ Gemini β†’ HF Router)
**matrix-ai** uses a production-ready multi-provider orchestrator:
1. **Groq** (`llama-3.1-8b-instant`) – free, fast, great latency
2. **Gemini** (`gemini-2.5-flash`) – free tier
3. **HF Router** – `HuggingFaceH4/zephyr-7b-beta` β†’ `mistralai/Mistral-7B-Instruct-v0.2`
Order is configurable via `provider_order`. Providers are skipped automatically if misconfigured or if quotas/credits are exceeded.
**Streaming:** Groq streams true tokens; Gemini/HF may yield one chunk (normalized to SSE).
---
## Configuration
All options can be set via environment variables (Space Secrets in HF), `.env` for local use, and/or `configs/settings.yaml`.
### `configs/settings.yaml` (excerpt)
```yaml
model:
# HF router defaults (used at the last step)
name: "HuggingFaceH4/zephyr-7b-beta"
fallback: "mistralai/Mistral-7B-Instruct-v0.2"
provider: "featherless-ai"
max_new_tokens: 256
temperature: 0.2
# Provider-specific defaults (free-tier friendly)
groq_model: "llama-3.1-8b-instant"
gemini_model: "gemini-2.5-flash"
# Try providers in this order
provider_order:
- groq
- gemini
- router
# Switch to the multi-provider path
chat_backend: "multi"
chat_stream: true
limits:
rate_per_min: 60
cache_size: 256
rag:
index_dataset: ""
top_k: 4
matrixhub:
base_url: "https://api.matrixhub.io"
security:
admin_token: ""
```
### Environment variables
| Variable | Default | Purpose |
| ---------------- | -----------------------------------: | ----------------------------------------- |
| `GROQ_API_KEY` | β€” | API key for Groq (primary) |
| `GOOGLE_API_KEY` | β€” | API key for Gemini |
| `HF_TOKEN` | β€” | Token for Hugging Face Inference Router |
| `GROQ_MODEL` | `llama-3.1-8b-instant` | Override Groq model |
| `GEMINI_MODEL` | `gemini-2.5-flash` | Override Gemini model |
| `MODEL_NAME` | `HuggingFaceH4/zephyr-7b-beta` | HF Router primary model |
| `MODEL_FALLBACK` | `mistralai/Mistral-7B-Instruct-v0.2` | HF Router fallback |
| `MODEL_PROVIDER` | `featherless-ai` | HF provider tag (`model:provider`) |
| `PROVIDER_ORDER` | `groq,gemini,router` | Comma-sep. cascade order |
| `CHAT_STREAM` | `true` | Enable streaming where available |
| `RATE_LIMITS` | `60` | Per-IP req/min (middleware) |
| `ADMIN_TOKEN` | β€” | Gate `/v1/plan` & `/v1/chat*` (Bearer) |
| `RAG_KB_PATH` | `data/kb.jsonl` | Path to KB (if present) |
| `RAG_RERANK` | `true` | Enable CrossEncoder re-ranker (GPU-aware) |
| `LOG_LEVEL` | `INFO` | Structured JSON logs level |
> Never commit real API keys. Use Space Secrets / Vault in production.
---
## API
### `POST /v1/plan`
**Description:** Generate a short, low-risk remediation plan from a compact app health context.
**Headers**
```
Content-Type: application/json
Authorization: Bearer <ADMIN_TOKEN> # required if ADMIN_TOKEN set
```
**Request (example)**
```json
{
"context": {
"entity_uid": "matrix-ai",
"health": {"score": 0.64, "status": "degraded", "last_checked": "2025-10-01T00:00:00Z"},
"recent_checks": [
{"check": "http", "result": "fail", "latency_ms": 900, "ts": "2025-10-01T00:00:00Z"}
]
},
"constraints": {"max_steps": 3, "risk": "low"}
}
```
**Response (example)**
```json
{
"plan_id": "pln_01J9YX2H6ZP9R2K9THT2J9F7G4",
"risk": "low",
"steps": [
{"action": "reprobe", "target": "https://service/health", "retries": 2},
{"action": "pin_lkg", "entity_uid": "matrix-ai"}
],
"explanation": "Transient HTTP failures observed; re-probe and pin to last-known-good if still failing."
}
```
**Status codes**
* `200` – plan generated
* `400` – invalid payload (schema)
* `401/403` – missing/invalid bearer (only if `ADMIN_TOKEN` configured)
* `429` – rate limited
* `502` – upstream model error after retries
### `POST /v1/chat`
Given a query about MatrixHub, returns an answer with citations **if** a local KB is configured at `RAG_KB_PATH`. Uses the same provider cascade.
### `GET /v1/chat/stream` & `POST /v1/chat/stream`
Server-Sent Events (SSE) streaming of token deltas. Production-safe middleware ensures no body buffering and proper headers (`Cache-Control: no-cache`, `X-Trace-Id`, `X-Process-Time-Ms`, `Server-Timing`).
---
## Safety & Reliability
* **PII redaction** – tokens/emails removed from prompts as a pre-filter
* **Strict schema** – JSON plan parsing with safe defaults; rejects unsafe shapes
* **Time-boxed** – short timeouts and bounded retries to providers
* **Rate-limited** – per-IP fixed window (configurable)
* **Structured logs** – JSON logs with `trace_id` for correlation
* **SSE-safe middleware** – never consumes streaming bodies; avoids Starlette β€œNo response returned” pitfalls
---
## RAG (Optional)
* **Embeddings:** `sentence-transformers/all-MiniLM-L6-v2` (GPU-aware)
* **Re-ranking:** optional `cross-encoder/ms-marco-MiniLM-L-2-v2` (GPU-aware)
* **KB:** `data/kb.jsonl` (one JSON per line: `{ "text": "...", "source": "..." }`)
* **Tunable:** `rag.top_k`, `RAG_RERANK`, `RAG_KB_PATH`
---
## Deployments
### Hugging Face Spaces (recommended for demo)
1. Push repo to a new **Space** (FastAPI).
2. **Settings β†’ Secrets**:
* `GROQ_API_KEY`, `GOOGLE_API_KEY`, `HF_TOKEN` (as needed by cascade)
* `ADMIN_TOKEN` (optional; gates `/v1/plan` & `/v1/chat*`)
3. Choose hardware (CPU is fine; GPU improves RAG throughput and cross-encoder).
4. Space runs `uvicorn` and exposes all endpoints.
### Containers / Cloud
* Use a minimal Python base, install with `pip install -r requirements.txt`.
* Expose port `7860` (configurable).
* Set secrets via your orchestrator (Kubernetes Secrets, ECS, etc.).
* Scale with multiple Uvicorn workers; put behind an HTTP proxy that supports streaming (e.g., nginx with `proxy_buffering off` for SSE).
---
## Observability
* **Trace IDs** (`X-Trace-Id`) attached per request and logged
* **Timing headers**: `X-Process-Time-Ms`, `Server-Timing`
* Provider selection logs (e.g., `Provider 'groq' succeeded in 0.82s`)
* Metrics endpoints can be added behind an auth wall (Prometheus friendly)
---
---
## Development Notes
* Keep `/v1/plan` **internal** behind a network boundary or `ADMIN_TOKEN`.
* Validate payloads rigorously (Pydantic) and write contract tests for the plan schema.
* If you switch models, re-run golden tests to guard against plan drift.
* Avoid logging sensitive data; logs are structured JSON only.
---
## License
Apache-2.0
---
**Tip:** The cascade order is controlled by `provider_order` (`groq,gemini,router`). If Groq is rate-limited or missing, the service automatically falls back to Gemini, then Hugging Face Router (Zephyr β†’ Mistral). Streaming works out of the box and is middleware-safe.