Spaces:

ruslanmv
/

matrix-ai

Sleeping

App Files Files Community

matrix-ai / README.md

ruslanmv

Update README.md

d1c625a 2 months ago

preview code

raw

history blame

11.3 kB

	---
	title: matrix-ai
	emoji: 🧠
	colorFrom: purple
	colorTo: indigo
	sdk: docker
	pinned: false
	---
	# matrix-ai

	matrix-ai is the AI planning microservice for the Matrix EcoSystem. It generates short, low-risk, auditable remediation plans from compact health context provided by Matrix Guardian, and also exposes a lightweight RAG Q&A over MatrixHub documents.

	It is optimized for Hugging Face Spaces / Inference Endpoints, but also runs locally and in containers.

	> Endpoints
	>
	> * `POST /v1/plan` – internal API for Matrix Guardian: returns a safe JSON plan.
	> * `POST /v1/chat` – Q&A (RAG-assisted) over MatrixHub content; returns a single answer.
	> * `GET /v1/chat/stream` – SSE token stream for interactive chat (production-hardened).
	> * `POST /v1/chat/stream` – same as `GET` but with JSON payloads.

	The service emphasizes safety, performance, and auditability:

	* Strict, schema-validated JSON plans (bounded steps, risk label, rationale)
	* PII redaction before calling upstream model endpoints
	* Multi-provider LLM cascade: GROQ → Gemini → HF Router (Zephyr → Mistral) with automatic failover
	* Production-safe SSE streaming & middleware (no body buffering, trace IDs, CORS, gzip)
	* Exponential backoff, short timeouts, and structured JSON logs
	* Per-IP rate limiting; optional `ADMIN_TOKEN` for private deployments
	* RAG with SentenceTransformers (optional CrossEncoder re-ranker) over `data/kb.jsonl`
	* ETag & response caching for non-mutating reads (where applicable)

	Last Updated: 2025-10-01 (UTC)

	---

	## Architecture (at a glance)

	```mermaid
	flowchart LR
	subgraph Client [Matrix Operators / Observers]
	end

	Client -->\|monitor\| HubAPI[Matrix-Hub API]
	Guardian[Matrix-Guardian<br/>control plane] -->\|/v1/plan\| AI[matrix-ai<br/>FastAPI service]
	Guardian -->\|/status,/apps,...\| HubAPI
	HubAPI <-->\|SQL\| DB[MatrixDB<br/>Postgres]

	subgraph LLM [LLM Providers fallback cascade]
	GROQ[Groq<br/>llama-3.1-8b-instant]
	GEM[Google Gemini<br/>gemini-2.5-flash]
	HF[Hugging Face Router<br/>Zephyr → Mistral]
	end

	AI -->\|primary\| GROQ
	AI -->\|fallback\| GEM
	AI -->\|final\| HF

	classDef svc fill:#0ea5e9,stroke:#0b4,stroke-width:1,color:#fff
	classDef db fill:#f59e0b,stroke:#0b4,stroke-width:1,color:#fff
	class Guardian,AI,HubAPI svc
	class DB db
	```

	### Sequence: `POST /v1/plan` (planning)

	```mermaid
	sequenceDiagram
	participant G as Matrix-Guardian
	participant A as matrix-ai
	participant P as Provider Cascade

	G->>A: POST /v1/plan { context, constraints }
	A->>A: redact PII, validate payload (schema)
	A->>P: generate plan (timeouts, retries)
	alt Provider available
	P-->>A: model output text
	else Provider unavailable/limited
	P-->>A: fallback to next provider
	end
	A->>A: parse → strict JSON plan (safe defaults if needed)
	A-->>G: 200 { plan_id, steps[], risk, explanation }

	```

	### Sequence: `GET/POST /v1/chat/stream` (SSE chat)

	```mermaid
	sequenceDiagram
	participant C as Client (UI)
	participant A as matrix-ai (SSE-safe middleware)
	participant P as Provider Cascade

	C->>A: GET /v1/chat/stream?query=...
	A->>P: chat(messages, stream=True)
	loop token chunks
	P-->>A: delta (text)
	A-->>C: SSE data: {"delta": "..."}
	end
	A-->>C: SSE data: [DONE]


	```

	---

	## Quick Start (Local Development)

	```bash
	# 1) Create venv
	python3 -m venv .venv
	source .venv/bin/activate

	# 2) Install deps
	pip install -r requirements.txt

	# 3) Configure env (local only; use Space Secrets in prod)
	cp configs/.env.example configs/.env
	# Edit configs/.env with your keys (do NOT commit):
	# GROQ_API_KEY=...
	# GOOGLE_API_KEY=...
	# HF_TOKEN=...

	# 4) Run
	uvicorn app.main:app --host 0.0.0.0 --port 7860
	```

	OpenAPI docs: [http://localhost:7860/docs](http://localhost:7860/docs)

	---

	## Provider Cascade (GROQ → Gemini → HF Router)

	matrix-ai uses a production-ready multi-provider orchestrator:

	1. Groq (`llama-3.1-8b-instant`) – free, fast, great latency
	2. Gemini (`gemini-2.5-flash`) – free tier
	3. HF Router – `HuggingFaceH4/zephyr-7b-beta` → `mistralai/Mistral-7B-Instruct-v0.2`

	Order is configurable via `provider_order`. Providers are skipped automatically if misconfigured or if quotas/credits are exceeded.

	Streaming: Groq streams true tokens; Gemini/HF may yield one chunk (normalized to SSE).

	---

	## Configuration

	All options can be set via environment variables (Space Secrets in HF), `.env` for local use, and/or `configs/settings.yaml`.

	### `configs/settings.yaml` (excerpt)

	```yaml
	model:
	# HF router defaults (used at the last step)
	name: "HuggingFaceH4/zephyr-7b-beta"
	fallback: "mistralai/Mistral-7B-Instruct-v0.2"
	provider: "featherless-ai"
	max_new_tokens: 256
	temperature: 0.2

	# Provider-specific defaults (free-tier friendly)
	groq_model: "llama-3.1-8b-instant"
	gemini_model: "gemini-2.5-flash"

	# Try providers in this order
	provider_order:
	- groq
	- gemini
	- router

	# Switch to the multi-provider path
	chat_backend: "multi"
	chat_stream: true

	limits:
	rate_per_min: 60
	cache_size: 256

	rag:
	index_dataset: ""
	top_k: 4

	matrixhub:
	base_url: "https://api.matrixhub.io"

	security:
	admin_token: ""
	```

	### Environment variables

	\| Variable \| Default \| Purpose \|
	\| ---------------- \| -----------------------------------: \| ----------------------------------------- \|
	\| `GROQ_API_KEY` \| — \| API key for Groq (primary) \|
	\| `GOOGLE_API_KEY` \| — \| API key for Gemini \|
	\| `HF_TOKEN` \| — \| Token for Hugging Face Inference Router \|
	\| `GROQ_MODEL` \| `llama-3.1-8b-instant` \| Override Groq model \|
	\| `GEMINI_MODEL` \| `gemini-2.5-flash` \| Override Gemini model \|
	\| `MODEL_NAME` \| `HuggingFaceH4/zephyr-7b-beta` \| HF Router primary model \|
	\| `MODEL_FALLBACK` \| `mistralai/Mistral-7B-Instruct-v0.2` \| HF Router fallback \|
	\| `MODEL_PROVIDER` \| `featherless-ai` \| HF provider tag (`model:provider`) \|
	\| `PROVIDER_ORDER` \| `groq,gemini,router` \| Comma-sep. cascade order \|
	\| `CHAT_STREAM` \| `true` \| Enable streaming where available \|
	\| `RATE_LIMITS` \| `60` \| Per-IP req/min (middleware) \|
	\| `ADMIN_TOKEN` \| — \| Gate `/v1/plan` & `/v1/chat*` (Bearer) \|
	\| `RAG_KB_PATH` \| `data/kb.jsonl` \| Path to KB (if present) \|
	\| `RAG_RERANK` \| `true` \| Enable CrossEncoder re-ranker (GPU-aware) \|
	\| `LOG_LEVEL` \| `INFO` \| Structured JSON logs level \|

	> Never commit real API keys. Use Space Secrets / Vault in production.

	---

	## API

	### `POST /v1/plan`

	Description: Generate a short, low-risk remediation plan from a compact app health context.

	Headers

	```
	Content-Type: application/json
	Authorization: Bearer <ADMIN_TOKEN> # required if ADMIN_TOKEN set
	```

	Request (example)

	```json
	{
	"context": {
	"entity_uid": "matrix-ai",
	"health": {"score": 0.64, "status": "degraded", "last_checked": "2025-10-01T00:00:00Z"},
	"recent_checks": [
	{"check": "http", "result": "fail", "latency_ms": 900, "ts": "2025-10-01T00:00:00Z"}
	]
	},
	"constraints": {"max_steps": 3, "risk": "low"}
	}
	```

	Response (example)

	```json
	{
	"plan_id": "pln_01J9YX2H6ZP9R2K9THT2J9F7G4",
	"risk": "low",
	"steps": [
	{"action": "reprobe", "target": "https://service/health", "retries": 2},
	{"action": "pin_lkg", "entity_uid": "matrix-ai"}
	],
	"explanation": "Transient HTTP failures observed; re-probe and pin to last-known-good if still failing."
	}
	```

	Status codes

	* `200` – plan generated
	* `400` – invalid payload (schema)
	* `401/403` – missing/invalid bearer (only if `ADMIN_TOKEN` configured)
	* `429` – rate limited
	* `502` – upstream model error after retries

	### `POST /v1/chat`

	Given a query about MatrixHub, returns an answer with citations if a local KB is configured at `RAG_KB_PATH`. Uses the same provider cascade.

	### `GET /v1/chat/stream` & `POST /v1/chat/stream`

	Server-Sent Events (SSE) streaming of token deltas. Production-safe middleware ensures no body buffering and proper headers (`Cache-Control: no-cache`, `X-Trace-Id`, `X-Process-Time-Ms`, `Server-Timing`).

	---

	## Safety & Reliability

	* PII redaction – tokens/emails removed from prompts as a pre-filter
	* Strict schema – JSON plan parsing with safe defaults; rejects unsafe shapes
	* Time-boxed – short timeouts and bounded retries to providers
	* Rate-limited – per-IP fixed window (configurable)
	* Structured logs – JSON logs with `trace_id` for correlation
	* SSE-safe middleware – never consumes streaming bodies; avoids Starlette “No response returned” pitfalls

	---

	## RAG (Optional)

	* Embeddings: `sentence-transformers/all-MiniLM-L6-v2` (GPU-aware)
	* Re-ranking: optional `cross-encoder/ms-marco-MiniLM-L-2-v2` (GPU-aware)
	* KB: `data/kb.jsonl` (one JSON per line: `{ "text": "...", "source": "..." }`)
	* Tunable: `rag.top_k`, `RAG_RERANK`, `RAG_KB_PATH`

	---

	## Deployments

	### Hugging Face Spaces (recommended for demo)

	1. Push repo to a new Space (FastAPI).
	2. Settings → Secrets:

	* `GROQ_API_KEY`, `GOOGLE_API_KEY`, `HF_TOKEN` (as needed by cascade)
	* `ADMIN_TOKEN` (optional; gates `/v1/plan` & `/v1/chat*`)
	3. Choose hardware (CPU is fine; GPU improves RAG throughput and cross-encoder).
	4. Space runs `uvicorn` and exposes all endpoints.

	### Containers / Cloud

	* Use a minimal Python base, install with `pip install -r requirements.txt`.
	* Expose port `7860` (configurable).
	* Set secrets via your orchestrator (Kubernetes Secrets, ECS, etc.).
	* Scale with multiple Uvicorn workers; put behind an HTTP proxy that supports streaming (e.g., nginx with `proxy_buffering off` for SSE).

	---

	## Observability

	* Trace IDs (`X-Trace-Id`) attached per request and logged
	* Timing headers: `X-Process-Time-Ms`, `Server-Timing`
	* Provider selection logs (e.g., `Provider 'groq' succeeded in 0.82s`)
	* Metrics endpoints can be added behind an auth wall (Prometheus friendly)

	---


	---

	## Development Notes

	* Keep `/v1/plan` internal behind a network boundary or `ADMIN_TOKEN`.
	* Validate payloads rigorously (Pydantic) and write contract tests for the plan schema.
	* If you switch models, re-run golden tests to guard against plan drift.
	* Avoid logging sensitive data; logs are structured JSON only.

	---

	## License

	Apache-2.0

	---

	Tip: The cascade order is controlled by `provider_order` (`groq,gemini,router`). If Groq is rate-limited or missing, the service automatically falls back to Gemini, then Hugging Face Router (Zephyr → Mistral). Streaming works out of the box and is middleware-safe.