ruslanmv commited on
Commit
5a47824
·
1 Parent(s): cd6d7ff

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +210 -74
README.md CHANGED
@@ -5,26 +5,36 @@ colorFrom: purple
5
  colorTo: indigo
6
  sdk: docker
7
  pinned: false
 
 
 
8
  ---
9
 
10
  # matrix-ai
11
 
12
- **matrix-ai** is the AI planning microservice for the Matrix EcoSystem. It generates **short, lowrisk, auditable remediation plans** from a compact health context provided by **Matrix Guardian**. The service is designed for **Hugging Face Spaces** or **Inference Endpoints**, but also runs locally.
 
 
13
 
14
  > **Endpoints**
15
  >
16
  > * `POST /v1/plan` – internal API for Matrix Guardian: returns a safe JSON plan.
17
- > * `POST /v1/chat` – (optional) RAG-style Q&A about MatrixHub (kept lightweight in Stage‑1).
 
 
18
 
19
  The service emphasizes **safety, performance, and auditability**:
20
 
21
- * Strict, schemavalidated JSON plans (bounded steps, risk label, rationale)
22
  * PII redaction before calling upstream model endpoints
 
 
23
  * Exponential backoff, short timeouts, and structured JSON logs
24
- * In‑memory rate limiting (per‑IP), optional auth for private deployments
25
- * ETag support and response caching for non‑mutating reads
 
26
 
27
- *Last Updated: 2025‑09‑27 (UTC)*
28
 
29
  ---
30
 
@@ -32,45 +42,74 @@ The service emphasizes **safety, performance, and auditability**:
32
 
33
  ```mermaid
34
  flowchart LR
35
- subgraph Client[Matrix Operators / Observers]
36
- end
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
- Client -->|monitor| HubAPI[Matrix‑Hub API]
39
- Guardian[Matrix‑Guardian
40
- control plane] -->|/v1/plan| AI[matrix‑ai
41
- HF Space]
42
- Guardian -->|/status,/apps,...| HubAPI
43
- HubAPI <-->|SQL| DB[(MatrixDB
44
- Postgres)]
45
-
46
- AI -->|HF Inference| HF[Hugging Face
47
- Inference API]
48
-
49
- classDef svc fill:#0ea5e9,stroke:#0b4,stroke-width:1,color:#fff
50
- classDef db fill:#f59e0b,stroke:#0b4,stroke-width:1,color:#fff
51
- class Guardian,AI,HubAPI svc
52
- class DB db
53
  ```
54
 
55
- ### Sequence: `POST /v1/plan`
56
 
57
  ```mermaid
58
  sequenceDiagram
59
- participant G as Matrix‑Guardian
60
- participant A as matrixai
61
- participant H as HF Inference
62
-
63
- G->>A: POST /v1/plan { context, constraints }
64
- A->>A: redact PII, validate payload
65
- A->>H: model.generate prompt [retries, timeout]
66
- H-->>A: model output text
67
- A->>A: parse strict JSON plan fallback if needed
68
- A-->>G: 200 { plan_id, steps[], risk, explanation }
 
 
 
69
  ```
70
 
71
  ---
72
 
73
  ## Quick Start (Local Development)
 
74
  ```bash
75
  # 1) Create venv
76
  python3 -m venv .venv
@@ -80,47 +119,99 @@ source .venv/bin/activate
80
  pip install -r requirements.txt
81
 
82
  # 3) Configure env (local only; use Space Secrets in prod)
83
- export HF_TOKEN="your_hugging_face_token"
 
 
 
 
84
 
85
  # 4) Run
86
  uvicorn app.main:app --host 0.0.0.0 --port 7860
87
  ```
88
 
89
- OpenAPI docs: http://localhost:7860/docs
90
 
91
  ---
92
 
93
- ## Deploy to Hugging Face Spaces
 
 
 
 
 
 
94
 
95
- 1) Push the repository to a new Space.
96
- 2) In **Settings → Secrets**, add:
97
- * `HF_TOKEN` (required) – used by the upstream HF Inference client
98
- * `ADMIN_TOKEN` (optional) – if set, private‑gates `/v1/plan` and `/v1/chat`
99
- 3) Choose hardware. CPU is fine for tests; GPU recommended for larger models.
100
- 4) The Space will serve FastAPI on the default port; the two endpoints are ready.
101
 
102
- > For Inference Endpoints, mirror the same env and start command.
103
 
104
  ---
105
 
106
  ## Configuration
107
 
108
- All options can be set via environment variables (Space Secrets in HF) or `.env` for local use.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109
 
110
- | Variable | Default | Purpose |
111
- |---|---:|---|
112
- | `HF_TOKEN` | — | Token for Hugging Face Inference API (required) |
113
- | `MODEL_NAME` | `meta-llama/Meta-Llama-3.1-8B-Instruct` | Upstream model ID (example) |
114
- | `MAX_NEW_TOKENS` | `256` | Output token cap for plan generations |
115
- | `TEMPERATURE` | `0.2` | Generation temperature |
116
- | `RATE_LIMIT_PER_MIN` | `120` | Per‑IP fixed‑window limit |
117
- | `REQUEST_TIMEOUT_SEC` | `15` | HTTP client timeout to HF |
118
- | `RETRY_MAX_ATTEMPTS` | `3` | Retry budget to HF |
119
- | `CACHE_TTL_SEC` | `30` | Optional in‑memory caching for GET |
120
- | `ADMIN_TOKEN` | — | If set, requires `Authorization: Bearer <ADMIN_TOKEN>` |
121
- | `LOG_LEVEL` | `INFO` | Log level (JSON logs) |
122
 
123
- > Names are illustrative; keep them in sync with your `configs/settings.yaml` if present.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124
 
125
  ---
126
 
@@ -128,24 +219,24 @@ All options can be set via environment variables (Space Secrets in HF) or `.env`
128
 
129
  ### `POST /v1/plan`
130
 
131
- **Description:** Generate a short, lowrisk remediation plan from a compact app health context.
132
 
133
  **Headers**
134
 
135
  ```
136
  Content-Type: application/json
137
- Authorization: Bearer <ADMIN_TOKEN> # required iff ADMIN_TOKEN set
138
  ```
139
 
140
- **Request body (example)**
141
 
142
  ```json
143
  {
144
  "context": {
145
  "entity_uid": "matrix-ai",
146
- "health": {"score": 0.64, "status": "degraded", "last_checked": "2025-09-27T00:00:00Z"},
147
  "recent_checks": [
148
- {"check": "http", "result": "fail", "latency_ms": 900, "ts": "2025-09-27T00:00:00Z"}
149
  ]
150
  },
151
  "constraints": {"max_steps": 3, "risk": "low"}
@@ -167,6 +258,7 @@ Authorization: Bearer <ADMIN_TOKEN> # required iff ADMIN_TOKEN set
167
  ```
168
 
169
  **Status codes**
 
170
  * `200` – plan generated
171
  * `400` – invalid payload (schema)
172
  * `401/403` – missing/invalid bearer (only if `ADMIN_TOKEN` configured)
@@ -175,25 +267,64 @@ Authorization: Bearer <ADMIN_TOKEN> # required iff ADMIN_TOKEN set
175
 
176
  ### `POST /v1/chat`
177
 
178
- *Optional, Stage‑1 placeholder.* Given a query about MatrixHub, returns an answer with citations if a local KB is configured.
 
 
 
 
179
 
180
  ---
181
 
182
  ## Safety & Reliability
183
 
184
- * **PII redaction** – tokens/emails removed from prompts as a prefilter
185
- * **Strict schema** – JSON plan parsing with fallbacks; rejects unsafe shapes
186
- * **Timeboxed** – short timeouts and bounded retries to HF Inference
187
- * **Ratelimited** – perIP fixed window (configurable)
188
- * **Structured logs** – JSON logs only; no sensitive payloads are logged
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
189
 
190
  ---
191
 
192
  ## Observability
193
 
194
- * Request IDs (correlated across Guardian AI)
195
- * Latency + retry counters
196
- * Plan success/failure metrics (prom‑friendly if you expose metrics)
 
 
 
 
197
 
198
  ---
199
 
@@ -201,10 +332,15 @@ Authorization: Bearer <ADMIN_TOKEN> # required iff ADMIN_TOKEN set
201
 
202
  * Keep `/v1/plan` **internal** behind a network boundary or `ADMIN_TOKEN`.
203
  * Validate payloads rigorously (Pydantic) and write contract tests for the plan schema.
204
- * If you switch models, rerun golden tests to guard against plan drift.
 
205
 
206
  ---
207
 
208
  ## License
209
 
210
- Apache2.0
 
 
 
 
 
5
  colorTo: indigo
6
  sdk: docker
7
  pinned: false
8
+ ---
9
+ Here’s an updated, production-ready **README.md** that reflects the latest architecture (multi-provider cascade: **GROQ → Gemini → HF Router**), SSE streaming, RAG, and middleware hardening. It keeps a professional tone, includes fresh diagrams, and is organized for fast adoption in prod.
10
+
11
  ---
12
 
13
  # matrix-ai
14
 
15
+ **matrix-ai** is the AI planning microservice for the Matrix EcoSystem. It generates **short, low-risk, auditable remediation plans** from compact health context provided by **Matrix Guardian**, and also exposes a lightweight **RAG** Q&A over MatrixHub documents.
16
+
17
+ It is optimized for **Hugging Face Spaces / Inference Endpoints**, but also runs locally and in containers.
18
 
19
  > **Endpoints**
20
  >
21
  > * `POST /v1/plan` – internal API for Matrix Guardian: returns a safe JSON plan.
22
+ > * `POST /v1/chat` – Q&A (RAG-assisted) over MatrixHub content; returns a single answer.
23
+ > * `GET /v1/chat/stream` – **SSE** token stream for interactive chat (production-hardened).
24
+ > * `POST /v1/chat/stream` – same as `GET` but with JSON payloads.
25
 
26
  The service emphasizes **safety, performance, and auditability**:
27
 
28
+ * Strict, schema-validated JSON plans (bounded steps, risk label, rationale)
29
  * PII redaction before calling upstream model endpoints
30
+ * **Multi-provider LLM cascade:** **GROQ → Gemini → HF Router (Zephyr → Mistral)** with automatic failover
31
+ * Production-safe **SSE** streaming & middleware (no body buffering, trace IDs, CORS, gzip)
32
  * Exponential backoff, short timeouts, and structured JSON logs
33
+ * Per-IP rate limiting; optional `ADMIN_TOKEN` for private deployments
34
+ * RAG with SentenceTransformers (optional CrossEncoder re-ranker) over `data/kb.jsonl`
35
+ * ETag & response caching for non-mutating reads (where applicable)
36
 
37
+ *Last Updated: 2025-10-01 (UTC)*
38
 
39
  ---
40
 
 
42
 
43
  ```mermaid
44
  flowchart LR
45
+ subgraph Client [Matrix Operators / Observers]
46
+ end
47
+
48
+ Client -->|monitor| HubAPI[Matrix-Hub API]
49
+ Guardian[Matrix-Guardian<br/>control plane] -->|/v1/plan| AI[matrix-ai<br/>FastAPI service]
50
+ Guardian -->|/status,/apps,...| HubAPI
51
+ HubAPI <-->|SQL| DB[MatrixDB<br/>Postgres]
52
+
53
+ subgraph LLM [LLM Providers fallback cascade]
54
+ GROQ[Groq<br/>llama-3.1-8b-instant]
55
+ GEM[Google Gemini<br/>gemini-2.5-flash]
56
+ HF[Hugging Face Router<br/>Zephyr → Mistral]
57
+ end
58
+
59
+ AI -->|primary| GROQ
60
+ AI -->|fallback| GEM
61
+ AI -->|final| HF
62
+
63
+ classDef svc fill:#0ea5e9,stroke:#0b4,stroke-width:1,color:#fff
64
+ classDef db fill:#f59e0b,stroke:#0b4,stroke-width:1,color:#fff
65
+ class Guardian,AI,HubAPI svc
66
+ class DB db
67
+ ```
68
+
69
+ ### Sequence: `POST /v1/plan` (planning)
70
+
71
+ ```mermaid
72
+ sequenceDiagram
73
+ participant G as Matrix-Guardian
74
+ participant A as matrix-ai
75
+ participant P as Provider Cascade
76
+
77
+ G->>A: POST /v1/plan { context, constraints }
78
+ A->>A: redact PII, validate payload (schema)
79
+ A->>P: generate plan (timeouts, retries)
80
+ alt Provider available
81
+ P-->>A: model output text
82
+ else Provider unavailable/limited
83
+ P-->>A: fallback to next provider
84
+ end
85
+ A->>A: parse → strict JSON plan (safe defaults if needed)
86
+ A-->>G: 200 { plan_id, steps[], risk, explanation }
87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
  ```
89
 
90
+ ### Sequence: `GET/POST /v1/chat/stream` (SSE chat)
91
 
92
  ```mermaid
93
  sequenceDiagram
94
+ participant C as Client (UI)
95
+ participant A as matrix-ai (SSE-safe middleware)
96
+ participant P as Provider Cascade
97
+
98
+ C->>A: GET /v1/chat/stream?query=...
99
+ A->>P: chat(messages, stream=True)
100
+ loop token chunks
101
+ P-->>A: delta (text)
102
+ A-->>C: SSE data: {"delta": "..."}
103
+ end
104
+ A-->>C: SSE data: [DONE]
105
+
106
+
107
  ```
108
 
109
  ---
110
 
111
  ## Quick Start (Local Development)
112
+
113
  ```bash
114
  # 1) Create venv
115
  python3 -m venv .venv
 
119
  pip install -r requirements.txt
120
 
121
  # 3) Configure env (local only; use Space Secrets in prod)
122
+ cp configs/.env.example configs/.env
123
+ # Edit configs/.env with your keys (do NOT commit):
124
+ # GROQ_API_KEY=...
125
+ # GOOGLE_API_KEY=...
126
+ # HF_TOKEN=...
127
 
128
  # 4) Run
129
  uvicorn app.main:app --host 0.0.0.0 --port 7860
130
  ```
131
 
132
+ OpenAPI docs: [http://localhost:7860/docs](http://localhost:7860/docs)
133
 
134
  ---
135
 
136
+ ## Provider Cascade (GROQ Gemini → HF Router)
137
+
138
+ **matrix-ai** uses a production-ready multi-provider orchestrator:
139
+
140
+ 1. **Groq** (`llama-3.1-8b-instant`) – free, fast, great latency
141
+ 2. **Gemini** (`gemini-2.5-flash`) – free tier
142
+ 3. **HF Router** – `HuggingFaceH4/zephyr-7b-beta` → `mistralai/Mistral-7B-Instruct-v0.2`
143
 
144
+ Order is configurable via `provider_order`. Providers are skipped automatically if misconfigured or if quotas/credits are exceeded.
 
 
 
 
 
145
 
146
+ **Streaming:** Groq streams true tokens; Gemini/HF may yield one chunk (normalized to SSE).
147
 
148
  ---
149
 
150
  ## Configuration
151
 
152
+ All options can be set via environment variables (Space Secrets in HF), `.env` for local use, and/or `configs/settings.yaml`.
153
+
154
+ ### `configs/settings.yaml` (excerpt)
155
+
156
+ ```yaml
157
+ model:
158
+ # HF router defaults (used at the last step)
159
+ name: "HuggingFaceH4/zephyr-7b-beta"
160
+ fallback: "mistralai/Mistral-7B-Instruct-v0.2"
161
+ provider: "featherless-ai"
162
+ max_new_tokens: 256
163
+ temperature: 0.2
164
+
165
+ # Provider-specific defaults (free-tier friendly)
166
+ groq_model: "llama-3.1-8b-instant"
167
+ gemini_model: "gemini-2.5-flash"
168
 
169
+ # Try providers in this order
170
+ provider_order:
171
+ - groq
172
+ - gemini
173
+ - router
 
 
 
 
 
 
 
174
 
175
+ # Switch to the multi-provider path
176
+ chat_backend: "multi"
177
+ chat_stream: true
178
+
179
+ limits:
180
+ rate_per_min: 60
181
+ cache_size: 256
182
+
183
+ rag:
184
+ index_dataset: ""
185
+ top_k: 4
186
+
187
+ matrixhub:
188
+ base_url: "https://api.matrixhub.io"
189
+
190
+ security:
191
+ admin_token: ""
192
+ ```
193
+
194
+ ### Environment variables
195
+
196
+ | Variable | Default | Purpose |
197
+ | ---------------- | -----------------------------------: | ----------------------------------------- |
198
+ | `GROQ_API_KEY` | — | API key for Groq (primary) |
199
+ | `GOOGLE_API_KEY` | — | API key for Gemini |
200
+ | `HF_TOKEN` | — | Token for Hugging Face Inference Router |
201
+ | `GROQ_MODEL` | `llama-3.1-8b-instant` | Override Groq model |
202
+ | `GEMINI_MODEL` | `gemini-2.5-flash` | Override Gemini model |
203
+ | `MODEL_NAME` | `HuggingFaceH4/zephyr-7b-beta` | HF Router primary model |
204
+ | `MODEL_FALLBACK` | `mistralai/Mistral-7B-Instruct-v0.2` | HF Router fallback |
205
+ | `MODEL_PROVIDER` | `featherless-ai` | HF provider tag (`model:provider`) |
206
+ | `PROVIDER_ORDER` | `groq,gemini,router` | Comma-sep. cascade order |
207
+ | `CHAT_STREAM` | `true` | Enable streaming where available |
208
+ | `RATE_LIMITS` | `60` | Per-IP req/min (middleware) |
209
+ | `ADMIN_TOKEN` | — | Gate `/v1/plan` & `/v1/chat*` (Bearer) |
210
+ | `RAG_KB_PATH` | `data/kb.jsonl` | Path to KB (if present) |
211
+ | `RAG_RERANK` | `true` | Enable CrossEncoder re-ranker (GPU-aware) |
212
+ | `LOG_LEVEL` | `INFO` | Structured JSON logs level |
213
+
214
+ > Never commit real API keys. Use Space Secrets / Vault in production.
215
 
216
  ---
217
 
 
219
 
220
  ### `POST /v1/plan`
221
 
222
+ **Description:** Generate a short, low-risk remediation plan from a compact app health context.
223
 
224
  **Headers**
225
 
226
  ```
227
  Content-Type: application/json
228
+ Authorization: Bearer <ADMIN_TOKEN> # required if ADMIN_TOKEN set
229
  ```
230
 
231
+ **Request (example)**
232
 
233
  ```json
234
  {
235
  "context": {
236
  "entity_uid": "matrix-ai",
237
+ "health": {"score": 0.64, "status": "degraded", "last_checked": "2025-10-01T00:00:00Z"},
238
  "recent_checks": [
239
+ {"check": "http", "result": "fail", "latency_ms": 900, "ts": "2025-10-01T00:00:00Z"}
240
  ]
241
  },
242
  "constraints": {"max_steps": 3, "risk": "low"}
 
258
  ```
259
 
260
  **Status codes**
261
+
262
  * `200` – plan generated
263
  * `400` – invalid payload (schema)
264
  * `401/403` – missing/invalid bearer (only if `ADMIN_TOKEN` configured)
 
267
 
268
  ### `POST /v1/chat`
269
 
270
+ Given a query about MatrixHub, returns an answer with citations **if** a local KB is configured at `RAG_KB_PATH`. Uses the same provider cascade.
271
+
272
+ ### `GET /v1/chat/stream` & `POST /v1/chat/stream`
273
+
274
+ Server-Sent Events (SSE) streaming of token deltas. Production-safe middleware ensures no body buffering and proper headers (`Cache-Control: no-cache`, `X-Trace-Id`, `X-Process-Time-Ms`, `Server-Timing`).
275
 
276
  ---
277
 
278
  ## Safety & Reliability
279
 
280
+ * **PII redaction** – tokens/emails removed from prompts as a pre-filter
281
+ * **Strict schema** – JSON plan parsing with safe defaults; rejects unsafe shapes
282
+ * **Time-boxed** – short timeouts and bounded retries to providers
283
+ * **Rate-limited** – per-IP fixed window (configurable)
284
+ * **Structured logs** – JSON logs with `trace_id` for correlation
285
+ * **SSE-safe middleware** – never consumes streaming bodies; avoids Starlette “No response returned” pitfalls
286
+
287
+ ---
288
+
289
+ ## RAG (Optional)
290
+
291
+ * **Embeddings:** `sentence-transformers/all-MiniLM-L6-v2` (GPU-aware)
292
+ * **Re-ranking:** optional `cross-encoder/ms-marco-MiniLM-L-2-v2` (GPU-aware)
293
+ * **KB:** `data/kb.jsonl` (one JSON per line: `{ "text": "...", "source": "..." }`)
294
+ * **Tunable:** `rag.top_k`, `RAG_RERANK`, `RAG_KB_PATH`
295
+
296
+ ---
297
+
298
+ ## Deployments
299
+
300
+ ### Hugging Face Spaces (recommended for demo)
301
+
302
+ 1. Push repo to a new **Space** (FastAPI).
303
+ 2. **Settings → Secrets**:
304
+
305
+ * `GROQ_API_KEY`, `GOOGLE_API_KEY`, `HF_TOKEN` (as needed by cascade)
306
+ * `ADMIN_TOKEN` (optional; gates `/v1/plan` & `/v1/chat*`)
307
+ 3. Choose hardware (CPU is fine; GPU improves RAG throughput and cross-encoder).
308
+ 4. Space runs `uvicorn` and exposes all endpoints.
309
+
310
+ ### Containers / Cloud
311
+
312
+ * Use a minimal Python base, install with `pip install -r requirements.txt`.
313
+ * Expose port `7860` (configurable).
314
+ * Set secrets via your orchestrator (Kubernetes Secrets, ECS, etc.).
315
+ * Scale with multiple Uvicorn workers; put behind an HTTP proxy that supports streaming (e.g., nginx with `proxy_buffering off` for SSE).
316
 
317
  ---
318
 
319
  ## Observability
320
 
321
+ * **Trace IDs** (`X-Trace-Id`) attached per request and logged
322
+ * **Timing headers**: `X-Process-Time-Ms`, `Server-Timing`
323
+ * Provider selection logs (e.g., `Provider 'groq' succeeded in 0.82s`)
324
+ * Metrics endpoints can be added behind an auth wall (Prometheus friendly)
325
+
326
+ ---
327
+
328
 
329
  ---
330
 
 
332
 
333
  * Keep `/v1/plan` **internal** behind a network boundary or `ADMIN_TOKEN`.
334
  * Validate payloads rigorously (Pydantic) and write contract tests for the plan schema.
335
+ * If you switch models, re-run golden tests to guard against plan drift.
336
+ * Avoid logging sensitive data; logs are structured JSON only.
337
 
338
  ---
339
 
340
  ## License
341
 
342
+ Apache-2.0
343
+
344
+ ---
345
+
346
+ **Tip:** The cascade order is controlled by `provider_order` (`groq,gemini,router`). If Groq is rate-limited or missing, the service automatically falls back to Gemini, then Hugging Face Router (Zephyr → Mistral). Streaming works out of the box and is middleware-safe.