ArunKr commited on
Commit
024301f
·
verified ·
1 Parent(s): 9849ff1

Upload folder using huggingface_hub

Browse files
.env.example CHANGED
@@ -21,3 +21,6 @@ SSH_KNOWN_HOSTS=
21
  # Optional: Provider CLI auth (prefer env vars, not files)
22
  GEMINI_API_KEY=
23
  ANTHROPIC_API_KEY=
 
 
 
 
21
  # Optional: Provider CLI auth (prefer env vars, not files)
22
  GEMINI_API_KEY=
23
  ANTHROPIC_API_KEY=
24
+
25
+ # Optional: GitHub indexing (private repos / higher rate limits)
26
+ GITHUB_TOKEN=
PLANS.md CHANGED
@@ -3,6 +3,7 @@
3
  This file is the repo-level roadmap for `autonomy-labs`. It’s intentionally opinionated and ordered by risk reduction first, then maintainability, then feature expansion.
4
 
5
  ## P0 — Security + correctness (blockers)
 
6
  - Gate **all dangerous endpoints** server-side (not just UI):
7
  - `/ws/terminal`
8
  - `/api/codex*`
@@ -14,6 +15,7 @@ This file is the repo-level roadmap for `autonomy-labs`. It’s intentionally op
14
  - Add `SECURITY.md` with threat model + safe deployment guidance.
15
 
16
  ## P1 — Backend refactor + lifecycle
 
17
  - Split `main.py` into routers/services:
18
  - `app/auth.py`, `app/chat.py`, `app/terminal.py`, `app/codex.py`, `app/mcp.py`, `app/settings.py`, `app/admin.py`, `app/indexing.py`
19
  - Add FastAPI lifespan management:
@@ -23,6 +25,7 @@ This file is the repo-level roadmap for `autonomy-labs`. It’s intentionally op
23
  - Standardize API error schema (UI should not parse strings to detect failure modes).
24
 
25
  ## P2 — UI/UX, settings, admin, landing
 
26
  - Split `static/dashboard.html` into modules:
27
  - `static/dashboard.js`, `static/terminal.js`, `static/agent.js`, `static/settings.js`, `static/admin.js`, `static/mcp.js`, `static/rag.js`
28
  - `static/theme.css`
@@ -38,6 +41,7 @@ This file is the repo-level roadmap for `autonomy-labs`. It’s intentionally op
38
  - keep `/login` and `/app` as dedicated routes (or similar)
39
 
40
  ## P2 — Provider auth parity (Codex/Gemini/Claude)
 
41
  - Keep provider auth out of git; source from env/HF Secrets.
42
  - Support “Codex-like” auth file generation when a CLI requires it:
43
  - Codex: `~/.codex/.auth.json` and `~/.codex/auth.json` from `CODEX_*` (or fallback envs).
@@ -46,21 +50,25 @@ This file is the repo-level roadmap for `autonomy-labs`. It’s intentionally op
46
  - `SSH_PRIVATE_KEY` (+ optional `SSH_PUBLIC_KEY`, `SSH_KNOWN_HOSTS`)
47
 
48
  ## P2 — Codex workspace directory (UI)
 
49
  - Add a per-user “workspace directory” setting.
50
  - Enforce an allowlisted root (e.g. `/data/codex/workspace/<user>`), prevent traversal, ensure it exists.
51
 
52
  ## P2 — Stream Codex events in Agent mode
 
53
  - Use `/api/codex/cli/stream` for agent execution.
54
  - UI: render streaming events progressively (agent text, tool events, final summary + usage).
55
  - Add stop/reconnect handling.
56
 
57
  ## P2/P3 — MCP registry
 
58
  - Add a first-class MCP registry:
59
  - per-user servers + optional global templates
60
  - “test connection”, “list tools”, allow/deny tool lists
61
  - import/export `mcp.json`
62
 
63
  ## P3 — RAG + indexing (docs/web/GitHub) + “password manager”
 
64
  - Clarify “password manager” scope:
65
  - secure vault for secrets (high-risk; encryption + audit required), or
66
  - indexed notes (lower-risk but still private)
@@ -73,6 +81,7 @@ This file is the repo-level roadmap for `autonomy-labs`. It’s intentionally op
73
  Note: see `docs/PASSWORD_MANAGER_SCOPE.md` for the current (non-vault) stance and recommended path forward.
74
 
75
  ## P3 — P2P pubsub chat + account manager
 
76
  - Implement account manager concepts:
77
  - identities/devices, room/topic membership, permissions, moderation tools
78
  - Transport:
@@ -82,7 +91,12 @@ Note: see `docs/PASSWORD_MANAGER_SCOPE.md` for the current (non-vault) stance an
82
  - rooms, presence, delivery status, network mode indicators
83
 
84
  ## Engineering hygiene (ongoing)
 
85
  - Add `.env.example`, `docs/TROUBLESHOOTING.md`, `docs/ARCHITECTURE.md`, `docs/SECURITY_DEPLOYMENT.md`
86
  - Add lint/tests + CI:
87
  - Python: `ruff`, `pytest`
88
  - basic security smoke tests for endpoint gating
 
 
 
 
 
3
  This file is the repo-level roadmap for `autonomy-labs`. It’s intentionally opinionated and ordered by risk reduction first, then maintainability, then feature expansion.
4
 
5
  ## P0 — Security + correctness (blockers)
6
+
7
  - Gate **all dangerous endpoints** server-side (not just UI):
8
  - `/ws/terminal`
9
  - `/api/codex*`
 
15
  - Add `SECURITY.md` with threat model + safe deployment guidance.
16
 
17
  ## P1 — Backend refactor + lifecycle
18
+
19
  - Split `main.py` into routers/services:
20
  - `app/auth.py`, `app/chat.py`, `app/terminal.py`, `app/codex.py`, `app/mcp.py`, `app/settings.py`, `app/admin.py`, `app/indexing.py`
21
  - Add FastAPI lifespan management:
 
25
  - Standardize API error schema (UI should not parse strings to detect failure modes).
26
 
27
  ## P2 — UI/UX, settings, admin, landing
28
+
29
  - Split `static/dashboard.html` into modules:
30
  - `static/dashboard.js`, `static/terminal.js`, `static/agent.js`, `static/settings.js`, `static/admin.js`, `static/mcp.js`, `static/rag.js`
31
  - `static/theme.css`
 
41
  - keep `/login` and `/app` as dedicated routes (or similar)
42
 
43
  ## P2 — Provider auth parity (Codex/Gemini/Claude)
44
+
45
  - Keep provider auth out of git; source from env/HF Secrets.
46
  - Support “Codex-like” auth file generation when a CLI requires it:
47
  - Codex: `~/.codex/.auth.json` and `~/.codex/auth.json` from `CODEX_*` (or fallback envs).
 
50
  - `SSH_PRIVATE_KEY` (+ optional `SSH_PUBLIC_KEY`, `SSH_KNOWN_HOSTS`)
51
 
52
  ## P2 — Codex workspace directory (UI)
53
+
54
  - Add a per-user “workspace directory” setting.
55
  - Enforce an allowlisted root (e.g. `/data/codex/workspace/<user>`), prevent traversal, ensure it exists.
56
 
57
  ## P2 — Stream Codex events in Agent mode
58
+
59
  - Use `/api/codex/cli/stream` for agent execution.
60
  - UI: render streaming events progressively (agent text, tool events, final summary + usage).
61
  - Add stop/reconnect handling.
62
 
63
  ## P2/P3 — MCP registry
64
+
65
  - Add a first-class MCP registry:
66
  - per-user servers + optional global templates
67
  - “test connection”, “list tools”, allow/deny tool lists
68
  - import/export `mcp.json`
69
 
70
  ## P3 — RAG + indexing (docs/web/GitHub) + “password manager”
71
+
72
  - Clarify “password manager” scope:
73
  - secure vault for secrets (high-risk; encryption + audit required), or
74
  - indexed notes (lower-risk but still private)
 
81
  Note: see `docs/PASSWORD_MANAGER_SCOPE.md` for the current (non-vault) stance and recommended path forward.
82
 
83
  ## P3 — P2P pubsub chat + account manager
84
+
85
  - Implement account manager concepts:
86
  - identities/devices, room/topic membership, permissions, moderation tools
87
  - Transport:
 
91
  - rooms, presence, delivery status, network mode indicators
92
 
93
  ## Engineering hygiene (ongoing)
94
+
95
  - Add `.env.example`, `docs/TROUBLESHOOTING.md`, `docs/ARCHITECTURE.md`, `docs/SECURITY_DEPLOYMENT.md`
96
  - Add lint/tests + CI:
97
  - Python: `ruff`, `pytest`
98
  - basic security smoke tests for endpoint gating
99
+
100
+ ## Feature suggestions (By User)
101
+
102
+ - Support GitHub token auth via HF Secrets (`GITHUB_TOKEN`/`GITHUB_PAT`) and document it in `.env.example`.
TASKS.md CHANGED
@@ -43,7 +43,7 @@ Legend:
43
  ## P2/P3 — MCP registry
44
  - [~] First-class MCP registry storage (per-user persistence via backend).
45
  - [~] Admin-managed MCP templates (server-side persisted).
46
- - [~] “Test connection” (browser/CORS; SSRF-safe server proxy pending).
47
  - [x] “List tools” (via `/api/mcp/tools`).
48
  - [x] Tool allow/deny policy (UI + server-side enforcement).
49
  - [x] Import/export `mcp.json` via UI with validation.
@@ -52,7 +52,7 @@ Legend:
52
  - [x] Clarify “password manager” scope and threat model (`docs/PASSWORD_MANAGER_SCOPE.md`).
53
  - [x] Document upload indexing connector (MVP: text-only, keyword search).
54
  - [~] Website crawler indexing (MVP: same-origin crawl with depth/pages, basic robots, private-host blocking).
55
- - [ ] GitHub repo indexing connector (branch/path filters + token support).
56
  - [~] Jobs UI (MVP: start/cancel/list crawl jobs).
57
 
58
  ## P3 — P2P pubsub chat + account manager
 
43
  ## P2/P3 — MCP registry
44
  - [~] First-class MCP registry storage (per-user persistence via backend).
45
  - [~] Admin-managed MCP templates (server-side persisted).
46
+ - [x] “Test connection” (server-side, SSRF-safe).
47
  - [x] “List tools” (via `/api/mcp/tools`).
48
  - [x] Tool allow/deny policy (UI + server-side enforcement).
49
  - [x] Import/export `mcp.json` via UI with validation.
 
52
  - [x] Clarify “password manager” scope and threat model (`docs/PASSWORD_MANAGER_SCOPE.md`).
53
  - [x] Document upload indexing connector (MVP: text-only, keyword search).
54
  - [~] Website crawler indexing (MVP: same-origin crawl with depth/pages, basic robots, private-host blocking).
55
+ - [~] GitHub repo indexing connector (MVP: owner/repo + ref + path prefix; token via env).
56
  - [~] Jobs UI (MVP: start/cancel/list crawl jobs).
57
 
58
  ## P3 — P2P pubsub chat + account manager
app/indexing_jobs.py CHANGED
@@ -2,8 +2,8 @@ from __future__ import annotations
2
 
3
  import asyncio
4
  import json
 
5
  import re
6
- import socket
7
  import uuid
8
  from dataclasses import asdict, dataclass, field
9
  from datetime import UTC, datetime
@@ -15,6 +15,7 @@ from urllib.parse import urljoin, urlparse
15
  import httpx
16
  from fastapi import HTTPException
17
 
 
18
  from app.storage import user_data_dir
19
 
20
 
@@ -26,54 +27,8 @@ def _now_iso() -> str:
26
  return datetime.now(UTC).strftime("%Y-%m-%dT%H:%M:%SZ")
27
 
28
 
29
- def _is_public_host(hostname: str) -> bool:
30
- host = (hostname or "").strip().lower()
31
- if not host:
32
- return False
33
- if host in {"localhost", "localhost.localdomain"}:
34
- return False
35
- if host.endswith(".local") or host.endswith(".internal"):
36
- return False
37
-
38
- try:
39
- infos = socket.getaddrinfo(host, None, proto=socket.IPPROTO_TCP)
40
- except Exception:
41
- return False
42
-
43
- import ipaddress
44
-
45
- for info in infos:
46
- addr = info[4][0]
47
- try:
48
- ip = ipaddress.ip_address(addr)
49
- except Exception:
50
- return False
51
- if (
52
- ip.is_private
53
- or ip.is_loopback
54
- or ip.is_link_local
55
- or ip.is_multicast
56
- or ip.is_reserved
57
- or ip.is_unspecified
58
- ):
59
- return False
60
- return True
61
-
62
-
63
  def _normalize_url(url: str) -> str:
64
- u = (url or "").strip()
65
- if not u:
66
- raise HTTPException(status_code=400, detail={"code": "invalid_request", "message": "Missing URL"})
67
- p = urlparse(u)
68
- if p.scheme not in {"https", "http"}:
69
- raise HTTPException(status_code=400, detail={"code": "invalid_request", "message": "URL must be http(s)"})
70
- if not p.netloc:
71
- raise HTTPException(status_code=400, detail={"code": "invalid_request", "message": "URL must include a host"})
72
- if not _is_public_host(p.hostname or ""):
73
- raise HTTPException(status_code=400, detail={"code": "invalid_request", "message": "Host is not allowed"})
74
- # Normalize: strip fragment, keep query.
75
- normalized = p._replace(fragment="").geturl()
76
- return normalized
77
 
78
 
79
  def _extract_links(html: str) -> list[str]:
@@ -174,6 +129,68 @@ def add_rag_document(user_id: str, *, name: str, text: str, source: str | None =
174
  return {"id": doc_id, "chunks": len(chunks)}
175
 
176
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
  @dataclass
178
  class IndexJob:
179
  id: str
@@ -282,6 +299,43 @@ class IndexJobStore:
282
  self._task_map(user_id)[job.id] = task
283
  return job
284
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
285
  async def _run_web_crawl(self, user_id: str, job: IndexJob) -> None:
286
  job.status = "running"
287
  job.progress = {"visited": 0, "indexedPages": 0, "queued": 0}
@@ -300,7 +354,11 @@ class IndexJobStore:
300
  robots_disallow_all = False
301
  if respect_robots:
302
  try:
303
- async with httpx.AsyncClient(timeout=10.0, follow_redirects=False, headers={"User-Agent": "autonomy-labs/1.0"}) as c:
 
 
 
 
304
  r = await c.get(f"{base}/robots.txt")
305
  if r.status_code == 200:
306
  txt = (r.text or "")
@@ -331,7 +389,11 @@ class IndexJobStore:
331
  visited: set[str] = set()
332
  pages: list[tuple[str, str]] = []
333
 
334
- async with httpx.AsyncClient(timeout=15.0, follow_redirects=False, headers={"User-Agent": "autonomy-labs/1.0"}) as client:
 
 
 
 
335
  try:
336
  while queue and len(visited) < max_pages:
337
  url, depth = queue.pop(0)
@@ -346,7 +408,7 @@ class IndexJobStore:
346
  continue
347
  if parsed.netloc != allowed_netloc:
348
  continue
349
- if not _is_public_host(parsed.hostname or ""):
350
  continue
351
 
352
  resp = await client.get(url)
@@ -416,3 +478,170 @@ class IndexJobStore:
416
  job.result = {"pages": len(pages), "ragDoc": result}
417
  job.progress = {"visited": len(visited), "indexedPages": len(pages), "queued": 0}
418
  await self._update_job(user_id, job)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  import asyncio
4
  import json
5
+ import os
6
  import re
 
7
  import uuid
8
  from dataclasses import asdict, dataclass, field
9
  from datetime import UTC, datetime
 
15
  import httpx
16
  from fastapi import HTTPException
17
 
18
+ from app.net_safety import is_public_host, validate_public_http_url
19
  from app.storage import user_data_dir
20
 
21
 
 
27
  return datetime.now(UTC).strftime("%Y-%m-%dT%H:%M:%SZ")
28
 
29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  def _normalize_url(url: str) -> str:
31
+ return validate_public_http_url(url)
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
 
34
  def _extract_links(html: str) -> list[str]:
 
129
  return {"id": doc_id, "chunks": len(chunks)}
130
 
131
 
132
+ def _parse_github_repo(repo: str) -> tuple[str, str]:
133
+ raw = (repo or "").strip()
134
+ if not raw:
135
+ raise HTTPException(status_code=400, detail={"code": "invalid_request", "message": "Missing repo"})
136
+
137
+ if raw.startswith(("https://", "http://")):
138
+ p = urlparse(raw)
139
+ host = (p.hostname or "").lower()
140
+ if host != "github.com":
141
+ raise HTTPException(
142
+ status_code=400,
143
+ detail={"code": "invalid_request", "message": "Repo host must be github.com"},
144
+ )
145
+ parts = [x for x in (p.path or "").split("/") if x]
146
+ if len(parts) < 2:
147
+ raise HTTPException(status_code=400, detail={"code": "invalid_request", "message": "Invalid repo URL"})
148
+ owner, name = parts[0], parts[1]
149
+ else:
150
+ if "/" not in raw:
151
+ raise HTTPException(
152
+ status_code=400,
153
+ detail={"code": "invalid_request", "message": "Repo must be owner/name"},
154
+ )
155
+ owner, name = raw.split("/", 1)
156
+
157
+ owner = owner.strip()
158
+ name = name.strip()
159
+ if name.endswith(".git"):
160
+ name = name[:-4]
161
+
162
+ if not re.fullmatch(r"[A-Za-z0-9_.-]{1,100}", owner or ""):
163
+ raise HTTPException(status_code=400, detail={"code": "invalid_request", "message": "Invalid owner"})
164
+ if not re.fullmatch(r"[A-Za-z0-9_.-]{1,100}", name or ""):
165
+ raise HTTPException(status_code=400, detail={"code": "invalid_request", "message": "Invalid repo name"})
166
+ return owner, name
167
+
168
+
169
+ def _github_headers() -> dict[str, str]:
170
+ token = (os.environ.get("GITHUB_TOKEN") or os.environ.get("GITHUB_PAT") or "").strip()
171
+ headers = {
172
+ "Accept": "application/vnd.github+json",
173
+ "User-Agent": "autonomy-labs/1.0",
174
+ }
175
+ if token:
176
+ headers["Authorization"] = f"Bearer {token}"
177
+ return headers
178
+
179
+
180
+ def _looks_binary(data: bytes) -> bool:
181
+ if not data:
182
+ return False
183
+ if b"\x00" in data[:4096]:
184
+ return True
185
+ return False
186
+
187
+
188
+ _GITHUB_TEXT_FILE_RE = re.compile(
189
+ r"\.(md|markdown|txt|rst|py|js|ts|jsx|tsx|json|yaml|yml|toml|go|rs|java|kt|c|cc|cpp|h|hpp|sh)$",
190
+ re.IGNORECASE,
191
+ )
192
+
193
+
194
  @dataclass
195
  class IndexJob:
196
  id: str
 
299
  self._task_map(user_id)[job.id] = task
300
  return job
301
 
302
+ async def create_github_repo_job(
303
+ self,
304
+ user_id: str,
305
+ *,
306
+ repo: str,
307
+ ref: str | None = None,
308
+ path_prefix: str | None = None,
309
+ max_files: int = 60,
310
+ max_file_bytes: int = 200_000,
311
+ max_total_bytes: int = 2_000_000,
312
+ ) -> IndexJob:
313
+ owner, name = _parse_github_repo(repo)
314
+ ref_s = (ref or "").strip() or None
315
+ prefix = (path_prefix or "").strip().lstrip("/")
316
+ max_files = max(1, min(int(max_files), 400))
317
+ max_file_bytes = max(1_000, min(int(max_file_bytes), 1_000_000))
318
+ max_total_bytes = max(10_000, min(int(max_total_bytes), 15_000_000))
319
+
320
+ job = IndexJob(
321
+ id=str(uuid.uuid4()),
322
+ type="github_repo",
323
+ createdAt=_now_iso(),
324
+ params={
325
+ "owner": owner,
326
+ "repo": name,
327
+ "ref": ref_s,
328
+ "pathPrefix": prefix,
329
+ "maxFiles": max_files,
330
+ "maxFileBytes": max_file_bytes,
331
+ "maxTotalBytes": max_total_bytes,
332
+ },
333
+ )
334
+ await self._update_job(user_id, job)
335
+ task = asyncio.create_task(self._run_github_repo(user_id, job))
336
+ self._task_map(user_id)[job.id] = task
337
+ return job
338
+
339
  async def _run_web_crawl(self, user_id: str, job: IndexJob) -> None:
340
  job.status = "running"
341
  job.progress = {"visited": 0, "indexedPages": 0, "queued": 0}
 
354
  robots_disallow_all = False
355
  if respect_robots:
356
  try:
357
+ async with httpx.AsyncClient(
358
+ timeout=10.0,
359
+ follow_redirects=False,
360
+ headers={"User-Agent": "autonomy-labs/1.0"},
361
+ ) as c:
362
  r = await c.get(f"{base}/robots.txt")
363
  if r.status_code == 200:
364
  txt = (r.text or "")
 
389
  visited: set[str] = set()
390
  pages: list[tuple[str, str]] = []
391
 
392
+ async with httpx.AsyncClient(
393
+ timeout=15.0,
394
+ follow_redirects=False,
395
+ headers={"User-Agent": "autonomy-labs/1.0"},
396
+ ) as client:
397
  try:
398
  while queue and len(visited) < max_pages:
399
  url, depth = queue.pop(0)
 
408
  continue
409
  if parsed.netloc != allowed_netloc:
410
  continue
411
+ if not is_public_host(parsed.hostname or ""):
412
  continue
413
 
414
  resp = await client.get(url)
 
478
  job.result = {"pages": len(pages), "ragDoc": result}
479
  job.progress = {"visited": len(visited), "indexedPages": len(pages), "queued": 0}
480
  await self._update_job(user_id, job)
481
+
482
+ async def _run_github_repo(self, user_id: str, job: IndexJob) -> None:
483
+ job.status = "running"
484
+ job.progress = {"files": 0, "indexedFiles": 0, "bytes": 0}
485
+ await self._update_job(user_id, job)
486
+
487
+ owner = str(job.params.get("owner") or "")
488
+ repo = str(job.params.get("repo") or "")
489
+ ref = (job.params.get("ref") or "").strip() or None
490
+ prefix = (job.params.get("pathPrefix") or "").strip().lstrip("/")
491
+ max_files = int(job.params.get("maxFiles") or 60)
492
+ max_file_bytes = int(job.params.get("maxFileBytes") or 200_000)
493
+ max_total_bytes = int(job.params.get("maxTotalBytes") or 2_000_000)
494
+
495
+ api_base = "https://api.github.com"
496
+ headers = _github_headers()
497
+
498
+ def api_url(path: str) -> str:
499
+ return f"{api_base}{path}"
500
+
501
+ files_text: list[tuple[str, str]] = []
502
+ total_bytes = 0
503
+
504
+ async with httpx.AsyncClient(timeout=20.0, follow_redirects=False) as client:
505
+ try:
506
+ # Resolve default branch if ref is not provided.
507
+ if not ref:
508
+ r = await client.get(api_url(f"/repos/{owner}/{repo}"), headers=headers)
509
+ if r.status_code == 404:
510
+ raise HTTPException(status_code=404, detail={"code": "not_found", "message": "Repo not found"})
511
+ if r.status_code == 401 or r.status_code == 403:
512
+ raise HTTPException(
513
+ status_code=401,
514
+ detail={"code": "unauthorized", "message": "GitHub auth required"},
515
+ )
516
+ r.raise_for_status()
517
+ ref = str(r.json().get("default_branch") or "main")
518
+
519
+ # Resolve commit -> tree sha.
520
+ c = await client.get(api_url(f"/repos/{owner}/{repo}/commits/{ref}"), headers=headers)
521
+ if c.status_code == 404:
522
+ raise HTTPException(status_code=404, detail={"code": "not_found", "message": "Ref not found"})
523
+ if c.status_code == 401 or c.status_code == 403:
524
+ raise HTTPException(
525
+ status_code=401,
526
+ detail={"code": "unauthorized", "message": "GitHub auth required"},
527
+ )
528
+ c.raise_for_status()
529
+ commit = c.json()
530
+ tree_sha = ((commit.get("commit") or {}).get("tree") or {}).get("sha")
531
+ if not tree_sha:
532
+ raise HTTPException(
533
+ status_code=500,
534
+ detail={"code": "github_error", "message": "Failed to resolve tree"},
535
+ )
536
+
537
+ t = await client.get(
538
+ api_url(f"/repos/{owner}/{repo}/git/trees/{tree_sha}?recursive=1"),
539
+ headers=headers,
540
+ )
541
+ if t.status_code == 401 or t.status_code == 403:
542
+ raise HTTPException(
543
+ status_code=401,
544
+ detail={"code": "unauthorized", "message": "GitHub auth required"},
545
+ )
546
+ t.raise_for_status()
547
+ tree = t.json().get("tree")
548
+ if not isinstance(tree, list):
549
+ raise HTTPException(
550
+ status_code=500,
551
+ detail={"code": "github_error", "message": "Invalid tree response"},
552
+ )
553
+
554
+ candidates = []
555
+ for node in tree:
556
+ if not isinstance(node, dict):
557
+ continue
558
+ if node.get("type") != "blob":
559
+ continue
560
+ path = str(node.get("path") or "")
561
+ if not path:
562
+ continue
563
+ if prefix and not path.startswith(prefix.rstrip("/") + "/") and path != prefix.rstrip("/"):
564
+ continue
565
+ size = int(node.get("size") or 0)
566
+ if size <= 0 or size > max_file_bytes:
567
+ continue
568
+ # Simple extension filter.
569
+ lower = path.lower()
570
+ if not _GITHUB_TEXT_FILE_RE.search(lower):
571
+ continue
572
+ candidates.append((path, size))
573
+
574
+ candidates.sort(key=lambda x: x[1])
575
+ candidates = candidates[:max_files]
576
+
577
+ job.progress = {"files": len(candidates), "indexedFiles": 0, "bytes": 0}
578
+ await self._update_job(user_id, job)
579
+
580
+ for i, (path, size) in enumerate(candidates, start=1):
581
+ if total_bytes + size > max_total_bytes:
582
+ break
583
+ # Contents endpoint; request raw bytes.
584
+ content_url = api_url(f"/repos/{owner}/{repo}/contents/{path}")
585
+ r = await client.get(
586
+ content_url,
587
+ headers={**headers, "Accept": "application/vnd.github.raw"},
588
+ params={"ref": ref},
589
+ )
590
+ if r.status_code == 404:
591
+ continue
592
+ if r.status_code == 401 or r.status_code == 403:
593
+ raise HTTPException(
594
+ status_code=401,
595
+ detail={"code": "unauthorized", "message": "GitHub auth required"},
596
+ )
597
+ r.raise_for_status()
598
+ data = r.content[: max_file_bytes + 1]
599
+ if len(data) > max_file_bytes:
600
+ continue
601
+ if _looks_binary(data):
602
+ continue
603
+ text = data.decode("utf-8", errors="ignore").strip()
604
+ if not text:
605
+ continue
606
+ files_text.append((path, text))
607
+ total_bytes += len(data)
608
+ job.progress = {"files": len(candidates), "indexedFiles": len(files_text), "bytes": total_bytes}
609
+ await self._update_job(user_id, job)
610
+ await asyncio.sleep(0.05)
611
+
612
+ except asyncio.CancelledError:
613
+ job.status = "canceled"
614
+ job.error = None
615
+ await self._update_job(user_id, job)
616
+ return
617
+ except HTTPException as e:
618
+ job.status = "failed"
619
+ job.error = str((e.detail or {}).get("message") or e.detail or "GitHub indexing failed")
620
+ await self._update_job(user_id, job)
621
+ return
622
+ except Exception as e:
623
+ job.status = "failed"
624
+ job.error = str(e)
625
+ await self._update_job(user_id, job)
626
+ return
627
+
628
+ if not files_text:
629
+ job.status = "failed"
630
+ job.error = "No indexable files found"
631
+ await self._update_job(user_id, job)
632
+ return
633
+
634
+ combined = []
635
+ for path, text in files_text:
636
+ combined.append(f"FILE: {path}\n\n{text}\n\n---\n")
637
+ combined_text = "\n".join(combined).strip()
638
+ result = add_rag_document(
639
+ user_id,
640
+ name=f"GitHub: {owner}/{repo}@{ref}",
641
+ text=combined_text,
642
+ source=f"https://github.com/{owner}/{repo}",
643
+ )
644
+ job.status = "succeeded"
645
+ job.result = {"files": len(files_text), "ragDoc": result}
646
+ job.progress = {"files": job.progress.get("files", 0), "indexedFiles": len(files_text), "bytes": total_bytes}
647
+ await self._update_job(user_id, job)
app/net_safety.py ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import socket
4
+ from urllib.parse import urlparse
5
+
6
+ from fastapi import HTTPException
7
+
8
+
9
+ def is_public_host(hostname: str) -> bool:
10
+ host = (hostname or "").strip().lower()
11
+ if not host:
12
+ return False
13
+ if host in {"localhost", "localhost.localdomain"}:
14
+ return False
15
+ if host.endswith(".local") or host.endswith(".internal"):
16
+ return False
17
+
18
+ try:
19
+ infos = socket.getaddrinfo(host, None, proto=socket.IPPROTO_TCP)
20
+ except Exception:
21
+ return False
22
+
23
+ import ipaddress
24
+
25
+ for info in infos:
26
+ addr = info[4][0]
27
+ try:
28
+ ip = ipaddress.ip_address(addr)
29
+ except Exception:
30
+ return False
31
+ if (
32
+ ip.is_private
33
+ or ip.is_loopback
34
+ or ip.is_link_local
35
+ or ip.is_multicast
36
+ or ip.is_reserved
37
+ or ip.is_unspecified
38
+ ):
39
+ return False
40
+ return True
41
+
42
+
43
+ def validate_public_http_url(
44
+ url: str,
45
+ *,
46
+ allowed_hosts: set[str] | None = None,
47
+ ) -> str:
48
+ u = (url or "").strip()
49
+ if not u:
50
+ raise HTTPException(status_code=400, detail={"code": "invalid_request", "message": "Missing URL"})
51
+ p = urlparse(u)
52
+ if p.scheme not in {"https", "http"}:
53
+ raise HTTPException(status_code=400, detail={"code": "invalid_request", "message": "URL must be http(s)"})
54
+ if not p.netloc:
55
+ raise HTTPException(status_code=400, detail={"code": "invalid_request", "message": "URL must include a host"})
56
+ host = (p.hostname or "").strip().lower()
57
+ if allowed_hosts is not None and host not in allowed_hosts:
58
+ raise HTTPException(status_code=400, detail={"code": "invalid_request", "message": "Host is not allowed"})
59
+ if not is_public_host(host):
60
+ raise HTTPException(status_code=400, detail={"code": "invalid_request", "message": "Host is not allowed"})
61
+ return p._replace(fragment="").geturl()
62
+
app/routes/indexing.py CHANGED
@@ -17,6 +17,15 @@ class WebCrawlRequest(BaseModel):
17
  respectRobots: bool = True
18
 
19
 
 
 
 
 
 
 
 
 
 
20
  @router.get("/api/indexing/jobs")
21
  async def list_indexing_jobs(http_request: Request):
22
  if not feature_enabled("indexing"):
@@ -48,6 +57,25 @@ async def start_web_crawl(body: WebCrawlRequest, http_request: Request):
48
  return {"ok": True, "job": job.__dict__}
49
 
50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  @router.post("/api/indexing/jobs/{job_id}/cancel")
52
  async def cancel_job(job_id: str, http_request: Request):
53
  if not feature_enabled("indexing"):
@@ -57,4 +85,3 @@ async def cancel_job(job_id: str, http_request: Request):
57
  store = http_request.app.state.index_job_store
58
  ok = await store.cancel_job(user_id, job_id)
59
  return {"ok": True, "canceled": ok}
60
-
 
17
  respectRobots: bool = True
18
 
19
 
20
+ class GitHubRepoRequest(BaseModel):
21
+ repo: str # owner/name or https://github.com/owner/name
22
+ ref: str | None = None # branch, tag, or sha
23
+ pathPrefix: str | None = None # optional subdirectory
24
+ maxFiles: int = 60
25
+ maxFileBytes: int = 200_000
26
+ maxTotalBytes: int = 2_000_000
27
+
28
+
29
  @router.get("/api/indexing/jobs")
30
  async def list_indexing_jobs(http_request: Request):
31
  if not feature_enabled("indexing"):
 
57
  return {"ok": True, "job": job.__dict__}
58
 
59
 
60
+ @router.post("/api/indexing/jobs/github-repo")
61
+ async def start_github_repo(body: GitHubRepoRequest, http_request: Request):
62
+ if not feature_enabled("indexing"):
63
+ raise HTTPException(status_code=403, detail={"code": "feature_disabled", "message": "Indexing is disabled"})
64
+ user = await require_user_from_request(http_request)
65
+ user_id = str(user.get("id") or "")
66
+ store = http_request.app.state.index_job_store
67
+ job = await store.create_github_repo_job(
68
+ user_id,
69
+ repo=body.repo,
70
+ ref=body.ref,
71
+ path_prefix=body.pathPrefix,
72
+ max_files=body.maxFiles,
73
+ max_file_bytes=body.maxFileBytes,
74
+ max_total_bytes=body.maxTotalBytes,
75
+ )
76
+ return {"ok": True, "job": job.__dict__}
77
+
78
+
79
  @router.post("/api/indexing/jobs/{job_id}/cancel")
80
  async def cancel_job(job_id: str, http_request: Request):
81
  if not feature_enabled("indexing"):
 
85
  store = http_request.app.state.index_job_store
86
  ok = await store.cancel_job(user_id, job_id)
87
  return {"ok": True, "canceled": ok}
 
app/routes/mcp.py CHANGED
@@ -1,10 +1,14 @@
1
  from __future__ import annotations
2
 
 
 
 
3
  from fastapi import APIRouter, HTTPException, Request
4
  from pydantic import BaseModel
5
 
6
  from app.auth import require_user_from_request
7
  from app.mcp_policy import load_mcp_policy, tool_allowed
 
8
  from app.settings import feature_enabled
9
 
10
  router = APIRouter()
@@ -43,6 +47,12 @@ class McpCallRequest(BaseModel):
43
  arguments: dict
44
 
45
 
 
 
 
 
 
 
46
  @router.post("/api/mcp/call")
47
  async def mcp_tools_call(request: McpCallRequest, http_request: Request):
48
  if not feature_enabled("mcp"):
@@ -61,3 +71,57 @@ async def mcp_tools_call(request: McpCallRequest, http_request: Request):
61
  raise
62
  except Exception as e:
63
  raise HTTPException(status_code=500, detail={"code": "mcp_error", "message": str(e)}) from e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  from __future__ import annotations
2
 
3
+ from time import monotonic
4
+
5
+ import httpx
6
  from fastapi import APIRouter, HTTPException, Request
7
  from pydantic import BaseModel
8
 
9
  from app.auth import require_user_from_request
10
  from app.mcp_policy import load_mcp_policy, tool_allowed
11
+ from app.net_safety import validate_public_http_url
12
  from app.settings import feature_enabled
13
 
14
  router = APIRouter()
 
47
  arguments: dict
48
 
49
 
50
+ class McpTestRequest(BaseModel):
51
+ url: str
52
+ headers: dict[str, str] | None = None
53
+ timeoutSec: float = 3.0
54
+
55
+
56
  @router.post("/api/mcp/call")
57
  async def mcp_tools_call(request: McpCallRequest, http_request: Request):
58
  if not feature_enabled("mcp"):
 
71
  raise
72
  except Exception as e:
73
  raise HTTPException(status_code=500, detail={"code": "mcp_error", "message": str(e)}) from e
74
+
75
+
76
+ @router.post("/api/mcp/test")
77
+ async def mcp_test(request: McpTestRequest, http_request: Request):
78
+ if not feature_enabled("mcp"):
79
+ raise HTTPException(status_code=403, detail={"code": "feature_disabled", "message": "MCP is disabled"})
80
+ _ = await require_user_from_request(http_request)
81
+
82
+ url = validate_public_http_url(request.url)
83
+ timeout = float(request.timeoutSec or 3.0)
84
+ timeout = max(0.5, min(timeout, 8.0))
85
+
86
+ raw_headers = request.headers if isinstance(request.headers, dict) else {}
87
+ headers: dict[str, str] = {}
88
+ for k, v in raw_headers.items():
89
+ key = str(k).strip()
90
+ if not key:
91
+ continue
92
+ lk = key.lower()
93
+ if lk in {"host", "content-length", "connection"}:
94
+ continue
95
+ if len(key) > 80:
96
+ continue
97
+ val = str(v).strip()
98
+ if len(val) > 2000:
99
+ val = val[:2000]
100
+ headers[key] = val
101
+
102
+ start = monotonic()
103
+ try:
104
+ async with httpx.AsyncClient(timeout=timeout, follow_redirects=False) as client:
105
+ async with client.stream("GET", url, headers=headers) as resp:
106
+ limit = 1024
107
+ buf = bytearray()
108
+ async for chunk in resp.aiter_bytes():
109
+ if not chunk:
110
+ continue
111
+ buf.extend(chunk)
112
+ if len(buf) >= limit:
113
+ break
114
+ elapsed_ms = int((monotonic() - start) * 1000)
115
+ preview = bytes(buf[:1024]).decode("utf-8", errors="ignore").strip()
116
+ return {
117
+ "ok": True,
118
+ "url": url,
119
+ "statusCode": resp.status_code,
120
+ "contentType": resp.headers.get("content-type"),
121
+ "elapsedMs": elapsed_ms,
122
+ "preview": preview[:500],
123
+ }
124
+ except HTTPException:
125
+ raise
126
+ except Exception as e:
127
+ raise HTTPException(status_code=500, detail={"code": "mcp_test_failed", "message": str(e)}) from e
docs/TROUBLESHOOTING.md CHANGED
@@ -34,6 +34,10 @@ Set `ENABLE_INDEXING=1` in your environment and restart the container.
34
  Website indexing blocks private/localhost targets to reduce SSRF risk.
35
  Use a public `http(s)` URL and keep indexing within the same origin.
36
 
 
 
 
 
37
  ## Terminal shows vertical/1-column text
38
 
39
  This usually means the terminal “fit” ran while the terminal view was hidden or at size 0.
 
34
  Website indexing blocks private/localhost targets to reduce SSRF risk.
35
  Use a public `http(s)` URL and keep indexing within the same origin.
36
 
37
+ ## GitHub indexing fails for private repos
38
+
39
+ Set `GITHUB_TOKEN` (or `GITHUB_PAT`) as an environment variable / HF Secret, then retry.
40
+
41
  ## Terminal shows vertical/1-column text
42
 
43
  This usually means the terminal “fit” ran while the terminal view was hidden or at size 0.
static/dashboard.html CHANGED
@@ -521,6 +521,34 @@
521
  <div id="indexing-status" class="text-xs text-gray-500"></div>
522
  <div id="indexing-jobs" class="space-y-2"></div>
523
  </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
524
  <div>
525
  <div class="text-xs font-semibold text-gray-400 mb-1 uppercase">Documents</div>
526
  <div id="rag-documents" class="space-y-2"></div>
 
521
  <div id="indexing-status" class="text-xs text-gray-500"></div>
522
  <div id="indexing-jobs" class="space-y-2"></div>
523
  </div>
524
+ <div class="bg-gray-900/30 border border-gray-700 rounded-lg p-3 space-y-2">
525
+ <div class="text-xs font-semibold text-gray-300 uppercase">GitHub Repo Indexing</div>
526
+ <div class="text-xs text-gray-400">Indexes repository text files into RAG (uses `GITHUB_TOKEN`/`GITHUB_PAT` for private repos).</div>
527
+ <div class="grid grid-cols-1 md:grid-cols-3 gap-2">
528
+ <input id="gh-repo" type="text" placeholder="owner/repo or https://github.com/owner/repo"
529
+ class="md:col-span-2 bg-gray-700 text-sm rounded border border-gray-600 p-2 text-white outline-none focus:border-blue-500">
530
+ <button onclick="startGithubRepoJob()"
531
+ class="bg-blue-600 hover:bg-blue-700 text-white px-3 py-2 rounded text-xs">Start</button>
532
+ </div>
533
+ <div class="grid grid-cols-1 md:grid-cols-3 gap-2">
534
+ <div>
535
+ <label class="block text-xs font-semibold text-gray-400 mb-1 uppercase">Ref</label>
536
+ <input id="gh-ref" type="text" placeholder="main"
537
+ class="w-full bg-gray-700 text-sm rounded border border-gray-600 p-2 text-white outline-none focus:border-blue-500">
538
+ </div>
539
+ <div>
540
+ <label class="block text-xs font-semibold text-gray-400 mb-1 uppercase">Path Prefix</label>
541
+ <input id="gh-path" type="text" placeholder="(optional) src/"
542
+ class="w-full bg-gray-700 text-sm rounded border border-gray-600 p-2 text-white outline-none focus:border-blue-500">
543
+ </div>
544
+ <div>
545
+ <label class="block text-xs font-semibold text-gray-400 mb-1 uppercase">Max Files</label>
546
+ <input id="gh-max-files" type="number" min="1" max="400" value="60"
547
+ class="w-full bg-gray-700 text-sm rounded border border-gray-600 p-2 text-white outline-none focus:border-blue-500">
548
+ </div>
549
+ </div>
550
+ <div id="github-indexing-status" class="text-xs text-gray-500"></div>
551
+ </div>
552
  <div>
553
  <div class="text-xs font-semibold text-gray-400 mb-1 uppercase">Documents</div>
554
  <div id="rag-documents" class="space-y-2"></div>
static/dashboard.js CHANGED
@@ -374,6 +374,11 @@ let supabase;
374
  if (el) el.textContent = text || '';
375
  }
376
 
 
 
 
 
 
377
  function renderIndexingJobs(jobs) {
378
  const el = document.getElementById('indexing-jobs');
379
  if (!el) return;
@@ -401,7 +406,22 @@ let supabase;
401
  const p = j?.progress || {};
402
  const visited = p?.visited ?? 0;
403
  const indexed = p?.indexedPages ?? 0;
404
- meta.textContent = `${j?.createdAt || ''} • visited ${visited} • indexed ${indexed}`;
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
405
  left.appendChild(title);
406
  left.appendChild(meta);
407
 
@@ -460,6 +480,31 @@ let supabase;
460
  }
461
  }
462
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
463
  async function cancelIndexingJob(jobId) {
464
  const id = (jobId || '').trim();
465
  if (!id) return;
@@ -2521,8 +2566,16 @@ let supabase;
2521
  if (server.token) headers['Authorization'] = `Bearer ${server.token}`;
2522
 
2523
  try {
2524
- const res = await fetch(server.url, { method: 'GET', headers });
2525
- if (statusEl) statusEl.textContent = `HTTP ${res.status}`;
 
 
 
 
 
 
 
 
2526
  } catch (e) {
2527
  if (statusEl) statusEl.textContent = `Error: ${e.message}`;
2528
  }
 
374
  if (el) el.textContent = text || '';
375
  }
376
 
377
+ function setGithubIndexingStatus(text) {
378
+ const el = document.getElementById('github-indexing-status');
379
+ if (el) el.textContent = text || '';
380
+ }
381
+
382
  function renderIndexingJobs(jobs) {
383
  const el = document.getElementById('indexing-jobs');
384
  if (!el) return;
 
406
  const p = j?.progress || {};
407
  const visited = p?.visited ?? 0;
408
  const indexed = p?.indexedPages ?? 0;
409
+ const params = j?.params || {};
410
+ const detail = (() => {
411
+ if (j?.type === 'web_crawl') return params?.startUrl || '';
412
+ if (j?.type === 'github_repo') {
413
+ const ref = params?.ref ? `@${params.ref}` : '';
414
+ const pref = params?.pathPrefix ? ` • ${params.pathPrefix}` : '';
415
+ return `${params?.owner || ''}/${params?.repo || ''}${ref}${pref}`.trim();
416
+ }
417
+ return '';
418
+ })();
419
+ const indexedFiles = p?.indexedFiles ?? 0;
420
+ const bytes = p?.bytes ?? 0;
421
+ const progressBits = j?.type === 'github_repo'
422
+ ? `files ${indexedFiles} • ${Math.round(bytes / 1024)} KB`
423
+ : `visited ${visited} • indexed ${indexed}`;
424
+ meta.textContent = `${j?.createdAt || ''}${detail ? ' • ' + detail : ''} • ${progressBits}`;
425
  left.appendChild(title);
426
  left.appendChild(meta);
427
 
 
480
  }
481
  }
482
 
483
+ async function startGithubRepoJob() {
484
+ const repo = (document.getElementById('gh-repo')?.value || '').trim();
485
+ const ref = (document.getElementById('gh-ref')?.value || '').trim() || null;
486
+ const pathPrefix = (document.getElementById('gh-path')?.value || '').trim() || null;
487
+ const maxFiles = Number(document.getElementById('gh-max-files')?.value || 60);
488
+ if (!repo) {
489
+ setGithubIndexingStatus('Enter a repo (owner/repo).');
490
+ return;
491
+ }
492
+ try {
493
+ setGithubIndexingStatus('Starting job...');
494
+ const res = await authFetch('/api/indexing/jobs/github-repo', {
495
+ method: 'POST',
496
+ headers: { 'Content-Type': 'application/json' },
497
+ body: JSON.stringify({ repo, ref, pathPrefix, maxFiles }),
498
+ });
499
+ if (!res.ok) throw new Error(await res.text());
500
+ await res.json();
501
+ setGithubIndexingStatus('Job started.');
502
+ loadIndexingJobs();
503
+ } catch (e) {
504
+ setGithubIndexingStatus(`Failed to start job: ${e?.message || e}`);
505
+ }
506
+ }
507
+
508
  async function cancelIndexingJob(jobId) {
509
  const id = (jobId || '').trim();
510
  if (!id) return;
 
2566
  if (server.token) headers['Authorization'] = `Bearer ${server.token}`;
2567
 
2568
  try {
2569
+ const res = await authFetch('/api/mcp/test', {
2570
+ method: 'POST',
2571
+ headers: { 'Content-Type': 'application/json' },
2572
+ body: JSON.stringify({ url: server.url, headers, timeoutSec: 3.0 }),
2573
+ });
2574
+ if (!res.ok) throw new Error(await res.text());
2575
+ const data = await res.json();
2576
+ const code = data?.statusCode ?? '—';
2577
+ const ctype = data?.contentType ? ` • ${data.contentType}` : '';
2578
+ if (statusEl) statusEl.textContent = `HTTP ${code}${ctype}`;
2579
  } catch (e) {
2580
  if (statusEl) statusEl.textContent = `Error: ${e.message}`;
2581
  }