AdithyaVardan commited on
Commit
2f22e68
·
1 Parent(s): 1fbfa0e

Add text/markdown parser for file agent; add sample knowledge base documents

Browse files
data_sources/api_reference.md ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Godspeed API Reference
2
+
3
+ ## POST /agent/query
4
+
5
+ The main chat endpoint. Returns a Server-Sent Events (SSE) stream.
6
+
7
+ **Request body:**
8
+ ```json
9
+ {
10
+ "query": "What caused the last incident?",
11
+ "team_id": "default",
12
+ "session_id": "unique-session-id"
13
+ }
14
+ ```
15
+
16
+ **SSE Events:**
17
+
18
+ | Event | Data | Description |
19
+ |-------|------|-------------|
20
+ | plan_ready | {tasks, reasoning} | Planner's execution plan |
21
+ | agent_started | {agent} | A retrieval agent has started |
22
+ | agent_done | {agent, chunks, confidence} | Agent finished, chunks found |
23
+ | synthesis_started | {} | Synthesiser is generating the answer |
24
+ | answer_chunk | {chunk} | One token/phrase of the answer |
25
+ | guardrail_result | {score, escalate} | Safety score (0-1) |
26
+ | error | {message} | Something went wrong |
27
+ | done | {} | Stream complete |
28
+
29
+ **Confidence levels:**
30
+ - `high` — reranker score ≥ 0.6, answer is reliable
31
+ - `medium` — reranker score ≥ 0.4, answer may have gaps
32
+ - `low` — reranker score < 0.4, answer is best-effort
33
+
34
+ ---
35
+
36
+ ## POST /confluence/sync/{space_key}
37
+
38
+ Triggers a full sync of a Confluence space. Runs as a background Celery task.
39
+
40
+ ```bash
41
+ curl -X POST http://localhost:8000/confluence/sync/ENG
42
+ ```
43
+
44
+ **Response:**
45
+ ```json
46
+ {"status": "accepted", "task_id": "abc123", "space_key": "ENG"}
47
+ ```
48
+
49
+ ---
50
+
51
+ ## POST /jira/sync/{project_key}
52
+
53
+ Syncs all issues in a Jira project.
54
+
55
+ ```bash
56
+ curl -X POST http://localhost:8000/jira/sync/BACKEND
57
+ ```
58
+
59
+ ---
60
+
61
+ ## POST /api/ingest/file
62
+
63
+ Upload a file for ingestion. Supports: PDF, DOCX, DOC, CSV, XLSX, XLS, HTML, HTM, XML, TXT, MD.
64
+
65
+ ```bash
66
+ curl -X POST http://localhost:8000/api/ingest/file \
67
+ -F "file=@report.pdf" \
68
+ -F "team_id=default"
69
+ ```
70
+
71
+ **Response:**
72
+ ```json
73
+ {"status": "accepted", "task_id": "xyz789", "filename": "report.pdf"}
74
+ ```
75
+
76
+ ---
77
+
78
+ ## GET /ingest/jobs/{job_id}
79
+
80
+ Check the status of an ingestion job.
81
+
82
+ ```bash
83
+ curl http://localhost:8000/ingest/jobs/abc123
84
+ ```
85
+
86
+ **Response:**
87
+ ```json
88
+ {
89
+ "job_id": "abc123",
90
+ "status": "completed",
91
+ "chunks_ingested": 42,
92
+ "source_type": "confluence",
93
+ "team_id": "default"
94
+ }
95
+ ```
96
+
97
+ Status values: `pending`, `running`, `completed`, `failed`
98
+
99
+ ---
100
+
101
+ ## GET /graph/traverse
102
+
103
+ Query the knowledge graph for related entities.
104
+
105
+ ```bash
106
+ curl "http://localhost:8000/graph/traverse?entity_name=AuthService&entity_type=Service&team_id=default"
107
+ ```
108
+
109
+ **Parameters:**
110
+ - `entity_name` — name of the entity to start from
111
+ - `entity_type` — one of: Service, Library, Incident, Team
112
+ - `team_id` — your team identifier
113
+
114
+ ---
115
+
116
+ ## POST /webhooks/jira
117
+
118
+ Receives Jira Cloud webhooks. Configure in Jira → Project Settings → Webhooks.
119
+
120
+ - URL: `https://your-domain.com/webhooks/jira`
121
+ - Events: Issue Created, Issue Updated
122
+ - Signs with HMAC-SHA256 using `JIRA_WEBHOOK_SECRET`
123
+
124
+ ---
125
+
126
+ ## POST /webhooks/confluence
127
+
128
+ Receives Confluence webhooks. Configure in Confluence → Space Settings → Integrations.
129
+
130
+ - URL: `https://your-domain.com/webhooks/confluence`
131
+ - Events: page_created, page_updated
132
+ - Signs with HMAC-SHA256 using `CONFLUENCE_WEBHOOK_SECRET`
data_sources/architecture.md ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Godspeed Architecture
2
+
3
+ ## System Components
4
+
5
+ ### 1. Ingestion Pipeline
6
+
7
+ Documents flow through a 5-stage pipeline:
8
+
9
+ 1. **Fetch** — Source adapters pull raw content (Confluence REST API, Jira REST API, GitHub API, file upload)
10
+ 2. **Chunk** — Semantic chunker uses spaCy sentence boundaries to split into 512-token chunks with 15% overlap
11
+ 3. **PII Mask** — GLiNER model redacts person names, emails, phone numbers, SSNs, credit cards, addresses
12
+ 4. **Embed** — BGE-M3 produces 1024-dim dense vectors and sparse lexical weights per chunk
13
+ 5. **Store** — Chunks upserted into Qdrant (vectors) and Supabase (metadata)
14
+
15
+ ### 2. Agent Graph (LangGraph)
16
+
17
+ ```
18
+ User Query
19
+
20
+
21
+ [planner_node] — Gemini 2.5 Pro decides which agents to invoke
22
+
23
+ ├──► [doc_search_node] — Hybrid Qdrant search + BM25 + reranker
24
+ ├──► [ticket_lookup_node] — Jira issue lookup
25
+ └──► [live_docs_node] — Live web fetch
26
+
27
+
28
+ [join_node] — Fan-in, waits for all retrieval agents
29
+
30
+
31
+ [synthesiser_node] — Gemini 2.5 Pro streams the answer with citations
32
+
33
+
34
+ [guardrail_node] — Gemini 2.5 Flash scores safety (0–1)
35
+ ```
36
+
37
+ ### 3. Retrieval (Hybrid Search)
38
+
39
+ For every query, doc_search runs three retrieval methods in parallel:
40
+ - **Dense search**: cosine similarity on 1024-dim BGE-M3 vectors
41
+ - **Sparse search**: lexical overlap using BGE-M3 sparse weights
42
+ - **BM25**: keyword matching on the full chunk corpus
43
+
44
+ Results are merged using Reciprocal Rank Fusion (RRF, k=60), then top candidates are reranked by BGE-reranker-v2-m3. Confidence is high (≥0.6), medium (≥0.4), or low (<0.4) based on the top reranker score.
45
+
46
+ ### 4. Knowledge Graph (Neo4j)
47
+
48
+ After ingestion, Gemini Flash extracts entities and relationships from each chunk:
49
+ - **Entity types**: Service, Library, Incident, Team
50
+ - **Relationship types**: MENTIONS, REFERENCES, DEPENDS_ON, OWNED_BY, CAUSED_BY, DOCUMENTS, HAS_CHUNK
51
+
52
+ Graph is queryable via GET /graph/traverse.
53
+
54
+ ### 5. CAG Snapshots
55
+
56
+ Every night at 2am UTC, a Celery beat task fetches recent Jira activity and GitHub commits for each team, summarises them with Gemini 2.5 Pro, and stores a 50k-token snapshot in the teams table. This snapshot is injected into the synthesiser context for time-aware answers.
57
+
58
+ ## API Endpoints
59
+
60
+ | Method | Path | Description |
61
+ |--------|------|-------------|
62
+ | POST | /agent/query | SSE streaming chat query |
63
+ | POST | /ingest/confluence | Ingest a Confluence space |
64
+ | POST | /ingest/github | Ingest a GitHub repo |
65
+ | POST | /ingest/upload | Upload a PDF |
66
+ | GET | /ingest/jobs/{id} | Check ingestion job status |
67
+ | POST | /confluence/sync/{space} | Sync a Confluence space |
68
+ | POST | /webhooks/confluence | Confluence webhook handler |
69
+ | POST | /jira/sync/{project} | Sync a Jira project |
70
+ | POST | /webhooks/jira | Jira webhook handler |
71
+ | POST | /api/ingest/file | Upload any file (PDF/DOCX/CSV/XLSX/HTML/XML) |
72
+ | POST | /api/ingest/folder | Ingest all files in a folder |
73
+ | POST | /graph/ingest | Re-run graph extraction for a team |
74
+ | GET | /graph/traverse | Traverse the knowledge graph |
75
+
76
+ ## Data Flow Diagram
77
+
78
+ ```
79
+ Confluence ──┐
80
+ Jira ──┤──► Ingestion Pipeline ──► Qdrant (vectors)
81
+ GitHub ──┤ ──► Supabase (metadata)
82
+ File Upload ──┘ ──► Neo4j (graph)
83
+ ──► BM25 index (pkl)
84
+
85
+ User Query ──► LangGraph Agent Graph ──────────┘
86
+
87
+
88
+ SSE Streaming Answer
89
+ ```
data_sources/godspeed_overview.md ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Godspeed — Enterprise Knowledge Copilot
2
+
3
+ ## What is Godspeed?
4
+
5
+ Godspeed is an AI-powered Enterprise Knowledge Copilot built by a team of engineers to help organizations search, retrieve, and reason over their internal knowledge. It ingests documents from Confluence, Jira, GitHub, and uploaded files, then uses a multi-agent LangGraph pipeline backed by Google Gemini to answer natural language questions with citations.
6
+
7
+ ## Core Problem
8
+
9
+ Enterprise teams lose hours every week hunting for information scattered across Confluence wikis, Jira tickets, GitHub repos, and internal PDFs. Godspeed unifies all of this into a single queryable knowledge base with real-time answers.
10
+
11
+ ## Key Features
12
+
13
+ - **Hybrid Search**: Combines dense vector search (BGE-M3), sparse lexical search, and BM25 for maximum recall
14
+ - **Reranking**: BGE-reranker-v2-m3 scores retrieved chunks by relevance before synthesis
15
+ - **PII Masking**: GLiNER model automatically redacts personal data before any external API call
16
+ - **Multi-Agent Orchestration**: LangGraph graph with parallel retrieval agents (doc_search, ticket_lookup, live_docs)
17
+ - **Streaming Answers**: Server-Sent Events (SSE) stream the answer token by token to the client
18
+ - **Knowledge Graph**: Neo4j graph of entities (Services, Libraries, Incidents, Teams) extracted from documents
19
+ - **Guardrail**: Gemini Flash scores every answer for safety before delivery
20
+ - **CAG Snapshots**: Nightly Celery job builds a Context-Augmented Generation snapshot per team
21
+
22
+ ## Tech Stack
23
+
24
+ - **Orchestration**: LangGraph + LangChain + Google Gemini 2.5 Pro/Flash
25
+ - **Embeddings**: BAAI/bge-m3 (1024-dim dense + sparse lexical weights)
26
+ - **Reranker**: BAAI/bge-reranker-v2-m3
27
+ - **Vector DB**: Qdrant (hybrid dense + sparse collection)
28
+ - **Metadata DB**: Supabase (PostgreSQL with RLS)
29
+ - **Graph DB**: Neo4j (entity relationships)
30
+ - **Task Queue**: Celery + Redis
31
+ - **API**: FastAPI with SSE streaming
32
+ - **PII**: GLiNER mediumv2.1
33
+
34
+ ## Team
35
+
36
+ - Adithya Vardan — Backend architecture, agent orchestration, ingestion pipeline
37
+ - Samyuktha — Project coordination, documentation
38
+ - Ananth Shyam — Jira agent, Confluence agent, File agent, integration
data_sources/incident_runbook.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Incident Runbook — Godspeed Platform
2
+
3
+ ## INC-001: Qdrant Connection Refused
4
+
5
+ **Symptoms:** Agent queries return 0 chunks, logs show "Qdrant search failed"
6
+
7
+ **Root cause:** Qdrant Docker container stopped (usually after system restart)
8
+
9
+ **Resolution:**
10
+ ```bash
11
+ docker ps -a | grep qdrant
12
+ docker start qdrant
13
+ curl http://localhost:6333/healthz # should return "healthz check passed"
14
+ ```
15
+
16
+ **Prevention:** Add Qdrant to Docker restart policy:
17
+ ```bash
18
+ docker update --restart=always qdrant
19
+ ```
20
+
21
+ ---
22
+
23
+ ## INC-002: Supabase RLS Policy Violation
24
+
25
+ **Symptoms:** Ingestion fails with "new row violates row-level security policy"
26
+
27
+ **Root cause:** Using anon key instead of service_role key in SUPABASE_KEY env var
28
+
29
+ **Resolution:**
30
+ 1. Go to Supabase → Project Settings → API
31
+ 2. Copy the `service_role` (secret) key
32
+ 3. Update `.env`: `SUPABASE_KEY=<service_role_key>`
33
+ 4. Restart the server
34
+
35
+ ---
36
+
37
+ ## INC-003: BGE-M3 Model OOM on Low-RAM Machine
38
+
39
+ **Symptoms:** Server crashes with MemoryError or process killed during first query
40
+
41
+ **Root cause:** BGE-M3 model requires ~4GB RAM. Machines with <8GB RAM may OOM.
42
+
43
+ **Resolution:**
44
+ - Upgrade to a machine with ≥16GB RAM for production
45
+ - For development: set `use_fp16=True` (already set) to halve memory usage
46
+ - Or reduce embed_batch_size in `.env`: `EMBED_BATCH_SIZE=8`
47
+
48
+ ---
49
+
50
+ ## INC-004: Celery Tasks Not Processing
51
+
52
+ **Symptoms:** Webhook triggers return "accepted" but chunks never appear in Supabase
53
+
54
+ **Root cause:** Celery worker not running
55
+
56
+ **Resolution:**
57
+ ```bash
58
+ # Check if worker is running
59
+ ps aux | grep celery
60
+
61
+ # Start worker
62
+ celery -A ingestion.jobs.celery_app worker --loglevel=info
63
+
64
+ # Check Redis
65
+ redis-cli ping # should return PONG
66
+ ```
67
+
68
+ ---
69
+
70
+ ## INC-005: Jira Sync Returns 0 Issues
71
+
72
+ **Symptoms:** /jira/sync/{project} returns task accepted but 0 chunks stored
73
+
74
+ **Root cause options:**
75
+ 1. Project has no issues yet
76
+ 2. Wrong project key
77
+ 3. API token expired or wrong email
78
+
79
+ **Resolution:**
80
+ ```bash
81
+ # Verify auth
82
+ curl -u "your-email:your-token" \
83
+ "https://your-org.atlassian.net/rest/api/3/myself"
84
+
85
+ # List available projects
86
+ curl -u "your-email:your-token" \
87
+ "https://your-org.atlassian.net/rest/api/3/project/search"
88
+ ```
89
+
90
+ ---
91
+
92
+ ## INC-006: Confluence Sync 404 Error
93
+
94
+ **Symptoms:** Confluence sync fails with "404 Not Found", URL has double `/wiki/wiki/`
95
+
96
+ **Root cause:** CONFLUENCE_BASE_URL incorrectly includes `/wiki` suffix
97
+
98
+ **Resolution:**
99
+ In `.env`, set:
100
+ ```
101
+ CONFLUENCE_BASE_URL=https://your-org.atlassian.net
102
+ ```
103
+ Not:
104
+ ```
105
+ CONFLUENCE_BASE_URL=https://your-org.atlassian.net/wiki ← WRONG
106
+ ```
data_sources/setup_guide.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Godspeed Setup Guide
2
+
3
+ ## Prerequisites
4
+
5
+ - Python 3.11+
6
+ - Docker (for Qdrant)
7
+ - Redis (brew install redis)
8
+ - Supabase account
9
+ - Google AI Studio API key (Gemini)
10
+
11
+ ## Quick Start
12
+
13
+ ### 1. Clone and install
14
+
15
+ ```bash
16
+ git clone https://github.com/samyuktha2004/Godspeed.git
17
+ cd GodSpeed
18
+ python -m venv .venv
19
+ source .venv/bin/activate
20
+ pip install -r requirements.txt
21
+ python -m spacy download en_core_web_sm
22
+ ```
23
+
24
+ ### 2. Start infrastructure
25
+
26
+ ```bash
27
+ # Redis (macOS)
28
+ brew services start redis
29
+
30
+ # Qdrant
31
+ docker run -d --name qdrant -p 6333:6333 qdrant/qdrant
32
+ ```
33
+
34
+ ### 3. Configure environment
35
+
36
+ Copy `.env.example` to `.env` and fill in:
37
+
38
+ ```
39
+ GOOGLE_API_KEY=your-gemini-key
40
+ SUPABASE_URL=https://your-project.supabase.co
41
+ SUPABASE_KEY=your-service-role-key
42
+ JIRA_BASE_URL=https://your-org.atlassian.net
43
+ JIRA_EMAIL=you@your-org.com
44
+ JIRA_API_TOKEN=your-atlassian-token
45
+ CONFLUENCE_BASE_URL=https://your-org.atlassian.net
46
+ CONFLUENCE_EMAIL=you@your-org.com
47
+ CONFLUENCE_TOKEN=your-atlassian-token
48
+ CONFLUENCE_SPACES=YOUR_SPACE_KEY
49
+ ```
50
+
51
+ ### 4. Run Supabase schema
52
+
53
+ Open your Supabase project → SQL Editor → paste and run `supabase/schema.sql`.
54
+
55
+ Also run this to add the qdrant_id column:
56
+ ```sql
57
+ ALTER TABLE chunks ADD COLUMN IF NOT EXISTS qdrant_id text;
58
+ ```
59
+
60
+ ### 5. Start the server
61
+
62
+ ```bash
63
+ uvicorn main:app --port 8000
64
+ ```
65
+
66
+ ### 6. Start Celery worker (optional, for background jobs)
67
+
68
+ ```bash
69
+ celery -A ingestion.jobs.celery_app worker --loglevel=info
70
+ celery -A ingestion.jobs.celery_app beat --loglevel=info
71
+ ```
72
+
73
+ ## Testing the System
74
+
75
+ ### Ingest Confluence
76
+ ```bash
77
+ curl -X POST http://localhost:8000/confluence/sync/YOUR_SPACE_KEY
78
+ ```
79
+
80
+ ### Query the agent
81
+ ```bash
82
+ curl -X POST http://localhost:8000/agent/query \
83
+ -H "Content-Type: application/json" \
84
+ -d '{"query": "What is our deployment process?", "team_id": "default", "session_id": "s1"}'
85
+ ```
86
+
87
+ ## Troubleshooting
88
+
89
+ ### Server won't start
90
+ - Check all env vars are set in `.env`
91
+ - Make sure Redis and Qdrant are running
92
+ - Run `python -c "import main"` to check for import errors
93
+
94
+ ### Agent returns low confidence
95
+ - The knowledge base may not have enough relevant content
96
+ - Run a Confluence sync to ingest more pages
97
+ - Check Supabase chunks table has rows
98
+
99
+ ### Qdrant connection refused
100
+ - Start Docker and run: `docker start qdrant`
101
+ - Or: `docker run -d --name qdrant -p 6333:6333 qdrant/qdrant`
102
+
103
+ ### Supabase RLS error
104
+ - Use the `service_role` key, not the `anon` key
105
+
106
+ ### First query is slow (30-60s)
107
+ - BGE-M3 and reranker models download on first use (~1.5GB total)
108
+ - Subsequent queries are fast (models cached in memory)
src/file_agent/parsers/__init__.py CHANGED
@@ -32,4 +32,4 @@ def dispatch(path: str, fmt: str) -> list[Block]:
32
 
33
 
34
  # Import parsers to trigger registration
35
- from src.file_agent.parsers import pdf, docx, xml, csv, html # noqa: F401, E402
 
32
 
33
 
34
  # Import parsers to trigger registration
35
+ from src.file_agent.parsers import pdf, docx, xml, csv, html, text # noqa: F401, E402
src/file_agent/parsers/text.py ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from src.file_agent.parsers import Block, register
4
+
5
+
6
+ @register("text")
7
+ def parse_text(path: str) -> list[Block]:
8
+ with open(path, encoding="utf-8", errors="replace") as f:
9
+ content = f.read().strip()
10
+ if not content:
11
+ return []
12
+ return [{"type": "text", "content": content}]