Spaces:

AdithyaVardan
/

GodSpeed

Sleeping

App Files Files Community

AdithyaVardan commited on 17 days ago

Commit

2f22e68

1 Parent(s): 1fbfa0e

Add text/markdown parser for file agent; add sample knowledge base documents

Browse files

Files changed (7) hide show

data_sources/api_reference.md +132 -0
data_sources/architecture.md +89 -0
data_sources/godspeed_overview.md +38 -0
data_sources/incident_runbook.md +106 -0
data_sources/setup_guide.md +108 -0
src/file_agent/parsers/__init__.py +1 -1
src/file_agent/parsers/text.py +12 -0

data_sources/api_reference.md ADDED Viewed

	@@ -0,0 +1,132 @@

+# Godspeed API Reference
+## POST /agent/query
+The main chat endpoint. Returns a Server-Sent Events (SSE) stream.
+**Request body:**
+```json
+{
+  "query": "What caused the last incident?",
+  "team_id": "default",
+  "session_id": "unique-session-id"
+}
+```
+**SSE Events:**
+| Event | Data | Description |
+|-------|------|-------------|
+| plan_ready | {tasks, reasoning} | Planner's execution plan |
+| agent_started | {agent} | A retrieval agent has started |
+| agent_done | {agent, chunks, confidence} | Agent finished, chunks found |
+| synthesis_started | {} | Synthesiser is generating the answer |
+| answer_chunk | {chunk} | One token/phrase of the answer |
+| guardrail_result | {score, escalate} | Safety score (0-1) |
+| error | {message} | Something went wrong |
+| done | {} | Stream complete |
+**Confidence levels:**
+- `high` — reranker score ≥ 0.6, answer is reliable
+- `medium` — reranker score ≥ 0.4, answer may have gaps
+- `low` — reranker score < 0.4, answer is best-effort
+---
+## POST /confluence/sync/{space_key}
+Triggers a full sync of a Confluence space. Runs as a background Celery task.
+```bash
+curl -X POST http://localhost:8000/confluence/sync/ENG
+```
+**Response:**
+```json
+{"status": "accepted", "task_id": "abc123", "space_key": "ENG"}
+```
+---
+## POST /jira/sync/{project_key}
+Syncs all issues in a Jira project.
+```bash
+curl -X POST http://localhost:8000/jira/sync/BACKEND
+```
+---
+## POST /api/ingest/file
+Upload a file for ingestion. Supports: PDF, DOCX, DOC, CSV, XLSX, XLS, HTML, HTM, XML, TXT, MD.
+```bash
+curl -X POST http://localhost:8000/api/ingest/file \
+  -F "file=@report.pdf" \
+  -F "team_id=default"
+```
+**Response:**
+```json
+{"status": "accepted", "task_id": "xyz789", "filename": "report.pdf"}
+```
+---
+## GET /ingest/jobs/{job_id}
+Check the status of an ingestion job.
+```bash
+curl http://localhost:8000/ingest/jobs/abc123
+```
+**Response:**
+```json
+{
+  "job_id": "abc123",
+  "status": "completed",
+  "chunks_ingested": 42,
+  "source_type": "confluence",
+  "team_id": "default"
+}
+```
+Status values: `pending`, `running`, `completed`, `failed`
+---
+## GET /graph/traverse
+Query the knowledge graph for related entities.
+```bash
+curl "http://localhost:8000/graph/traverse?entity_name=AuthService&entity_type=Service&team_id=default"
+```
+**Parameters:**
+- `entity_name` — name of the entity to start from
+- `entity_type` — one of: Service, Library, Incident, Team
+- `team_id` — your team identifier
+---
+## POST /webhooks/jira
+Receives Jira Cloud webhooks. Configure in Jira → Project Settings → Webhooks.
+- URL: `https://your-domain.com/webhooks/jira`
+- Events: Issue Created, Issue Updated
+- Signs with HMAC-SHA256 using `JIRA_WEBHOOK_SECRET`
+---
+## POST /webhooks/confluence
+Receives Confluence webhooks. Configure in Confluence → Space Settings → Integrations.
+- URL: `https://your-domain.com/webhooks/confluence`
+- Events: page_created, page_updated
+- Signs with HMAC-SHA256 using `CONFLUENCE_WEBHOOK_SECRET`

data_sources/architecture.md ADDED Viewed

	@@ -0,0 +1,89 @@

+# Godspeed Architecture
+## System Components
+### 1. Ingestion Pipeline
+Documents flow through a 5-stage pipeline:
+1. **Fetch** — Source adapters pull raw content (Confluence REST API, Jira REST API, GitHub API, file upload)
+2. **Chunk** — Semantic chunker uses spaCy sentence boundaries to split into 512-token chunks with 15% overlap
+3. **PII Mask** — GLiNER model redacts person names, emails, phone numbers, SSNs, credit cards, addresses
+4. **Embed** — BGE-M3 produces 1024-dim dense vectors and sparse lexical weights per chunk
+5. **Store** — Chunks upserted into Qdrant (vectors) and Supabase (metadata)
+### 2. Agent Graph (LangGraph)
+```
+User Query
+    │
+    ▼
+[planner_node] — Gemini 2.5 Pro decides which agents to invoke
+    │
+    ├──► [doc_search_node]    — Hybrid Qdrant search + BM25 + reranker
+    ├──► [ticket_lookup_node] — Jira issue lookup
+    └──► [live_docs_node]     — Live web fetch
+    │
+    ▼
+[join_node] — Fan-in, waits for all retrieval agents
+    │
+    ▼
+[synthesiser_node] — Gemini 2.5 Pro streams the answer with citations
+    │
+    ▼
+[guardrail_node] — Gemini 2.5 Flash scores safety (0–1)
+```
+### 3. Retrieval (Hybrid Search)
+For every query, doc_search runs three retrieval methods in parallel:
+- **Dense search**: cosine similarity on 1024-dim BGE-M3 vectors
+- **Sparse search**: lexical overlap using BGE-M3 sparse weights
+- **BM25**: keyword matching on the full chunk corpus
+Results are merged using Reciprocal Rank Fusion (RRF, k=60), then top candidates are reranked by BGE-reranker-v2-m3. Confidence is high (≥0.6), medium (≥0.4), or low (<0.4) based on the top reranker score.
+### 4. Knowledge Graph (Neo4j)
+After ingestion, Gemini Flash extracts entities and relationships from each chunk:
+- **Entity types**: Service, Library, Incident, Team
+- **Relationship types**: MENTIONS, REFERENCES, DEPENDS_ON, OWNED_BY, CAUSED_BY, DOCUMENTS, HAS_CHUNK
+Graph is queryable via GET /graph/traverse.
+### 5. CAG Snapshots
+Every night at 2am UTC, a Celery beat task fetches recent Jira activity and GitHub commits for each team, summarises them with Gemini 2.5 Pro, and stores a 50k-token snapshot in the teams table. This snapshot is injected into the synthesiser context for time-aware answers.
+## API Endpoints
+| Method | Path | Description |
+|--------|------|-------------|
+| POST | /agent/query | SSE streaming chat query |
+| POST | /ingest/confluence | Ingest a Confluence space |
+| POST | /ingest/github | Ingest a GitHub repo |
+| POST | /ingest/upload | Upload a PDF |
+| GET  | /ingest/jobs/{id} | Check ingestion job status |
+| POST | /confluence/sync/{space} | Sync a Confluence space |
+| POST | /webhooks/confluence | Confluence webhook handler |
+| POST | /jira/sync/{project} | Sync a Jira project |
+| POST | /webhooks/jira | Jira webhook handler |
+| POST | /api/ingest/file | Upload any file (PDF/DOCX/CSV/XLSX/HTML/XML) |
+| POST | /api/ingest/folder | Ingest all files in a folder |
+| POST | /graph/ingest | Re-run graph extraction for a team |
+| GET  | /graph/traverse | Traverse the knowledge graph |
+## Data Flow Diagram
+```
+Confluence ──┐
+Jira        ──┤──► Ingestion Pipeline ──► Qdrant (vectors)
+GitHub      ──┤                       ──► Supabase (metadata)
+File Upload ──┘                       ──► Neo4j (graph)
+                                      ──► BM25 index (pkl)
+                                               │
+User Query ──► LangGraph Agent Graph ──────────┘
+                      │
+                      ▼
+              SSE Streaming Answer
+```

data_sources/godspeed_overview.md ADDED Viewed

	@@ -0,0 +1,38 @@

+# Godspeed — Enterprise Knowledge Copilot
+## What is Godspeed?
+Godspeed is an AI-powered Enterprise Knowledge Copilot built by a team of engineers to help organizations search, retrieve, and reason over their internal knowledge. It ingests documents from Confluence, Jira, GitHub, and uploaded files, then uses a multi-agent LangGraph pipeline backed by Google Gemini to answer natural language questions with citations.
+## Core Problem
+Enterprise teams lose hours every week hunting for information scattered across Confluence wikis, Jira tickets, GitHub repos, and internal PDFs. Godspeed unifies all of this into a single queryable knowledge base with real-time answers.
+## Key Features
+- **Hybrid Search**: Combines dense vector search (BGE-M3), sparse lexical search, and BM25 for maximum recall
+- **Reranking**: BGE-reranker-v2-m3 scores retrieved chunks by relevance before synthesis
+- **PII Masking**: GLiNER model automatically redacts personal data before any external API call
+- **Multi-Agent Orchestration**: LangGraph graph with parallel retrieval agents (doc_search, ticket_lookup, live_docs)
+- **Streaming Answers**: Server-Sent Events (SSE) stream the answer token by token to the client
+- **Knowledge Graph**: Neo4j graph of entities (Services, Libraries, Incidents, Teams) extracted from documents
+- **Guardrail**: Gemini Flash scores every answer for safety before delivery
+- **CAG Snapshots**: Nightly Celery job builds a Context-Augmented Generation snapshot per team
+## Tech Stack
+- **Orchestration**: LangGraph + LangChain + Google Gemini 2.5 Pro/Flash
+- **Embeddings**: BAAI/bge-m3 (1024-dim dense + sparse lexical weights)
+- **Reranker**: BAAI/bge-reranker-v2-m3
+- **Vector DB**: Qdrant (hybrid dense + sparse collection)
+- **Metadata DB**: Supabase (PostgreSQL with RLS)
+- **Graph DB**: Neo4j (entity relationships)
+- **Task Queue**: Celery + Redis
+- **API**: FastAPI with SSE streaming
+- **PII**: GLiNER mediumv2.1
+## Team
+- Adithya Vardan — Backend architecture, agent orchestration, ingestion pipeline
+- Samyuktha — Project coordination, documentation
+- Ananth Shyam — Jira agent, Confluence agent, File agent, integration

data_sources/incident_runbook.md ADDED Viewed

	@@ -0,0 +1,106 @@

+# Incident Runbook — Godspeed Platform
+## INC-001: Qdrant Connection Refused
+**Symptoms:** Agent queries return 0 chunks, logs show "Qdrant search failed"
+**Root cause:** Qdrant Docker container stopped (usually after system restart)
+**Resolution:**
+```bash
+docker ps -a | grep qdrant
+docker start qdrant
+curl http://localhost:6333/healthz  # should return "healthz check passed"
+```
+**Prevention:** Add Qdrant to Docker restart policy:
+```bash
+docker update --restart=always qdrant
+```
+---
+## INC-002: Supabase RLS Policy Violation
+**Symptoms:** Ingestion fails with "new row violates row-level security policy"
+**Root cause:** Using anon key instead of service_role key in SUPABASE_KEY env var
+**Resolution:**
+1. Go to Supabase → Project Settings → API
+2. Copy the `service_role` (secret) key
+3. Update `.env`: `SUPABASE_KEY=<service_role_key>`
+4. Restart the server
+---
+## INC-003: BGE-M3 Model OOM on Low-RAM Machine
+**Symptoms:** Server crashes with MemoryError or process killed during first query
+**Root cause:** BGE-M3 model requires ~4GB RAM. Machines with <8GB RAM may OOM.
+**Resolution:**
+- Upgrade to a machine with ≥16GB RAM for production
+- For development: set `use_fp16=True` (already set) to halve memory usage
+- Or reduce embed_batch_size in `.env`: `EMBED_BATCH_SIZE=8`
+---
+## INC-004: Celery Tasks Not Processing
+**Symptoms:** Webhook triggers return "accepted" but chunks never appear in Supabase
+**Root cause:** Celery worker not running
+**Resolution:**
+```bash
+# Check if worker is running
+ps aux | grep celery
+# Start worker
+celery -A ingestion.jobs.celery_app worker --loglevel=info
+# Check Redis
+redis-cli ping  # should return PONG
+```
+---
+## INC-005: Jira Sync Returns 0 Issues
+**Symptoms:** /jira/sync/{project} returns task accepted but 0 chunks stored
+**Root cause options:**
+1. Project has no issues yet
+2. Wrong project key
+3. API token expired or wrong email
+**Resolution:**
+```bash
+# Verify auth
+curl -u "your-email:your-token" \
+  "https://your-org.atlassian.net/rest/api/3/myself"
+# List available projects
+curl -u "your-email:your-token" \
+  "https://your-org.atlassian.net/rest/api/3/project/search"
+```
+---
+## INC-006: Confluence Sync 404 Error
+**Symptoms:** Confluence sync fails with "404 Not Found", URL has double `/wiki/wiki/`
+**Root cause:** CONFLUENCE_BASE_URL incorrectly includes `/wiki` suffix
+**Resolution:**
+In `.env`, set:
+```
+CONFLUENCE_BASE_URL=https://your-org.atlassian.net
+```
+Not:
+```
+CONFLUENCE_BASE_URL=https://your-org.atlassian.net/wiki  ← WRONG
+```

data_sources/setup_guide.md ADDED Viewed

	@@ -0,0 +1,108 @@

+# Godspeed Setup Guide
+## Prerequisites
+- Python 3.11+
+- Docker (for Qdrant)
+- Redis (brew install redis)
+- Supabase account
+- Google AI Studio API key (Gemini)
+## Quick Start
+### 1. Clone and install
+```bash
+git clone https://github.com/samyuktha2004/Godspeed.git
+cd GodSpeed
+python -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+python -m spacy download en_core_web_sm
+```
+### 2. Start infrastructure
+```bash
+# Redis (macOS)
+brew services start redis
+# Qdrant
+docker run -d --name qdrant -p 6333:6333 qdrant/qdrant
+```
+### 3. Configure environment
+Copy `.env.example` to `.env` and fill in:
+```
+GOOGLE_API_KEY=your-gemini-key
+SUPABASE_URL=https://your-project.supabase.co
+SUPABASE_KEY=your-service-role-key
+JIRA_BASE_URL=https://your-org.atlassian.net
+JIRA_EMAIL=you@your-org.com
+JIRA_API_TOKEN=your-atlassian-token
+CONFLUENCE_BASE_URL=https://your-org.atlassian.net
+CONFLUENCE_EMAIL=you@your-org.com
+CONFLUENCE_TOKEN=your-atlassian-token
+CONFLUENCE_SPACES=YOUR_SPACE_KEY
+```
+### 4. Run Supabase schema
+Open your Supabase project → SQL Editor → paste and run `supabase/schema.sql`.
+Also run this to add the qdrant_id column:
+```sql
+ALTER TABLE chunks ADD COLUMN IF NOT EXISTS qdrant_id text;
+```
+### 5. Start the server
+```bash
+uvicorn main:app --port 8000
+```
+### 6. Start Celery worker (optional, for background jobs)
+```bash
+celery -A ingestion.jobs.celery_app worker --loglevel=info
+celery -A ingestion.jobs.celery_app beat --loglevel=info
+```
+## Testing the System
+### Ingest Confluence
+```bash
+curl -X POST http://localhost:8000/confluence/sync/YOUR_SPACE_KEY
+```
+### Query the agent
+```bash
+curl -X POST http://localhost:8000/agent/query \
+  -H "Content-Type: application/json" \
+  -d '{"query": "What is our deployment process?", "team_id": "default", "session_id": "s1"}'
+```
+## Troubleshooting
+### Server won't start
+- Check all env vars are set in `.env`
+- Make sure Redis and Qdrant are running
+- Run `python -c "import main"` to check for import errors
+### Agent returns low confidence
+- The knowledge base may not have enough relevant content
+- Run a Confluence sync to ingest more pages
+- Check Supabase chunks table has rows
+### Qdrant connection refused
+- Start Docker and run: `docker start qdrant`
+- Or: `docker run -d --name qdrant -p 6333:6333 qdrant/qdrant`
+### Supabase RLS error
+- Use the `service_role` key, not the `anon` key
+### First query is slow (30-60s)
+- BGE-M3 and reranker models download on first use (~1.5GB total)
+- Subsequent queries are fast (models cached in memory)

src/file_agent/parsers/__init__.py CHANGED Viewed

@@ -32,4 +32,4 @@ def dispatch(path: str, fmt: str) -> list[Block]:
 # Import parsers to trigger registration
-from src.file_agent.parsers import pdf, docx, xml, csv, html  # noqa: F401, E402


32
33
34	# Import parsers to trigger registration
35	+ from src.file_agent.parsers import pdf, docx, xml, csv, html, text # noqa: F401, E402

src/file_agent/parsers/text.py ADDED Viewed

	@@ -0,0 +1,12 @@

+from __future__ import annotations
+from src.file_agent.parsers import Block, register
+@register("text")
+def parse_text(path: str) -> list[Block]:
+    with open(path, encoding="utf-8", errors="replace") as f:
+        content = f.read().strip()
+    if not content:
+        return []
+    return [{"type": "text", "content": content}]