Spaces:
Sleeping
Sleeping
Commit ·
2f22e68
1
Parent(s): 1fbfa0e
Add text/markdown parser for file agent; add sample knowledge base documents
Browse files- data_sources/api_reference.md +132 -0
- data_sources/architecture.md +89 -0
- data_sources/godspeed_overview.md +38 -0
- data_sources/incident_runbook.md +106 -0
- data_sources/setup_guide.md +108 -0
- src/file_agent/parsers/__init__.py +1 -1
- src/file_agent/parsers/text.py +12 -0
data_sources/api_reference.md
ADDED
|
@@ -0,0 +1,132 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Godspeed API Reference
|
| 2 |
+
|
| 3 |
+
## POST /agent/query
|
| 4 |
+
|
| 5 |
+
The main chat endpoint. Returns a Server-Sent Events (SSE) stream.
|
| 6 |
+
|
| 7 |
+
**Request body:**
|
| 8 |
+
```json
|
| 9 |
+
{
|
| 10 |
+
"query": "What caused the last incident?",
|
| 11 |
+
"team_id": "default",
|
| 12 |
+
"session_id": "unique-session-id"
|
| 13 |
+
}
|
| 14 |
+
```
|
| 15 |
+
|
| 16 |
+
**SSE Events:**
|
| 17 |
+
|
| 18 |
+
| Event | Data | Description |
|
| 19 |
+
|-------|------|-------------|
|
| 20 |
+
| plan_ready | {tasks, reasoning} | Planner's execution plan |
|
| 21 |
+
| agent_started | {agent} | A retrieval agent has started |
|
| 22 |
+
| agent_done | {agent, chunks, confidence} | Agent finished, chunks found |
|
| 23 |
+
| synthesis_started | {} | Synthesiser is generating the answer |
|
| 24 |
+
| answer_chunk | {chunk} | One token/phrase of the answer |
|
| 25 |
+
| guardrail_result | {score, escalate} | Safety score (0-1) |
|
| 26 |
+
| error | {message} | Something went wrong |
|
| 27 |
+
| done | {} | Stream complete |
|
| 28 |
+
|
| 29 |
+
**Confidence levels:**
|
| 30 |
+
- `high` — reranker score ≥ 0.6, answer is reliable
|
| 31 |
+
- `medium` — reranker score ≥ 0.4, answer may have gaps
|
| 32 |
+
- `low` — reranker score < 0.4, answer is best-effort
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
## POST /confluence/sync/{space_key}
|
| 37 |
+
|
| 38 |
+
Triggers a full sync of a Confluence space. Runs as a background Celery task.
|
| 39 |
+
|
| 40 |
+
```bash
|
| 41 |
+
curl -X POST http://localhost:8000/confluence/sync/ENG
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
**Response:**
|
| 45 |
+
```json
|
| 46 |
+
{"status": "accepted", "task_id": "abc123", "space_key": "ENG"}
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
---
|
| 50 |
+
|
| 51 |
+
## POST /jira/sync/{project_key}
|
| 52 |
+
|
| 53 |
+
Syncs all issues in a Jira project.
|
| 54 |
+
|
| 55 |
+
```bash
|
| 56 |
+
curl -X POST http://localhost:8000/jira/sync/BACKEND
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
---
|
| 60 |
+
|
| 61 |
+
## POST /api/ingest/file
|
| 62 |
+
|
| 63 |
+
Upload a file for ingestion. Supports: PDF, DOCX, DOC, CSV, XLSX, XLS, HTML, HTM, XML, TXT, MD.
|
| 64 |
+
|
| 65 |
+
```bash
|
| 66 |
+
curl -X POST http://localhost:8000/api/ingest/file \
|
| 67 |
+
-F "file=@report.pdf" \
|
| 68 |
+
-F "team_id=default"
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
**Response:**
|
| 72 |
+
```json
|
| 73 |
+
{"status": "accepted", "task_id": "xyz789", "filename": "report.pdf"}
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
---
|
| 77 |
+
|
| 78 |
+
## GET /ingest/jobs/{job_id}
|
| 79 |
+
|
| 80 |
+
Check the status of an ingestion job.
|
| 81 |
+
|
| 82 |
+
```bash
|
| 83 |
+
curl http://localhost:8000/ingest/jobs/abc123
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
**Response:**
|
| 87 |
+
```json
|
| 88 |
+
{
|
| 89 |
+
"job_id": "abc123",
|
| 90 |
+
"status": "completed",
|
| 91 |
+
"chunks_ingested": 42,
|
| 92 |
+
"source_type": "confluence",
|
| 93 |
+
"team_id": "default"
|
| 94 |
+
}
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
Status values: `pending`, `running`, `completed`, `failed`
|
| 98 |
+
|
| 99 |
+
---
|
| 100 |
+
|
| 101 |
+
## GET /graph/traverse
|
| 102 |
+
|
| 103 |
+
Query the knowledge graph for related entities.
|
| 104 |
+
|
| 105 |
+
```bash
|
| 106 |
+
curl "http://localhost:8000/graph/traverse?entity_name=AuthService&entity_type=Service&team_id=default"
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
**Parameters:**
|
| 110 |
+
- `entity_name` — name of the entity to start from
|
| 111 |
+
- `entity_type` — one of: Service, Library, Incident, Team
|
| 112 |
+
- `team_id` — your team identifier
|
| 113 |
+
|
| 114 |
+
---
|
| 115 |
+
|
| 116 |
+
## POST /webhooks/jira
|
| 117 |
+
|
| 118 |
+
Receives Jira Cloud webhooks. Configure in Jira → Project Settings → Webhooks.
|
| 119 |
+
|
| 120 |
+
- URL: `https://your-domain.com/webhooks/jira`
|
| 121 |
+
- Events: Issue Created, Issue Updated
|
| 122 |
+
- Signs with HMAC-SHA256 using `JIRA_WEBHOOK_SECRET`
|
| 123 |
+
|
| 124 |
+
---
|
| 125 |
+
|
| 126 |
+
## POST /webhooks/confluence
|
| 127 |
+
|
| 128 |
+
Receives Confluence webhooks. Configure in Confluence → Space Settings → Integrations.
|
| 129 |
+
|
| 130 |
+
- URL: `https://your-domain.com/webhooks/confluence`
|
| 131 |
+
- Events: page_created, page_updated
|
| 132 |
+
- Signs with HMAC-SHA256 using `CONFLUENCE_WEBHOOK_SECRET`
|
data_sources/architecture.md
ADDED
|
@@ -0,0 +1,89 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Godspeed Architecture
|
| 2 |
+
|
| 3 |
+
## System Components
|
| 4 |
+
|
| 5 |
+
### 1. Ingestion Pipeline
|
| 6 |
+
|
| 7 |
+
Documents flow through a 5-stage pipeline:
|
| 8 |
+
|
| 9 |
+
1. **Fetch** — Source adapters pull raw content (Confluence REST API, Jira REST API, GitHub API, file upload)
|
| 10 |
+
2. **Chunk** — Semantic chunker uses spaCy sentence boundaries to split into 512-token chunks with 15% overlap
|
| 11 |
+
3. **PII Mask** — GLiNER model redacts person names, emails, phone numbers, SSNs, credit cards, addresses
|
| 12 |
+
4. **Embed** — BGE-M3 produces 1024-dim dense vectors and sparse lexical weights per chunk
|
| 13 |
+
5. **Store** — Chunks upserted into Qdrant (vectors) and Supabase (metadata)
|
| 14 |
+
|
| 15 |
+
### 2. Agent Graph (LangGraph)
|
| 16 |
+
|
| 17 |
+
```
|
| 18 |
+
User Query
|
| 19 |
+
│
|
| 20 |
+
▼
|
| 21 |
+
[planner_node] — Gemini 2.5 Pro decides which agents to invoke
|
| 22 |
+
│
|
| 23 |
+
├──► [doc_search_node] — Hybrid Qdrant search + BM25 + reranker
|
| 24 |
+
├──► [ticket_lookup_node] — Jira issue lookup
|
| 25 |
+
└──► [live_docs_node] — Live web fetch
|
| 26 |
+
│
|
| 27 |
+
▼
|
| 28 |
+
[join_node] — Fan-in, waits for all retrieval agents
|
| 29 |
+
│
|
| 30 |
+
▼
|
| 31 |
+
[synthesiser_node] — Gemini 2.5 Pro streams the answer with citations
|
| 32 |
+
│
|
| 33 |
+
▼
|
| 34 |
+
[guardrail_node] — Gemini 2.5 Flash scores safety (0–1)
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
### 3. Retrieval (Hybrid Search)
|
| 38 |
+
|
| 39 |
+
For every query, doc_search runs three retrieval methods in parallel:
|
| 40 |
+
- **Dense search**: cosine similarity on 1024-dim BGE-M3 vectors
|
| 41 |
+
- **Sparse search**: lexical overlap using BGE-M3 sparse weights
|
| 42 |
+
- **BM25**: keyword matching on the full chunk corpus
|
| 43 |
+
|
| 44 |
+
Results are merged using Reciprocal Rank Fusion (RRF, k=60), then top candidates are reranked by BGE-reranker-v2-m3. Confidence is high (≥0.6), medium (≥0.4), or low (<0.4) based on the top reranker score.
|
| 45 |
+
|
| 46 |
+
### 4. Knowledge Graph (Neo4j)
|
| 47 |
+
|
| 48 |
+
After ingestion, Gemini Flash extracts entities and relationships from each chunk:
|
| 49 |
+
- **Entity types**: Service, Library, Incident, Team
|
| 50 |
+
- **Relationship types**: MENTIONS, REFERENCES, DEPENDS_ON, OWNED_BY, CAUSED_BY, DOCUMENTS, HAS_CHUNK
|
| 51 |
+
|
| 52 |
+
Graph is queryable via GET /graph/traverse.
|
| 53 |
+
|
| 54 |
+
### 5. CAG Snapshots
|
| 55 |
+
|
| 56 |
+
Every night at 2am UTC, a Celery beat task fetches recent Jira activity and GitHub commits for each team, summarises them with Gemini 2.5 Pro, and stores a 50k-token snapshot in the teams table. This snapshot is injected into the synthesiser context for time-aware answers.
|
| 57 |
+
|
| 58 |
+
## API Endpoints
|
| 59 |
+
|
| 60 |
+
| Method | Path | Description |
|
| 61 |
+
|--------|------|-------------|
|
| 62 |
+
| POST | /agent/query | SSE streaming chat query |
|
| 63 |
+
| POST | /ingest/confluence | Ingest a Confluence space |
|
| 64 |
+
| POST | /ingest/github | Ingest a GitHub repo |
|
| 65 |
+
| POST | /ingest/upload | Upload a PDF |
|
| 66 |
+
| GET | /ingest/jobs/{id} | Check ingestion job status |
|
| 67 |
+
| POST | /confluence/sync/{space} | Sync a Confluence space |
|
| 68 |
+
| POST | /webhooks/confluence | Confluence webhook handler |
|
| 69 |
+
| POST | /jira/sync/{project} | Sync a Jira project |
|
| 70 |
+
| POST | /webhooks/jira | Jira webhook handler |
|
| 71 |
+
| POST | /api/ingest/file | Upload any file (PDF/DOCX/CSV/XLSX/HTML/XML) |
|
| 72 |
+
| POST | /api/ingest/folder | Ingest all files in a folder |
|
| 73 |
+
| POST | /graph/ingest | Re-run graph extraction for a team |
|
| 74 |
+
| GET | /graph/traverse | Traverse the knowledge graph |
|
| 75 |
+
|
| 76 |
+
## Data Flow Diagram
|
| 77 |
+
|
| 78 |
+
```
|
| 79 |
+
Confluence ──┐
|
| 80 |
+
Jira ──┤──► Ingestion Pipeline ──► Qdrant (vectors)
|
| 81 |
+
GitHub ──┤ ──► Supabase (metadata)
|
| 82 |
+
File Upload ──┘ ──► Neo4j (graph)
|
| 83 |
+
──► BM25 index (pkl)
|
| 84 |
+
│
|
| 85 |
+
User Query ──► LangGraph Agent Graph ──────────┘
|
| 86 |
+
│
|
| 87 |
+
▼
|
| 88 |
+
SSE Streaming Answer
|
| 89 |
+
```
|
data_sources/godspeed_overview.md
ADDED
|
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Godspeed — Enterprise Knowledge Copilot
|
| 2 |
+
|
| 3 |
+
## What is Godspeed?
|
| 4 |
+
|
| 5 |
+
Godspeed is an AI-powered Enterprise Knowledge Copilot built by a team of engineers to help organizations search, retrieve, and reason over their internal knowledge. It ingests documents from Confluence, Jira, GitHub, and uploaded files, then uses a multi-agent LangGraph pipeline backed by Google Gemini to answer natural language questions with citations.
|
| 6 |
+
|
| 7 |
+
## Core Problem
|
| 8 |
+
|
| 9 |
+
Enterprise teams lose hours every week hunting for information scattered across Confluence wikis, Jira tickets, GitHub repos, and internal PDFs. Godspeed unifies all of this into a single queryable knowledge base with real-time answers.
|
| 10 |
+
|
| 11 |
+
## Key Features
|
| 12 |
+
|
| 13 |
+
- **Hybrid Search**: Combines dense vector search (BGE-M3), sparse lexical search, and BM25 for maximum recall
|
| 14 |
+
- **Reranking**: BGE-reranker-v2-m3 scores retrieved chunks by relevance before synthesis
|
| 15 |
+
- **PII Masking**: GLiNER model automatically redacts personal data before any external API call
|
| 16 |
+
- **Multi-Agent Orchestration**: LangGraph graph with parallel retrieval agents (doc_search, ticket_lookup, live_docs)
|
| 17 |
+
- **Streaming Answers**: Server-Sent Events (SSE) stream the answer token by token to the client
|
| 18 |
+
- **Knowledge Graph**: Neo4j graph of entities (Services, Libraries, Incidents, Teams) extracted from documents
|
| 19 |
+
- **Guardrail**: Gemini Flash scores every answer for safety before delivery
|
| 20 |
+
- **CAG Snapshots**: Nightly Celery job builds a Context-Augmented Generation snapshot per team
|
| 21 |
+
|
| 22 |
+
## Tech Stack
|
| 23 |
+
|
| 24 |
+
- **Orchestration**: LangGraph + LangChain + Google Gemini 2.5 Pro/Flash
|
| 25 |
+
- **Embeddings**: BAAI/bge-m3 (1024-dim dense + sparse lexical weights)
|
| 26 |
+
- **Reranker**: BAAI/bge-reranker-v2-m3
|
| 27 |
+
- **Vector DB**: Qdrant (hybrid dense + sparse collection)
|
| 28 |
+
- **Metadata DB**: Supabase (PostgreSQL with RLS)
|
| 29 |
+
- **Graph DB**: Neo4j (entity relationships)
|
| 30 |
+
- **Task Queue**: Celery + Redis
|
| 31 |
+
- **API**: FastAPI with SSE streaming
|
| 32 |
+
- **PII**: GLiNER mediumv2.1
|
| 33 |
+
|
| 34 |
+
## Team
|
| 35 |
+
|
| 36 |
+
- Adithya Vardan — Backend architecture, agent orchestration, ingestion pipeline
|
| 37 |
+
- Samyuktha — Project coordination, documentation
|
| 38 |
+
- Ananth Shyam — Jira agent, Confluence agent, File agent, integration
|
data_sources/incident_runbook.md
ADDED
|
@@ -0,0 +1,106 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Incident Runbook — Godspeed Platform
|
| 2 |
+
|
| 3 |
+
## INC-001: Qdrant Connection Refused
|
| 4 |
+
|
| 5 |
+
**Symptoms:** Agent queries return 0 chunks, logs show "Qdrant search failed"
|
| 6 |
+
|
| 7 |
+
**Root cause:** Qdrant Docker container stopped (usually after system restart)
|
| 8 |
+
|
| 9 |
+
**Resolution:**
|
| 10 |
+
```bash
|
| 11 |
+
docker ps -a | grep qdrant
|
| 12 |
+
docker start qdrant
|
| 13 |
+
curl http://localhost:6333/healthz # should return "healthz check passed"
|
| 14 |
+
```
|
| 15 |
+
|
| 16 |
+
**Prevention:** Add Qdrant to Docker restart policy:
|
| 17 |
+
```bash
|
| 18 |
+
docker update --restart=always qdrant
|
| 19 |
+
```
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## INC-002: Supabase RLS Policy Violation
|
| 24 |
+
|
| 25 |
+
**Symptoms:** Ingestion fails with "new row violates row-level security policy"
|
| 26 |
+
|
| 27 |
+
**Root cause:** Using anon key instead of service_role key in SUPABASE_KEY env var
|
| 28 |
+
|
| 29 |
+
**Resolution:**
|
| 30 |
+
1. Go to Supabase → Project Settings → API
|
| 31 |
+
2. Copy the `service_role` (secret) key
|
| 32 |
+
3. Update `.env`: `SUPABASE_KEY=<service_role_key>`
|
| 33 |
+
4. Restart the server
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
## INC-003: BGE-M3 Model OOM on Low-RAM Machine
|
| 38 |
+
|
| 39 |
+
**Symptoms:** Server crashes with MemoryError or process killed during first query
|
| 40 |
+
|
| 41 |
+
**Root cause:** BGE-M3 model requires ~4GB RAM. Machines with <8GB RAM may OOM.
|
| 42 |
+
|
| 43 |
+
**Resolution:**
|
| 44 |
+
- Upgrade to a machine with ≥16GB RAM for production
|
| 45 |
+
- For development: set `use_fp16=True` (already set) to halve memory usage
|
| 46 |
+
- Or reduce embed_batch_size in `.env`: `EMBED_BATCH_SIZE=8`
|
| 47 |
+
|
| 48 |
+
---
|
| 49 |
+
|
| 50 |
+
## INC-004: Celery Tasks Not Processing
|
| 51 |
+
|
| 52 |
+
**Symptoms:** Webhook triggers return "accepted" but chunks never appear in Supabase
|
| 53 |
+
|
| 54 |
+
**Root cause:** Celery worker not running
|
| 55 |
+
|
| 56 |
+
**Resolution:**
|
| 57 |
+
```bash
|
| 58 |
+
# Check if worker is running
|
| 59 |
+
ps aux | grep celery
|
| 60 |
+
|
| 61 |
+
# Start worker
|
| 62 |
+
celery -A ingestion.jobs.celery_app worker --loglevel=info
|
| 63 |
+
|
| 64 |
+
# Check Redis
|
| 65 |
+
redis-cli ping # should return PONG
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
---
|
| 69 |
+
|
| 70 |
+
## INC-005: Jira Sync Returns 0 Issues
|
| 71 |
+
|
| 72 |
+
**Symptoms:** /jira/sync/{project} returns task accepted but 0 chunks stored
|
| 73 |
+
|
| 74 |
+
**Root cause options:**
|
| 75 |
+
1. Project has no issues yet
|
| 76 |
+
2. Wrong project key
|
| 77 |
+
3. API token expired or wrong email
|
| 78 |
+
|
| 79 |
+
**Resolution:**
|
| 80 |
+
```bash
|
| 81 |
+
# Verify auth
|
| 82 |
+
curl -u "your-email:your-token" \
|
| 83 |
+
"https://your-org.atlassian.net/rest/api/3/myself"
|
| 84 |
+
|
| 85 |
+
# List available projects
|
| 86 |
+
curl -u "your-email:your-token" \
|
| 87 |
+
"https://your-org.atlassian.net/rest/api/3/project/search"
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
---
|
| 91 |
+
|
| 92 |
+
## INC-006: Confluence Sync 404 Error
|
| 93 |
+
|
| 94 |
+
**Symptoms:** Confluence sync fails with "404 Not Found", URL has double `/wiki/wiki/`
|
| 95 |
+
|
| 96 |
+
**Root cause:** CONFLUENCE_BASE_URL incorrectly includes `/wiki` suffix
|
| 97 |
+
|
| 98 |
+
**Resolution:**
|
| 99 |
+
In `.env`, set:
|
| 100 |
+
```
|
| 101 |
+
CONFLUENCE_BASE_URL=https://your-org.atlassian.net
|
| 102 |
+
```
|
| 103 |
+
Not:
|
| 104 |
+
```
|
| 105 |
+
CONFLUENCE_BASE_URL=https://your-org.atlassian.net/wiki ← WRONG
|
| 106 |
+
```
|
data_sources/setup_guide.md
ADDED
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Godspeed Setup Guide
|
| 2 |
+
|
| 3 |
+
## Prerequisites
|
| 4 |
+
|
| 5 |
+
- Python 3.11+
|
| 6 |
+
- Docker (for Qdrant)
|
| 7 |
+
- Redis (brew install redis)
|
| 8 |
+
- Supabase account
|
| 9 |
+
- Google AI Studio API key (Gemini)
|
| 10 |
+
|
| 11 |
+
## Quick Start
|
| 12 |
+
|
| 13 |
+
### 1. Clone and install
|
| 14 |
+
|
| 15 |
+
```bash
|
| 16 |
+
git clone https://github.com/samyuktha2004/Godspeed.git
|
| 17 |
+
cd GodSpeed
|
| 18 |
+
python -m venv .venv
|
| 19 |
+
source .venv/bin/activate
|
| 20 |
+
pip install -r requirements.txt
|
| 21 |
+
python -m spacy download en_core_web_sm
|
| 22 |
+
```
|
| 23 |
+
|
| 24 |
+
### 2. Start infrastructure
|
| 25 |
+
|
| 26 |
+
```bash
|
| 27 |
+
# Redis (macOS)
|
| 28 |
+
brew services start redis
|
| 29 |
+
|
| 30 |
+
# Qdrant
|
| 31 |
+
docker run -d --name qdrant -p 6333:6333 qdrant/qdrant
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
### 3. Configure environment
|
| 35 |
+
|
| 36 |
+
Copy `.env.example` to `.env` and fill in:
|
| 37 |
+
|
| 38 |
+
```
|
| 39 |
+
GOOGLE_API_KEY=your-gemini-key
|
| 40 |
+
SUPABASE_URL=https://your-project.supabase.co
|
| 41 |
+
SUPABASE_KEY=your-service-role-key
|
| 42 |
+
JIRA_BASE_URL=https://your-org.atlassian.net
|
| 43 |
+
JIRA_EMAIL=you@your-org.com
|
| 44 |
+
JIRA_API_TOKEN=your-atlassian-token
|
| 45 |
+
CONFLUENCE_BASE_URL=https://your-org.atlassian.net
|
| 46 |
+
CONFLUENCE_EMAIL=you@your-org.com
|
| 47 |
+
CONFLUENCE_TOKEN=your-atlassian-token
|
| 48 |
+
CONFLUENCE_SPACES=YOUR_SPACE_KEY
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
### 4. Run Supabase schema
|
| 52 |
+
|
| 53 |
+
Open your Supabase project → SQL Editor → paste and run `supabase/schema.sql`.
|
| 54 |
+
|
| 55 |
+
Also run this to add the qdrant_id column:
|
| 56 |
+
```sql
|
| 57 |
+
ALTER TABLE chunks ADD COLUMN IF NOT EXISTS qdrant_id text;
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
### 5. Start the server
|
| 61 |
+
|
| 62 |
+
```bash
|
| 63 |
+
uvicorn main:app --port 8000
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
### 6. Start Celery worker (optional, for background jobs)
|
| 67 |
+
|
| 68 |
+
```bash
|
| 69 |
+
celery -A ingestion.jobs.celery_app worker --loglevel=info
|
| 70 |
+
celery -A ingestion.jobs.celery_app beat --loglevel=info
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
## Testing the System
|
| 74 |
+
|
| 75 |
+
### Ingest Confluence
|
| 76 |
+
```bash
|
| 77 |
+
curl -X POST http://localhost:8000/confluence/sync/YOUR_SPACE_KEY
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
### Query the agent
|
| 81 |
+
```bash
|
| 82 |
+
curl -X POST http://localhost:8000/agent/query \
|
| 83 |
+
-H "Content-Type: application/json" \
|
| 84 |
+
-d '{"query": "What is our deployment process?", "team_id": "default", "session_id": "s1"}'
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
## Troubleshooting
|
| 88 |
+
|
| 89 |
+
### Server won't start
|
| 90 |
+
- Check all env vars are set in `.env`
|
| 91 |
+
- Make sure Redis and Qdrant are running
|
| 92 |
+
- Run `python -c "import main"` to check for import errors
|
| 93 |
+
|
| 94 |
+
### Agent returns low confidence
|
| 95 |
+
- The knowledge base may not have enough relevant content
|
| 96 |
+
- Run a Confluence sync to ingest more pages
|
| 97 |
+
- Check Supabase chunks table has rows
|
| 98 |
+
|
| 99 |
+
### Qdrant connection refused
|
| 100 |
+
- Start Docker and run: `docker start qdrant`
|
| 101 |
+
- Or: `docker run -d --name qdrant -p 6333:6333 qdrant/qdrant`
|
| 102 |
+
|
| 103 |
+
### Supabase RLS error
|
| 104 |
+
- Use the `service_role` key, not the `anon` key
|
| 105 |
+
|
| 106 |
+
### First query is slow (30-60s)
|
| 107 |
+
- BGE-M3 and reranker models download on first use (~1.5GB total)
|
| 108 |
+
- Subsequent queries are fast (models cached in memory)
|
src/file_agent/parsers/__init__.py
CHANGED
|
@@ -32,4 +32,4 @@ def dispatch(path: str, fmt: str) -> list[Block]:
|
|
| 32 |
|
| 33 |
|
| 34 |
# Import parsers to trigger registration
|
| 35 |
-
from src.file_agent.parsers import pdf, docx, xml, csv, html # noqa: F401, E402
|
|
|
|
| 32 |
|
| 33 |
|
| 34 |
# Import parsers to trigger registration
|
| 35 |
+
from src.file_agent.parsers import pdf, docx, xml, csv, html, text # noqa: F401, E402
|
src/file_agent/parsers/text.py
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
from src.file_agent.parsers import Block, register
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
@register("text")
|
| 7 |
+
def parse_text(path: str) -> list[Block]:
|
| 8 |
+
with open(path, encoding="utf-8", errors="replace") as f:
|
| 9 |
+
content = f.read().strip()
|
| 10 |
+
if not content:
|
| 11 |
+
return []
|
| 12 |
+
return [{"type": "text", "content": content}]
|