Spaces:

nothingworry
/

IntegraChat

Sleeping

App Files Files Community

IntegraChat / backend /api /services /INGESTION_SYSTEM.md

nothingworry

feat: Add knowledge base with document ingestion and file upload support

73fd1fc 27 days ago

preview code

raw

history blame

3.95 kB

Document Ingestion System

Overview

The backend now supports a comprehensive document ingestion system that matches the system prompt specification. This allows AI agents to automatically detect and ingest various document types (PDF, DOCX, TXT, URLs, raw text) with full metadata support.

Endpoints

1. Legacy Endpoint (Backward Compatible)

POST /rag/ingest
Headers:
  x-tenant-id: <tenant_id>
Body:
  {
    "content": "text content to ingest"
  }

2. Enhanced Document Ingestion Endpoint

POST /rag/ingest-document
Headers:
  x-tenant-id: <tenant_id> (optional if in body)
Body:
  {
    "action": "ingest_document",
    "tenant_id": "<tenant_id>",  // Optional if in header
    "source_type": "pdf | docx | txt | url | raw_text | markdown",  // Auto-detected if not provided
    "content": "text content or URL",
    "metadata": {
      "filename": "document.pdf",
      "url": "https://example.com/doc",
      "doc_id": "unique-document-id"
    }
  }

Features

Automatic Source Type Detection

PDF: Detected from .pdf extension or filename
DOCX: Detected from .docx or .doc extension
TXT: Detected from .txt or .text extension
Markdown: Detected from .md or .markdown extension
URL: Detected from URL in content or metadata
Raw Text: Default fallback for plain text

URL Processing

Automatically fetches content from URLs
Strips HTML tags and scripts
Normalizes whitespace
Handles redirects and timeouts

Text Normalization

Removes excessive whitespace
Strips control characters
Sanitizes input before ingestion

Metadata Support

filename: Original filename
url: Source URL
doc_id: Unique document identifier (auto-generated if not provided)
Custom metadata can be added to the metadata object

Usage Examples

Example 1: Ingest Raw Text

{
  "action": "ingest_document",
  "tenant_id": "tenant123",
  "source_type": "raw_text",
  "content": "This is a company policy document...",
  "metadata": {
    "filename": "policy.txt",
    "doc_id": "policy-2024-01"
  }
}

Example 2: Ingest from URL

{
  "action": "ingest_document",
  "tenant_id": "tenant123",
  "source_type": "url",
  "content": "https://example.com/documentation",
  "metadata": {
    "url": "https://example.com/documentation",
    "doc_id": "docs-example-com"
  }
}

Example 3: Ingest PDF (with extracted text)

{
  "action": "ingest_document",
  "tenant_id": "tenant123",
  "source_type": "pdf",
  "content": "<extracted PDF text>",
  "metadata": {
    "filename": "manual.pdf",
    "doc_id": "manual-2024"
  }
}

Response Format

{
  "status": "ok",
  "message": "Document ingested successfully. 5 chunk(s) stored.",
  "tenant_id": "tenant123",
  "source_type": "raw_text",
  "doc_id": "policy-2024-01",
  "chunks_stored": 5,
  "metadata": {
    "filename": "policy.txt",
    "doc_id": "policy-2024-01"
  }
}

Integration with AI Agents

The system is designed to work with AI agents that follow the system prompt specification:

Agent detects document/URL/pasted content
Agent prepares ingestion payload with proper structure
Agent sends to POST /rag/ingest-document
Backend processes:
- Detects/validates source type
- Fetches URL content if needed
- Normalizes text
- Sends to RAG MCP server for chunking/embedding
- Stores in pgvector
Agent confirms ingestion to user

Error Handling

400 Bad Request: Missing tenant_id, invalid payload, empty content
500 Internal Server Error: RAG MCP server error, database error, URL fetch failure

Notes

The legacy /rag/ingest endpoint remains for backward compatibility
Source type is auto-detected if not provided
URL fetching is async and handles timeouts gracefully
All content is normalized before ingestion
Metadata is preserved and stored with chunks