File size: 3,951 Bytes
73fd1fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
# Document Ingestion System

## Overview

The backend now supports a comprehensive document ingestion system that matches the system prompt specification. This allows AI agents to automatically detect and ingest various document types (PDF, DOCX, TXT, URLs, raw text) with full metadata support.

## Endpoints

### 1. Legacy Endpoint (Backward Compatible)
```
POST /rag/ingest
Headers:
  x-tenant-id: <tenant_id>
Body:
  {
    "content": "text content to ingest"
  }
```

### 2. Enhanced Document Ingestion Endpoint
```
POST /rag/ingest-document
Headers:
  x-tenant-id: <tenant_id> (optional if in body)
Body:
  {
    "action": "ingest_document",
    "tenant_id": "<tenant_id>",  // Optional if in header
    "source_type": "pdf | docx | txt | url | raw_text | markdown",  // Auto-detected if not provided
    "content": "text content or URL",
    "metadata": {
      "filename": "document.pdf",
      "url": "https://example.com/doc",
      "doc_id": "unique-document-id"
    }
  }
```

## Features

### Automatic Source Type Detection
- **PDF**: Detected from `.pdf` extension or filename
- **DOCX**: Detected from `.docx` or `.doc` extension
- **TXT**: Detected from `.txt` or `.text` extension
- **Markdown**: Detected from `.md` or `.markdown` extension
- **URL**: Detected from URL in content or metadata
- **Raw Text**: Default fallback for plain text

### URL Processing
- Automatically fetches content from URLs
- Strips HTML tags and scripts
- Normalizes whitespace
- Handles redirects and timeouts

### Text Normalization
- Removes excessive whitespace
- Strips control characters
- Sanitizes input before ingestion

### Metadata Support
- `filename`: Original filename
- `url`: Source URL
- `doc_id`: Unique document identifier (auto-generated if not provided)
- Custom metadata can be added to the metadata object

## Usage Examples

### Example 1: Ingest Raw Text
```json
{
  "action": "ingest_document",
  "tenant_id": "tenant123",
  "source_type": "raw_text",
  "content": "This is a company policy document...",
  "metadata": {
    "filename": "policy.txt",
    "doc_id": "policy-2024-01"
  }
}
```

### Example 2: Ingest from URL
```json
{
  "action": "ingest_document",
  "tenant_id": "tenant123",
  "source_type": "url",
  "content": "https://example.com/documentation",
  "metadata": {
    "url": "https://example.com/documentation",
    "doc_id": "docs-example-com"
  }
}
```

### Example 3: Ingest PDF (with extracted text)
```json
{
  "action": "ingest_document",
  "tenant_id": "tenant123",
  "source_type": "pdf",
  "content": "<extracted PDF text>",
  "metadata": {
    "filename": "manual.pdf",
    "doc_id": "manual-2024"
  }
}
```

## Response Format

```json
{
  "status": "ok",
  "message": "Document ingested successfully. 5 chunk(s) stored.",
  "tenant_id": "tenant123",
  "source_type": "raw_text",
  "doc_id": "policy-2024-01",
  "chunks_stored": 5,
  "metadata": {
    "filename": "policy.txt",
    "doc_id": "policy-2024-01"
  }
}
```

## Integration with AI Agents

The system is designed to work with AI agents that follow the system prompt specification:

1. **Agent detects** document/URL/pasted content
2. **Agent prepares** ingestion payload with proper structure
3. **Agent sends** to `POST /rag/ingest-document`
4. **Backend processes**:
   - Detects/validates source type
   - Fetches URL content if needed
   - Normalizes text
   - Sends to RAG MCP server for chunking/embedding
   - Stores in pgvector
5. **Agent confirms** ingestion to user

## Error Handling

- **400 Bad Request**: Missing tenant_id, invalid payload, empty content
- **500 Internal Server Error**: RAG MCP server error, database error, URL fetch failure

## Notes

- The legacy `/rag/ingest` endpoint remains for backward compatibility
- Source type is auto-detected if not provided
- URL fetching is async and handles timeouts gracefully
- All content is normalized before ingestion
- Metadata is preserved and stored with chunks