Spaces:

nothingworry
/

IntegraChat

Sleeping

App Files Files Community

nothingworry commited on Nov 18, 2025

Commit

eb29e58

1 Parent(s): 73fd1fc

update the readme file

Browse files

Files changed (1) hide show

README.md +192 -60

README.md CHANGED Viewed

@@ -77,8 +77,22 @@ Agents can intelligently:
 - **Strict multi-tenant isolation** with tenant_id filtering
 - **Intelligent text chunking** (~300 words per chunk)
 - **Vector similarity search** using cosine distance
-- **Async document ingestion** via Celery workers (PDF, DOCX, TXT support)
-- **Knowledge base management UI** for document upload and search
 ### 3. 🌐 Live Web Search Tool
@@ -126,7 +140,27 @@ Comprehensive insights for:
 - **Real-time analytics panel** in the frontend UI
 - **Async analytics processing** via Celery workers
-### 7. 🏢 Multi-Tenant Isolation
 Each tenant gets:
@@ -155,6 +189,9 @@ Isolation is guaranteed via **Supabase Row-Level Security (RLS)**.
 | **Slack / Email** | Alerting system |
 | **Celery** | Async task queue for document ingestion and analytics |
 | **Redis / RabbitMQ** | Message broker for Celery workers |
 ### Frontend
@@ -228,29 +265,33 @@ IntegraChat/
 │   │   │   ├── intent_classifier.py
 │   │   │   ├── redflag_detector.py
 │   │   │   ├── tool_selector.py
 │   │   │   ├── prompt_builder.py
-│   │   │   └── llm_client.py
 │   │   ├── mcp_clients/
 │   │   │   ├── rag_client.py
 │   │   │   ├── web_client.py
 │   │   │   └── admin_client.py
 │   │   ├── models/
-│   │   │   ├── requests.py
-│   │   │   ├── responses.py
-│   │   │   ├── agent_decision.py
-│   │   │   └── embeddings.py
 │   │   ├── utils/
-│   │   │   ├── logging.py
-│   │   │   ├── supabase_client.py
-│   │   │   ├── tenant_context.py
-│   │   │   ├── security.py
 │   │   │   └── text_extractor.py
 │   │   └── config.py
 │   │
 │   ├── mcp_servers/
 │   │   ├── main.py              # RAG MCP Server (FastAPI)
-│   │   ├── database.py          # Supabase/PostgreSQL connection
-│   │   └── embeddings.py        # Sentence transformers embeddings
 │   │
 │   ├── workers/
 │   │   ├── ingestion_worker.py    # Celery tasks for document ingestion
@@ -258,10 +299,6 @@ IntegraChat/
 │   │   ├── scheduler.py           # Scheduled task definitions
 │   │   └── celeryconfig.py        # Celery app configuration
 │   │
-│   ├── api/
-│   │   ├── ingestion/
-│   │   │   └── pdf.py             # PDF processing utilities
-│   │
 │   ├── tests/
 │   │   ├── test_agent.py
 │   │   ├── test_rag.py
@@ -277,12 +314,11 @@ IntegraChat/
 │   │   ├── layout.tsx              # Root layout
 │   │   ├── globals.css            # Global styles
 │   │   └── knowledge-base/
-│   │       └── page.tsx           # Knowledge base management page
 │   ├── components/
 │   │   ├── chat-panel.tsx         # Chat interface component
 │   │   ├── analytics-panel.tsx    # Analytics dashboard component
 │   │   ├── knowledge-base-panel.tsx # Knowledge base search/ingest UI
-│   │   ├── ingestion-card.tsx     # Document ingestion card
 │   │   ├── hero.tsx               # Hero section
 │   │   ├── feature-grid.tsx       # Feature showcase grid
 │   │   └── footer.tsx             # Footer component
@@ -333,12 +369,13 @@ IntegraChat/
 Before you begin, ensure you have the following installed:
 - ✅ **Python 3.10+**
-- ✅ **Node.js 18+** (for frontend)
 - ✅ **Supabase project** (with pgvector extension enabled)
 - ✅ **PostgreSQL connection string** (from Supabase)
-- ✅ **Ollama** (for local LLM) or **Groq API key** (for cloud LLM)
-- ✅ **DuckDuckGo Search key** (optional, if configured)
 - ✅ **Slack/Email webhook** for alerts (optional)
 ### Backend Setup
@@ -368,23 +405,45 @@ Before you begin, ensure you have the following installed:
    # LLM Configuration
    OLLAMA_URL=http://localhost:11434
-   OLLAMA_MODEL=llama3
-   # Or use Groq instead:
-   # GROQ_API_KEY=your_groq_api_key
    # Celery Configuration (for async workers)
    CELERY_BROKER_URL=redis://localhost:6379/0
    CELERY_RESULT_BACKEND=redis://localhost:6379/0
    ```
-4. **Start the RAG MCP Server**
    ```bash
    cd backend/mcp_servers
    python main.py
    ```
-   The server will automatically initialize the database schema on startup.
    - Server runs on `http://localhost:8001`
-   - API docs available at `http://localhost:8001/docs`
 5. **Start Celery workers** (for async document ingestion and analytics)
    ```bash
@@ -398,60 +457,101 @@ Before you begin, ensure you have the following installed:
 6. **Run the main API server**
    ```bash
    cd backend
-   uvicorn api.main:app --reload
    ```
-### RAG MCP Server API
-The RAG MCP Server provides two main endpoints:
-**Ingest Documents:**
 ```bash
-curl -X POST http://localhost:8001/ingest \
   -H "Content-Type: application/json" \
   -d '{
-    "tenant_id": "tenant123",
-    "content": "Your document text here..."
   }'
 ```
-**Semantic Search:**
 ```bash
-curl -X POST http://localhost:8001/search \
   -H "Content-Type: application/json" \
   -d '{
-    "tenant_id": "tenant123",
     "query": "What are the HR policies?"
   }'
 ```
-### Document Ingestion via Celery Workers
-Documents can be ingested asynchronously using Celery workers:
-**Via API endpoint (triggers Celery task):**
 ```bash
 curl -X POST http://localhost:8000/rag/ingest \
   -H "Content-Type: application/json" \
   -H "x-tenant-id: tenant123" \
   -d '{
-    "content": "Your document text here...",
-    "doc_id": "doc_001"
   }'
 ```
 **Supported formats:**
-- Raw text content
-- PDF files (via file upload)
-- DOCX files (via file upload)
-- TXT files (via file upload)
-- URLs (web page content)
-The ingestion worker automatically:
-- Extracts text from files
-- Chunks text with configurable overlap
-- Generates embeddings using Sentence-Transformers
-- Stores chunks and embeddings in Supabase/pgvector
 ### Frontend Setup
@@ -478,9 +578,39 @@ The ingestion worker automatically:
    ```
    The app will be available at `http://localhost:3000` with:
-   - **Main landing page** with hero, features, and chat panel
-   - **Knowledge base page** (`/knowledge-base`) for document management
-   - **Analytics panel** showing query metrics and tool usage
 ### Quick Start with Docker
@@ -501,7 +631,9 @@ docker-compose up -d
 | 🌐 **English Web Search** | Forces English language results for better accuracy |
 | 🏢 **Production-Grade** | Multi-tenant design with strict Supabase RLS |
 | 📊 **Full Observability** | Logs, analytics, tool events, violations |
-| 📚 **Knowledge Base UI** | Complete document management interface with search and ingestion |
 | ⚡ **Async Processing** | Celery workers for scalable document ingestion and analytics |
 | 🎯 **Demo-Ready** | Perfect for enterprise presentations |

 - **Strict multi-tenant isolation** with tenant_id filtering
 - **Intelligent text chunking** (~300 words per chunk)
 - **Vector similarity search** using cosine distance
+- **Multi-format document ingestion**:
+  - **PDF files** - Server-side parsing with PyPDF2
+  - **DOCX files** - Server-side parsing with python-docx
+  - **TXT/Markdown files** - Direct text ingestion
+  - **URLs** - Automatic content fetching and extraction
+  - **Raw text** - Direct paste and ingest
+- **File upload endpoint** (`/rag/ingest-file`) for binary file processing
+- **Enhanced ingestion API** (`/rag/ingest-document`) with metadata support
+- **Document listing** (`/rag/list`) with pagination and filtering
+- **Knowledge base management UI**:
+  - Search interface with semantic search
+  - File upload with drag-and-drop support
+  - Source type selection (PDF, DOCX, TXT, URL, raw text)
+  - Document library page showing all ingested content
+  - Filter by document type (PDF, FAQ, Link, Text)
+- **Async document ingestion** via Celery workers (optional)
 ### 3. 🌐 Live Web Search Tool
 - **Real-time analytics panel** in the frontend UI
 - **Async analytics processing** via Celery workers
+### 7. 📄 Document Ingestion System
+Complete document management workflow:
+- **Multiple ingestion methods**:
+  - File upload (PDF, DOCX, TXT, MD)
+  - URL fetching with HTML extraction
+  - Raw text pasting
+  - Programmatic API ingestion
+- **Automatic type detection** from filename or content
+- **Metadata support** (filename, URL, doc_id, custom fields)
+- **Server-side parsing** for binary files (PDF/DOCX)
+- **Text normalization** and sanitization
+- **Knowledge base library** page with:
+  - Document grid view
+  - Type-based filtering (PDF, FAQ, Link, Text)
+  - Search functionality
+  - Document metadata display
+  - Creation date tracking
+### 8. 🏢 Multi-Tenant Isolation
 Each tenant gets:
 | **Slack / Email** | Alerting system |
 | **Celery** | Async task queue for document ingestion and analytics |
 | **Redis / RabbitMQ** | Message broker for Celery workers |
+| **PyPDF2** | PDF text extraction |
+| **python-docx** | DOCX text extraction |
+| **python-multipart** | File upload handling |
 ### Frontend
 │   │   │   ├── intent_classifier.py
 │   │   │   ├── redflag_detector.py
 │   │   │   ├── tool_selector.py
+│   │   │   ├── tool_scoring.py
+│   │   │   ├── semantic_encoder.py
 │   │   │   ├── prompt_builder.py
+│   │   │   ├── llm_client.py
+│   │   │   └── document_ingestion.py
 │   │   ├── mcp_clients/
 │   │   │   ├── rag_client.py
 │   │   │   ├── web_client.py
 │   │   │   └── admin_client.py
 │   │   ├── models/
+│   │   │   ├── agent.py
+│   │   │   └── redflag.py
 │   │   ├── utils/
 │   │   │   └── text_extractor.py
 │   │   └── config.py
 │   │
 │   ├── mcp_servers/
 │   │   ├── main.py              # RAG MCP Server (FastAPI)
+│   │   ├── rag_server.py        # Alternative RAG server implementation
+│   │   ├── web_server.py        # Web search MCP server
+│   │   ├── admin_server.py      # Admin governance MCP server
+│   │   ├── database.py          # Supabase/PostgreSQL connection + pgvector
+│   │   ├── embeddings.py       # Sentence transformers embeddings
+│   │   └── models/
+│   │       ├── rag.py
+│   │       ├── web.py
+│   │       └── admin.py
 │   │
 │   ├── workers/
 │   │   ├── ingestion_worker.py    # Celery tasks for document ingestion
 │   │   ├── scheduler.py           # Scheduled task definitions
 │   │   └── celeryconfig.py        # Celery app configuration
 │   │
 │   ├── tests/
 │   │   ├── test_agent.py
 │   │   ├── test_rag.py
 │   │   ├── layout.tsx              # Root layout
 │   │   ├── globals.css            # Global styles
 │   │   └── knowledge-base/
+│   │       └── page.tsx           # Knowledge base library page
 │   ├── components/
 │   │   ├── chat-panel.tsx         # Chat interface component
 │   │   ├── analytics-panel.tsx    # Analytics dashboard component
 │   │   ├── knowledge-base-panel.tsx # Knowledge base search/ingest UI
 │   │   ├── hero.tsx               # Hero section
 │   │   ├── feature-grid.tsx       # Feature showcase grid
 │   │   └── footer.tsx             # Footer component
 Before you begin, ensure you have the following installed:
 - ✅ **Python 3.10+**
+- ✅ **Node.js 20+ (64-bit)** (for frontend - required for Next.js)
 - ✅ **Supabase project** (with pgvector extension enabled)
 - ✅ **PostgreSQL connection string** (from Supabase)
+- ✅ **Ollama** (for local LLM) - [Installation Guide](#llm-setup)
+- ✅ **DuckDuckGo Search** (built-in, no key required)
 - ✅ **Slack/Email webhook** for alerts (optional)
+- ✅ **Redis/RabbitMQ** (for Celery workers, optional)
 ### Backend Setup
    # LLM Configuration
    OLLAMA_URL=http://localhost:11434
+   OLLAMA_MODEL=llama3.1:latest
+   LLM_BACKEND=ollama
+   # Note: Install Ollama from https://ollama.ai and run: ollama pull llama3.1:latest
    # Celery Configuration (for async workers)
    CELERY_BROKER_URL=redis://localhost:6379/0
    CELERY_RESULT_BACKEND=redis://localhost:6379/0
    ```
+4. **Start the MCP Servers** (in separate terminals or use start.bat)
+   **RAG MCP Server:**
    ```bash
    cd backend/mcp_servers
    python main.py
+   # Or: uvicorn main:app --reload --port 8001
    ```
    - Server runs on `http://localhost:8001`
+   - Automatically initializes database schema on startup
+   - API docs: `http://localhost:8001/docs`
+   **Web MCP Server:**
+   ```bash
+   cd backend/mcp_servers
+   uvicorn web_server:web_app --reload --port 8002
+   ```
+   **Admin MCP Server:**
+   ```bash
+   cd backend/mcp_servers
+   uvicorn admin_server:admin_app --reload --port 8003
+   ```
+   **Or use the start script:**
+   ```bash
+   ./start.bat  # Windows
+   # or
+   ./start.sh   # Linux/Mac
+   ```
 5. **Start Celery workers** (for async document ingestion and analytics)
    ```bash
 6. **Run the main API server**
    ```bash
    cd backend
+   uvicorn backend.api.main:app --reload --port 8000
    ```
+   - Server runs on `http://localhost:8000`
+   - API docs: `http://localhost:8000/docs`
+### RAG API Endpoints
+The RAG system provides multiple endpoints:
+**1. Enhanced Document Ingestion (with metadata):**
 ```bash
+curl -X POST http://localhost:8000/rag/ingest-document \
   -H "Content-Type: application/json" \
+  -H "x-tenant-id: tenant123" \
   -d '{
+    "action": "ingest_document",
+    "source_type": "raw_text",
+    "content": "Your document text here...",
+    "metadata": {
+      "filename": "policy.txt",
+      "doc_id": "policy-001"
+    }
   }'
 ```
+**2. File Upload (PDF, DOCX, TXT, MD):**
 ```bash
+curl -X POST http://localhost:8000/rag/ingest-file \
+  -H "x-tenant-id: tenant123" \
+  -F "file=@document.pdf"
+```
+**3. Semantic Search:**
+```bash
+curl -X POST http://localhost:8000/rag/search \
   -H "Content-Type: application/json" \
+  -H "x-tenant-id: tenant123" \
   -d '{
     "query": "What are the HR policies?"
   }'
 ```
+**4. List All Documents:**
+```bash
+curl -X GET "http://localhost:8000/rag/list?limit=100&offset=0" \
+  -H "x-tenant-id: tenant123"
+```
+**5. Legacy Simple Ingestion:**
 ```bash
 curl -X POST http://localhost:8000/rag/ingest \
   -H "Content-Type: application/json" \
   -H "x-tenant-id: tenant123" \
   -d '{
+    "content": "Your document text here..."
   }'
 ```
+### Document Ingestion System
+Documents can be ingested through multiple methods:
 **Supported formats:**
+- **PDF files** - Server-side parsing with PyPDF2
+- **DOCX files** - Server-side parsing with python-docx
+- **TXT/Markdown files** - Direct text ingestion
+- **URLs** - Automatic content fetching and HTML extraction
+- **Raw text** - Direct paste and ingest
+**Ingestion methods:**
+1. **File Upload** (recommended for PDF/DOCX):
+   - Use the frontend UI or `/rag/ingest-file` endpoint
+   - Files are parsed server-side automatically
+2. **Enhanced API** (with metadata):
+   - Use `/rag/ingest-document` for structured ingestion
+   - Supports filename, URL, doc_id, and custom metadata
+3. **Simple API** (legacy):
+   - Use `/rag/ingest` for quick text ingestion
+**The ingestion process automatically:**
+- Detects document type from filename or content
+- Extracts text (PDF/DOCX parsed server-side)
+- Normalizes and sanitizes text
+- Chunks text with configurable overlap (~300 words)
+- Generates embeddings using Sentence-Transformers (MiniLM)
+- Stores chunks and embeddings in pgvector
+- Preserves metadata (filename, URL, doc_id)
+**Optional: Async Processing via Celery**
+- For large-scale ingestion, use Celery workers
+- Configure `CELERY_BROKER_URL` in `.env`
+- Workers process ingestion tasks asynchronously
 ### Frontend Setup
    ```
    The app will be available at `http://localhost:3000` with:
+   - **Main landing page** (`/`) with:
+     - Hero section and feature overview
+     - **Knowledge Base Panel** - Search and ingest documents
+     - **Chat Panel** - Interact with the AI agent
+     - **Analytics Panel** - View metrics and tool usage
+   - **Knowledge Base Library** (`/knowledge-base`) - Browse all ingested documents with filtering
+### LLM Setup
+**Ollama (Recommended for Local Development):**
+1. **Install Ollama:**
+   - Download from https://ollama.ai
+   - Install and start the service
+2. **Pull a model:**
+   ```bash
+   ollama pull llama3.1:latest
+   ```
+3. **Verify it's running:**
+   ```bash
+   curl http://localhost:11434/api/tags
+   ```
+4. **Configure in `.env`:**
+   ```env
+   OLLAMA_URL=http://localhost:11434
+   OLLAMA_MODEL=llama3.1:latest
+   LLM_BACKEND=ollama
+   ```
+**Note:** If Ollama is not running, the system will show helpful error messages with setup instructions.
 ### Quick Start with Docker
 | 🌐 **English Web Search** | Forces English language results for better accuracy |
 | 🏢 **Production-Grade** | Multi-tenant design with strict Supabase RLS |
 | 📊 **Full Observability** | Logs, analytics, tool events, violations |
+| 📚 **Knowledge Base UI** | Complete document management with search, ingestion, and library view |
+| 📄 **Multi-Format Ingestion** | PDF, DOCX, TXT, URL, and raw text support with server-side parsing |
+| 🔍 **Document Library** | Browse, filter, and search all ingested documents |
 | ⚡ **Async Processing** | Celery workers for scalable document ingestion and analytics |
 | 🎯 **Demo-Ready** | Perfect for enterprise presentations |