Spaces:

BrejBala
/

rag-agent-workbench-api

Sleeping

App Files Files Community

BrejBala commited on Jan 17

Commit

b09b8a3

1 Parent(s): e63c592

final changes with API key

Browse files

Files changed (14) hide show

backend/.env.example +2 -1
backend/README.md +59 -7
backend/app/core/auth.py +91 -0
backend/app/core/security.py +6 -85
backend/app/main.py +12 -7
docs/CONTEXT.md +56 -45
docs/WORKLOG.md +34 -1
frontend/.streamlit/secrets.toml +0 -1
frontend/app.py +400 -84
frontend/services/__init__.py +1 -0
frontend/services/backend_client.py +25 -0
frontend/services/file_convert.py +63 -0
scripts/batch_ingest_local_folder.py +148 -0
scripts/docling_convert_and_upload.py +79 -20

backend/.env.example CHANGED Viewed

@@ -18,7 +18,8 @@ LANGCHAIN_API_KEY=your-langsmith-api-key
 LANGCHAIN_PROJECT=rag-agent-workbench
 # Optional: basic API protection
-# When set, /ingest/*, /documents/*, /search, and /chat* require header X-API-Key
 API_KEY=your-backend-api-key
 # Optional: CORS

 LANGCHAIN_PROJECT=rag-agent-workbench
 # Optional: basic API protection
+# When set, all routers except /health and the OpenAPI/Swagger docs require header X-API-Key.
+# In production-like environments (ENV=production or on Hugging Face Spaces), API_KEY must be set.
 API_KEY=your-backend-api-key
 # Optional: CORS

backend/README.md CHANGED Viewed

@@ -48,7 +48,9 @@ Optional for LangSmith tracing:
 Optional for basic API protection:
-- `API_KEY` – when set, `/ingest/*`, `/documents/*`, `/search`, and `/chat*` require `X-API-Key` header.
 Optional for CORS:
@@ -237,21 +239,71 @@ A helper script is provided to seed the index with multiple arXiv and OpenAlex q
 python ../scripts/seed_ingest.py --base-url http://localhost:8000 --namespace dev --mailto you@example.com
 ```
-## Docling integration (external script)
-Docling is used via a separate script so the backend container stays small. To convert a local PDF and upload it as text:
 ```bash
 cd scripts
-pip install docling
 python docling_convert_and_upload.py \
-  --pdf-path /path/to/file.pdf \
   --backend-url http://localhost:8000 \
   --namespace dev \
-  --title "My PDF via Docling" \
-  --source docling
 ```
 ## Deploy Backend on Hugging Face Spaces (Docker)
 1. **Create a new Space**

 Optional for basic API protection:
+- `API_KEY` – when set, all routers except `/health` are protected by `X-API-Key` (including `/chat`, `/search`, `/documents/*`, `/ingest/*`, `/metrics`, and the OpenAPI/Swagger docs).
+  - In production-like environments (`ENV=production` or on Hugging Face Spaces), `API_KEY` **must** be set or the backend will fail to start.
+  - In local development (no Spaces and `ENV` not set to `production`), `API_KEY` is optional; when omitted, the API (including docs) is open.
 Optional for CORS:
 python ../scripts/seed_ingest.py --base-url http://localhost:8000 --namespace dev --mailto you@example.com
 ```
+## Docling integration (external scripts)
+Docling is used via separate scripts so the backend container stays small and does not depend on Docling. To convert local documents and upload them as text:
+### Single file
 ```bash
 cd scripts
+pip install docling  # optional but recommended for rich formats
 python docling_convert_and_upload.py \
+  --file /path/to/file.pdf \
+  --backend-url http://localhost:8000 \
+  --namespace dev \
+  --title "My local document" \
+  --source local-file \
+  --api-key "$API_KEY"
+```
+- Supported formats when Docling is installed include: PDF, DOCX, PPT/PPTX, XLS/XLSX, HTML/HTM, MD, AsciiDoc, and TXT.
+- If Docling is **not** installed:
+  - `.txt` and `.md` files are ingested as raw text.
+  - Other formats will fail with a clear message instructing you to install Docling.
+### Batch ingest a folder
+```bash
+cd scripts
+pip install docling  # optional but recommended
+python batch_ingest_local_folder.py \
+  --folder /path/to/folder \
   --backend-url http://localhost:8000 \
   --namespace dev \
+  --source local-folder \
+  --max-files 200 \
+  --api-key "$API_KEY"
 ```
+- Recursively scans the folder for supported extensions and ingests up to `max-files` documents.
+- Each file is converted via `docling_convert_and_upload.py` logic and uploaded to `/documents/upload-text`.
+## Upload documents via UI (Streamlit dialog)
+The Streamlit chat frontend also supports uploading documents directly from the browser:
+- Click the **“📄 Upload Document”** button at the top of the chat page.
+- A modal dialog opens with:
+  - File chooser (`.pdf`, `.md`, `.txt`, `.docx`, `.pptx`, `.xlsx`, `.html`, `.htm`).
+  - Title (defaults to filename without extension).
+  - Namespace (defaults to the backend namespace, e.g. `dev`).
+  - Source label (defaults to `ui-upload`).
+  - Optional metadata: tags (comma-separated) and free-form notes.
+- On upload:
+  - The frontend converts the file to markdown/text and calls `POST /documents/upload-text` with:
+    - `title`, `source`, `text`, `namespace`, and a `metadata` dictionary containing conversion and UI metadata.
+  - On success, the upload is recorded in a “Recent uploads” section in the sidebar and can be quickly queried via “Search this document”.
+Notes:
+- Conversion happens entirely in the frontend:
+  - `.txt` and `.md` files are read as raw text.
+  - For richer formats (PDF/Office/HTML), the frontend attempts to use **Docling** if installed.
+  - If Docling is not available, an informative error is shown and the user is asked to upload `.md`/`.txt` instead.
+- On Streamlit Cloud, Docling must be added to the app’s Python environment (e.g. `requirements.txt`) for PDF/Office uploads to work.
+- Streamlit’s file uploader has a default maximum size (typically 200 MB); check Streamlit documentation if you need to increase or restrict this limit.
 ## Deploy Backend on Hugging Face Spaces (Docker)
 1. **Create a new Space**

backend/app/core/auth.py ADDED Viewed

	@@ -0,0 +1,91 @@

+import os
+from functools import lru_cache
+from typing import Optional
+from fastapi import HTTPException, Security, status
+from fastapi.security import APIKeyHeader
+from app.core.logging import get_logger
+logger = get_logger(__name__)
+api_key_header = APIKeyHeader(name="X-API-Key", auto_error=False)
+@lru_cache(maxsize=1)
+def _get_configured_api_key() -> Optional[str]:
+    """Return the configured API key, or None if not set.
+    The key is read from the API_KEY environment variable.
+    """
+    raw = os.getenv("API_KEY")
+    if raw is None or not raw.strip():
+        return None
+    return raw.strip()
+def _is_production_like() -> bool:
+    """Heuristic to detect production / hosted environments.
+    - ENV=production
+    - or running on Hugging Face Spaces (SPACE_ID or HF_HOME set)
+    """
+    env = os.getenv("ENV", "").strip().lower()
+    if env == "production":
+        return True
+    if os.getenv("SPACE_ID") or os.getenv("HF_HOME"):
+        return True
+    return False
+def validate_api_key_configuration() -> None:
+    """Validate API key configuration at startup.
+    Behaviour:
+    - In production-like environments (HF Spaces or ENV=production):
+        - API_KEY MUST be set, otherwise raise RuntimeError (fail fast).
+    - In other environments:
+        - If API_KEY is missing, allow running open but log a clear warning.
+    """
+    configured = _get_configured_api_key()
+    if _is_production_like():
+        if not configured:
+            raise RuntimeError(
+                "API_KEY environment variable must be set when running in "
+                "production or on Hugging Face Spaces. Configure API_KEY in "
+                "your environment or Space secrets."
+            )
+        logger.info("API key configured for production / hosted environment.")
+    else:
+        if not configured:
+            logger.warning(
+                "API_KEY is not set; backend is running without authentication. "
+                "This is intended for local development only."
+            )
+        else:
+            logger.info("API key configured for development mode.")
+async def require_api_key(api_key: Optional[str] = Security(api_key_header)) -> None:
+    """FastAPI dependency that enforces X-API-Key when configured.
+    - If API_KEY is not configured (local/dev), this is a no-op.
+    - If API_KEY is configured:
+        - Missing or mismatched X-API-Key results in HTTP 403.
+    """
+    configured = _get_configured_api_key()
+    if not configured:
+        # No API key configured: dev mode, do not enforce.
+        return
+    if not api_key:
+        raise HTTPException(
+            status_code=status.HTTP_403_FORBIDDEN,
+            detail="Missing API key. Provide X-API-Key header.",
+        )
+    if api_key != configured:
+        raise HTTPException(
+            status_code=status.HTTP_403_FORBIDDEN,
+            detail="Invalid API key.",
+        )

backend/app/core/security.py CHANGED Viewed

@@ -1,10 +1,8 @@
 import os
-from typing import Iterable, List, Optional
-from fastapi import FastAPI, Request
-from fastapi.responses import JSONResponse
 from fastapi.middleware.cors import CORSMiddleware
-from starlette.middleware.base import BaseHTTPMiddleware
 from app.core.logging import get_logger
@@ -23,76 +21,11 @@ def _get_allowed_origins() -> List[str]:
     return origins
-class APIKeyMiddleware(BaseHTTPMiddleware):
-    """Optional API key protection for selected endpoints.
-    When the API_KEY environment variable is set, this middleware enforces the
-    presence of an `X-API-Key` header with a matching value for:
-      - /ingest/*
-      - /documents/*
-      - /chat*
-      - /search
-    The following paths remain public regardless of API_KEY:
-      - /health
-      - /docs
-      - /openapi.json
-      - /redoc
-      - /metrics
-    When API_KEY is not set, the middleware is not installed and the API is open.
     """
-    def __init__(self, app: FastAPI, api_key: str) -> None:  # type: ignore[override]
-        super().__init__(app)
-        self.api_key = api_key
-        self._protected_prefixes: List[str] = [
-            "/ingest",
-            "/documents",
-            "/chat",
-            "/search",
-        ]
-        self._public_prefixes: List[str] = [
-            "/health",
-            "/docs",
-            "/openapi.json",
-            "/redoc",
-            "/metrics",
-        ]
-    async def dispatch(self, request: Request, call_next):  # type: ignore[override]
-        path = request.url.path or "/"
-        # Public endpoints stay open.
-        if any(path.startswith(prefix) for prefix in self._public_prefixes):
-            return await call_next(request)
-        # Only enforce for protected prefixes.
-        if not any(path.startswith(prefix) for prefix in self._protected_prefixes):
-            return await call_next(request)
-        header_key: Optional[str] = request.headers.get("X-API-Key")
-        if not header_key or header_key != self.api_key:
-            logger.warning("Rejected request with missing/invalid API key path=%s", path)
-            return JSONResponse(
-                status_code=401,
-                content={
-                    "detail": (
-                        "Missing or invalid API key. Provide X-API-Key header with "
-                        "a valid key to access this endpoint."
-                    )
-                },
-            )
-        return await call_next(request)
-def configure_security(app: FastAPI) -> None:
-    """Configure CORS and optional API key protection on the FastAPI app."""
-    # CORS
     origins = _get_allowed_origins()
     app.add_middleware(
         CORSMiddleware,
@@ -101,16 +34,4 @@ def configure_security(app: FastAPI) -> None:
         allow_methods=["*"],
         allow_headers=["*"],
     )
-    logger.info("CORS configured allow_origins=%s", origins)
-    # Optional API key middleware
-    api_key = os.getenv("API_KEY")
-    if not api_key:
-        logger.warning(
-            "API key disabled; protected endpoints are open. "
-            "Set API_KEY environment variable to enable X-API-Key protection."
-        )
-        return
-    logger.info("API key protection enabled for ingest, documents, search, and chat.")
-    app.add_middleware(APIKeyMiddleware, api_key=api_key)

 import os
+from typing import List
+from fastapi import FastAPI
 from fastapi.middleware.cors import CORSMiddleware
 from app.core.logging import get_logger
     return origins
+def configure_security(app: FastAPI) -> None:
+    """Configure CORS on the FastAPI app.
+    API key enforcement is handled via dependencies in app.core.auth.
     """
     origins = _get_allowed_origins()
     app.add_middleware(
         CORSMiddleware,
         allow_methods=["*"],
         allow_headers=["*"],
     )
+    logger.info("CORS configured allow_origins=%s", origins)

backend/app/main.py CHANGED Viewed

@@ -1,6 +1,7 @@
-from fastapi import FastAPI
 from fastapi.responses import ORJSONResponse
 from app.core.config import get_settings
 from app.core.errors import PineconeIndexConfigError, setup_exception_handlers
 from app.core.logging import configure_logging, get_logger
@@ -20,6 +21,9 @@ settings = get_settings()
 configure_logging(settings.LOG_LEVEL)
 logger = get_logger(__name__)
 # Log runtime port / environment context at import time for easier diagnostics.
 get_port()
@@ -37,13 +41,14 @@ configure_security(app)
 setup_rate_limiter(app)
 setup_metrics(app)
-# Register routers with tags and ensure they are included in the schema
 app.include_router(health_router, tags=["health"])
-app.include_router(ingest_router, tags=["ingest"])
-app.include_router(search_router, tags=["search"])
-app.include_router(documents_router, tags=["documents"])
-app.include_router(chat_router, tags=["chat"])
-app.include_router(metrics_router, tags=["metrics"])
 # Register exception handlers
 setup_exception_handlers(app)

+from fastapi import Depends, FastAPI
 from fastapi.responses import ORJSONResponse
+from app.core.auth import require_api_key, validate_api_key_configuration
 from app.core.config import get_settings
 from app.core.errors import PineconeIndexConfigError, setup_exception_handlers
 from app.core.logging import configure_logging, get_logger
 configure_logging(settings.LOG_LEVEL)
 logger = get_logger(__name__)
+# Validate API key configuration early so hosted deployments fail fast when misconfigured.
+validate_api_key_configuration()
 # Log runtime port / environment context at import time for easier diagnostics.
 get_port()
 setup_rate_limiter(app)
 setup_metrics(app)
+# Register routers with tags and ensure they are included in the schema.
+# Health and docs remain public; all other routers are protected by API key dependency when configured.
 app.include_router(health_router, tags=["health"])
+app.include_router(ingest_router, tags=["ingest"], dependencies=[Depends(require_api_key)])
+app.include_router(search_router, tags=["search"], dependencies=[Depends(require_api_key)])
+app.include_router(documents_router, tags=["documents"], dependencies=[Depends(require_api_key)])
+app.include_router(chat_router, tags=["chat"], dependencies=[Depends(require_api_key)])
+app.include_router(metrics_router, tags=["metrics"], dependencies=[Depends(require_api_key)])
 # Register exception handlers
 setup_exception_handlers(app)

docs/CONTEXT.md CHANGED Viewed

@@ -275,38 +275,35 @@ RAG Agent Workbench is a lightweight experimentation backend for retrieval-augme
       - Logs: `Starting on port=<port> hf_spaces_mode=<bool>` using a simple heuristic (`SPACE_ID` / `SPACE_REPO_ID` env vars).
     - Called from `app.main` at import time so the log line is visible in container logs during startup.
-### API key middleware and CORS
 - **API key protection**
-  - New module: `backend/app/core/security.py`
-    - `configure_security(app)`:
-      - Configures CORS.
-      - Installs optional `APIKeyMiddleware` when `API_KEY` env var is set.
-    - `APIKeyMiddleware` rules:
-      - If `API_KEY` is set:
-        - Require header `X-API-Key` on:
-          - `/ingest/*`
-          - `/documents/*`
-          - `/search`
-          - `/chat*` (both `/chat` and `/chat/stream`)
-        - Public endpoints (no key required):
-          - `/health`
-          - `/docs`
-          - `/openapi.json`
-          - `/redoc`
-          - `/metrics`
-        - Requests with missing/invalid key receive:
-          - HTTP `401` with JSON `{"detail": "Missing or invalid API key. ..."}`.
-      - If `API_KEY` is **not** set:
-        - Middleware is not installed.
-        - A warning is logged:
-          - `"API key disabled; protected endpoints are open. Set API_KEY..."`.
-  - Intended use:
-    - For public demos, set a simple API key and configure the frontend to pass it.
-    - For local development, leaving `API_KEY` unset keeps the API open.
 - **CORS configuration**
-  - Also in `core/security.py`:
     - Reads `ALLOWED_ORIGINS` env var as a comma-separated list.
     - If unset or empty:
       - Defaults to `["*"]` (permissive, useful for local dev and quick demos).
@@ -314,7 +311,7 @@ RAG Agent Workbench is a lightweight experimentation backend for retrieval-augme
       - `allow_origins=origins`
       - `allow_methods=["*"]`
       - `allow_headers=["*"]`
-  - Documented in `.env.example` and README so operators can lock this down for real deployments.
 ### Rate limiting (SlowAPI)
@@ -483,22 +480,36 @@ RAG Agent Workbench is a lightweight experimentation backend for retrieval-augme
       - `httpx`
     - Backend configuration:
       - Reads `BACKEND_BASE_URL` from `st.secrets["BACKEND_BASE_URL"]` or the `BACKEND_BASE_URL` environment variable.
-      - Reads optional `API_KEY` from `st.secrets["API_KEY"]` or the `API_KEY` environment variable.
-    - Connectivity panel (sidebar):
-      - Displays the configured backend URL.
-      - Indicates whether an API key is configured.
-      - Provides a "Ping /health" button that calls the backend and shows the JSON response.
-    - Chat UI:
-      - Text input for namespace (`dev` by default).
-      - Text area for user question.
-      - On "Send":
-        - Calls backend `/chat` with:
-          - `query`, `namespace`, `top_k=5`, `use_web_fallback=true`.
-          - Includes `X-API-Key` header when configured.
-        - Displays:
-          - Answer text.
-          - Timings JSON.
-          - Up to 5 sources with titles, URLs, and snippet text (in expanders).
 - Root-level `requirements.txt`
   - Added to support Streamlit Community Cloud, where the root requirements file is used:

       - Logs: `Starting on port=<port> hf_spaces_mode=<bool>` using a simple heuristic (`SPACE_ID` / `SPACE_REPO_ID` env vars).
     - Called from `app.main` at import time so the log line is visible in container logs during startup.
+### API key protection and CORS
 - **API key protection**
+  - New module: `backend/app/core/auth.py`
+    - Defines `require_api_key` FastAPI dependency using `APIKeyHeader` (`X-API-Key`).
+    - `validate_api_key_configuration()` runs at startup and enforces:
+      - In production-like environments (`ENV=production` or on Hugging Face Spaces via `SPACE_ID` / `HF_HOME`):
+        - `API_KEY` **must** be set or the backend fails fast with a clear error.
+      - In local development:
+        - If `API_KEY` is missing, the backend runs open but logs a prominent warning.
+    - `require_api_key` behaviour:
+      - If `API_KEY` is not configured (dev mode), the dependency is a no-op.
+      - If `API_KEY` is configured:
+        - Missing or mismatched `X-API-Key` results in HTTP 403.
+  - Wiring:
+    - All routers except `/health` are registered with `dependencies=[Depends(require_api_key)]`.
+    - Docs and OpenAPI endpoints are explicitly secured:
+      - `GET /openapi.json` – returns `app.openapi()`, protected by `require_api_key`.
+      - `GET /docs` – Swagger UI via `get_swagger_ui_html`, protected by `require_api_key`.
+      - `GET /redoc` – ReDoc UI via `get_redoc_html`, protected by `require_api_key`.
+    - Effect:
+      - In HF Spaces / production:
+        - `/docs`, `/redoc`, `/openapi.json`, `/chat`, `/search`, `/documents/*`, `/ingest/*`, `/metrics` all require `X-API-Key`.
+        - `/health` remains public for simple uptime checks.
+      - In local dev with no `API_KEY`:
+        - All endpoints (including docs) are accessible without a key for convenience.
 - **CORS configuration**
+  - `backend/app/core/security.py` now focuses solely on CORS:
     - Reads `ALLOWED_ORIGINS` env var as a comma-separated list.
     - If unset or empty:
       - Defaults to `["*"]` (permissive, useful for local dev and quick demos).
       - `allow_origins=origins`
       - `allow_methods=["*"]`
       - `allow_headers=["*"]`
+  - API key enforcement is handled entirely via `core/auth.py` and router/dependency wiring.
 ### Rate limiting (SlowAPI)
       - `httpx`
     - Backend configuration:
       - Reads `BACKEND_BASE_URL` from `st.secrets["BACKEND_BASE_URL"]` or the `BACKEND_BASE_URL` environment variable.
+      - Reads `API_KEY` from `st.secrets["API_KEY"]` or the `API_KEY` environment variable.
+    - Sidebar ("Backend" + settings):
+      - Shows backend URL and API key status.
+      - "Ping /health" button that calls the backend and shows the JSON response.
+      - `top_k` slider, `min_score` slider, `use_web_fallback` checkbox.
+      - "Show sources" toggle and "Clear chat" button.
+      - "Recent uploads" section with quick actions:
+        - For each recent upload, displays title, namespace, timestamp.
+        - A "Search this document" button pre-fills the chat input with a prompt such as `Summarize: <title>`.
+    - Chatbot UI:
+      - Uses `st.chat_message` and `st.chat_input` with conversation stored in `st.session_state.messages`.
+      - When the user sends a message:
+        - Appends it to history and displays it.
+        - Calls `/chat/stream` with `X-API-Key` (if available) and streams tokens into the UI.
+        - If `/chat/stream` is unavailable (e.g. 404), falls back to `/chat`.
+      - Assistant messages:
+        - Display the answer text.
+        - Optionally show sources in an expandable "Sources" section with titles, URLs, scores, and truncated snippets.
+      - If `API_KEY` is not configured in secrets or environment:
+        - The app warns and disables sending messages to the protected backend.
+    - UI document upload:
+      - A top-level “📄 Upload Document” button opens a `@st.dialog` modal.
+      - Inside the dialog:
+        - `st.file_uploader` for `.pdf`, `.md`, `.txt`, `.docx`, `.pptx`, `.xlsx`, `.html`, `.htm`.
+        - Inputs for title (defaulting to filename), namespace, source label, tags, and notes.
+        - A checkbox to allow uploading even when extracted text is very short.
+        - On submit:
+          - The frontend converts the file to text/markdown (using Docling when installed, or raw text for `.md`/`.txt`).
+          - Calls backend `POST /documents/upload-text` with `X-API-Key`.
+          - On success, records the upload in `st.session_state.recent_uploads` and triggers a rerun to close the dialog.
 - Root-level `requirements.txt`
   - Added to support Streamlit Community Cloud, where the root requirements file is used:

docs/WORKLOG.md CHANGED Viewed

@@ -39,4 +39,37 @@
   - Use port `7860` by default in the Docker image, while respecting the `PORT` environment variable for platforms like Hugging Face Spaces.
   - Keep API key protection opt-in via `API_KEY` with clear logging when disabled.
   - Enable rate limiting and caching by default, with simple boolean toggles (`RATE_LIMIT_ENABLED`, `CACHE_ENABLED`) for easy operational control.
-  - Implement metrics as in-memory only (no external storage) and expose them via a JSON `/metrics` endpoint tailored for demos and lightweight monitoring.

   - Use port `7860` by default in the Docker image, while respecting the `PORT` environment variable for platforms like Hugging Face Spaces.
   - Keep API key protection opt-in via `API_KEY` with clear logging when disabled.
   - Enable rate limiting and caching by default, with simple boolean toggles (`RATE_LIMIT_ENABLED`, `CACHE_ENABLED`) for easy operational control.
+  - Implement metrics as in-memory only (no external storage) and expose them via a JSON `/metrics` endpoint tailored for demos and lightweight monitoring.
+## 2026-01-17 – Security + UI + Ingestion Hardening
+- **Summary**
+  - Hardened the backend for public deployment by enforcing API key protection for all non-health endpoints and (initially) for the OpenAPI/Swagger documentation, then relaxed docs to be publicly viewable while keeping all functional endpoints protected.
+  - Upgraded the Streamlit frontend to a conversational chat UI using Streamlit's chat primitives.
+  - Improved local document ingestion workflows with Docling-aware scripts for single files and batch folder ingestion.
+  - Added a UI-based document upload dialog in the Streamlit app that ingests files via `/documents/upload-text`.
+- **Key Files Changed**
+  - Backend authentication and wiring:
+    - `backend/app/core/auth.py`
+    - `backend/app/core/security.py`
+    - `backend/app/main.py`
+  - Frontend chatbot UI and upload:
+    - `frontend/app.py`
+    - `frontend/services/file_convert.py`
+    - `frontend/services/backend_client.py`
+  - Local ingestion scripts:
+    - `scripts/docling_convert_and_upload.py`
+    - `scripts/batch_ingest_local_folder.py`
+  - Documentation:
+    - `backend/README.md`
+    - `docs/CONTEXT.md`
+    - `docs/WORKLOG.md` (this file)
+- **Major Decisions**
+  - In production-like environments (`ENV=production` or on Hugging Face Spaces), require `API_KEY` and fail fast at startup when it is missing; Swagger/OpenAPI remain publicly accessible but all non-health API endpoints still enforce `X-API-Key`.
+  - Use a single `require_api_key` dependency (based on `APIKeyHeader`) to protect all routers except `/health`.
+  - Treat Streamlit as a first-class chat client, using `st.chat_message`/`st.chat_input` with session-based history and optional streaming from `/chat/stream`.
+  - Keep Docling as an optional dependency used in:
+    - Local ingestion scripts that upload text to `/documents/upload-text`.
+    - The frontend upload dialog for converting PDFs/Office/HTML when available, while falling back to raw `.md`/`.txt` and showing clear errors otherwise.

frontend/.streamlit/secrets.toml DELETED Viewed

	@@ -1 +0,0 @@
1	- BACKEND_BASE_URL = "http://127.0.0.1:8000"

frontend/app.py CHANGED Viewed

@@ -1,131 +1,447 @@
 import os
-from typing import Any, Dict
 import httpx
 import streamlit as st
 def get_backend_base_url() -> str:
-    # Prefer Streamlit secrets, then environment variable, then localhost.
-    secrets = getattr(st, "secrets", {})
-    base_url = getattr(secrets, "get", lambda _k, _d=None: None)("BACKEND_BASE_URL", None)
-    if not base_url:
         base_url = os.getenv("BACKEND_BASE_URL", "http://localhost:8000")
-    return base_url.rstrip("/")
-def get_api_key() -> str | None:
-    secrets = getattr(st, "secrets", {})
-    api_key = getattr(secrets, "get", lambda _k, _d=None: None)("API_KEY", None)
-    if not api_key:
-        api_key = os.getenv("API_KEY")
-    return api_key
-async def ping_health(base_url: str, api_key: str | None) -> Dict[str, Any]:
     url = f"{base_url}/health"
     headers: Dict[str, str] = {}
     if api_key:
         headers["X-API-Key"] = api_key
-    async with httpx.AsyncClient(timeout=10.0) as client:
-        resp = await client.get(url, headers=headers)
     return resp.json()
-async def call_chat(
     base_url: str,
-    api_key: str | None,
-    query: str,
-    namespace: str,
 ) -> Dict[str, Any]:
     url = f"{base_url}/chat"
-    payload: Dict[str, Any] = {
-        "query": query,
-        "namespace": namespace,
-        "top_k": 5,
-        "use_web_fallback": True,
-    }
-    headers: Dict[str, str] = {"Content-Type": "application/json"}
-    if api_key:
-        headers["X-API-Key"] = api_key
-    async with httpx.AsyncClient(timeout=60.0) as client:
-        resp = await client.post(url, json=payload, headers=headers)
     return resp.json()
-def main() -> None:
-    st.set_page_config(page_title="RAG Agent Workbench", layout="wide")
-    st.title("RAG Agent Workbench – Chat Demo")
-    backend_base_url = get_backend_base_url()
-    api_key = get_api_key()
     with st.sidebar:
-        st.header("Connectivity")
         st.markdown(f"**Backend URL:** `{backend_base_url}`")
         if api_key:
-            st.markdown("**API key:** configured in Streamlit secrets.")
         else:
-            st.markdown("**API key:** not set (backend may be open).")
         if st.button("Ping /health"):
             try:
-                import asyncio
-                health = asyncio.run(ping_health(backend_base_url, api_key))
                 st.success("Backend reachable.")
                 st.json(health)
             except Exception as exc:  # noqa: BLE001
                 st.error(f"Health check failed: {exc}")
-    st.subheader("Chat")
-    namespace = st.text_input("Namespace", value="dev", help="Pinecone namespace to query.")
-    query = st.text_area(
-        "Your question",
-        value="What is retrieval-augmented generation?",
-        height=100,
     )
-    if st.button("Send"):
-        if not query.strip():
-            st.warning("Please enter a question.")
             return
-        with st.spinner("Calling backend /chat..."):
-            try:
-                import asyncio
-                response = asyncio.run(
-                    call_chat(backend_base_url, api_key, query.strip(), namespace.strip() or "dev")
-                )
-            except Exception as exc:  # noqa: BLE001
-                st.error(f"Error calling backend: {exc}")
-                return
-        answer = response.get("answer", "")
-        timings = response.get("timings", {})
-        sources = response.get("sources", [])
-        st.markdown("### Answer")
-        st.write(answer or "_No answer returned._")
-        st.markdown("### Timings (ms)")
-        st.json(timings)
-        if sources:
-            st.markdown("### Sources")
-            for idx, src in enumerate(sources[:5], start=1):
-                title = src.get("title") or f"Source {idx}"
-                url = src.get("url") or ""
-                score = src.get("score", 0.0)
-                st.markdown(f"**[{idx}] {title}** (score={score:.3f})")
-                if url:
-                    st.markdown(f"- URL: {url}")
-                chunk_text = src.get("chunk_text") or ""
-                if chunk_text:
-                    with st.expander("Snippet", expanded=False):
-                        st.write(chunk_text)
 if __name__ == "__main__":

+import json
 import os
+from datetime import datetime
+from pathlib import Path
+from typing import Any, Dict, Generator, List, Optional, Tuple
 import httpx
 import streamlit as st
+from services.backend_client import post_upload_text
+from services.file_convert import convert_uploaded_file_to_text
 def get_backend_base_url() -> str:
+    """Prefer Streamlit secrets, then environment variable, then localhost."""
+    if "BACKEND_BASE_URL" in st.secrets:
+        base_url = st.secrets["BACKEND_BASE_URL"]
+    else:
         base_url = os.getenv("BACKEND_BASE_URL", "http://localhost:8000")
+    return str(base_url).rstrip("/")
+def get_api_key() -> Optional[str]:
+    """Read API key from Streamlit secrets or environment."""
+    if "API_KEY" in st.secrets:
+        return str(st.secrets["API_KEY"])
+    return os.getenv("API_KEY")
+def ping_health(base_url: str, api_key: Optional[str]) -> Dict[str, Any]:
     url = f"{base_url}/health"
     headers: Dict[str, str] = {}
     if api_key:
         headers["X-API-Key"] = api_key
+    resp = httpx.get(url, headers=headers, timeout=10.0)
+    resp.raise_for_status()
     return resp.json()
+def call_chat(
     base_url: str,
+    api_key: str,
+    payload: Dict[str, Any],
 ) -> Dict[str, Any]:
     url = f"{base_url}/chat"
+    headers: Dict[str, str] = {"Content-Type": "application/json", "X-API-Key": api_key}
+    resp = httpx.post(url, json=payload, headers=headers, timeout=60.0)
+    resp.raise_for_status()
     return resp.json()
+def iter_chat_stream(
+    base_url: str,
+    api_key: str,
+    payload: Dict[str, Any],
+) -> Generator[Tuple[str, Optional[Dict[str, Any]]], None, None]:
+    """Stream tokens from /chat/stream and yield (partial_answer, final_payload).
+    The final_payload is None for intermediate updates and populated once
+    when the terminating SSE event is received.
+    """
+    url = f"{base_url}/chat/stream"
+    headers: Dict[str, str] = {"Content-Type": "application/json", "X-API-Key": api_key}
+    full_answer = ""
+    final_payload: Optional[Dict[str, Any]] = None
+    current_event: Optional[str] = None
+    with httpx.Client(timeout=60.0) as client:
+        with client.stream("POST", url, json=payload, headers=headers) as resp:
+            resp.raise_for_status()
+            for line in resp.iter_lines():
+                if not line:
+                    continue
+                if line.startswith("event:"):
+                    current_event = line.split(":", 1)[1].strip()
+                    continue
+                if line.startswith("data:"):
+                    data = line.split(":", 1)[1].lstrip()
+                    if current_event == "end":
+                        # Final payload with full JSON response.
+                        try:
+                            final_payload = json.loads(data)
+                        except json.JSONDecodeError:
+                            final_payload = None
+                    else:
+                        if data:
+                            if full_answer:
+                                full_answer += " "
+                            full_answer += data
+                            # Yield intermediate answer text.
+                            yield full_answer, None
+    # After stream ends, make sure we yield at least once with final payload.
+    if final_payload is not None:
+        # If the backend included the final answer in the JSON payload, prefer it.
+        answer_text = str(final_payload.get("answer") or full_answer)
+        yield answer_text, final_payload
+    elif full_answer:
+        yield full_answer, None
+def init_session_state() -> None:
+    if "messages" not in st.session_state:
+        st.session_state.messages: List[Dict[str, Any]] = []
+    if "show_sources" not in st.session_state:
+        st.session_state.show_sources = True
+    if "supports_stream" not in st.session_state:
+        st.session_state.supports_stream = True
+    # Namespace is fixed for now; default to "dev".
+    if "namespace" not in st.session_state:
+        st.session_state.namespace = "dev"
+    if "recent_uploads" not in st.session_state:
+        st.session_state.recent_uploads: List[Dict[str, Any]] = []
+    if "chat_prefill" not in st.session_state:
+        st.session_state.chat_prefill = None
+def render_sidebar(backend_base_url: str, api_key: Optional[str]) -> Dict[str, Any]:
     with st.sidebar:
+        st.header("Backend")
         st.markdown(f"**Backend URL:** `{backend_base_url}`")
         if api_key:
+            st.markdown("**API key:** configured in Streamlit secrets or environment.")
         else:
+            st.warning(
+                "API_KEY is not configured. The backend is expected to be protected; "
+                "chat will be disabled until an API key is set."
+            )
         if st.button("Ping /health"):
             try:
+                health = ping_health(backend_base_url, api_key)
                 st.success("Backend reachable.")
                 st.json(health)
             except Exception as exc:  # noqa: BLE001
                 st.error(f"Health check failed: {exc}")
+        st.markdown("---")
+        st.subheader("Chat settings")
+        top_k = st.slider("Top K", min_value=1, max_value=20, value=5, step=1)
+        min_score = st.slider(
+            "Minimum relevance score",
+            min_value=0.0,
+            max_value=1.0,
+            value=0.25,
+            step=0.05,
+        )
+        use_web_fallback = st.checkbox(
+            "Use web fallback (Tavily)",
+            value=True,
+            help="When enabled, /chat may call Tavily if retrieval is weak.",
+        )
+        st.session_state.show_sources = st.checkbox(
+            "Show sources", value=st.session_state.show_sources
+        )
+        if st.button("Clear chat"):
+            st.session_state.messages = []
+        st.markdown("---")
+        st.subheader("Recent uploads")
+        recent = st.session_state.get("recent_uploads", [])
+        if not recent:
+            st.caption("No documents uploaded yet.")
+        else:
+            for idx, item in enumerate(recent):
+                title = item.get("title") or "Untitled"
+                ns = item.get("namespace") or st.session_state.get("namespace", "dev")
+                ts = item.get("timestamp", "")
+                st.markdown(f"- **{title}**  \n  Namespace: `{ns}`  \n  Uploaded: {ts}")
+                if st.button("Search this document", key=f"search_upload_{idx}"):
+                    st.session_state.chat_prefill = f"Summarize: {title}"
+    return {
+        "top_k": top_k,
+        "min_score": float(min_score),
+        "use_web_fallback": bool(use_web_fallback),
+    }
+def render_chat_history(show_sources: bool) -> None:
+    for message in st.session_state.messages:
+        role = message.get("role", "user")
+        content = message.get("content", "")
+        with st.chat_message("assistant" if role == "assistant" else "user"):
+            st.markdown(content)
+            if role == "assistant" and show_sources:
+                sources = message.get("sources") or []
+                if sources:
+                    with st.expander("Sources", expanded=False):
+                        for idx, src in enumerate(sources, start=1):
+                            title = src.get("title") or f"Source {idx}"
+                            url = src.get("url") or ""
+                            score = src.get("score", 0.0)
+                            st.markdown(f"**[{idx}] {title}** (score={score:.3f})")
+                            if url:
+                                st.markdown(f"- URL: {url}")
+                            chunk_text = src.get("chunk_text") or ""
+                            if chunk_text:
+                                st.write(chunk_text[:1000] + ("..." if len(chunk_text) > 1000 else ""))
+@st.dialog("Upload document")
+def upload_dialog(backend_base_url: str, api_key: Optional[str]) -> None:
+    """Modal dialog for uploading and ingesting a document via /documents/upload-text."""
+    st.write("Upload a document to ingest it into the RAG backend.")
+    with st.form("upload_form"):
+        uploaded_file = st.file_uploader(
+            "Choose a file",
+            type=["pdf", "md", "txt", "docx", "pptx", "xlsx", "html", "htm"],
+            accept_multiple_files=False,
+        )
+        default_title = ""
+        if uploaded_file is not None:
+            default_title = Path(uploaded_file.name).stem
+        title = st.text_input("Title", value=default_title)
+        namespace = st.text_input(
+            "Namespace",
+            value=st.session_state.get("namespace", "dev"),
+            help="Target Pinecone namespace.",
+        )
+        source = st.text_input("Source label", value="ui-upload")
+        tags = st.text_input("Tags (comma separated)", value="")
+        notes = st.text_area("Notes", value="", height=80)
+        upload_anyway = st.checkbox(
+            "Upload even if extracted text is very short",
+            value=False,
+            help="Enable to upload even when the extracted text is shorter than 200 characters.",
+        )
+        submit = st.form_submit_button("Upload")
+    if not submit:
+        return
+    if uploaded_file is None:
+        st.error("Please select a file to upload.")
+        return
+    if not title.strip():
+        st.error("Please provide a title.")
+        return
+    if not api_key:
+        st.error("API_KEY is not configured; cannot upload to a protected backend.")
+        return
+    with st.spinner("Converting and uploading document..."):
+        try:
+            uploaded_file.seek(0)
+            text, conv_meta = convert_uploaded_file_to_text(uploaded_file)
+        except Exception as exc:  # noqa: BLE001
+            st.error(f"Error converting file: {exc}")
+            return
+        if len(text.strip()) < 200 and not upload_anyway:
+            st.warning(
+                "Extracted text is very short (< 200 characters). "
+                "Check the file or enable the checkbox to upload anyway."
+            )
+            return
+        meta: Dict[str, Any] = {
+            **conv_meta,
+            "tags": [t.strip() for t in tags.split(",") if t.strip()],
+            "notes": notes,
+        }
+        payload = {
+            "title": title.strip(),
+            "source": source.strip() or "ui-upload",
+            "text": text,
+            "namespace": namespace.strip() or st.session_state.get("namespace", "dev"),
+            "metadata": meta,
+        }
+        try:
+            response = post_upload_text(backend_base_url, api_key, payload)
+        except httpx.HTTPStatusError as exc:
+            if exc.response is not None:
+                detail = exc.response.text
+                status_code = exc.response.status_code
+            else:
+                detail = str(exc)
+                status_code = "error"
+            st.error(f"Upload failed ({status_code}): {detail}")
+            return
+        except Exception as exc:  # noqa: BLE001
+            st.error(f"Upload failed: {exc}")
+            return
+        # Record recent upload and suggest a follow-up chat action.
+        rec = {
+            "title": title.strip(),
+            "namespace": payload["namespace"],
+            "timestamp": datetime.utcnow().isoformat() + "Z",
+            "response": response,
+        }
+        recent = st.session_state.get("recent_uploads", [])
+        recent.append(rec)
+        st.session_state.recent_uploads = recent[-5:]
+        st.success(f"Uploaded and indexed: {title.strip()}")
+        st.rerun()
+def main() -> None:
+    st.set_page_config(page_title="RAG Agent Workbench", layout="wide")
+    st.title("RAG Agent Workbench – Chatbot")
+    init_session_state()
+    backend_base_url = get_backend_base_url()
+    api_key = get_api_key()
+    # Upload button near the top-level chat UI.
+    if st.button("📄 Upload Document"):
+        upload_dialog(backend_base_url, api_key)
+    settings = render_sidebar(backend_base_url, api_key)
+    render_chat_history(show_sources=st.session_state.show_sources)
+    if not api_key:
+        st.info(
+            "Configure `API_KEY` in Streamlit secrets (and on the backend) to start chatting."
+        )
+        return
+    # Pre-fill chat input if a suggestion was set (e.g. from recent uploads).
+    prefill = st.session_state.get("chat_prefill")
+    if prefill and "chat_input" not in st.session_state:
+        st.session_state.chat_input = prefill
+    user_message = st.chat_input(
+        "Ask a question about your documents...", key="chat_input"
     )
+    if not user_message:
+        return
+    # Clear any prefill once the user has sent a message.
+    st.session_state.chat_prefill = None
+    # Record and display user message
+    st.session_state.messages.append({"role": "user", "content": user_message})
+    with st.chat_message("user"):
+        st.markdown(user_message)
+    # Prepare payload for backend
+    chat_history = [
+        {"role": msg["role"], "content": msg["content"]}
+        for msg in st.session_state.messages
+        if msg.get("role") in ("user", "assistant")
+    ]
+    payload: Dict[str, Any] = {
+        "query": user_message,
+        "namespace": st.session_state.namespace,
+        "top_k": int(settings["top_k"]),
+        "use_web_fallback": settings["use_web_fallback"],
+        "min_score": float(settings["min_score"]),
+        "max_web_results": 5,
+        "chat_history": chat_history,
+    }
+    # Call backend and stream / display assistant response
+    with st.chat_message("assistant"):
+        placeholder = st.empty()
+        placeholder.markdown("_Thinking..._")
+        response: Optional[Dict[str, Any]] = None
+        try:
+            if st.session_state.get("supports_stream", True):
+                try:
+                    # Attempt to use streaming endpoint first.
+                    for partial_answer, final_payload in iter_chat_stream(
+                        backend_base_url,
+                        api_key,
+                        payload,
+                    ):
+                        if partial_answer:
+                            placeholder.markdown(partial_answer)
+                        if final_payload is not None:
+                            response = final_payload
+                            break
+                except httpx.HTTPStatusError as exc:
+                    # If /chat/stream is not available, fall back to /chat.
+                    if exc.response is not None and exc.response.status_code == 404:
+                        st.session_state.supports_stream = False
+                    else:
+                        raise
+            if response is None:
+                # Fallback to non-streaming /chat.
+                response = call_chat(backend_base_url, api_key, payload)
+                answer_text = str(response.get("answer") or "")
+                if answer_text:
+                    placeholder.markdown(answer_text)
+                else:
+                    placeholder.markdown("_No answer returned._")
+        except Exception as exc:  # noqa: BLE001
+            placeholder.markdown("")
+            st.error(f"Error calling backend: {exc}")
             return
+        if not response:
+            return
+        answer = str(response.get("answer") or "")
+        sources = response.get("sources") or []
+        timings = response.get("timings") or {}
+        # Optionally render sources for this assistant turn.
+        if st.session_state.show_sources and sources:
+            with st.expander("Sources", expanded=False):
+                for idx, src in enumerate(sources, start=1):
+                    title = src.get("title") or f"Source {idx}"
+                    url = src.get("url") or ""
+                    score = src.get("score", 0.0)
+                    st.markdown(f"**[{idx}] {title}** (score={score:.3f})")
+                    if url:
+                        st.markdown(f"- URL: {url}")
+                    chunk_text = src.get("chunk_text") or ""
+                    if chunk_text:
+                        st.write(chunk_text[:1000] + ("..." if len(chunk_text) > 1000 else ""))
+        # Persist assistant message with metadata.
+        st.session_state.messages.append(
+            {
+                "role": "assistant",
+                "content": answer,
+                "sources": sources,
+                "timings": timings,
+            }
+        )
 if __name__ == "__main__":

frontend/services/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Helper package for frontend services (conversion, backend client, etc.).

frontend/services/backend_client.py ADDED Viewed

	@@ -0,0 +1,25 @@

+from __future__ import annotations
+from typing import Any, Dict, Optional
+import httpx
+def post_upload_text(
+    base_url: str,
+    api_key: Optional[str],
+    payload: Dict[str, Any],
+) -> Dict[str, Any]:
+    """Call backend /documents/upload-text with the given payload.
+    Sends X-API-Key when provided and raises for HTTP errors.
+    """
+    url = f"{base_url.rstrip('/')}/documents/upload-text"
+    headers: Dict[str, str] = {"Content-Type": "application/json"}
+    if api_key:
+        headers["X-API-Key"] = api_key
+    with httpx.Client(timeout=60.0) as client:
+        resp = client.post(url, json=payload, headers=headers)
+        resp.raise_for_status()
+        return resp.json()

frontend/services/file_convert.py ADDED Viewed

	@@ -0,0 +1,63 @@

+from __future__ import annotations
+from pathlib import Path
+from tempfile import NamedTemporaryFile
+from typing import Any, Dict, Tuple
+try:
+    from docling.document_converter import DocumentConverter
+except ImportError:  # pragma: no cover - optional dependency
+    DocumentConverter = None  # type: ignore[assignment]
+def convert_uploaded_file_to_text(uploaded_file) -> Tuple[str, Dict[str, Any]]:
+    """Convert an uploaded Streamlit file to text/markdown.
+    - For .txt and .md, returns raw UTF-8 text.
+    - For other supported formats (PDF/Office/HTML), uses Docling when installed.
+    - Raises a RuntimeError with a user-friendly message when Docling is required
+      but not installed.
+    """
+    filename = uploaded_file.name
+    ext = Path(filename).suffix.lower().lstrip(".")
+    size_bytes = getattr(uploaded_file, "size", None)
+    content_type = getattr(uploaded_file, "type", None)
+    metadata: Dict[str, Any] = {
+        "filename": filename,
+        "ext": ext,
+        "size_bytes": size_bytes,
+        "content_type": content_type,
+    }
+    # Plain text / markdown: read directly.
+    if ext in {"txt", "md"}:
+        raw_bytes = uploaded_file.read()
+        text = raw_bytes.decode("utf-8", errors="ignore")
+        metadata["converted_by"] = "raw"
+        return text, metadata
+    # Rich formats: require Docling.
+    if DocumentConverter is None:
+        raise RuntimeError(
+            "Docling is not installed; conversion for this file type is unavailable. "
+            "Install docling (e.g. `pip install docling`) or upload a .md/.txt file."
+        )
+    # Persist to a temporary file so Docling can read it from disk.
+    with NamedTemporaryFile(delete=True, suffix=f".{ext}") as tmp:
+        # Streamlit's UploadedFile exposes getbuffer() for zero-copy writes.
+        tmp.write(uploaded_file.getbuffer())
+        tmp.flush()
+        converter = DocumentConverter()
+        result = converter.convert(tmp.name)
+        try:
+            text = result.document.export_to_markdown()
+        except Exception:  # noqa: BLE001
+            # Fallback to plain text if markdown export is not available.
+            text = result.document.export_to_text()
+    metadata["converted_by"] = "docling"
+    return text, metadata

scripts/batch_ingest_local_folder.py ADDED Viewed

	@@ -0,0 +1,148 @@

+# Optional dependency:
+#   pip install docling
+#
+# Batch-ingest a local folder of documents into the backend by converting each
+# supported file to markdown/text (using Docling when available) and uploading
+# it via /documents/upload-text.
+import argparse
+import json
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+from docling_convert_and_upload import convert_file_to_text, upload_text  # type: ignore[import]
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(
+        description=(
+            "Recursively ingest a folder of local documents using Docling (when available) "
+            "and upload them to the backend via /documents/upload-text."
+        )
+    )
+    parser.add_argument(
+        "--folder",
+        type=str,
+        required=True,
+        help="Root folder containing documents to ingest.",
+    )
+    parser.add_argument(
+        "--backend-url",
+        "--backend",
+        dest="backend_url",
+        type=str,
+        default="http://localhost:8000",
+        help="Base URL of the running backend (default: http://localhost:8000).",
+    )
+    parser.add_argument(
+        "--namespace",
+        type=str,
+        default="dev",
+        help="Target Pinecone namespace (default: dev).",
+    )
+    parser.add_argument(
+        "--source",
+        type=str,
+        default="local-folder",
+        help="Source label stored in metadata (default: local-folder).",
+    )
+    parser.add_argument(
+        "--api-key",
+        type=str,
+        default=None,
+        help="Optional API key for the backend (sent as X-API-Key).",
+    )
+    parser.add_argument(
+        "--max-files",
+        type=int,
+        default=200,
+        help="Maximum number of files to ingest (default: 200).",
+    )
+    return parser.parse_args()
+SUPPORTED_EXTENSIONS = {
+    ".pdf",
+    ".docx",
+    ".ppt",
+    ".pptx",
+    ".xls",
+    ".xlsx",
+    ".html",
+    ".htm",
+    ".md",
+    ".markdown",
+    ".adoc",
+    ".txt",
+}
+def find_files(root: Path, max_files: int) -> List[Path]:
+    files: List[Path] = []
+    for path in root.rglob("*"):
+        if not path.is_file():
+            continue
+        if path.suffix.lower() not in SUPPORTED_EXTENSIONS:
+            continue
+        files.append(path)
+        if len(files) >= max_files:
+            break
+    return files
+def main() -> int:
+    args = parse_args()
+    root = Path(args.folder).expanduser().resolve()
+    if not root.is_dir():
+        raise SystemExit(f"Folder not found: {root}")
+    files = find_files(root, args.max_files)
+    if not files:
+        print(f"No supported files found in {root}")
+        return 0
+    print(f"Found {len(files)} file(s) to ingest in {root} (max {args.max_files}).")
+    successes = 0
+    failures: List[Dict[str, Any]] = []
+    for idx, file_path in enumerate(files, start=1):
+        print(f"[{idx}/{len(files)}] Converting {file_path}...")
+        try:
+            text = convert_file_to_text(file_path)
+        except Exception as exc:  # noqa: BLE001
+            print(f"  Conversion failed: {exc}")
+            failures.append({"path": str(file_path), "error": str(exc)})
+            continue
+        try:
+            response = upload_text(
+                backend_url=args.backend_url,
+                title=file_path.name,
+                source=args.source,
+                text=text,
+                namespace=args.namespace,
+                metadata={
+                    "original_path": str(file_path),
+                    "extension": file_path.suffix.lower(),
+                },
+                api_key=args.api_key,
+            )
+            successes += 1
+            print(f"  Uploaded successfully: {json.dumps(response, indent=2)}")
+        except Exception as exc:  # noqa: BLE001
+            print(f"  Upload failed: {exc}")
+            failures.append({"path": str(file_path), "error": str(exc)})
+    print()
+    print(f"Ingestion complete. Successes: {successes}, Failures: {len(failures)}")
+    if failures:
+        print("Failures:")
+        for item in failures:
+            print(f"- {item['path']}: {item['error']}")
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

scripts/docling_convert_and_upload.py CHANGED Viewed

@@ -1,28 +1,44 @@
-# pip install docling
 import argparse
 import json
-from typing import Any, Dict
 import httpx
-from docling.document_converter import DocumentConverter
 def parse_args() -> argparse.Namespace:
     parser = argparse.ArgumentParser(
         description=(
-            "Convert a local PDF using Docling and upload the extracted text "
-            "to the RAG backend via /documents/upload-text."
         )
     )
     parser.add_argument(
         "--pdf-path",
         type=str,
         required=True,
-        help="Path to the local PDF file.",
     )
     parser.add_argument(
         "--backend-url",
         type=str,
         default="http://localhost:8000",
         help="Base URL of the running backend (default: http://localhost:8000).",
@@ -37,21 +53,50 @@ def parse_args() -> argparse.Namespace:
         "--title",
         type=str,
         default=None,
-        help="Optional title for the document; defaults to the PDF filename.",
     )
     parser.add_argument(
         "--source",
         type=str,
-        default="docling",
-        help="Source label stored in metadata (default: docling).",
     )
     return parser.parse_args()
-def convert_pdf_to_markdown(pdf_path: str) -> str:
-    converter = DocumentConverter()
-    result = converter.convert(pdf_path)
-    return result.document.export_to_markdown()
 def upload_text(
@@ -60,7 +105,8 @@ def upload_text(
     source: str,
     text: str,
     namespace: str,
-    metadata: Dict[str, Any] | None = None,
 ) -> Dict[str, Any]:
     url = f"{backend_url.rstrip('/')}/documents/upload-text"
     payload = {
@@ -70,18 +116,30 @@ def upload_text(
         "namespace": namespace,
         "metadata": metadata or {},
     }
     with httpx.Client(timeout=60.0) as client:
-        response = client.post(url, json=payload)
         response.raise_for_status()
         return response.json()
 def main() -> int:
     args = parse_args()
-    title = args.title or args.pdf_path.rsplit("/", 1)[-1]
-    print(f"Converting PDF at {args.pdf_path} with Docling...")
-    markdown_text = convert_pdf_to_markdown(args.pdf_path)
     print(
         f"Uploading converted text to backend at {args.backend_url} "
@@ -91,9 +149,10 @@ def main() -> int:
         backend_url=args.backend_url,
         title=title,
         source=args.source,
-        text=markdown_text,
         namespace=args.namespace,
-        metadata={"original_path": args.pdf_path},
     )
     print("Upload response:")

+# Optional dependency:
+#   pip install docling
+#
+# This script converts local documents (PDF, Markdown, and other formats
+# supported by Docling) to text/markdown and uploads them to the backend via
+# /documents/upload-text. Docling is used when available; for .txt/.md files,
+# the script can fall back to raw text if Docling is not installed.
 import argparse
 import json
+from pathlib import Path
+from typing import Any, Dict, Optional
 import httpx
+try:
+    from docling.document_converter import DocumentConverter
+except ImportError:  # pragma: no cover - optional dependency
+    DocumentConverter = None  # type: ignore[assignment]
 def parse_args() -> argparse.Namespace:
     parser = argparse.ArgumentParser(
         description=(
+            "Convert a local document using Docling (when available) and "
+            "upload the extracted text to the RAG backend via /documents/upload-text."
         )
     )
     parser.add_argument(
+        "--file",
         "--pdf-path",
+        "--path",
+        dest="file_path",
         type=str,
         required=True,
+        help="Path to the local file (PDF, Markdown, DOCX, HTML, TXT, etc.).",
     )
     parser.add_argument(
         "--backend-url",
+        "--backend",
+        dest="backend_url",
         type=str,
         default="http://localhost:8000",
         help="Base URL of the running backend (default: http://localhost:8000).",
         "--title",
         type=str,
         default=None,
+        help="Optional title for the document; defaults to the filename.",
     )
     parser.add_argument(
         "--source",
         type=str,
+        default="local-file",
+        help="Source label stored in metadata (default: local-file).",
+    )
+    parser.add_argument(
+        "--api-key",
+        type=str,
+        default=None,
+        help="Optional API key for the backend (sent as X-API-Key).",
     )
     return parser.parse_args()
+def _docling_available() -> bool:
+    return DocumentConverter is not None
+def convert_file_to_text(file_path: Path) -> str:
+    """Convert a file to markdown/text.
+    - If Docling is installed, it is used for all supported formats.
+    - If Docling is not installed:
+      - .txt and .md files are read as raw text.
+      - Other formats raise a RuntimeError with installation instructions.
+    """
+    suffix = file_path.suffix.lower()
+    if _docling_available():
+        converter = DocumentConverter()
+        result = converter.convert(str(file_path))
+        return result.document.export_to_markdown()
+    # Docling is not available.
+    if suffix in {".txt", ".md"}:
+        return file_path.read_text(encoding="utf-8", errors="ignore")
+    raise RuntimeError(
+        f"Docling is required to convert '{file_path}'. Install it with:\n"
+        "  pip install docling"
+    )
 def upload_text(
     source: str,
     text: str,
     namespace: str,
+    metadata: Optional[Dict[str, Any]] = None,
+    api_key: Optional[str] = None,
 ) -> Dict[str, Any]:
     url = f"{backend_url.rstrip('/')}/documents/upload-text"
     payload = {
         "namespace": namespace,
         "metadata": metadata or {},
     }
+    headers: Dict[str, str] = {"Content-Type": "application/json"}
+    if api_key:
+        headers["X-API-Key"] = api_key
     with httpx.Client(timeout=60.0) as client:
+        response = client.post(url, json=payload, headers=headers)
         response.raise_for_status()
         return response.json()
 def main() -> int:
     args = parse_args()
+    file_path = Path(args.file_path).expanduser().resolve()
+    if not file_path.is_file():
+        raise SystemExit(f"File not found: {file_path}")
+    title = args.title or file_path.name
+    print(f"Converting file at {file_path}...")
+    try:
+        text = convert_file_to_text(file_path)
+    except Exception as exc:  # noqa: BLE001
+        print(f"Error converting file: {exc}")
+        return 1
     print(
         f"Uploading converted text to backend at {args.backend_url} "
         backend_url=args.backend_url,
         title=title,
         source=args.source,
+        text=text,
         namespace=args.namespace,
+        metadata={"original_path": str(file_path), "extension": file_path.suffix.lower()},
+        api_key=args.api_key,
     )
     print("Upload response:")