Spaces:

Param20h
/

PDF-Assit_RAG

Running

App Files Files Community

Paramjit Singh commited on about 15 hours ago

Commit

5926dae

unverified ·

2 Parent(s): f618282 3dfe460

Merge pull request #336 from Srushti-Kamble14/feat/celery-redis-pdf-processing

Browse files

Files changed (13) hide show

.env.example +10 -0
README.md +5 -1
backend/app/celery_app.py +23 -0
backend/app/config.py +5 -0
backend/app/routes/documents.py +16 -151
backend/app/schemas.py +1 -0
backend/app/services/document_ingestion.py +27 -3
backend/app/tasks.py +22 -0
backend/requirements.txt +1 -0
backend/tests/test_document_upload_validation.py +7 -3
backend/tests/test_documents.py +5 -5
docker-compose.yml +46 -2
docs/ARCHITECTURE.md +8 -6

.env.example CHANGED Viewed

@@ -55,6 +55,16 @@ ALLOWED_ORIGINS=http://localhost:3000,http://localhost:7860
 # Optional — required only for Google sign-in.
 # NEXT_PUBLIC_GOOGLE_CLIENT_ID=your_google_oauth_client_id.apps.googleusercontent.com
 # ── File Upload ─────────────────────────────────────────────
 # Directory where uploaded documents (PDFs, DOCXs, etc.) are stored.

 # Optional — required only for Google sign-in.
 # NEXT_PUBLIC_GOOGLE_CLIENT_ID=your_google_oauth_client_id.apps.googleusercontent.com
+# ── Celery / Redis Background Processing ───────────────────
+# Redis URL used by FastAPI to enqueue PDF processing jobs.
+# Optional — defaults to redis://localhost:6379/0
+# CELERY_BROKER_URL=redis://localhost:6379/0
+# Redis URL used by Celery to store task results/status.
+# Optional — defaults to redis://localhost:6379/1
+# CELERY_RESULT_BACKEND=redis://localhost:6379/1
 # ── File Upload ─────────────────────────────────────────────
 # Directory where uploaded documents (PDFs, DOCXs, etc.) are stored.

README.md CHANGED Viewed

@@ -362,6 +362,8 @@ DATABASE_URL=sqlite:///./data/app.db
 HF_TOKEN=hf_your_huggingface_token_here
 UPLOAD_DIR=./data/uploads
 CHROMA_PERSIST_DIR=./data/chroma_db
 ```
 > Get your free HuggingFace token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
@@ -410,7 +412,7 @@ npm run dev
 ```bash
 docker compose up --build
-# → Full stack at http://localhost:7860
 ```
 <br/>
@@ -503,6 +505,8 @@ docker compose up --build
 | `JWT_EXPIRY_HOURS` | ❌ | `72` | JWT token lifetime in hours before re-login is required. | — |
 | `GOOGLE_CLIENT_ID` | ❌ | — | Google OAuth web client ID used by FastAPI to verify ID tokens. | [Google Cloud Console](https://console.cloud.google.com/apis/credentials) |
 | `NEXT_PUBLIC_GOOGLE_CLIENT_ID` | ❌ | — | Google OAuth web client ID exposed to the Next.js Google sign-in button. | [Google Cloud Console](https://console.cloud.google.com/apis/credentials) |
 | `UPLOAD_DIR` | ❌ | `./data/uploads` | Local directory for storing uploaded documents. | — |
 | `MAX_FILE_SIZE_MB` | ❌ | `50` | Maximum allowed upload file size in MB. | — |
 | `ALLOWED_EXTENSIONS` | ❌ | `pdf,docx,txt,md` | Comma-separated list of permitted file extensions. | — |

 HF_TOKEN=hf_your_huggingface_token_here
 UPLOAD_DIR=./data/uploads
 CHROMA_PERSIST_DIR=./data/chroma_db
+CELERY_BROKER_URL=redis://localhost:6379/0
+CELERY_RESULT_BACKEND=redis://localhost:6379/1
 ```
 > Get your free HuggingFace token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
 ```bash
 docker compose up --build
+# → FastAPI, Redis, Celery worker, and Postgres at http://localhost:7860
 ```
 <br/>
 | `JWT_EXPIRY_HOURS` | ❌ | `72` | JWT token lifetime in hours before re-login is required. | — |
 | `GOOGLE_CLIENT_ID` | ❌ | — | Google OAuth web client ID used by FastAPI to verify ID tokens. | [Google Cloud Console](https://console.cloud.google.com/apis/credentials) |
 | `NEXT_PUBLIC_GOOGLE_CLIENT_ID` | ❌ | — | Google OAuth web client ID exposed to the Next.js Google sign-in button. | [Google Cloud Console](https://console.cloud.google.com/apis/credentials) |
+| `CELERY_BROKER_URL` | ❌ | `redis://localhost:6379/0` | Redis broker URL used by FastAPI to queue document ingestion jobs. | Redis |
+| `CELERY_RESULT_BACKEND` | ❌ | `redis://localhost:6379/1` | Redis backend URL used by Celery to store task state/results. | Redis |
 | `UPLOAD_DIR` | ❌ | `./data/uploads` | Local directory for storing uploaded documents. | — |
 | `MAX_FILE_SIZE_MB` | ❌ | `50` | Maximum allowed upload file size in MB. | — |
 | `ALLOWED_EXTENSIONS` | ❌ | `pdf,docx,txt,md` | Comma-separated list of permitted file extensions. | — |

backend/app/celery_app.py ADDED Viewed

	@@ -0,0 +1,23 @@

+"""Celery application configured for Redis-backed background jobs."""
+from celery import Celery
+from app.config import get_settings
+settings = get_settings()
+celery_app = Celery(
+    "pdf_assistant_rag",
+    broker=settings.CELERY_BROKER_URL,
+    backend=settings.CELERY_RESULT_BACKEND,
+    include=["app.tasks"],
+)
+celery_app.conf.update(
+    task_track_started=settings.CELERY_TASK_TRACK_STARTED,
+    task_serializer="json",
+    result_serializer="json",
+    accept_content=["json"],
+    timezone="UTC",
+)

backend/app/config.py CHANGED Viewed

@@ -33,6 +33,11 @@ class Settings(BaseSettings):
     DRIVE_SYNC_INTERVAL_MINUTES: int = 60
     GOOGLE_SERVICE_ACCOUNT_FILE: str = ""
     # ── File Upload ──────────────────────────────────────
     UPLOAD_DIR: str = "./data/uploads"
     MAX_UPLOAD_SIZE_MB: int = 20

     DRIVE_SYNC_INTERVAL_MINUTES: int = 60
     GOOGLE_SERVICE_ACCOUNT_FILE: str = ""
+    # Celery / Redis background processing
+    CELERY_BROKER_URL: str = "redis://localhost:6379/0"
+    CELERY_RESULT_BACKEND: str = "redis://localhost:6379/1"
+    CELERY_TASK_TRACK_STARTED: bool = True
     # ── File Upload ──────────────────────────────────────
     UPLOAD_DIR: str = "./data/uploads"
     MAX_UPLOAD_SIZE_MB: int = 20

backend/app/routes/documents.py CHANGED Viewed

@@ -1,6 +1,6 @@
 """
 Document management routes — upload, list, delete, and serve PDF files.
-Background ingestion via FastAPI BackgroundTasks.
 """
 import os
 import sys
@@ -14,7 +14,7 @@ from pathlib import Path
 import shutil
 import tempfile
 from urllib.parse import urlparse
-from fastapi import APIRouter, Depends, HTTPException, UploadFile, File, BackgroundTasks, status, Query
 from fastapi.responses import FileResponse
 from sqlalchemy.orm import Session
@@ -23,8 +23,7 @@ from app.models import User, Document
 from app.schemas import DocumentResponse, DocumentListResponse, DocumentStatusResponse, ChunkSettings, UploadUrl
 from app.auth import get_current_user
 from app.config import get_settings
-from app.rag.chunker import chunk_document, get_page_count
-from app.rag.vectorstore import store_chunks
 try:
     from crawl4ai import AsyncWebCrawler
@@ -130,133 +129,6 @@ async def validate_upload(file: UploadFile):
         pass
-def _ingest_document(document_id: str, filepath: str, original_name: str, user_id: str):
-    """
-    Process a document in the background: chunk document, generate embeddings, and store in ChromaDB,
-    calls document summary function, and update the database record.
-    This function is intended to be run as a background task.
-    It creates its own database session, updates the
-    document status, extracts text, splits into chunks, generates embeddings,
-    stores everything in ChromaDB, calls summary function, updates the document record with page count,
-    chunk count, and summary, and marks the document as 'ready'.
-    On failure, it sets status to 'failed' and records the error message.
-    Args:
-        document_id: Unique identifier of the document in the database.
-        filepath: Absolute or relative path to the uploaded file on disk.
-        original_name: original filename provided by the user (for logging and metadata).
-        user_id: Identifier of the user who owns the document.
-    Returns:
-        None
-    Note:
-        This function does not raise exceptions to the caller;
-        all errors are logged and the database record is updated accordingly.
-    """
-    from app.database import SessionLocal
-    db = SessionLocal()
-    try:
-        doc = (
-            db.query(Document)
-            .filter(Document.id == document_id, Document.is_deleted.is_(False))
-            .first()
-        )
-        if not doc:
-            logger.error(f"Document {document_id} not found for ingestion")
-            return
-        # Update status to processing
-        doc.status = "processing"
-        db.commit()
-        # Get page count
-        page_count = get_page_count(filepath)
-        doc.page_count = page_count
-        # Chunk document with optional chunk size and overlap parameters from the document record, falling back to global defaults if not set
-        chunk_size = doc.chunk_size
-        chunk_overlap = doc.chunk_overlap
-        try:
-            kwargs = {}
-            if chunk_size is not None:
-                kwargs["chunk_size"] = chunk_size
-            if chunk_overlap is not None:
-                kwargs["chunk_overlap"] = chunk_overlap
-            if kwargs:
-                chunks = chunk_document(filepath, **kwargs)
-            else:
-                chunks = chunk_document(filepath)
-        except TypeError:
-            # Backward-compatible fallback for chunk_document implementations/tests
-            # that only accept (filepath)
-            chunks = chunk_document(filepath)
-        if not chunks:
-            doc.status = "failed"
-            doc.error_message = "No text could be extracted from the document"
-            db.commit()
-            return
-        # Build and persist a lightweight entity co-occurrence graph for GraphRAG.
-        try:
-            from app.rag.graph_builder import build_graph, save_graph
-            graph = build_graph(chunks)
-            save_graph(graph, user_id=user_id, document_id=document_id)
-        except Exception as e:
-            logger.warning(f"Could not build knowledge graph for document {document_id}: {e}")
-        # Store embeddings in ChromaDB
-        chunk_count = store_chunks(
-            chunks=chunks,
-            document_id=document_id,
-            filename=original_name,
-            user_id=user_id,
-        )
-        # Generate summary and update document record
-        try:
-            from app.rag.summarizer import generate_document_summary
-            summary = generate_document_summary(filepath, max_sentences=2)
-            if summary:
-                doc.summary = summary
-                db.commit() # Update document record with summary
-        except Exception as e:
-            logger.warning(f"Could not import summarizer for document {document_id}: {e}")
-            doc.summary = None
-        # Update document record
-        doc.chunk_count = chunk_count
-        doc.status = "ready"
-        db.commit()
-        logger.info(f"Document {document_id} ingested: {page_count} pages, {chunk_count} chunks")
-    except Exception as e:
-        logger.error(f"Ingestion error for {document_id}: {e}")
-        try:
-            doc = (
-                db.query(Document)
-                .filter(Document.id == document_id, Document.is_deleted.is_(False))
-                .first()
-            )
-            if doc:
-                doc.status = "failed"
-                doc.error_message = str(e)[:500]
-                db.commit()
-        except Exception:
-            pass
-    finally:
-        db.close()
 def _crawl_in_new_loop(url: str) -> str:
     """Run the async crawler in a fresh event loop on a worker thread.
     On Windows this must be a ProactorEventLoop to support subprocesses.
@@ -288,7 +160,6 @@ def _crawl_in_new_loop(url: str) -> str:
 @router.post("/upload", response_model=DocumentResponse, status_code=status.HTTP_202_ACCEPTED)
 async def upload_document(
-    background_tasks: BackgroundTasks,
     file: UploadFile = File(...),
     user: User = Depends(get_current_user),
     db: Session = Depends(get_db),
@@ -298,12 +169,11 @@ async def upload_document(
     Validates the uploaded file (extension, size, MIME type, integrity),
     saves it to the user's directory, creates a database record with status
-    'pending', schedules a background task for chunking and embedding, and
-    returns 202 Accepted immediately so large documents do not block the API
-    request while embeddings are generated.
     Args:
-        background_tasks: FastAPI BackgroundTasks instance to run the ingestion process asynchronously.
         file: The uploaded file, provided as a multipart/form-data field in the request.
         user: The currently authenticated user, injected by the `get_current_user` dependency.
         db: Database session, injected by the `get_db` dependency.
@@ -357,21 +227,19 @@ async def upload_document(
     db.commit()
     db.refresh(document)
-    # ── Trigger background ingestion ─────────────────
-    background_tasks.add_task(
-        _ingest_document,
         document_id=document.id,
         filepath=filepath,
         original_name=file.filename,
         user_id=user.id,
     )
-    return DocumentResponse.model_validate(document)
 @router.post("/urlupload", status_code=status.HTTP_202_ACCEPTED)
 async def upload_document_url(
         payload: UploadUrl,
-        background_tasks: BackgroundTasks,
         user: User = Depends(get_current_user),
         db: Session = Depends(get_db),
 ):
@@ -443,16 +311,15 @@ async def upload_document_url(
         db.commit()
         db.refresh(document)
-        # ── Trigger background ingestion ───────────────────────
-        background_tasks.add_task(
-            _ingest_document,
             document_id=document.id,
             filepath=filepath,
             original_name=original_name,
             user_id=user.id,
         )
-        return DocumentResponse.model_validate(document)
     except HTTPException:
         raise
@@ -681,7 +548,6 @@ def delete_document(
 def update_chunk_settings(
     document_id: str,
     settings_update: ChunkSettings,
-    background_tasks: BackgroundTasks,
     user: User = Depends(get_current_user),
     db: Session = Depends(get_db),
 ):
@@ -692,7 +558,6 @@ def update_chunk_settings(
     Args:
         document_id: The unique identifier of the document to update.
         settings_update: A ChunkSettings object containing the chunk_size and chunk_overlap values.
-        background_tasks: FastAPI BackgroundTasks instance to run the ingestion process asynchronously.
         user: The currently authenticated user, injected by the `get_current_user` dependency.
         db: Database session, injected by the `get_db` dependency.
@@ -733,13 +598,13 @@ def update_chunk_settings(
     doc.summary = None
     db.commit()
-    # Trigger background ingestion with updated chunk settings. The _ingest_document function will read the new chunk settings from the document record and re-chunk the document accordingly.
-    background_tasks.add_task(
-        _ingest_document,
         document_id=doc.id,
         filepath=os.path.join(settings.UPLOAD_DIR, user.id, doc.filename),
         original_name=doc.original_name,
         user_id=user.id,
     )
     # Return the updated document record with new chunk settings
-    return DocumentResponse.model_validate(doc)

 """
 Document management routes — upload, list, delete, and serve PDF files.
+Background ingestion via Celery workers.
 """
 import os
 import sys
 import shutil
 import tempfile
 from urllib.parse import urlparse
+from fastapi import APIRouter, Depends, HTTPException, UploadFile, File, status, Query
 from fastapi.responses import FileResponse
 from sqlalchemy.orm import Session
 from app.schemas import DocumentResponse, DocumentListResponse, DocumentStatusResponse, ChunkSettings, UploadUrl
 from app.auth import get_current_user
 from app.config import get_settings
+from app.tasks import process_document
 try:
     from crawl4ai import AsyncWebCrawler
         pass
 def _crawl_in_new_loop(url: str) -> str:
     """Run the async crawler in a fresh event loop on a worker thread.
     On Windows this must be a ProactorEventLoop to support subprocesses.
 @router.post("/upload", response_model=DocumentResponse, status_code=status.HTTP_202_ACCEPTED)
 async def upload_document(
     file: UploadFile = File(...),
     user: User = Depends(get_current_user),
     db: Session = Depends(get_db),
     Validates the uploaded file (extension, size, MIME type, integrity),
     saves it to the user's directory, creates a database record with status
+    'pending', queues a Celery task for chunking and embedding, and returns
+    202 Accepted immediately so large documents do not block the API request
+    while embeddings are generated.
     Args:
         file: The uploaded file, provided as a multipart/form-data field in the request.
         user: The currently authenticated user, injected by the `get_current_user` dependency.
         db: Database session, injected by the `get_db` dependency.
     db.commit()
     db.refresh(document)
+    # ── Queue background ingestion ─────────────────
+    task = process_document.delay(
         document_id=document.id,
         filepath=filepath,
         original_name=file.filename,
         user_id=user.id,
     )
+    return DocumentResponse.model_validate(document).model_copy(update={"task_id": task.id})
 @router.post("/urlupload", status_code=status.HTTP_202_ACCEPTED)
 async def upload_document_url(
         payload: UploadUrl,
         user: User = Depends(get_current_user),
         db: Session = Depends(get_db),
 ):
         db.commit()
         db.refresh(document)
+        # ── Queue background ingestion ───────────────────────
+        task = process_document.delay(
             document_id=document.id,
             filepath=filepath,
             original_name=original_name,
             user_id=user.id,
         )
+        return DocumentResponse.model_validate(document).model_copy(update={"task_id": task.id})
     except HTTPException:
         raise
 def update_chunk_settings(
     document_id: str,
     settings_update: ChunkSettings,
     user: User = Depends(get_current_user),
     db: Session = Depends(get_db),
 ):
     Args:
         document_id: The unique identifier of the document to update.
         settings_update: A ChunkSettings object containing the chunk_size and chunk_overlap values.
         user: The currently authenticated user, injected by the `get_current_user` dependency.
         db: Database session, injected by the `get_db` dependency.
     doc.summary = None
     db.commit()
+    # Queue ingestion with updated chunk settings. The worker reads the new
+    # settings from the document record before re-chunking.
+    task = process_document.delay(
         document_id=doc.id,
         filepath=os.path.join(settings.UPLOAD_DIR, user.id, doc.filename),
         original_name=doc.original_name,
         user_id=user.id,
     )
     # Return the updated document record with new chunk settings
+    return DocumentResponse.model_validate(doc).model_copy(update={"task_id": task.id})

backend/app/schemas.py CHANGED Viewed

@@ -119,6 +119,7 @@ class DocumentResponse(BaseModel):
     error_message: Optional[str] = None
     uploaded_at: datetime
     summary: Optional[str] = None # New field for document summary
     class Config:
         from_attributes = True

     error_message: Optional[str] = None
     uploaded_at: datetime
     summary: Optional[str] = None # New field for document summary
+    task_id: Optional[str] = None
     class Config:
         from_attributes = True

backend/app/services/document_ingestion.py CHANGED Viewed

@@ -17,18 +17,31 @@ def ingest_document(document_id: str, filepath: str, original_name: str, user_id
     db = SessionLocal()
     try:
-        doc = db.query(Document).filter(Document.id == document_id).first()
         if not doc:
             logger.error("Document %s not found for ingestion", document_id)
             return
         doc.status = "processing"
         db.commit()
         page_count = get_page_count(filepath)
         doc.page_count = page_count
-        chunks = chunk_document(filepath)
         if not chunks:
             doc.status = "failed"
@@ -36,6 +49,14 @@ def ingest_document(document_id: str, filepath: str, original_name: str, user_id
             db.commit()
             return
         chunk_count = store_chunks(
             chunks=chunks,
             document_id=document_id,
@@ -69,7 +90,10 @@ def ingest_document(document_id: str, filepath: str, original_name: str, user_id
     except Exception as e:
         logger.error("Ingestion error for %s: %s", document_id, e)
         try:
-            doc = db.query(Document).filter(Document.id == document_id).first()
             if doc:
                 doc.status = "failed"
                 doc.error_message = str(e)[:500]

     db = SessionLocal()
     try:
+        doc = db.query(Document).filter(
+            Document.id == document_id,
+            Document.is_deleted.is_(False),
+        ).first()
         if not doc:
             logger.error("Document %s not found for ingestion", document_id)
             return
         doc.status = "processing"
+        doc.error_message = None
         db.commit()
         page_count = get_page_count(filepath)
         doc.page_count = page_count
+        try:
+            chunk_kwargs = {}
+            if doc.chunk_size is not None:
+                chunk_kwargs["chunk_size"] = doc.chunk_size
+            if doc.chunk_overlap is not None:
+                chunk_kwargs["chunk_overlap"] = doc.chunk_overlap
+            chunks = chunk_document(filepath, **chunk_kwargs)
+        except TypeError:
+            # Preserve compatibility with patched/test implementations.
+            chunks = chunk_document(filepath)
         if not chunks:
             doc.status = "failed"
             db.commit()
             return
+        try:
+            from app.rag.graph_builder import build_graph, save_graph
+            graph = build_graph(chunks)
+            save_graph(graph, user_id=user_id, document_id=document_id)
+        except Exception as e:
+            logger.warning("Could not build knowledge graph for document %s: %s", document_id, e)
         chunk_count = store_chunks(
             chunks=chunks,
             document_id=document_id,
     except Exception as e:
         logger.error("Ingestion error for %s: %s", document_id, e)
         try:
+            doc = db.query(Document).filter(
+                Document.id == document_id,
+                Document.is_deleted.is_(False),
+            ).first()
             if doc:
                 doc.status = "failed"
                 doc.error_message = str(e)[:500]

backend/app/tasks.py ADDED Viewed

	@@ -0,0 +1,22 @@

+"""Celery tasks for document processing."""
+from app.celery_app import celery_app
+from app.services.document_ingestion import ingest_document
+@celery_app.task(bind=True, name="app.tasks.process_document")
+def process_document(
+    self,
+    document_id: str,
+    filepath: str,
+    original_name: str,
+    user_id: str,
+) -> dict[str, str]:
+    """Run the RAG ingestion pipeline for a stored document."""
+    ingest_document(
+        document_id=document_id,
+        filepath=filepath,
+        original_name=original_name,
+        user_id=user_id,
+    )
+    return {"document_id": document_id, "status": "completed"}

backend/requirements.txt CHANGED Viewed

@@ -56,6 +56,7 @@ huggingface-hub
 gunicorn
 slowapi
 prometheus-fastapi-instrumentator
 # File Validation
 #sudo apt-get install libmagic1 // for Debian/Ubuntu

 gunicorn
 slowapi
 prometheus-fastapi-instrumentator
+celery[redis]
 # File Validation
 #sudo apt-get install libmagic1 // for Debian/Ubuntu

backend/tests/test_document_upload_validation.py CHANGED Viewed

@@ -6,7 +6,7 @@ import uuid
 from pathlib import Path
 import pytest
-from fastapi import BackgroundTasks, HTTPException, UploadFile
 from pypdf import PdfWriter
 from sqlalchemy import create_engine
 from sqlalchemy.orm import sessionmaker
@@ -141,10 +141,14 @@ def test_upload_document_handles_duplicate_original_names(
     monkeypatch.setattr(documents, "validate_upload", fake_validate_upload)
     monkeypatch.setattr(documents.settings, "UPLOAD_DIR", str(tmp_path / "uploads"))
     monkeypatch.setattr(documents.uuid, "uuid4", lambda: next(uuid_values))
     first = _run(
         documents.upload_document(
-            BackgroundTasks(),
             file=_upload_file("same-name.pdf", b"first"),
             user=user,
             db=session,
@@ -152,7 +156,6 @@ def test_upload_document_handles_duplicate_original_names(
     )
     second = _run(
         documents.upload_document(
-            BackgroundTasks(),
             file=_upload_file("same-name.pdf", b"second"),
             user=user,
             db=session,
@@ -164,6 +167,7 @@ def test_upload_document_handles_duplicate_original_names(
     assert [doc.original_name for doc in stored_docs] == ["same-name.pdf", "same-name.pdf"]
     assert len({doc.filename for doc in stored_docs}) == 2
     assert first.original_name == second.original_name == "same-name.pdf"
     assert (tmp_path / "uploads" / user.id / f"{first_hex}.pdf").exists()
     assert (tmp_path / "uploads" / user.id / f"{second_hex}.pdf").exists()
     assert all(not path.exists() for path in temp_files)

 from pathlib import Path
 import pytest
+from fastapi import HTTPException, UploadFile
 from pypdf import PdfWriter
 from sqlalchemy import create_engine
 from sqlalchemy.orm import sessionmaker
     monkeypatch.setattr(documents, "validate_upload", fake_validate_upload)
     monkeypatch.setattr(documents.settings, "UPLOAD_DIR", str(tmp_path / "uploads"))
     monkeypatch.setattr(documents.uuid, "uuid4", lambda: next(uuid_values))
+    monkeypatch.setattr(
+        documents.process_document,
+        "delay",
+        lambda **_kwargs: types.SimpleNamespace(id="queued-task"),
+    )
     first = _run(
         documents.upload_document(
             file=_upload_file("same-name.pdf", b"first"),
             user=user,
             db=session,
     )
     second = _run(
         documents.upload_document(
             file=_upload_file("same-name.pdf", b"second"),
             user=user,
             db=session,
     assert [doc.original_name for doc in stored_docs] == ["same-name.pdf", "same-name.pdf"]
     assert len({doc.filename for doc in stored_docs}) == 2
     assert first.original_name == second.original_name == "same-name.pdf"
+    assert first.task_id == second.task_id == "queued-task"
     assert (tmp_path / "uploads" / user.id / f"{first_hex}.pdf").exists()
     assert (tmp_path / "uploads" / user.id / f"{second_hex}.pdf").exists()
     assert all(not path.exists() for path in temp_files)

backend/tests/test_documents.py CHANGED Viewed

@@ -1,7 +1,7 @@
 import types
 from app.models import Document
-from app.routes.documents import _ingest_document
 def test_api_health(client):
@@ -56,9 +56,9 @@ def test_ingest_document_builds_and_saves_graph(db_session, monkeypatch, tmp_pat
     chunks = [{"text": "OpenAI works with Microsoft.", "page": 1, "chunk_index": 0}]
     saved = {}
-    monkeypatch.setattr("app.routes.documents.get_page_count", lambda filepath: 1)
-    monkeypatch.setattr("app.routes.documents.chunk_document", lambda filepath: chunks)
-    monkeypatch.setattr("app.routes.documents.store_chunks", lambda **kwargs: len(chunks))
     monkeypatch.setattr("app.database.SessionLocal", lambda: db_session)
     fake_summary = types.ModuleType("app.rag.summarizer")
@@ -76,7 +76,7 @@ def test_ingest_document_builds_and_saves_graph(db_session, monkeypatch, tmp_pat
         ),
     )
-    _ingest_document(
         document_id=document_id,
         filepath=str(tmp_path / "graph.txt"),
         original_name=document.original_name,

 import types
 from app.models import Document
+from app.services.document_ingestion import ingest_document
 def test_api_health(client):
     chunks = [{"text": "OpenAI works with Microsoft.", "page": 1, "chunk_index": 0}]
     saved = {}
+    monkeypatch.setattr("app.services.document_ingestion.get_page_count", lambda filepath: 1)
+    monkeypatch.setattr("app.services.document_ingestion.chunk_document", lambda filepath: chunks)
+    monkeypatch.setattr("app.services.document_ingestion.store_chunks", lambda **kwargs: len(chunks))
     monkeypatch.setattr("app.database.SessionLocal", lambda: db_session)
     fake_summary = types.ModuleType("app.rag.summarizer")
         ),
     )
+    ingest_document(
         document_id=document_id,
         filepath=str(tmp_path / "graph.txt"),
         original_name=document.original_name,

docker-compose.yml CHANGED Viewed

@@ -1,6 +1,20 @@
 version: '3.8'
 services:
   # ── PostgreSQL Database ──────────────────────────────────
   postgres:
     image: postgres:16-alpine
@@ -34,11 +48,16 @@ services:
       - SECRET_KEY=${SECRET_KEY:-dev-secret-key-change-me}
       - HF_TOKEN=${HF_TOKEN}
       - DATABASE_URL=postgresql://${POSTGRES_USER:-pdf_rag_user}:${POSTGRES_PASSWORD:-pdf_rag_pass}@postgres:5432/${POSTGRES_DB:-pdf_rag}
-      - UPLOAD_DIR=./data/uploads
-      - CHROMA_PERSIST_DIR=./data/chroma_db
     depends_on:
       postgres:
         condition: service_healthy
     restart: unless-stopped
     healthcheck:
       test: ["CMD", "curl", "-f", "http://localhost:7860/api/health"]
@@ -47,6 +66,31 @@ services:
       retries: 3
       start_period: 60s
   # ── pgAdmin (optional — for local DB inspection) ─────────
   pgadmin:
     image: dpage/pgadmin4:latest

 version: '3.8'
 services:
+  # Redis broker/result backend for Celery document processing
+  redis:
+    image: redis:7-alpine
+    container_name: pdf_rag_redis
+    restart: unless-stopped
+    ports:
+      - "6379:6379"
+    healthcheck:
+      test: ["CMD", "redis-cli", "ping"]
+      interval: 10s
+      timeout: 5s
+      retries: 5
+      start_period: 5s
   # ── PostgreSQL Database ──────────────────────────────────
   postgres:
     image: postgres:16-alpine
       - SECRET_KEY=${SECRET_KEY:-dev-secret-key-change-me}
       - HF_TOKEN=${HF_TOKEN}
       - DATABASE_URL=postgresql://${POSTGRES_USER:-pdf_rag_user}:${POSTGRES_PASSWORD:-pdf_rag_pass}@postgres:5432/${POSTGRES_DB:-pdf_rag}
+      - UPLOAD_DIR=/app/data/uploads
+      - CHROMA_PERSIST_DIR=/app/data/chroma_db
+      - GRAPH_PERSIST_DIR=/app/data/graphs
+      - CELERY_BROKER_URL=redis://redis:6379/0
+      - CELERY_RESULT_BACKEND=redis://redis:6379/1
     depends_on:
       postgres:
         condition: service_healthy
+      redis:
+        condition: service_healthy
     restart: unless-stopped
     healthcheck:
       test: ["CMD", "curl", "-f", "http://localhost:7860/api/health"]
       retries: 3
       start_period: 60s
+  # Celery worker for document extraction, chunking, embeddings, and vector storage
+  worker:
+    build: .
+    container_name: pdf_rag_worker
+    command: >
+      sh -c "cd /app/backend &&
+      celery -A app.celery_app.celery_app worker --loglevel=info"
+    volumes:
+      - app_data:/app/data
+    environment:
+      - SECRET_KEY=${SECRET_KEY:-dev-secret-key-change-me}
+      - HF_TOKEN=${HF_TOKEN}
+      - DATABASE_URL=postgresql://${POSTGRES_USER:-pdf_rag_user}:${POSTGRES_PASSWORD:-pdf_rag_pass}@postgres:5432/${POSTGRES_DB:-pdf_rag}
+      - UPLOAD_DIR=/app/data/uploads
+      - CHROMA_PERSIST_DIR=/app/data/chroma_db
+      - GRAPH_PERSIST_DIR=/app/data/graphs
+      - CELERY_BROKER_URL=redis://redis:6379/0
+      - CELERY_RESULT_BACKEND=redis://redis:6379/1
+    depends_on:
+      postgres:
+        condition: service_healthy
+      redis:
+        condition: service_healthy
+    restart: unless-stopped
   # ── pgAdmin (optional — for local DB inspection) ─────────
   pgadmin:
     image: dpage/pgadmin4:latest

docs/ARCHITECTURE.md CHANGED Viewed

@@ -52,7 +52,8 @@ sequenceDiagram
     participant UI as Frontend
     participant API as FastAPI documents route
     participant DB as SQL metadata
-    participant Worker as Background task
     participant Files as Upload storage
     participant Vector as ChromaDB
@@ -60,8 +61,9 @@ sequenceDiagram
     API->>API: Validate filename, extension, size, MIME, and parser readability
     API->>Files: Persist original file under the user's upload directory
     API->>DB: Create document row with processing status
-    API-->>UI: 202 Accepted with document metadata
-    API->>Worker: Queue ingestion task
     Worker->>Files: Read saved document
     Worker->>Worker: Extract pages, chunk text, build graph summary data
     Worker->>Vector: Store chunks with document and user metadata
@@ -70,9 +72,9 @@ sequenceDiagram
 The upload route is intentionally strict before it writes long-lived state:
 extension checks, size checks, MIME checks, and parser checks happen before the
-file is moved into permanent storage. The background task owns expensive work
-such as text extraction, chunking, embedding, graph building, and summary
-generation.
 ## Chat And Retrieval Flow

     participant UI as Frontend
     participant API as FastAPI documents route
     participant DB as SQL metadata
+    participant Redis as Redis broker
+    participant Worker as Celery worker
     participant Files as Upload storage
     participant Vector as ChromaDB
     API->>API: Validate filename, extension, size, MIME, and parser readability
     API->>Files: Persist original file under the user's upload directory
     API->>DB: Create document row with processing status
+    API->>Redis: Queue Celery ingestion task
+    API-->>UI: 202 Accepted with document metadata and task_id
+    Redis->>Worker: Deliver ingestion task
     Worker->>Files: Read saved document
     Worker->>Worker: Extract pages, chunk text, build graph summary data
     Worker->>Vector: Store chunks with document and user metadata
 The upload route is intentionally strict before it writes long-lived state:
 extension checks, size checks, MIME checks, and parser checks happen before the
+file is moved into permanent storage. Celery uses Redis as the broker/result
+backend, and the worker owns expensive work such as text extraction, chunking,
+embedding, graph building, and summary generation.
 ## Chat And Retrieval Flow