Spaces:

logan-codes
/

Knowledge_Mangaement_System_using_RAG

Sleeping

App Files Files Community

logan-codes commited on Feb 8

Commit

6a91298

1 Parent(s): f12045e

upload the project

Browse files

Files changed (17) hide show

.dockerignore +34 -0
.gitignore +17 -0
README.md +105 -10
app/Dockerfile +25 -0
app/main.py +159 -0
app/requirements.txt +199 -0
app/services/database.py +63 -0
app/services/document_ingester.py +79 -0
app/services/generation.py +64 -0
app/services/retriever.py +67 -0
compose.yaml +29 -0
requirements.txt +0 -0
ui/Dockerfile +25 -0
ui/Home.py +85 -0
ui/pages/Chat.py +55 -0
ui/pages/Documents.py +111 -0
ui/requirements.txt +52 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,34 @@

+# Include any files or directories that you don't want to be copied to your
+# container here (e.g., local build artifacts, temporary files, etc.).
+#
+# For more help, visit the .dockerignore file reference guide at
+# https://docs.docker.com/go/build-context-dockerignore/
+**/.DS_Store
+**/__pycache__
+**/.venv
+**/.classpath
+**/.dockerignore
+**/.env
+**/.git
+**/.gitignore
+**/.project
+**/.settings
+**/.toolstarget
+**/.vs
+**/.vscode
+**/*.*proj.user
+**/*.dbmdl
+**/*.jfm
+**/bin
+**/charts
+**/docker-compose*
+**/compose.y*ml
+**/Dockerfile*
+**/node_modules
+**/npm-debug.log
+**/obj
+**/secrets.dev.yaml
+**/values.dev.yaml
+LICENSE
+README.md

.gitignore ADDED Viewed

	@@ -0,0 +1,17 @@

+# python venv
+.venv/
+# docs folder
+data/
+# test folder
+test/
+# pycache files
+**/__pycache__/
+# github files
+.github/
+# env file
+.env

README.md CHANGED Viewed

@@ -1,10 +1,105 @@
----
-title: Knowledge Mangaement System Using RAG
-emoji: 🌍
-colorFrom: purple
-colorTo: blue
-sdk: docker
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# 🧠 Knowledge Management RAG System
+A powerful, local-first Retrieval-Augmented Generation (RAG) system designed to manage your personal knowledge base. Built with a modern client-server architecture, it allows you to upload documents, persist them in a vector database, and chat with your data using Google's Gemini models.
+![Python](https://img.shields.io/badge/Python-3.10%2B-blue)
+![FastAPI](https://img.shields.io/badge/FastAPI-0.109-009688)
+![Streamlit](https://img.shields.io/badge/Streamlit-1.31-FF4B4B)
+![LangChain](https://img.shields.io/badge/LangChain-0.1-green)
+## ✨ Key Features
+-   **📄 Document Ingestion**: Seamlessly upload PDF, DOCX, and TXT files.
+-   **🤖 Advanced Parsing**: Powered by [Docling](https://github.com/DS4SD/docling) for high-fidelity document parsing and chunking.
+-   **🧠 Smart Retrieval**: Uses `sentence-transformers/all-MiniLM-L6-v2` embeddings stored in a local ChromaDB instance.
+-   **💬 Context-Aware Chat**: Chat interface powered by Google Gemini 2.5 Flash Lite.
+-   **STORAGE**: Uses ChromaDB for vector storage and **SQLite** for state management.
+-   **⚡ High Performance**: Optimized architecture with model caching (LRU) to prevent redundant reloading.
+-   **🧹 Management**: View and delete uploaded documents directly from the UI.
+## 🛠️ Architecture
+The project follows a clean segregation of duties:
+```
+/
+├── 📁 app/                 # FastAPI Backend
+│   ├── main.py             # API Entry point & Dependency Injection
+│   └── 📁 services/        # Core Business Logic
+│       ├── document_ingester.py  # Docling + ChromaDB ingestion
+│       ├── retriever.py          # Semantic Search Logic
+│       └── generation.py         # Gemini LLM Interface
+├── 📁 ui/                  # Streamlit Frontend
+│   ├── Home.py             # Landing Page
+│   └── 📁 pages/           # Chat & Document Management Modules
+├── 📁 data/                # Persistent Storage
+│   ├── 📁 chroma_db/       # Vector Database
+│   ├── 📁 sqlite_db/       # State Management (Metadata)
+│   └── 📁 uploads/         # Raw Files
+└── requirements.txt        # Dependencies
+```
+## 🚀 Getting Started
+### Prerequisites
+-   Python 3.10 or higher
+-   A Google AI Studio API Key
+### Installation
+1.  **Clone the repository**
+    ```bash
+    git clone https://github.com/yourusername/rag-knowledge-management.git
+    cd rag-knowledge-management
+    ```
+2.  **Create a virtual environment**
+    ```bash
+    python -m venv .venv
+    # Windows
+    .venv\Scripts\activate
+    # Mac/Linux
+    source .venv/bin/activate
+    ```
+3.  **Install dependencies**
+    ```bash
+    pip install -r requirements.txt
+    ```
+4.  **Configure Environment**
+    Create a `.env` file in the root directory:
+    ```env
+    GOOGLE_API_KEY=your_google_api_key_here
+    API_URL=http://localhost:8000/
+    DATA_DIR=data/
+    ```
+### Running the Application
+You will need two terminal windows:
+**Terminal 1: Backend (API)**
+```bash
+uvicorn app.main:app --reload --port 8000
+```
+**Terminal 2: Frontend (UI)**
+```bash
+streamlit run ui/Home.py
+```
+## 📚 Usage Guide
+1.  **Upload Info**: Go to the **Documents** page. Upload your PDFs or text files. The system will parse and vectorise them automatically.
+2.  **Verify**: Check the file list to ensure your documents are indexed.
+3.  **Chat**: Switch to the **Chat** page. Ask questions like "Summarize the document I just uploaded" or specific details contained in your files.
+## 🔮 Roadmap
+-   [ ] Multiple chat history
+-   [ ] Docker & Docker Compose support
+---
+*Built with ❤️ by logan*

app/Dockerfile ADDED Viewed

	@@ -0,0 +1,25 @@

+FROM python:3.13-slim
+# Environment settings
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+WORKDIR /app
+# System dependencies (safe default)
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    && rm -rf /var/lib/apt/lists/*
+# Install Python dependencies
+COPY ../requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application code
+COPY . .
+# Expose FastAPI port
+EXPOSE 8000
+# Run FastAPI with Uvicorn
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

app/main.py ADDED Viewed

	@@ -0,0 +1,159 @@

+from fastapi import FastAPI, UploadFile, Request, HTTPException,BackgroundTasks, File, Depends
+from fastapi.responses import JSONResponse
+from contextlib import asynccontextmanager
+from functools import lru_cache
+from langchain_huggingface.embeddings import HuggingFaceEmbeddings
+import shutil
+from services.document_ingester import Ingester
+from services.retriever import Retriever
+from services.generation import Generation
+from services.database import Database
+from pydantic import BaseModel
+from werkzeug.utils import secure_filename
+from dotenv import load_dotenv
+import os
+import time
+import logging
+import json
+# --- Structured Logging Setup ---
+class JSONFormatter(logging.Formatter):
+    def format(self, record):
+        log_record = {
+            "timestamp": self.formatTime(record, self.datefmt),
+            "level": record.levelname,
+            "message": record.getMessage(),
+            "module": record.module,
+            "function": record.funcName,
+        }
+        if hasattr(record, "extra"):
+            log_record.update(record.extra)
+        if record.exc_info:
+            log_record["exception"] = self.formatException(record.exc_info)
+        return json.dumps(log_record)
+logger = logging.getLogger("app")
+logger.setLevel(logging.INFO)
+handler = logging.StreamHandler()
+handler.setFormatter(JSONFormatter())
+logger.addHandler(handler)
+#--- Lifecycle Management ---
+database = Database()
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    logger.info("Starting up the application...")
+    logger.info("Database connection established.")
+    load_dotenv()
+    ingest_uploaded_docs()
+    yield
+    database.disconnect()
+    logger.info("Database connection closed. Application shutdown complete.")
+    logger.info("Application has been stopped.")
+# --- FastAPI App ---
+app = FastAPI(lifespan=lifespan)
+# --- Middleware for Request Logging ---
+@app.middleware("http")
+async def log_requests(request: Request, call_next):
+    start_time = time.perf_counter()
+    response = await call_next(request)
+    process_time = (time.perf_counter() - start_time) * 1000
+    logger.info(
+        f"{request.method} {request.url.path}",
+        extra={"extra": {"method": request.method, "path": request.url.path, "status_code": response.status_code, "duration_ms": round(process_time, 2)}}
+    )
+    return response
+# --- Global Exception Handler ---
+@app.exception_handler(Exception)
+async def global_exception_handler(request: Request, exc: Exception):
+    logger.error(f"Unhandled exception: {exc}", exc_info=True)
+    return JSONResponse(
+        status_code=500,
+        content={"message": "An unexpected internal server error occurred."}
+    )
+@app.exception_handler(HTTPException)
+async def http_exception_handler(request: Request, exc: HTTPException):
+    return JSONResponse(status_code=exc.status_code, content={"message": exc.detail})
+# --- Dependency Injection ---
+embed_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
+@lru_cache()
+def get_ingester():
+    return Ingester(embedding_model=embed_model)
+@lru_cache()
+def get_retriever():
+    return Retriever(embedding_model=embed_model)
+@lru_cache()
+def get_generator():
+    return Generation()
+# --- Background Tasks ---
+def ingest_documents(path:str):
+    ingester=get_ingester()
+    logger.info(f"Starting document ingestion for {path}", extra={"extra": {"document_path": path}})
+    ingester.ingest_documents(path)
+    logger.info(f"Document ingestion completed for {path}", extra={"extra": {"document_path": path}})
+    database.update_document_status(path, "ingested")
+    logger.info(f"Document status updated to 'ingested' for {path}", extra={"extra": {"document_path": path}})
+def ingest_uploaded_docs():
+    to_be_ingested = database.list_documents()
+    for doc in to_be_ingested:
+        if doc[1] == "uploaded":
+            ingest_documents(doc[3])
+            logger.info(f"Background ingestion completed for {doc[3]}", extra={"extra": {"document_path": doc[3]}})
+# --- API Endpoints ---
+@app.get("/")
+async def health_check():
+    return {"status": "ok"}
+@app.post("/document")
+async def upload_file(background_tasks: BackgroundTasks,file:UploadFile=File(...) ):
+    upload_dir = os.path.join(os.getenv("DATA_DIR"), "uploads")
+    os.makedirs(upload_dir, exist_ok=True)
+    safe_filename = secure_filename(file.filename)
+    file_path = os.path.join(upload_dir, f"{os.path.splitext(safe_filename)[0]}_{int(time.time())}{os.path.splitext(safe_filename)[1]}")
+    with open(file_path, "wb") as buffer:
+        shutil.copyfileobj(file.file, buffer)
+    database.add_document(filename=safe_filename, path=file_path)
+    logger.info(f"Uploading file: {file.filename}", extra={"extra": {"original_filename": file.filename, "safe_path": file_path}})
+    background_tasks.add_task(ingest_documents, path=file_path)
+    return {"filename": file.filename, "message": "File uploaded successfully."}
+@app.get("/documents")
+def list_documents():
+    documents = database.list_documents()
+    logger.info("Fetched document list", extra={"extra": {"document_count": len(documents)}})
+    return {"documents": documents}
+class DeleteRequest(BaseModel):
+    source: str
+@app.delete("/document")
+def clear_document(payload: DeleteRequest, ingester: Ingester = Depends(get_ingester)):
+    logger.info(f"Deleting document: {payload.source}")
+    message = ingester.delete_document(payload.source)
+    logger.info(f"Vector deletion completed for: {payload.source,message}")
+    db_msg=database.delete_document(payload.source)
+    logger.info(f"Document deletion completed for: {payload.source}")
+    return {"message": message, "db_msg": db_msg}
+class ChatRequest(BaseModel):
+    question: str
+    history: str
+@app.post("/chat")
+async def chat_endpoint(request: ChatRequest, retriever: Retriever = Depends(get_retriever), generator: Generation = Depends(get_generator)):
+    logger.info(f"Chat request received", extra={"extra": {"question_length": len(request.question)}})
+    context = retriever.retrieve_context(request.question)
+    response = generator.generate_response(request.question, context, request.history)
+    return {"response": response}

app/requirements.txt ADDED Viewed

	@@ -0,0 +1,199 @@

+accelerate==1.12.0
+annotated-doc==0.0.4
+annotated-types==0.7.0
+antlr4-python3-runtime==4.9.3
+anyio==4.12.1
+asn1crypto==1.5.1
+attrs==25.4.0
+backoff==2.2.1
+bcrypt==5.0.0
+boto3==1.42.41
+botocore==1.42.41
+build==1.4.0
+cachetools==6.2.6
+certifi==2026.1.4
+cffi==2.0.0
+charset-normalizer==3.4.4
+chromadb==1.4.1
+click==8.3.1
+cloudpickle==3.1.1
+coloredlogs==15.0.1
+colorlog==6.10.1
+cryptography==46.0.4
+dill==0.4.1
+distro==1.9.0
+dnspython==2.8.0
+docling==2.71.0
+docling-core==2.62.0
+docling-ibm-models==3.11.0
+docling-parse==4.7.3
+docopt==0.6.2
+dotenv==0.9.9
+durationpy==0.10
+email-validator==2.3.0
+entrypoints==0.4
+fastapi==0.128.0
+fastapi-cli==0.0.20
+fastapi-cloud-cli==0.11.0
+fastar==0.8.0
+filelock==3.20.3
+filetype==1.2.0
+flatbuffers==25.12.19
+fsspec==2026.1.0
+gitdb==4.0.12
+GitPython==3.1.46
+google-ai-generativelanguage==0.6.15
+google-api-core==2.29.0
+google-api-python-client==2.188.0
+google-auth==2.48.0
+google-auth-httplib2==0.3.0
+google-genai==1.61.0
+google-generativeai==0.8.6
+googleapis-common-protos==1.72.0
+grpcio==1.76.0
+grpcio-status==1.71.2
+h11==0.16.0
+httpcore==1.0.9
+httplib2==0.31.2
+httptools==0.7.1
+httpx==0.28.1
+huggingface-hub==0.36.0
+humanfriendly==10.0
+idna==3.11
+importlib_metadata==8.7.1
+importlib_resources==6.5.2
+Jinja2==3.1.6
+jmespath==1.1.0
+joblib==1.5.3
+jsonlines==4.0.0
+jsonpatch==1.33
+jsonpointer==3.0.0
+jsonref==1.1.0
+jsonschema==4.26.0
+jsonschema-specifications==2025.9.1
+kubernetes==35.0.0
+langchain==1.2.7
+langchain-chroma==1.1.0
+langchain-core==1.2.7
+langchain-docling==2.0.0
+langchain-google-genai==4.2.0
+langchain-huggingface==1.2.0
+langgraph==1.0.7
+langgraph-checkpoint==4.0.0
+langgraph-prebuilt==1.0.7
+langgraph-sdk==0.3.3
+langsmith==0.6.7
+latex2mathml==3.78.1
+lxml==6.0.2
+MarkupSafe==3.0.3
+mmh3==5.2.0
+mpire==2.10.2
+mpmath==1.3.0
+multiprocess==0.70.19
+networkx==3.6.1
+numpy==2.4.2
+oauthlib==3.3.1
+omegaconf==2.3.0
+onnxruntime==1.23.2
+opentelemetry-api==1.39.1
+opentelemetry-exporter-otlp-proto-common==1.39.1
+opentelemetry-exporter-otlp-proto-grpc==1.39.1
+opentelemetry-proto==1.39.1
+opentelemetry-sdk==1.39.1
+opentelemetry-semantic-conventions==0.60b1
+orjson==3.11.6
+ormsgpack==1.12.2
+overrides==7.7.0
+packaging==25.0
+pandas==2.3.3
+pipreqs==0.4.13
+platformdirs==4.5.1
+pluggy==1.6.0
+polyfactory==3.2.0
+posthog==5.4.0
+prometheus_client==0.24.1
+proto-plus==1.27.1
+protobuf==5.29.5
+psutil==7.2.2
+pyarrow==23.0.0
+pyasn1==0.6.2
+pyasn1_modules==0.4.2
+pybase64==1.4.3
+pycparser==3.0
+pydantic==2.12.5
+pydantic-extra-types==2.11.0
+pydantic-settings==2.12.0
+pydantic_core==2.41.5
+PyJWT==2.11.0
+pyOpenSSL==25.3.0
+pypdfium2==5.3.0
+PyPika==0.50.0
+pyproject_hooks==1.2.0
+python-dateutil==2.9.0.post0
+python-docx==1.2.0
+python-dotenv==1.2.1
+python-multipart==0.0.22
+python-pptx==1.0.2
+pytz==2025.2
+PyYAML==6.0.3
+rapidocr==3.6.0
+referencing==0.37.0
+regex==2026.1.15
+requests==2.32.5
+requests-oauthlib==2.0.0
+requests-toolbelt==1.0.0
+rignore==0.7.6
+rpds-py==0.30.0
+rsa==4.9.1
+rtree==1.4.1
+s3transfer==0.16.0
+safetensors==0.7.0
+scikit-learn==1.8.0
+scipy==1.17.0
+semchunk==2.2.2
+sentence-transformers==5.2.2
+sentry-sdk==2.51.0
+setuptools==80.10.2
+shapely==2.1.2
+six==1.17.0
+smmap==5.0.2
+sniffio==1.3.1
+snowflake-connector-python==4.2.0
+snowflake-snowpark-python==1.45.0
+sortedcontainers==2.4.0
+soupsieve==2.8.3
+starlette==0.50.0
+sympy==1.14.0
+tabulate==0.9.0
+tenacity==9.1.2
+threadpoolctl==3.6.0
+tokenizers==0.22.2
+toml==0.10.2
+tomlkit==0.14.0
+torch==2.10.0
+torchvision==0.25.0
+tqdm==4.67.2
+transformers==4.57.6
+tree-sitter==0.25.2
+tree-sitter-c==0.24.1
+tree-sitter-javascript==0.25.0
+tree-sitter-python==0.25.0
+tree-sitter-typescript==0.23.2
+typer==0.21.1
+typing-inspection==0.4.2
+typing_extensions==4.15.0
+tzdata==2025.3
+tzlocal==5.3.1
+uritemplate==4.2.0
+urllib3==2.6.3
+uuid_utils==0.14.0
+uvicorn==0.40.0
+watchfiles==1.1.1
+websocket-client==1.9.0
+websockets==15.0.1
+Werkzeug==3.1.5
+wheel==0.46.3
+xxhash==3.6.0
+yarg==0.1.10
+zipp==3.23.0
+zstandard==0.25.0

app/services/database.py ADDED Viewed

	@@ -0,0 +1,63 @@

+import sqlite3
+from dotenv import load_dotenv
+import os
+class Database:
+    def __init__(self):
+        load_dotenv()
+        self.db_path = os.path.join(os.getenv("DATA_DIR"),"sqlite_db/sqlite.db")
+        os.makedirs(os.path.dirname(self.db_path), exist_ok=True)
+        self.conn = sqlite3.connect(self.db_path, check_same_thread=False)
+        self.create_tables()
+    def create_tables(self):
+        with self.conn:
+            self.conn.execute("""
+                CREATE TABLE IF NOT EXISTS documents (
+                    id INTEGER PRIMARY KEY AUTOINCREMENT,
+                    filename TEXT NOT NULL,
+                    path TEXT NOT NULL,
+                    status VARCHAR(20) NOT NULL
+                              CHECK(status IN ('uploaded', 'ingested')),
+                    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
+                )
+            """)
+    def add_document(self, filename, path, status="uploaded"):
+        with self.conn:
+            self.conn.execute("""
+                INSERT INTO documents (filename, path, status) VALUES (?, ?, ?)
+            """, (filename, path, status))
+    def update_document_status(self, path, status):
+        with self.conn:
+            self.conn.execute("""
+                UPDATE documents SET status = ? WHERE path = ?
+            """, (status, path))
+    def list_documents(self):
+        with self.conn:
+            cursor = self.conn.execute("""
+                SELECT filename, status, timestamp, path FROM documents ORDER BY timestamp DESC
+            """)
+            return cursor.fetchall()
+    def delete_document(self, path):
+        with self.conn:
+            self.conn.execute("""
+                DELETE FROM documents WHERE path = ?
+            """, (path,))
+    def disconnect(self):
+        self.conn.close()
+if __name__ == "__main__":
+    db = Database()
+    # db.create_tables()
+    # db.add_document("sample.pdf", "data/uploads/sample.pdf")
+    # print(db.list_documents())
+    # db.update_document_status("sample.pdf", "ingested")
+    # print(db.list_documents())
+    db.delete_document("sample.pdf")
+    print(db.list_documents())

app/services/document_ingester.py ADDED Viewed

	@@ -0,0 +1,79 @@

+from langchain_chroma.vectorstores import Chroma
+from langchain_huggingface.embeddings import HuggingFaceEmbeddings
+from langchain_core.documents import Document
+from docling.chunking import HybridChunker
+from docling.document_converter import DocumentConverter
+from pathlib import Path
+from dotenv import load_dotenv
+import os
+import logging
+class Ingester:
+    def __init__(self, embedding_model:HuggingFaceEmbeddings=None):
+        self.embedding_model= embedding_model if embedding_model else HuggingFaceEmbeddings(
+            model_name="sentence-transformers/all-MiniLM-L6-v2"
+        )
+        load_dotenv()
+        self.DATA_DIR = os.getenv("DATA_DIR")
+        self.vector_store= Chroma(
+            collection_name="documents_collection",
+            embedding_function=self.embedding_model,
+            persist_directory=os.path.join(self.DATA_DIR,"chroma_db")
+        )
+        self.converter= DocumentConverter()
+        self.chunker= HybridChunker(max_tokens=400, overlap=50)
+        self.logger = logging.getLogger(__name__)
+    def ingest_documents(self,documents_path):
+        source_path = Path(documents_path)
+        converted=self.converter.convert(source=source_path).document
+        chunks= self.chunker.chunk(dl_doc=converted)
+        lc_docs= [Document(page_content=chunk.text,metadata={"source": str(source_path.resolve())}) for chunk in chunks]
+        self.logger.info(f"Ingesting {len(lc_docs)} chunks from document '{source_path}' into the vector store.")
+        self.vector_store.add_documents(documents=lc_docs)
+    def delete_document(self,source: str):
+        source_path = Path(source)
+        try:
+            os.remove(source_path)
+        except FileNotFoundError:
+            self.logger.warning(f"File {source_path} not found for deletion.")
+            pass  # If the file does not exist, we can ignore the error
+        # Attempt to delete by the given source value and also by the resolved absolute path.
+        deleted_any = False
+        try:
+            self.vector_store.delete(where={"source": source})
+            deleted_any = True
+        except Exception as e:
+            self.logger.debug(f"Vector delete by provided source failed: {e}")
+        try:
+            abs_source = str(source_path.resolve())
+            # If abs_source equals the original, this will just repeat; that's fine.
+            self.vector_store.delete(where={"source": abs_source})
+            deleted_any = True
+        except Exception as e:
+            self.logger.debug(f"Vector delete by absolute source failed: {e}")
+        if not deleted_any:
+            self.logger.warning(f"No vector entries deleted for source '{source}' or '{abs_source}'.")
+        return f"Documents from source '{source}' have been cleared from the vector store. deleted={deleted_any}"
+    def clear_document(self):
+        return self.vector_store.reset_collection()
+    def list_chunks(self):
+        return self.vector_store.get(limit=100000,offset=0)
+if __name__ == "__main__":
+    ingester = Ingester()
+    # ingester.ingest_documents("E:/Coding/AIMl/Rag/data/test_doc/sample.pdf")
+    # print("Document ingestion completed.")
+    ingester.delete_document("sample_1770465383.pdf")
+    print("Deleted document and its chunks from the vector store.")
+    ingester.clear_document()
+    print("Cleared documents from the vector store.")
+    print(ingester.list_chunks())

app/services/generation.py ADDED Viewed

	@@ -0,0 +1,64 @@

+from langchain_google_genai import ChatGoogleGenerativeAI
+from langchain_core.prompts import ChatPromptTemplate
+from langchain_core.output_parsers import StrOutputParser
+from dotenv import load_dotenv
+import os
+class Generation:
+    def __init__(self):
+        load_dotenv()
+        self.GEMINI_API_KEY = os.getenv("GOOGLE_API_KEY")
+        self.llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash-lite", temperature=0.7)
+    def generate_response(self, prompt: str, content:str, history:str) -> str:
+        template="""Answer the following question based on this context:
+{context}
+Question: {question}
+History: {history}
+"""
+        prompt_template = ChatPromptTemplate.from_template(template)
+        chain = (prompt_template
+                 | self.llm
+                 | StrOutputParser()
+        )
+        response = chain.invoke({"context": content, "question": prompt, "history": history})
+        return response
+if __name__ == "__main__":
+    generator = Generation()
+    sample_context ="""Document 1:
+Docling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support downstream operations. Then, the standard model pipeline applies a sequence of AI models independently on every page in the document to extract features and content, such as layout and table structures. Finally, the results from all pages are aggregated and passed through a post-processing stage, which augments metadata, detects the document language, infers reading-order and eventually assembles a typed document object which can be serialized to JSON or Markdown.
+--------------------------------------------------Document 2:
+Docling provides optional support for OCR, for example to cover scanned PDFs or content in bitmaps images embedded on a page. In our initial release, we rely on EasyOCR [1], a popular thirdparty OCR library with support for many languages. Docling, by default, feeds a high-resolution page image (216 dpi) to the OCR engine, to allow capturing small print detail in decent quality. While EasyOCR delivers reasonable transcription quality, we observe that it runs fairly slow on CPU (upwards of 30 seconds per page).
+We are actively seeking collaboration from the open-source community to extend Docling with additional OCR backends and speed improvements.
+--------------------------------------------------Document 3:
+Converting PDF documents back into a machine-processable format has been a major challenge for decades due to their huge variability in formats, weak standardization and printing-optimized characteristic, which discards most structural features and metadata. With the advent of LLMs and popular application patterns such as retrieval-augmented generation (RAG), leveraging the rich content embedded in PDFs has become ever more relevant. In the past decade, several powerful document understanding solutions have emerged on the market, most of which are commercial software, cloud offerings [3] and most recently, multi-modal vision-language models. As of today, only a handful of open-source tools cover PDF conversion, leaving a significant feature and quality gap to proprietary solutions.
+With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.
+Here is what Docling delivers today:
+- Converts PDF documents to JSON or Markdown format, stable and lightning fast
+- Understands detailed page layout, reading order, locates figures and recovers table structures
+- Extracts metadata from the document, such as title, authors, references and language
+- Optionally applies OCR, e.g. for scanned PDFs
+- Can be configured to be optimal for batch-mode (i.e high throughput, low time-to-solution) or interactive mode (compromise on efficiency, low time-to-solution)
+- Can leverage different accelerators (GPU, MPS, etc).
+--------------------------------------------------Document 4:
+- [1] J. AI. Easyocr: Ready-to-use ocr with 80+ supported languages. https://github.com/ JaidedAI/EasyOCR , 2024. Version: 1.7.0.
+- [2] J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. Voznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, A. Chourdia, W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W. Feng, J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar, L. Kirsch, M. Lazos, M. Lezcano, Y. Liang, J. Liang, Y. Lu, C. Luk, B. Maher, Y. Pan, C. Puhrsch, M. Reso, M. Saroufim, M. Y. Siraichi, H. Suk, M. Suo, P. Tillet, E. Wang, X. Wang, W. Wen, S. Zhang, X. Zhao, K. Zhou, R. Zou, A. Mathews, G. Chanan, P. Wu, and S. Chintala. Pytorch 2: Faster
+--------------------------------------------------Document 5:
+Docling provides optional support for OCR, for example to cover scanned PDFs or content in bitmaps images embedded on a page. In our initial release, we rely on EasyOCR [1], a popular thirdparty OCR library with support for many languages. Docling, by default, feeds a high-resolution page image (216 dpi) to the OCR engine, to allow capturing small print detail in decent quality. While EasyOCR delivers reasonable transcription quality, we observe that it runs fairly slow on CPU (upwards of 30 seconds per page).
+We are actively seeking collaboration from the open-source community to extend Docling with additional OCR backends and speed improvements.
+--------------------------------------------------Document 6:
+Docling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support downstream operations. Then, the standard model pipeline applies a sequence of AI models independently on every page in the document to extract features and content, such as layout and table structures. Finally, the results from all pages are aggregated and passed through a post-processing stage, which augments metadata, detects the document language, infers reading-order and eventually assembles a typed document object which can be serialized to JSON or Markdown.
+--------------------------------------------------Document 7:
+Docling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support downstream operations. Then, the standard model pipeline applies a sequence of AI models independently on every page in the document to extract features and content, such as layout and table structures. Finally, the results from all pages are aggregated and passed through a post-processing stage, which augments metadata, detects the document language, infers reading-order and eventually assembles a typed document object which can be serialized to JSON or Markdown.
+--------------------------------------------------Document 8:
+Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of content, as well as augment extracted document metadata with additional information. Further investment into testing and optimizing GPU acceleration as well as improving the Docling-native PDF backend are on our roadmap, too.
+We encourage everyone to propose or implement additional features and models, and will gladly take your inputs and contributions under review . The codebase of Docling is open for use and contribution, under the MIT license agreement and in alignment with our contributing guidelines included in the Docling repository. If you use Docling in your projects, please consider citing this technical report.
+--------------------------------------------------Document 9:
+In the final pipeline stage, Docling assembles all prediction results produced on each page into a well-defined datatype that encapsulates a converted document, as defined in the auxiliary package docling-core . The generated document object is passed through a post-processing model which leverages several algorithms to augment features, such as detection of the document language, correcting the reading order, matching figures with captions and labelling metadata such as title, authors and references. The final output can then be serialized to JSON or transformed into a Markdown representation at the users request.
+--------------------------------------------------"""
+    sample_question = "How does the OCR work in Docling?"
+    response = generator.generate_response(sample_question, sample_context)
+    print("Generated Response:")
+    print(response)

app/services/retriever.py ADDED Viewed

	@@ -0,0 +1,67 @@

+from langchain_chroma.vectorstores import Chroma
+from langchain_huggingface.embeddings import HuggingFaceEmbeddings
+from langchain_google_genai import ChatGoogleGenerativeAI
+from langchain_core.prompts import ChatPromptTemplate
+from langchain_core.output_parsers import StrOutputParser
+from dotenv import load_dotenv
+import os
+class Retriever:
+    def __init__(self, embedding_model:HuggingFaceEmbeddings=None):
+        self.embed= embedding_model if embedding_model else HuggingFaceEmbeddings(
+            model_name="sentence-transformers/all-MiniLM-L6-v2"
+        )
+        load_dotenv()
+        self.DATA_DIR = os.getenv("DATA_DIR")
+        self.vector_store=Chroma(
+            collection_name="documents_collection",
+            embedding_function=self.embed,
+            persist_directory=os.path.join(self.DATA_DIR,"chroma_db")
+        )
+        self.GEMINI_API_KEY = os.getenv("GOOGLE_API_KEY")
+        if self.GEMINI_API_KEY is None:
+            raise ValueError("GOOGLE_API_KEY not found in environment variables.")
+    def _retrieve_chunks(self,query:str):
+        retrieved_chunks = self.vector_store.similarity_search(query,k=3)
+        return retrieved_chunks
+    def _query_transformer(self,query:str):
+        template= """You are an AI language model assistant. Your task is to generate three
+different versions of the given user question to retrieve relevant documents from a vector
+database. By generating multiple perspectives on the user question, your goal is to help
+the user overcome some of the limitations of the distance-based similarity search.
+Provide these alternative questions separated by newlines. Original question: {question}"""
+        prompt = ChatPromptTemplate.from_template(template)
+        llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash-lite",temperature=0.7)
+        chain= (prompt
+                | llm
+                | StrOutputParser()
+                | (lambda x: x.strip().split("\n"))  # Split the output into a list of questions
+        )
+        response= chain.invoke({"question": query})
+        return response
+    def retrieve_context(self, query: str):
+        transformed_queries = self._query_transformer(query)
+        all_retrieved_chunks = []
+        for tq in transformed_queries:
+            chunks = self._retrieve_chunks(tq)
+            for chunk in chunks:
+                if chunk not in all_retrieved_chunks:
+                    all_retrieved_chunks.append(chunk)
+        context=""
+        for idx, doc in enumerate(all_retrieved_chunks):
+            context+=(f"Context {idx+1}:\n{doc.page_content}\n{'-'*50}\n")
+        return context
+if __name__ == "__main__":
+    retriever_instance = Retriever()
+    # results = retriever_instance.retrieve_chunks("Sample query")
+    # print(results)
+    # transformed_response = retriever_instance.query_transformer("tell me about the history of AI and its applications in healthcare and finance")
+    # print(transformed_response)
+    context = retriever_instance.retrieve_context("how does the ocr work in docling?")
+    print(context)

compose.yaml ADDED Viewed

	@@ -0,0 +1,29 @@

+# Comments are provided throughout this file to help you get started.
+# If you need more help, visit the Docker Compose reference guide at
+# https://docs.docker.com/go/compose-spec-reference/
+# Here the instructions define your application as a service called "server".
+# This service is built from the Dockerfile in the current directory.
+# You can add other services your application may depend on here, such as a
+# database or a cache. For examples, see the Awesome Compose repository:
+# https://github.com/docker/awesome-compose
+services:
+  server:
+    build:
+      context: ./app
+    ports:
+      - 8000:8000
+    env_file:
+      - .env
+  frontend:
+    build:
+      context: ./ui
+    ports:
+      - 3000:3000
+    depends_on:
+      - server
+    env_file:
+      - .env
+    environment:
+      - API_URL=http://server:8000/

requirements.txt ADDED Viewed

Binary file (10.4 kB). View file

ui/Dockerfile ADDED Viewed

	@@ -0,0 +1,25 @@

+FROM python:3.13-slim
+# Prevent Python from writing pyc files & buffering stdout
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+WORKDIR /app
+# Install system dependencies (optional but safe)
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    && rm -rf /var/lib/apt/lists/*
+# Install Python dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application code
+COPY . .
+# Expose Streamlit port
+EXPOSE 8501
+# Run Streamlit
+CMD ["streamlit", "run", "Home.py", "--server.port=3000", "--server.address=0.0.0.0"]

ui/Home.py ADDED Viewed

	@@ -0,0 +1,85 @@

+import streamlit as st
+st.set_page_config(
+    page_title="Knowledge Management RAG",
+    page_icon="🧠",
+    layout="wide"
+)
+# Hero Section
+st.title("🧠 Knowledge Management System")
+st.markdown("""
+    ### Your Personal AI-Powered Knowledge Base
+    Welcome to a local-first Retrieval-Augmented Generation (RAG) system designed to help you
+    **organize**, **search**, and **chat** with your documents. Powered by advanced AI models,
+    this tool transforms your static files into an interactive knowledge engine.
+""")
+st.divider()
+# Key Features Section
+st.header("✨ Key Features")
+col1, col2, col3 = st.columns(3)
+with col1:
+    st.subheader("📄 Smart Ingestion")
+    st.markdown("""
+    - **Advanced Parsing**: Uses [Docling](https://github.com/DS4SD/docling) for high-fidelity PDF & document processing.
+    - **Async Processing**: Upload large files without blocking the UI.
+    - **State Management**: Track document status from upload to full indexing.
+    """)
+with col2:
+    st.subheader("🤖 Intelligent Retrieval")
+    st.markdown("""
+    - **Semantic Search**: Powered by `all-MiniLM-L6-v2` embeddings.
+    - **Vector Database**: Fast and scalable storage using ChromaDB (local) or Qdrant.
+    - **Query Expansion**: Generates multiple perspectives for better recall.
+    """)
+with col3:
+    st.subheader("💬 Context-Aware Chat")
+    st.markdown("""
+    - **Gemini Powered**: Uses Google's Gemini 2.5 Flash Lite for accurate reasoning.
+    - **History Aware**: Remembers conversation context for fluid interaction.
+    - **Source Citations**: Know exactly where the answer came from (Coming Soon).
+    """)
+st.divider()
+# How It Works / Getting Started
+st.header("🚀 Getting Started")
+step1, step2, step3 = st.columns(3)
+with step1:
+    st.markdown("#### 1. Upload Documents")
+    st.info("Go to the **Documents** page and upload your PDFs, DOCX, or TXT files.")
+with step2:
+    st.markdown("#### 2. Process & Index")
+    st.warning("The system automatically processes files in the background. Watch the status change to 'Ingested'.")
+with step3:
+    st.markdown("#### 3. Chat with Data")
+    st.success("Switch to the **Chat** page and ask questions about your knowledge base.")
+st.divider()
+# Tech Stack Footer
+with st.expander("🛠️ Under the Hood"):
+    st.markdown("""
+    This project is built with a modern, robust tech stack:
+    - **Backend**: FastAPI (Python)
+    - **Frontend**: Streamlit
+    - **LLM**: Google Gemini 2.5 Flash Lite
+    - **Embeddings**: HuggingFace (`sentence-transformers`)
+    - **Vector Store**: ChromaDB (local)
+    - **State Management**: SQLite
+    - **Parsing**: Docling
+    """)
+st.markdown("---")
+st.caption("Built with ❤️ by Logan | version 2.0.0")

ui/pages/Chat.py ADDED Viewed

	@@ -0,0 +1,55 @@

+import streamlit as st
+import requests
+import os
+import json
+from dotenv import load_dotenv
+load_dotenv()
+API_URL = os.getenv("API_URL", "http://localhost:8000/")
+st.set_page_config(page_title="Chat App", page_icon="💬")
+st.title("💬 Chat Interface")
+if "messages" not in st.session_state:
+    st.session_state.messages = []
+for message in st.session_state.messages:
+    with st.chat_message(message["role"]):
+        st.markdown(message["content"])
+if prompt := st.chat_input("Type your message..."):
+    st.session_state.messages.append(
+        {"role": "user", "content": prompt}
+    )
+    with st.chat_message("user"):
+        st.markdown(prompt)
+    with st.chat_message("assistant"):
+        with st.spinner("Thinking..."):
+            res = None
+            try:
+                res = requests.post(
+                    API_URL+"chat",
+                    json={"question": prompt, "history": json.dumps(st.session_state.messages)}
+                )
+            except requests.exceptions.RequestException:
+                st.error("⚠️ Could not connect to the backend. Please try again later.")
+                st.stop()
+            if res is None:
+                st.stop()
+            if res.status_code == 200:
+                reply = res.json()["response"]
+            else:
+                reply = "Sorry, something went wrong. Please try again later."
+            st.session_state.messages.append(
+                {"role": "assistant", "content": reply}
+            )
+            st.markdown(reply)

ui/pages/Documents.py ADDED Viewed

	@@ -0,0 +1,111 @@

+import streamlit as st
+import requests
+import os
+import pandas as pd
+from dotenv import load_dotenv
+load_dotenv()
+API_URL = os.getenv("API_URL", "http://localhost:8000/")
+st.set_page_config(page_title="Documents", page_icon="📄")
+st.title("📄 Document Management")
+def fetch_documents():
+    try:
+        res = requests.get(API_URL + "documents")
+        if res.status_code == 200:
+            return res.json().get("documents", [])
+    except Exception:
+        st.error("Failed to fetch documents from the server.")
+    return []
+def upload_document(file):
+    files = {"file": (file.name, file.getvalue(), file.type)}
+    try:
+        res=requests.post(API_URL + "document", files=files)
+    except requests.exceptions.RequestException:
+        st.error("⚠️ Could not connect to the backend. Please try again later.")
+        return None
+    return res
+def delete_document(name):
+    try:
+        res=requests.delete(API_URL + f"document", json={"source": name})
+    except requests.exceptions.RequestException:
+        st.error("⚠️ Could not connect to the backend. Please try again later.")
+        return None
+    return res
+st.subheader("📤 Upload Document")
+uploaded_file = st.file_uploader(
+    "Choose a file",
+    type=["pdf", "docx", "txt"]
+)
+if uploaded_file:
+    if st.button("Upload and Ingest"):
+        with st.spinner("Uploading and ingesting document..."):
+            res = upload_document(uploaded_file)
+            if res is None:
+                pass  # Error already shown by upload_document
+            elif res.status_code == 200:
+                st.success("✅ Document uploaded successfully")
+            else:
+                st.error("❌ Failed to upload document")
+st.divider()
+st.subheader("📄 Available Documents")
+search_query = st.text_input(
+    "",
+    placeholder="🔍 Search documents"
+)
+with st.spinner("Fetching documents..."):
+    documents = fetch_documents()
+if search_query:
+    documents = [
+        doc for doc in documents
+        if search_query.lower() in doc[0].lower()
+    ]
+if not documents:
+    st.info("No documents available.")
+else:
+    # Table header
+    header_cols = st.columns([3, 2, 2, 1])
+    header_cols[0].markdown("**Filename**")
+    header_cols[1].markdown("**Status**")
+    header_cols[2].markdown("**Uploaded At**")
+    header_cols[3].markdown("**Actions**")
+    st.divider()
+    for idx, doc in enumerate(documents):
+        filename, status, timestamp, path = doc
+        cols = st.columns([3, 2, 2, 1])
+        cols[0].write(filename)
+        cols[1].write(status)
+        cols[2].write(timestamp)
+        if cols[3].button(
+            "Delete",
+            key=f"delete_{idx}",
+            type="secondary"
+        ):
+            with st.spinner("Deleting document..."):
+                res = delete_document(path)
+                if res is None:
+                    pass
+                elif res.status_code == 200:
+                    st.success(f"✅ Deleted `{filename}`")
+                    st.rerun()
+                else:
+                    st.error("❌ Failed to delete document")

ui/requirements.txt ADDED Viewed

	@@ -0,0 +1,52 @@

+altair==6.0.0
+altex==0.2.0
+beautifulsoup4==4.14.3
+blinker==1.9.0
+colorama==0.4.6
+contourpy==1.3.3
+cycler==0.12.1
+et_xmlfile==2.0.0
+Faker==40.1.2
+favicon==0.7.0
+fonttools==4.61.1
+htbuilder==0.9.0
+kiwisolver==1.4.9
+Markdown==3.10.1
+markdown-it-py==4.0.0
+markdownlit==0.0.7
+marko==2.2.2
+matplotlib==3.10.8
+mdurl==0.1.2
+narwhals==2.15.0
+opencv-python==4.13.0.90
+openpyxl==3.1.5
+pillow==11.3.0
+plotly==6.5.2
+pyclipper==1.4.0
+pydeck==0.9.1
+Pygments==2.19.2
+pylatexenc==2.10
+pymdown-extensions==10.20.1
+pyparsing==3.3.2
+pyreadline3==3.5.4
+rich==14.3.2
+rich-toolkit==0.18.1
+st-annotated-text==4.0.2
+st-theme==1.2.3
+streamlit==1.53.1
+streamlit-avatar==0.1.3
+streamlit-camera-input-live==0.2.0
+streamlit-card==1.0.2
+streamlit-embedcode==0.1.2
+streamlit-extras==0.7.8
+streamlit-image-coordinates==0.4.0
+streamlit-keyup==0.3.0
+streamlit-notify==0.3.1
+streamlit-shadcn-ui==0.1.19
+streamlit-toggle-switch==1.0.2
+streamlit-vertical-slider==2.5.5
+streamlit_faker==0.0.4
+tornado==6.5.4
+validators==0.35.0
+watchdog==6.0.0
+xlsxwriter==3.2.9