logan-codes commited on
Commit
6a91298
·
1 Parent(s): f12045e

upload the project

Browse files
.dockerignore ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Include any files or directories that you don't want to be copied to your
2
+ # container here (e.g., local build artifacts, temporary files, etc.).
3
+ #
4
+ # For more help, visit the .dockerignore file reference guide at
5
+ # https://docs.docker.com/go/build-context-dockerignore/
6
+
7
+ **/.DS_Store
8
+ **/__pycache__
9
+ **/.venv
10
+ **/.classpath
11
+ **/.dockerignore
12
+ **/.env
13
+ **/.git
14
+ **/.gitignore
15
+ **/.project
16
+ **/.settings
17
+ **/.toolstarget
18
+ **/.vs
19
+ **/.vscode
20
+ **/*.*proj.user
21
+ **/*.dbmdl
22
+ **/*.jfm
23
+ **/bin
24
+ **/charts
25
+ **/docker-compose*
26
+ **/compose.y*ml
27
+ **/Dockerfile*
28
+ **/node_modules
29
+ **/npm-debug.log
30
+ **/obj
31
+ **/secrets.dev.yaml
32
+ **/values.dev.yaml
33
+ LICENSE
34
+ README.md
.gitignore ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # python venv
2
+ .venv/
3
+
4
+ # docs folder
5
+ data/
6
+
7
+ # test folder
8
+ test/
9
+
10
+ # pycache files
11
+ **/__pycache__/
12
+
13
+ # github files
14
+ .github/
15
+
16
+ # env file
17
+ .env
README.md CHANGED
@@ -1,10 +1,105 @@
1
- ---
2
- title: Knowledge Mangaement System Using RAG
3
- emoji: 🌍
4
- colorFrom: purple
5
- colorTo: blue
6
- sdk: docker
7
- pinned: false
8
- ---
9
-
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🧠 Knowledge Management RAG System
2
+
3
+ A powerful, local-first Retrieval-Augmented Generation (RAG) system designed to manage your personal knowledge base. Built with a modern client-server architecture, it allows you to upload documents, persist them in a vector database, and chat with your data using Google's Gemini models.
4
+
5
+ ![Python](https://img.shields.io/badge/Python-3.10%2B-blue)
6
+ ![FastAPI](https://img.shields.io/badge/FastAPI-0.109-009688)
7
+ ![Streamlit](https://img.shields.io/badge/Streamlit-1.31-FF4B4B)
8
+ ![LangChain](https://img.shields.io/badge/LangChain-0.1-green)
9
+
10
+ ## Key Features
11
+
12
+ - **📄 Document Ingestion**: Seamlessly upload PDF, DOCX, and TXT files.
13
+ - **🤖 Advanced Parsing**: Powered by [Docling](https://github.com/DS4SD/docling) for high-fidelity document parsing and chunking.
14
+ - **🧠 Smart Retrieval**: Uses `sentence-transformers/all-MiniLM-L6-v2` embeddings stored in a local ChromaDB instance.
15
+ - **💬 Context-Aware Chat**: Chat interface powered by Google Gemini 2.5 Flash Lite.
16
+ - **STORAGE**: Uses ChromaDB for vector storage and **SQLite** for state management.
17
+ - **⚡ High Performance**: Optimized architecture with model caching (LRU) to prevent redundant reloading.
18
+ - **🧹 Management**: View and delete uploaded documents directly from the UI.
19
+
20
+ ## 🛠️ Architecture
21
+
22
+ The project follows a clean segregation of duties:
23
+
24
+ ```
25
+ /
26
+ ├── 📁 app/ # FastAPI Backend
27
+ │ ├── main.py # API Entry point & Dependency Injection
28
+ │ └── 📁 services/ # Core Business Logic
29
+ │ ├── document_ingester.py # Docling + ChromaDB ingestion
30
+ │ ├── retriever.py # Semantic Search Logic
31
+ │ └── generation.py # Gemini LLM Interface
32
+ ├── 📁 ui/ # Streamlit Frontend
33
+ │ ├── Home.py # Landing Page
34
+ │ └── 📁 pages/ # Chat & Document Management Modules
35
+ ├── 📁 data/ # Persistent Storage
36
+ │ ├── 📁 chroma_db/ # Vector Database
37
+ │ ├── 📁 sqlite_db/ # State Management (Metadata)
38
+ │ └── 📁 uploads/ # Raw Files
39
+ └── requirements.txt # Dependencies
40
+ ```
41
+
42
+ ## 🚀 Getting Started
43
+
44
+ ### Prerequisites
45
+
46
+ - Python 3.10 or higher
47
+ - A Google AI Studio API Key
48
+
49
+ ### Installation
50
+
51
+ 1. **Clone the repository**
52
+ ```bash
53
+ git clone https://github.com/yourusername/rag-knowledge-management.git
54
+ cd rag-knowledge-management
55
+ ```
56
+
57
+ 2. **Create a virtual environment**
58
+ ```bash
59
+ python -m venv .venv
60
+ # Windows
61
+ .venv\Scripts\activate
62
+ # Mac/Linux
63
+ source .venv/bin/activate
64
+ ```
65
+
66
+ 3. **Install dependencies**
67
+ ```bash
68
+ pip install -r requirements.txt
69
+ ```
70
+
71
+ 4. **Configure Environment**
72
+ Create a `.env` file in the root directory:
73
+ ```env
74
+ GOOGLE_API_KEY=your_google_api_key_here
75
+ API_URL=http://localhost:8000/
76
+ DATA_DIR=data/
77
+ ```
78
+
79
+ ### Running the Application
80
+
81
+ You will need two terminal windows:
82
+
83
+ **Terminal 1: Backend (API)**
84
+ ```bash
85
+ uvicorn app.main:app --reload --port 8000
86
+ ```
87
+
88
+ **Terminal 2: Frontend (UI)**
89
+ ```bash
90
+ streamlit run ui/Home.py
91
+ ```
92
+
93
+ ## 📚 Usage Guide
94
+
95
+ 1. **Upload Info**: Go to the **Documents** page. Upload your PDFs or text files. The system will parse and vectorise them automatically.
96
+ 2. **Verify**: Check the file list to ensure your documents are indexed.
97
+ 3. **Chat**: Switch to the **Chat** page. Ask questions like "Summarize the document I just uploaded" or specific details contained in your files.
98
+
99
+ ## 🔮 Roadmap
100
+
101
+ - [ ] Multiple chat history
102
+ - [ ] Docker & Docker Compose support
103
+
104
+ ---
105
+ *Built with ❤️ by logan*
app/Dockerfile ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.13-slim
2
+
3
+ # Environment settings
4
+ ENV PYTHONDONTWRITEBYTECODE=1
5
+ ENV PYTHONUNBUFFERED=1
6
+
7
+ WORKDIR /app
8
+
9
+ # System dependencies (safe default)
10
+ RUN apt-get update && apt-get install -y \
11
+ build-essential \
12
+ && rm -rf /var/lib/apt/lists/*
13
+
14
+ # Install Python dependencies
15
+ COPY ../requirements.txt .
16
+ RUN pip install --no-cache-dir -r requirements.txt
17
+
18
+ # Copy application code
19
+ COPY . .
20
+
21
+ # Expose FastAPI port
22
+ EXPOSE 8000
23
+
24
+ # Run FastAPI with Uvicorn
25
+ CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
app/main.py ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fastapi import FastAPI, UploadFile, Request, HTTPException,BackgroundTasks, File, Depends
2
+ from fastapi.responses import JSONResponse
3
+ from contextlib import asynccontextmanager
4
+ from functools import lru_cache
5
+ from langchain_huggingface.embeddings import HuggingFaceEmbeddings
6
+ import shutil
7
+ from services.document_ingester import Ingester
8
+ from services.retriever import Retriever
9
+ from services.generation import Generation
10
+ from services.database import Database
11
+ from pydantic import BaseModel
12
+ from werkzeug.utils import secure_filename
13
+ from dotenv import load_dotenv
14
+ import os
15
+ import time
16
+ import logging
17
+ import json
18
+
19
+ # --- Structured Logging Setup ---
20
+ class JSONFormatter(logging.Formatter):
21
+ def format(self, record):
22
+ log_record = {
23
+ "timestamp": self.formatTime(record, self.datefmt),
24
+ "level": record.levelname,
25
+ "message": record.getMessage(),
26
+ "module": record.module,
27
+ "function": record.funcName,
28
+ }
29
+ if hasattr(record, "extra"):
30
+ log_record.update(record.extra)
31
+ if record.exc_info:
32
+ log_record["exception"] = self.formatException(record.exc_info)
33
+ return json.dumps(log_record)
34
+
35
+ logger = logging.getLogger("app")
36
+ logger.setLevel(logging.INFO)
37
+ handler = logging.StreamHandler()
38
+ handler.setFormatter(JSONFormatter())
39
+ logger.addHandler(handler)
40
+ #--- Lifecycle Management ---
41
+ database = Database()
42
+ @asynccontextmanager
43
+ async def lifespan(app: FastAPI):
44
+ logger.info("Starting up the application...")
45
+ logger.info("Database connection established.")
46
+ load_dotenv()
47
+ ingest_uploaded_docs()
48
+ yield
49
+ database.disconnect()
50
+ logger.info("Database connection closed. Application shutdown complete.")
51
+ logger.info("Application has been stopped.")
52
+
53
+ # --- FastAPI App ---
54
+ app = FastAPI(lifespan=lifespan)
55
+
56
+ # --- Middleware for Request Logging ---
57
+ @app.middleware("http")
58
+ async def log_requests(request: Request, call_next):
59
+ start_time = time.perf_counter()
60
+ response = await call_next(request)
61
+ process_time = (time.perf_counter() - start_time) * 1000
62
+ logger.info(
63
+ f"{request.method} {request.url.path}",
64
+ extra={"extra": {"method": request.method, "path": request.url.path, "status_code": response.status_code, "duration_ms": round(process_time, 2)}}
65
+ )
66
+ return response
67
+
68
+ # --- Global Exception Handler ---
69
+ @app.exception_handler(Exception)
70
+ async def global_exception_handler(request: Request, exc: Exception):
71
+ logger.error(f"Unhandled exception: {exc}", exc_info=True)
72
+ return JSONResponse(
73
+ status_code=500,
74
+ content={"message": "An unexpected internal server error occurred."}
75
+ )
76
+
77
+ @app.exception_handler(HTTPException)
78
+ async def http_exception_handler(request: Request, exc: HTTPException):
79
+ return JSONResponse(status_code=exc.status_code, content={"message": exc.detail})
80
+
81
+ # --- Dependency Injection ---
82
+ embed_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
83
+
84
+ @lru_cache()
85
+ def get_ingester():
86
+ return Ingester(embedding_model=embed_model)
87
+
88
+ @lru_cache()
89
+ def get_retriever():
90
+ return Retriever(embedding_model=embed_model)
91
+
92
+ @lru_cache()
93
+ def get_generator():
94
+ return Generation()
95
+
96
+ # --- Background Tasks ---
97
+ def ingest_documents(path:str):
98
+ ingester=get_ingester()
99
+ logger.info(f"Starting document ingestion for {path}", extra={"extra": {"document_path": path}})
100
+ ingester.ingest_documents(path)
101
+ logger.info(f"Document ingestion completed for {path}", extra={"extra": {"document_path": path}})
102
+ database.update_document_status(path, "ingested")
103
+ logger.info(f"Document status updated to 'ingested' for {path}", extra={"extra": {"document_path": path}})
104
+
105
+ def ingest_uploaded_docs():
106
+ to_be_ingested = database.list_documents()
107
+ for doc in to_be_ingested:
108
+ if doc[1] == "uploaded":
109
+ ingest_documents(doc[3])
110
+ logger.info(f"Background ingestion completed for {doc[3]}", extra={"extra": {"document_path": doc[3]}})
111
+
112
+ # --- API Endpoints ---
113
+ @app.get("/")
114
+ async def health_check():
115
+ return {"status": "ok"}
116
+
117
+ @app.post("/document")
118
+ async def upload_file(background_tasks: BackgroundTasks,file:UploadFile=File(...) ):
119
+ upload_dir = os.path.join(os.getenv("DATA_DIR"), "uploads")
120
+ os.makedirs(upload_dir, exist_ok=True)
121
+
122
+ safe_filename = secure_filename(file.filename)
123
+ file_path = os.path.join(upload_dir, f"{os.path.splitext(safe_filename)[0]}_{int(time.time())}{os.path.splitext(safe_filename)[1]}")
124
+
125
+ with open(file_path, "wb") as buffer:
126
+ shutil.copyfileobj(file.file, buffer)
127
+ database.add_document(filename=safe_filename, path=file_path)
128
+ logger.info(f"Uploading file: {file.filename}", extra={"extra": {"original_filename": file.filename, "safe_path": file_path}})
129
+ background_tasks.add_task(ingest_documents, path=file_path)
130
+ return {"filename": file.filename, "message": "File uploaded successfully."}
131
+
132
+ @app.get("/documents")
133
+ def list_documents():
134
+ documents = database.list_documents()
135
+ logger.info("Fetched document list", extra={"extra": {"document_count": len(documents)}})
136
+ return {"documents": documents}
137
+
138
+ class DeleteRequest(BaseModel):
139
+ source: str
140
+
141
+ @app.delete("/document")
142
+ def clear_document(payload: DeleteRequest, ingester: Ingester = Depends(get_ingester)):
143
+ logger.info(f"Deleting document: {payload.source}")
144
+ message = ingester.delete_document(payload.source)
145
+ logger.info(f"Vector deletion completed for: {payload.source,message}")
146
+ db_msg=database.delete_document(payload.source)
147
+ logger.info(f"Document deletion completed for: {payload.source}")
148
+ return {"message": message, "db_msg": db_msg}
149
+
150
+ class ChatRequest(BaseModel):
151
+ question: str
152
+ history: str
153
+
154
+ @app.post("/chat")
155
+ async def chat_endpoint(request: ChatRequest, retriever: Retriever = Depends(get_retriever), generator: Generation = Depends(get_generator)):
156
+ logger.info(f"Chat request received", extra={"extra": {"question_length": len(request.question)}})
157
+ context = retriever.retrieve_context(request.question)
158
+ response = generator.generate_response(request.question, context, request.history)
159
+ return {"response": response}
app/requirements.txt ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ accelerate==1.12.0
2
+ annotated-doc==0.0.4
3
+ annotated-types==0.7.0
4
+ antlr4-python3-runtime==4.9.3
5
+ anyio==4.12.1
6
+ asn1crypto==1.5.1
7
+ attrs==25.4.0
8
+ backoff==2.2.1
9
+ bcrypt==5.0.0
10
+ boto3==1.42.41
11
+ botocore==1.42.41
12
+ build==1.4.0
13
+ cachetools==6.2.6
14
+ certifi==2026.1.4
15
+ cffi==2.0.0
16
+ charset-normalizer==3.4.4
17
+ chromadb==1.4.1
18
+ click==8.3.1
19
+ cloudpickle==3.1.1
20
+ coloredlogs==15.0.1
21
+ colorlog==6.10.1
22
+ cryptography==46.0.4
23
+ dill==0.4.1
24
+ distro==1.9.0
25
+ dnspython==2.8.0
26
+ docling==2.71.0
27
+ docling-core==2.62.0
28
+ docling-ibm-models==3.11.0
29
+ docling-parse==4.7.3
30
+ docopt==0.6.2
31
+ dotenv==0.9.9
32
+ durationpy==0.10
33
+ email-validator==2.3.0
34
+ entrypoints==0.4
35
+ fastapi==0.128.0
36
+ fastapi-cli==0.0.20
37
+ fastapi-cloud-cli==0.11.0
38
+ fastar==0.8.0
39
+ filelock==3.20.3
40
+ filetype==1.2.0
41
+ flatbuffers==25.12.19
42
+ fsspec==2026.1.0
43
+ gitdb==4.0.12
44
+ GitPython==3.1.46
45
+ google-ai-generativelanguage==0.6.15
46
+ google-api-core==2.29.0
47
+ google-api-python-client==2.188.0
48
+ google-auth==2.48.0
49
+ google-auth-httplib2==0.3.0
50
+ google-genai==1.61.0
51
+ google-generativeai==0.8.6
52
+ googleapis-common-protos==1.72.0
53
+ grpcio==1.76.0
54
+ grpcio-status==1.71.2
55
+ h11==0.16.0
56
+ httpcore==1.0.9
57
+ httplib2==0.31.2
58
+ httptools==0.7.1
59
+ httpx==0.28.1
60
+ huggingface-hub==0.36.0
61
+ humanfriendly==10.0
62
+ idna==3.11
63
+ importlib_metadata==8.7.1
64
+ importlib_resources==6.5.2
65
+ Jinja2==3.1.6
66
+ jmespath==1.1.0
67
+ joblib==1.5.3
68
+ jsonlines==4.0.0
69
+ jsonpatch==1.33
70
+ jsonpointer==3.0.0
71
+ jsonref==1.1.0
72
+ jsonschema==4.26.0
73
+ jsonschema-specifications==2025.9.1
74
+ kubernetes==35.0.0
75
+ langchain==1.2.7
76
+ langchain-chroma==1.1.0
77
+ langchain-core==1.2.7
78
+ langchain-docling==2.0.0
79
+ langchain-google-genai==4.2.0
80
+ langchain-huggingface==1.2.0
81
+ langgraph==1.0.7
82
+ langgraph-checkpoint==4.0.0
83
+ langgraph-prebuilt==1.0.7
84
+ langgraph-sdk==0.3.3
85
+ langsmith==0.6.7
86
+ latex2mathml==3.78.1
87
+ lxml==6.0.2
88
+ MarkupSafe==3.0.3
89
+ mmh3==5.2.0
90
+ mpire==2.10.2
91
+ mpmath==1.3.0
92
+ multiprocess==0.70.19
93
+ networkx==3.6.1
94
+ numpy==2.4.2
95
+ oauthlib==3.3.1
96
+ omegaconf==2.3.0
97
+ onnxruntime==1.23.2
98
+ opentelemetry-api==1.39.1
99
+ opentelemetry-exporter-otlp-proto-common==1.39.1
100
+ opentelemetry-exporter-otlp-proto-grpc==1.39.1
101
+ opentelemetry-proto==1.39.1
102
+ opentelemetry-sdk==1.39.1
103
+ opentelemetry-semantic-conventions==0.60b1
104
+ orjson==3.11.6
105
+ ormsgpack==1.12.2
106
+ overrides==7.7.0
107
+ packaging==25.0
108
+ pandas==2.3.3
109
+ pipreqs==0.4.13
110
+ platformdirs==4.5.1
111
+ pluggy==1.6.0
112
+ polyfactory==3.2.0
113
+ posthog==5.4.0
114
+ prometheus_client==0.24.1
115
+ proto-plus==1.27.1
116
+ protobuf==5.29.5
117
+ psutil==7.2.2
118
+ pyarrow==23.0.0
119
+ pyasn1==0.6.2
120
+ pyasn1_modules==0.4.2
121
+ pybase64==1.4.3
122
+ pycparser==3.0
123
+ pydantic==2.12.5
124
+ pydantic-extra-types==2.11.0
125
+ pydantic-settings==2.12.0
126
+ pydantic_core==2.41.5
127
+ PyJWT==2.11.0
128
+ pyOpenSSL==25.3.0
129
+ pypdfium2==5.3.0
130
+ PyPika==0.50.0
131
+ pyproject_hooks==1.2.0
132
+ python-dateutil==2.9.0.post0
133
+ python-docx==1.2.0
134
+ python-dotenv==1.2.1
135
+ python-multipart==0.0.22
136
+ python-pptx==1.0.2
137
+ pytz==2025.2
138
+ PyYAML==6.0.3
139
+ rapidocr==3.6.0
140
+ referencing==0.37.0
141
+ regex==2026.1.15
142
+ requests==2.32.5
143
+ requests-oauthlib==2.0.0
144
+ requests-toolbelt==1.0.0
145
+ rignore==0.7.6
146
+ rpds-py==0.30.0
147
+ rsa==4.9.1
148
+ rtree==1.4.1
149
+ s3transfer==0.16.0
150
+ safetensors==0.7.0
151
+ scikit-learn==1.8.0
152
+ scipy==1.17.0
153
+ semchunk==2.2.2
154
+ sentence-transformers==5.2.2
155
+ sentry-sdk==2.51.0
156
+ setuptools==80.10.2
157
+ shapely==2.1.2
158
+ six==1.17.0
159
+ smmap==5.0.2
160
+ sniffio==1.3.1
161
+ snowflake-connector-python==4.2.0
162
+ snowflake-snowpark-python==1.45.0
163
+ sortedcontainers==2.4.0
164
+ soupsieve==2.8.3
165
+ starlette==0.50.0
166
+ sympy==1.14.0
167
+ tabulate==0.9.0
168
+ tenacity==9.1.2
169
+ threadpoolctl==3.6.0
170
+ tokenizers==0.22.2
171
+ toml==0.10.2
172
+ tomlkit==0.14.0
173
+ torch==2.10.0
174
+ torchvision==0.25.0
175
+ tqdm==4.67.2
176
+ transformers==4.57.6
177
+ tree-sitter==0.25.2
178
+ tree-sitter-c==0.24.1
179
+ tree-sitter-javascript==0.25.0
180
+ tree-sitter-python==0.25.0
181
+ tree-sitter-typescript==0.23.2
182
+ typer==0.21.1
183
+ typing-inspection==0.4.2
184
+ typing_extensions==4.15.0
185
+ tzdata==2025.3
186
+ tzlocal==5.3.1
187
+ uritemplate==4.2.0
188
+ urllib3==2.6.3
189
+ uuid_utils==0.14.0
190
+ uvicorn==0.40.0
191
+ watchfiles==1.1.1
192
+ websocket-client==1.9.0
193
+ websockets==15.0.1
194
+ Werkzeug==3.1.5
195
+ wheel==0.46.3
196
+ xxhash==3.6.0
197
+ yarg==0.1.10
198
+ zipp==3.23.0
199
+ zstandard==0.25.0
app/services/database.py ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sqlite3
2
+ from dotenv import load_dotenv
3
+ import os
4
+
5
+ class Database:
6
+ def __init__(self):
7
+ load_dotenv()
8
+ self.db_path = os.path.join(os.getenv("DATA_DIR"),"sqlite_db/sqlite.db")
9
+ os.makedirs(os.path.dirname(self.db_path), exist_ok=True)
10
+ self.conn = sqlite3.connect(self.db_path, check_same_thread=False)
11
+ self.create_tables()
12
+
13
+ def create_tables(self):
14
+ with self.conn:
15
+ self.conn.execute("""
16
+ CREATE TABLE IF NOT EXISTS documents (
17
+ id INTEGER PRIMARY KEY AUTOINCREMENT,
18
+ filename TEXT NOT NULL,
19
+ path TEXT NOT NULL,
20
+ status VARCHAR(20) NOT NULL
21
+ CHECK(status IN ('uploaded', 'ingested')),
22
+ timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
23
+ )
24
+ """)
25
+
26
+ def add_document(self, filename, path, status="uploaded"):
27
+ with self.conn:
28
+ self.conn.execute("""
29
+ INSERT INTO documents (filename, path, status) VALUES (?, ?, ?)
30
+ """, (filename, path, status))
31
+
32
+ def update_document_status(self, path, status):
33
+ with self.conn:
34
+ self.conn.execute("""
35
+ UPDATE documents SET status = ? WHERE path = ?
36
+ """, (status, path))
37
+
38
+ def list_documents(self):
39
+ with self.conn:
40
+ cursor = self.conn.execute("""
41
+ SELECT filename, status, timestamp, path FROM documents ORDER BY timestamp DESC
42
+ """)
43
+ return cursor.fetchall()
44
+
45
+ def delete_document(self, path):
46
+ with self.conn:
47
+ self.conn.execute("""
48
+ DELETE FROM documents WHERE path = ?
49
+ """, (path,))
50
+
51
+ def disconnect(self):
52
+ self.conn.close()
53
+
54
+
55
+ if __name__ == "__main__":
56
+ db = Database()
57
+ # db.create_tables()
58
+ # db.add_document("sample.pdf", "data/uploads/sample.pdf")
59
+ # print(db.list_documents())
60
+ # db.update_document_status("sample.pdf", "ingested")
61
+ # print(db.list_documents())
62
+ db.delete_document("sample.pdf")
63
+ print(db.list_documents())
app/services/document_ingester.py ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from langchain_chroma.vectorstores import Chroma
2
+ from langchain_huggingface.embeddings import HuggingFaceEmbeddings
3
+ from langchain_core.documents import Document
4
+ from docling.chunking import HybridChunker
5
+ from docling.document_converter import DocumentConverter
6
+ from pathlib import Path
7
+ from dotenv import load_dotenv
8
+ import os
9
+ import logging
10
+
11
+ class Ingester:
12
+ def __init__(self, embedding_model:HuggingFaceEmbeddings=None):
13
+ self.embedding_model= embedding_model if embedding_model else HuggingFaceEmbeddings(
14
+ model_name="sentence-transformers/all-MiniLM-L6-v2"
15
+ )
16
+ load_dotenv()
17
+ self.DATA_DIR = os.getenv("DATA_DIR")
18
+ self.vector_store= Chroma(
19
+ collection_name="documents_collection",
20
+ embedding_function=self.embedding_model,
21
+ persist_directory=os.path.join(self.DATA_DIR,"chroma_db")
22
+ )
23
+
24
+ self.converter= DocumentConverter()
25
+ self.chunker= HybridChunker(max_tokens=400, overlap=50)
26
+ self.logger = logging.getLogger(__name__)
27
+
28
+ def ingest_documents(self,documents_path):
29
+ source_path = Path(documents_path)
30
+ converted=self.converter.convert(source=source_path).document
31
+ chunks= self.chunker.chunk(dl_doc=converted)
32
+ lc_docs= [Document(page_content=chunk.text,metadata={"source": str(source_path.resolve())}) for chunk in chunks]
33
+ self.logger.info(f"Ingesting {len(lc_docs)} chunks from document '{source_path}' into the vector store.")
34
+ self.vector_store.add_documents(documents=lc_docs)
35
+
36
+ def delete_document(self,source: str):
37
+ source_path = Path(source)
38
+ try:
39
+ os.remove(source_path)
40
+ except FileNotFoundError:
41
+ self.logger.warning(f"File {source_path} not found for deletion.")
42
+ pass # If the file does not exist, we can ignore the error
43
+ # Attempt to delete by the given source value and also by the resolved absolute path.
44
+ deleted_any = False
45
+ try:
46
+ self.vector_store.delete(where={"source": source})
47
+ deleted_any = True
48
+ except Exception as e:
49
+ self.logger.debug(f"Vector delete by provided source failed: {e}")
50
+
51
+ try:
52
+ abs_source = str(source_path.resolve())
53
+ # If abs_source equals the original, this will just repeat; that's fine.
54
+ self.vector_store.delete(where={"source": abs_source})
55
+ deleted_any = True
56
+ except Exception as e:
57
+ self.logger.debug(f"Vector delete by absolute source failed: {e}")
58
+
59
+ if not deleted_any:
60
+ self.logger.warning(f"No vector entries deleted for source '{source}' or '{abs_source}'.")
61
+
62
+ return f"Documents from source '{source}' have been cleared from the vector store. deleted={deleted_any}"
63
+
64
+ def clear_document(self):
65
+ return self.vector_store.reset_collection()
66
+
67
+ def list_chunks(self):
68
+ return self.vector_store.get(limit=100000,offset=0)
69
+
70
+
71
+ if __name__ == "__main__":
72
+ ingester = Ingester()
73
+ # ingester.ingest_documents("E:/Coding/AIMl/Rag/data/test_doc/sample.pdf")
74
+ # print("Document ingestion completed.")
75
+ ingester.delete_document("sample_1770465383.pdf")
76
+ print("Deleted document and its chunks from the vector store.")
77
+ ingester.clear_document()
78
+ print("Cleared documents from the vector store.")
79
+ print(ingester.list_chunks())
app/services/generation.py ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from langchain_google_genai import ChatGoogleGenerativeAI
2
+ from langchain_core.prompts import ChatPromptTemplate
3
+ from langchain_core.output_parsers import StrOutputParser
4
+ from dotenv import load_dotenv
5
+ import os
6
+
7
+ class Generation:
8
+ def __init__(self):
9
+ load_dotenv()
10
+ self.GEMINI_API_KEY = os.getenv("GOOGLE_API_KEY")
11
+ self.llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash-lite", temperature=0.7)
12
+
13
+ def generate_response(self, prompt: str, content:str, history:str) -> str:
14
+ template="""Answer the following question based on this context:
15
+ {context}
16
+ Question: {question}
17
+ History: {history}
18
+ """
19
+ prompt_template = ChatPromptTemplate.from_template(template)
20
+ chain = (prompt_template
21
+ | self.llm
22
+ | StrOutputParser()
23
+ )
24
+ response = chain.invoke({"context": content, "question": prompt, "history": history})
25
+ return response
26
+
27
+
28
+ if __name__ == "__main__":
29
+ generator = Generation()
30
+ sample_context ="""Document 1:
31
+ Docling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support downstream operations. Then, the standard model pipeline applies a sequence of AI models independently on every page in the document to extract features and content, such as layout and table structures. Finally, the results from all pages are aggregated and passed through a post-processing stage, which augments metadata, detects the document language, infers reading-order and eventually assembles a typed document object which can be serialized to JSON or Markdown.
32
+ --------------------------------------------------Document 2:
33
+ Docling provides optional support for OCR, for example to cover scanned PDFs or content in bitmaps images embedded on a page. In our initial release, we rely on EasyOCR [1], a popular thirdparty OCR library with support for many languages. Docling, by default, feeds a high-resolution page image (216 dpi) to the OCR engine, to allow capturing small print detail in decent quality. While EasyOCR delivers reasonable transcription quality, we observe that it runs fairly slow on CPU (upwards of 30 seconds per page).
34
+ We are actively seeking collaboration from the open-source community to extend Docling with additional OCR backends and speed improvements.
35
+ --------------------------------------------------Document 3:
36
+ Converting PDF documents back into a machine-processable format has been a major challenge for decades due to their huge variability in formats, weak standardization and printing-optimized characteristic, which discards most structural features and metadata. With the advent of LLMs and popular application patterns such as retrieval-augmented generation (RAG), leveraging the rich content embedded in PDFs has become ever more relevant. In the past decade, several powerful document understanding solutions have emerged on the market, most of which are commercial software, cloud offerings [3] and most recently, multi-modal vision-language models. As of today, only a handful of open-source tools cover PDF conversion, leaving a significant feature and quality gap to proprietary solutions.
37
+ With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.
38
+ Here is what Docling delivers today:
39
+ - Converts PDF documents to JSON or Markdown format, stable and lightning fast
40
+ - Understands detailed page layout, reading order, locates figures and recovers table structures
41
+ - Extracts metadata from the document, such as title, authors, references and language
42
+ - Optionally applies OCR, e.g. for scanned PDFs
43
+ - Can be configured to be optimal for batch-mode (i.e high throughput, low time-to-solution) or interactive mode (compromise on efficiency, low time-to-solution)
44
+ - Can leverage different accelerators (GPU, MPS, etc).
45
+ --------------------------------------------------Document 4:
46
+ - [1] J. AI. Easyocr: Ready-to-use ocr with 80+ supported languages. https://github.com/ JaidedAI/EasyOCR , 2024. Version: 1.7.0.
47
+ - [2] J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. Voznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, A. Chourdia, W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W. Feng, J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar, L. Kirsch, M. Lazos, M. Lezcano, Y. Liang, J. Liang, Y. Lu, C. Luk, B. Maher, Y. Pan, C. Puhrsch, M. Reso, M. Saroufim, M. Y. Siraichi, H. Suk, M. Suo, P. Tillet, E. Wang, X. Wang, W. Wen, S. Zhang, X. Zhao, K. Zhou, R. Zou, A. Mathews, G. Chanan, P. Wu, and S. Chintala. Pytorch 2: Faster
48
+ --------------------------------------------------Document 5:
49
+ Docling provides optional support for OCR, for example to cover scanned PDFs or content in bitmaps images embedded on a page. In our initial release, we rely on EasyOCR [1], a popular thirdparty OCR library with support for many languages. Docling, by default, feeds a high-resolution page image (216 dpi) to the OCR engine, to allow capturing small print detail in decent quality. While EasyOCR delivers reasonable transcription quality, we observe that it runs fairly slow on CPU (upwards of 30 seconds per page).
50
+ We are actively seeking collaboration from the open-source community to extend Docling with additional OCR backends and speed improvements.
51
+ --------------------------------------------------Document 6:
52
+ Docling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support downstream operations. Then, the standard model pipeline applies a sequence of AI models independently on every page in the document to extract features and content, such as layout and table structures. Finally, the results from all pages are aggregated and passed through a post-processing stage, which augments metadata, detects the document language, infers reading-order and eventually assembles a typed document object which can be serialized to JSON or Markdown.
53
+ --------------------------------------------------Document 7:
54
+ Docling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support downstream operations. Then, the standard model pipeline applies a sequence of AI models independently on every page in the document to extract features and content, such as layout and table structures. Finally, the results from all pages are aggregated and passed through a post-processing stage, which augments metadata, detects the document language, infers reading-order and eventually assembles a typed document object which can be serialized to JSON or Markdown.
55
+ --------------------------------------------------Document 8:
56
+ Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of content, as well as augment extracted document metadata with additional information. Further investment into testing and optimizing GPU acceleration as well as improving the Docling-native PDF backend are on our roadmap, too.
57
+ We encourage everyone to propose or implement additional features and models, and will gladly take your inputs and contributions under review . The codebase of Docling is open for use and contribution, under the MIT license agreement and in alignment with our contributing guidelines included in the Docling repository. If you use Docling in your projects, please consider citing this technical report.
58
+ --------------------------------------------------Document 9:
59
+ In the final pipeline stage, Docling assembles all prediction results produced on each page into a well-defined datatype that encapsulates a converted document, as defined in the auxiliary package docling-core . The generated document object is passed through a post-processing model which leverages several algorithms to augment features, such as detection of the document language, correcting the reading order, matching figures with captions and labelling metadata such as title, authors and references. The final output can then be serialized to JSON or transformed into a Markdown representation at the users request.
60
+ --------------------------------------------------"""
61
+ sample_question = "How does the OCR work in Docling?"
62
+ response = generator.generate_response(sample_question, sample_context)
63
+ print("Generated Response:")
64
+ print(response)
app/services/retriever.py ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from langchain_chroma.vectorstores import Chroma
2
+ from langchain_huggingface.embeddings import HuggingFaceEmbeddings
3
+ from langchain_google_genai import ChatGoogleGenerativeAI
4
+ from langchain_core.prompts import ChatPromptTemplate
5
+ from langchain_core.output_parsers import StrOutputParser
6
+ from dotenv import load_dotenv
7
+ import os
8
+
9
+ class Retriever:
10
+ def __init__(self, embedding_model:HuggingFaceEmbeddings=None):
11
+ self.embed= embedding_model if embedding_model else HuggingFaceEmbeddings(
12
+ model_name="sentence-transformers/all-MiniLM-L6-v2"
13
+ )
14
+ load_dotenv()
15
+ self.DATA_DIR = os.getenv("DATA_DIR")
16
+ self.vector_store=Chroma(
17
+ collection_name="documents_collection",
18
+ embedding_function=self.embed,
19
+ persist_directory=os.path.join(self.DATA_DIR,"chroma_db")
20
+ )
21
+
22
+ self.GEMINI_API_KEY = os.getenv("GOOGLE_API_KEY")
23
+ if self.GEMINI_API_KEY is None:
24
+ raise ValueError("GOOGLE_API_KEY not found in environment variables.")
25
+
26
+ def _retrieve_chunks(self,query:str):
27
+ retrieved_chunks = self.vector_store.similarity_search(query,k=3)
28
+ return retrieved_chunks
29
+
30
+ def _query_transformer(self,query:str):
31
+ template= """You are an AI language model assistant. Your task is to generate three
32
+ different versions of the given user question to retrieve relevant documents from a vector
33
+ database. By generating multiple perspectives on the user question, your goal is to help
34
+ the user overcome some of the limitations of the distance-based similarity search.
35
+ Provide these alternative questions separated by newlines. Original question: {question}"""
36
+ prompt = ChatPromptTemplate.from_template(template)
37
+ llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash-lite",temperature=0.7)
38
+ chain= (prompt
39
+ | llm
40
+ | StrOutputParser()
41
+ | (lambda x: x.strip().split("\n")) # Split the output into a list of questions
42
+ )
43
+ response= chain.invoke({"question": query})
44
+ return response
45
+
46
+ def retrieve_context(self, query: str):
47
+ transformed_queries = self._query_transformer(query)
48
+ all_retrieved_chunks = []
49
+ for tq in transformed_queries:
50
+ chunks = self._retrieve_chunks(tq)
51
+ for chunk in chunks:
52
+ if chunk not in all_retrieved_chunks:
53
+ all_retrieved_chunks.append(chunk)
54
+
55
+ context=""
56
+ for idx, doc in enumerate(all_retrieved_chunks):
57
+ context+=(f"Context {idx+1}:\n{doc.page_content}\n{'-'*50}\n")
58
+ return context
59
+
60
+ if __name__ == "__main__":
61
+ retriever_instance = Retriever()
62
+ # results = retriever_instance.retrieve_chunks("Sample query")
63
+ # print(results)
64
+ # transformed_response = retriever_instance.query_transformer("tell me about the history of AI and its applications in healthcare and finance")
65
+ # print(transformed_response)
66
+ context = retriever_instance.retrieve_context("how does the ocr work in docling?")
67
+ print(context)
compose.yaml ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Comments are provided throughout this file to help you get started.
2
+ # If you need more help, visit the Docker Compose reference guide at
3
+ # https://docs.docker.com/go/compose-spec-reference/
4
+
5
+ # Here the instructions define your application as a service called "server".
6
+ # This service is built from the Dockerfile in the current directory.
7
+ # You can add other services your application may depend on here, such as a
8
+ # database or a cache. For examples, see the Awesome Compose repository:
9
+ # https://github.com/docker/awesome-compose
10
+ services:
11
+ server:
12
+ build:
13
+ context: ./app
14
+ ports:
15
+ - 8000:8000
16
+ env_file:
17
+ - .env
18
+
19
+ frontend:
20
+ build:
21
+ context: ./ui
22
+ ports:
23
+ - 3000:3000
24
+ depends_on:
25
+ - server
26
+ env_file:
27
+ - .env
28
+ environment:
29
+ - API_URL=http://server:8000/
requirements.txt ADDED
Binary file (10.4 kB). View file
 
ui/Dockerfile ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.13-slim
2
+
3
+ # Prevent Python from writing pyc files & buffering stdout
4
+ ENV PYTHONDONTWRITEBYTECODE=1
5
+ ENV PYTHONUNBUFFERED=1
6
+
7
+ WORKDIR /app
8
+
9
+ # Install system dependencies (optional but safe)
10
+ RUN apt-get update && apt-get install -y \
11
+ build-essential \
12
+ && rm -rf /var/lib/apt/lists/*
13
+
14
+ # Install Python dependencies
15
+ COPY requirements.txt .
16
+ RUN pip install --no-cache-dir -r requirements.txt
17
+
18
+ # Copy application code
19
+ COPY . .
20
+
21
+ # Expose Streamlit port
22
+ EXPOSE 8501
23
+
24
+ # Run Streamlit
25
+ CMD ["streamlit", "run", "Home.py", "--server.port=3000", "--server.address=0.0.0.0"]
ui/Home.py ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+
3
+ st.set_page_config(
4
+ page_title="Knowledge Management RAG",
5
+ page_icon="🧠",
6
+ layout="wide"
7
+ )
8
+
9
+ # Hero Section
10
+ st.title("🧠 Knowledge Management System")
11
+ st.markdown("""
12
+ ### Your Personal AI-Powered Knowledge Base
13
+
14
+ Welcome to a local-first Retrieval-Augmented Generation (RAG) system designed to help you
15
+ **organize**, **search**, and **chat** with your documents. Powered by advanced AI models,
16
+ this tool transforms your static files into an interactive knowledge engine.
17
+ """)
18
+
19
+ st.divider()
20
+
21
+ # Key Features Section
22
+ st.header("✨ Key Features")
23
+
24
+ col1, col2, col3 = st.columns(3)
25
+
26
+ with col1:
27
+ st.subheader("📄 Smart Ingestion")
28
+ st.markdown("""
29
+ - **Advanced Parsing**: Uses [Docling](https://github.com/DS4SD/docling) for high-fidelity PDF & document processing.
30
+ - **Async Processing**: Upload large files without blocking the UI.
31
+ - **State Management**: Track document status from upload to full indexing.
32
+ """)
33
+
34
+ with col2:
35
+ st.subheader("🤖 Intelligent Retrieval")
36
+ st.markdown("""
37
+ - **Semantic Search**: Powered by `all-MiniLM-L6-v2` embeddings.
38
+ - **Vector Database**: Fast and scalable storage using ChromaDB (local) or Qdrant.
39
+ - **Query Expansion**: Generates multiple perspectives for better recall.
40
+ """)
41
+
42
+ with col3:
43
+ st.subheader("💬 Context-Aware Chat")
44
+ st.markdown("""
45
+ - **Gemini Powered**: Uses Google's Gemini 2.5 Flash Lite for accurate reasoning.
46
+ - **History Aware**: Remembers conversation context for fluid interaction.
47
+ - **Source Citations**: Know exactly where the answer came from (Coming Soon).
48
+ """)
49
+
50
+ st.divider()
51
+
52
+ # How It Works / Getting Started
53
+ st.header("🚀 Getting Started")
54
+
55
+ step1, step2, step3 = st.columns(3)
56
+
57
+ with step1:
58
+ st.markdown("#### 1. Upload Documents")
59
+ st.info("Go to the **Documents** page and upload your PDFs, DOCX, or TXT files.")
60
+
61
+ with step2:
62
+ st.markdown("#### 2. Process & Index")
63
+ st.warning("The system automatically processes files in the background. Watch the status change to 'Ingested'.")
64
+
65
+ with step3:
66
+ st.markdown("#### 3. Chat with Data")
67
+ st.success("Switch to the **Chat** page and ask questions about your knowledge base.")
68
+
69
+ st.divider()
70
+
71
+ # Tech Stack Footer
72
+ with st.expander("🛠️ Under the Hood"):
73
+ st.markdown("""
74
+ This project is built with a modern, robust tech stack:
75
+ - **Backend**: FastAPI (Python)
76
+ - **Frontend**: Streamlit
77
+ - **LLM**: Google Gemini 2.5 Flash Lite
78
+ - **Embeddings**: HuggingFace (`sentence-transformers`)
79
+ - **Vector Store**: ChromaDB (local)
80
+ - **State Management**: SQLite
81
+ - **Parsing**: Docling
82
+ """)
83
+
84
+ st.markdown("---")
85
+ st.caption("Built with ❤️ by Logan | version 2.0.0")
ui/pages/Chat.py ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import requests
3
+ import os
4
+ import json
5
+ from dotenv import load_dotenv
6
+ load_dotenv()
7
+
8
+ API_URL = os.getenv("API_URL", "http://localhost:8000/")
9
+
10
+
11
+ st.set_page_config(page_title="Chat App", page_icon="💬")
12
+ st.title("💬 Chat Interface")
13
+
14
+ if "messages" not in st.session_state:
15
+ st.session_state.messages = []
16
+
17
+
18
+ for message in st.session_state.messages:
19
+ with st.chat_message(message["role"]):
20
+ st.markdown(message["content"])
21
+
22
+
23
+ if prompt := st.chat_input("Type your message..."):
24
+ st.session_state.messages.append(
25
+ {"role": "user", "content": prompt}
26
+ )
27
+
28
+ with st.chat_message("user"):
29
+ st.markdown(prompt)
30
+
31
+ with st.chat_message("assistant"):
32
+ with st.spinner("Thinking..."):
33
+ res = None
34
+ try:
35
+ res = requests.post(
36
+ API_URL+"chat",
37
+ json={"question": prompt, "history": json.dumps(st.session_state.messages)}
38
+ )
39
+ except requests.exceptions.RequestException:
40
+ st.error("⚠️ Could not connect to the backend. Please try again later.")
41
+ st.stop()
42
+
43
+ if res is None:
44
+ st.stop()
45
+
46
+ if res.status_code == 200:
47
+ reply = res.json()["response"]
48
+ else:
49
+ reply = "Sorry, something went wrong. Please try again later."
50
+
51
+ st.session_state.messages.append(
52
+ {"role": "assistant", "content": reply}
53
+ )
54
+ st.markdown(reply)
55
+
ui/pages/Documents.py ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import requests
3
+ import os
4
+ import pandas as pd
5
+ from dotenv import load_dotenv
6
+
7
+ load_dotenv()
8
+
9
+ API_URL = os.getenv("API_URL", "http://localhost:8000/")
10
+
11
+ st.set_page_config(page_title="Documents", page_icon="📄")
12
+ st.title("📄 Document Management")
13
+
14
+ def fetch_documents():
15
+ try:
16
+ res = requests.get(API_URL + "documents")
17
+ if res.status_code == 200:
18
+ return res.json().get("documents", [])
19
+ except Exception:
20
+ st.error("Failed to fetch documents from the server.")
21
+ return []
22
+
23
+ def upload_document(file):
24
+ files = {"file": (file.name, file.getvalue(), file.type)}
25
+ try:
26
+ res=requests.post(API_URL + "document", files=files)
27
+ except requests.exceptions.RequestException:
28
+ st.error("⚠️ Could not connect to the backend. Please try again later.")
29
+ return None
30
+ return res
31
+
32
+ def delete_document(name):
33
+ try:
34
+ res=requests.delete(API_URL + f"document", json={"source": name})
35
+ except requests.exceptions.RequestException:
36
+ st.error("⚠️ Could not connect to the backend. Please try again later.")
37
+ return None
38
+ return res
39
+
40
+ st.subheader("📤 Upload Document")
41
+
42
+ uploaded_file = st.file_uploader(
43
+ "Choose a file",
44
+ type=["pdf", "docx", "txt"]
45
+ )
46
+
47
+ if uploaded_file:
48
+ if st.button("Upload and Ingest"):
49
+ with st.spinner("Uploading and ingesting document..."):
50
+ res = upload_document(uploaded_file)
51
+
52
+ if res is None:
53
+ pass # Error already shown by upload_document
54
+ elif res.status_code == 200:
55
+ st.success("✅ Document uploaded successfully")
56
+ else:
57
+ st.error("❌ Failed to upload document")
58
+
59
+ st.divider()
60
+
61
+ st.subheader("📄 Available Documents")
62
+
63
+ search_query = st.text_input(
64
+ "",
65
+ placeholder="🔍 Search documents"
66
+ )
67
+
68
+ with st.spinner("Fetching documents..."):
69
+ documents = fetch_documents()
70
+
71
+ if search_query:
72
+ documents = [
73
+ doc for doc in documents
74
+ if search_query.lower() in doc[0].lower()
75
+ ]
76
+ if not documents:
77
+ st.info("No documents available.")
78
+ else:
79
+ # Table header
80
+ header_cols = st.columns([3, 2, 2, 1])
81
+ header_cols[0].markdown("**Filename**")
82
+ header_cols[1].markdown("**Status**")
83
+ header_cols[2].markdown("**Uploaded At**")
84
+ header_cols[3].markdown("**Actions**")
85
+
86
+ st.divider()
87
+
88
+ for idx, doc in enumerate(documents):
89
+ filename, status, timestamp, path = doc
90
+
91
+ cols = st.columns([3, 2, 2, 1])
92
+
93
+ cols[0].write(filename)
94
+ cols[1].write(status)
95
+ cols[2].write(timestamp)
96
+
97
+ if cols[3].button(
98
+ "Delete",
99
+ key=f"delete_{idx}",
100
+ type="secondary"
101
+ ):
102
+ with st.spinner("Deleting document..."):
103
+ res = delete_document(path)
104
+
105
+ if res is None:
106
+ pass
107
+ elif res.status_code == 200:
108
+ st.success(f"✅ Deleted `{filename}`")
109
+ st.rerun()
110
+ else:
111
+ st.error("❌ Failed to delete document")
ui/requirements.txt ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ altair==6.0.0
2
+ altex==0.2.0
3
+ beautifulsoup4==4.14.3
4
+ blinker==1.9.0
5
+ colorama==0.4.6
6
+ contourpy==1.3.3
7
+ cycler==0.12.1
8
+ et_xmlfile==2.0.0
9
+ Faker==40.1.2
10
+ favicon==0.7.0
11
+ fonttools==4.61.1
12
+ htbuilder==0.9.0
13
+ kiwisolver==1.4.9
14
+ Markdown==3.10.1
15
+ markdown-it-py==4.0.0
16
+ markdownlit==0.0.7
17
+ marko==2.2.2
18
+ matplotlib==3.10.8
19
+ mdurl==0.1.2
20
+ narwhals==2.15.0
21
+ opencv-python==4.13.0.90
22
+ openpyxl==3.1.5
23
+ pillow==11.3.0
24
+ plotly==6.5.2
25
+ pyclipper==1.4.0
26
+ pydeck==0.9.1
27
+ Pygments==2.19.2
28
+ pylatexenc==2.10
29
+ pymdown-extensions==10.20.1
30
+ pyparsing==3.3.2
31
+ pyreadline3==3.5.4
32
+ rich==14.3.2
33
+ rich-toolkit==0.18.1
34
+ st-annotated-text==4.0.2
35
+ st-theme==1.2.3
36
+ streamlit==1.53.1
37
+ streamlit-avatar==0.1.3
38
+ streamlit-camera-input-live==0.2.0
39
+ streamlit-card==1.0.2
40
+ streamlit-embedcode==0.1.2
41
+ streamlit-extras==0.7.8
42
+ streamlit-image-coordinates==0.4.0
43
+ streamlit-keyup==0.3.0
44
+ streamlit-notify==0.3.1
45
+ streamlit-shadcn-ui==0.1.19
46
+ streamlit-toggle-switch==1.0.2
47
+ streamlit-vertical-slider==2.5.5
48
+ streamlit_faker==0.0.4
49
+ tornado==6.5.4
50
+ validators==0.35.0
51
+ watchdog==6.0.0
52
+ xlsxwriter==3.2.9