nikhile-galileo commited on
Commit
e68d535
·
1 Parent(s): dfb540e

Adding finance protect demo

Browse files
.gitignore ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Mac stuff
2
+ .DS_Store
3
+
4
+ # Byte-compiled / optimized / DLL files
5
+ __pycache__/
6
+
7
+ # Jupyter Notebook
8
+ .ipynb_checkpoints
9
+
10
+ # Environments
11
+ .env
12
+ .venv
13
+ venv/
14
+
15
+ # PyCharm
16
+ .idea/
17
+
18
+ backend/backend-venv
19
+
20
+ # Database files
21
+ *.db
22
+ *.db.lock
23
+ *.pdf
24
+
25
+ # Processed data files
26
+ rfp-data/processed/*.jsonl
27
+ fin-data/processed/*.jsonl
28
+
29
+ # Notes
30
+ notes
Dockerfile ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Use Python 3.12 base image and install uv
2
+ FROM python:3.12-slim
3
+
4
+ # Install uv
5
+ RUN pip install uv
6
+
7
+ # Set environment variables
8
+ ENV PYTHONUNBUFFERED=1 \
9
+ PYTHONDONTWRITEBYTECODE=1 \
10
+ UV_COMPILE_BYTECODE=1 \
11
+ UV_LINK_MODE=copy \
12
+ PYTHONPATH=/app
13
+
14
+ # Set working directory
15
+ WORKDIR /app
16
+
17
+ # Copy pyproject.toml and uv.lock first for better layer caching
18
+ COPY pyproject.toml uv.lock ./
19
+
20
+ # Install dependencies using uv
21
+ RUN uv sync --frozen --no-cache
22
+
23
+ # Copy the application code
24
+ COPY backend/ ./backend/
25
+ COPY fin-data/ ./fin-data/
26
+
27
+ # Create static directory if it doesn't exist
28
+ RUN mkdir -p backend/api/static
29
+
30
+ # Expose the port
31
+ EXPOSE 8000
32
+
33
+ # Command to run the application
34
+ CMD ["uv", "run", "uvicorn", "backend.api.main:app", "--host", "0.0.0.0", "--port", "8000"]
README.md CHANGED
@@ -1,10 +1,95 @@
1
- ---
2
- title: Demos
3
- emoji: 🌖
4
- colorFrom: pink
5
- colorTo: yellow
6
- sdk: docker
7
- pinned: false
8
- ---
9
-
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Galileo POC
2
+
3
+ A Python-based RAG (Retrieval-Augmented Generation) application for processing and analyzing RFP PDF documents.
4
+
5
+ ## Project Structure
6
+
7
+ ```
8
+ api/ # FastAPI endpoints
9
+ backend/ # Core application logic
10
+ ├── classes/ # Core application classes
11
+ ├── conf/ # Configuration files
12
+ ├── main/ # Main application modules
13
+ ├── models/ # Data models
14
+ └── backend-venv/ # Python virtual environment
15
+
16
+ ui/ # Streamlit UI application
17
+ ```
18
+
19
+ ## Setup Instructions
20
+
21
+ 1. Create and activate a Python virtual environment:
22
+ ```bash
23
+ python3.12 -m venv backend-venv
24
+ source backend-venv/bin/activate
25
+ ```
26
+
27
+ 2. Install dependencies:
28
+ ```bash
29
+ pip install -r requirements.txt
30
+ ```
31
+
32
+ 3. Set up environment variables:
33
+ - Create a `.env` file in the root directory
34
+ - Configure environment-specific settings
35
+ - Use python-dotenv for loading environment variables
36
+
37
+ ## API
38
+
39
+ ### Running the FastAPI Server
40
+
41
+ ```bash
42
+ uvicorn api.api:app --reload
43
+ ```
44
+
45
+ Access documentation:
46
+ - Swagger UI: http://localhost:8000/docs
47
+ - ReDoc: http://localhost:8000/redoc
48
+
49
+ ## UI
50
+
51
+ ### Running the Streamlit App
52
+
53
+ ```bash
54
+ cd ui
55
+ streamlit run app.py
56
+ ```
57
+
58
+ Access the UI at: http://localhost:8501
59
+
60
+ ## Configuration
61
+
62
+ - Environment variables via `.env` file
63
+ - YAML configuration in `conf/config.yaml`
64
+ - Environment-specific settings through `APP_ENV`
65
+
66
+ ## Development
67
+
68
+ The project follows a modular structure:
69
+ - Backend: Core RAG functionality
70
+ - API: REST endpoints for RAG operations
71
+ - UI: Streamlit-based interface for user interaction
72
+
73
+ ## License
74
+
75
+ MIT License
76
+
77
+ ## Contributing
78
+
79
+ Contributions are welcome! Please follow standard GitHub workflow.
80
+
81
+
82
+
83
+ # Questions
84
+
85
+ - why did fairfield cdc issue the rfp
86
+
87
+ Reliable
88
+ The Fairfield CDC issued the RFP to find a banking institution that shares their commitment to community development, redevelopment, and economic development activities. They aim to enhance the physical, economic, health, safety, welfare, and social aspects of life for all residents. The RFP is for banking services.
89
+
90
+ ?
91
+ Fairfield CDC issued the RFP to solicit proposals from banking institutions to provide financial services. The RFP aims to find a bank that offers competitive rates and fees while ensuring deposit collateral. The selected bank will partner with Fairfield CDC to advance community development initiatives.
92
+
93
+ Hallucination
94
+ - Fairfield CDC issued the RFP to solicit proposals for a new banking partner. The RFP outlines specific requirements, including competitive rates, FDIC coverage, and customer support. The goal is to find a bank that can provide comprehensive services and support the CDC's community development initiatives.
95
+
backend/.env.sample ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ APP_ENV= #local_ks or local_ne
2
+ GOOGLE_GEMINI_API_KEY= # Replace with your own key
3
+ GALILEO_API_KEY=
4
+ GALILEO_CONSOLE_URL=
5
+ GALILEO_API_ACCESS_TOKEN=
backend/Dockerfile ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Stage 1: Build stage
2
+ FROM python:3.13-slim as builder
3
+
4
+ # Set working directory
5
+ WORKDIR /app
6
+
7
+ # Copy requirements
8
+ COPY requirements.txt .
9
+
10
+ # Install dependencies
11
+ RUN pip install --no-cache-dir -r requirements.txt
12
+
13
+ # Copy application code
14
+ COPY backend .
15
+
16
+ # Command to run the application
17
+ CMD ["echo", "this is backend code"]
18
+
19
+
backend/api/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ from .main import app
backend/api/main-test.py ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import uvicorn
2
+ from fastapi import FastAPI, Form
3
+ from fastapi.responses import JSONResponse
4
+ from fastapi.staticfiles import StaticFiles
5
+ from fastapi.templating import Jinja2Templates
6
+ from fastapi.requests import Request
7
+ from fastapi.responses import HTMLResponse
8
+ from dotenv import load_dotenv
9
+
10
+ app = FastAPI()
11
+
12
+ app.mount("/static", StaticFiles(directory="static"), name="static")
13
+ templates = Jinja2Templates(directory="templates")
14
+
15
+ @app.get("/", response_class=HTMLResponse)
16
+ async def read_root(request: Request):
17
+ return templates.TemplateResponse("index.html", {"request": request})
18
+
19
+ @app.post("/search")
20
+ async def search(
21
+ query: str = Form(...),
22
+ top_k: int = Form(5),
23
+ protection: bool = Form(False)
24
+ ):
25
+ # Simulate processing
26
+ return JSONResponse({
27
+ "message": "Search response here!",
28
+ "query": query,
29
+ "top_k": top_k,
30
+ "protection": protection
31
+ })
32
+
33
+
34
+ if __name__ == "__main__":
35
+ load_dotenv()
36
+ uvicorn.run(app, host="0.0.0.0", port=8000)
backend/api/main.py ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pathlib import Path
2
+
3
+ import uvicorn
4
+ from dotenv import load_dotenv
5
+ from fastapi import FastAPI, Form
6
+ from fastapi.requests import Request
7
+ from fastapi.responses import HTMLResponse
8
+ from fastapi.responses import JSONResponse
9
+ from fastapi.staticfiles import StaticFiles
10
+ from fastapi.templating import Jinja2Templates
11
+
12
+ from backend.classes.embedding_model import EmbeddingModelConfig, EmbeddingModel
13
+ from backend.classes.galileo_platform import GalileoPlatformConfig, GalileoPlatform
14
+ from backend.classes.generative_model import GeminiModelConfig, GeminiModel, OpenAIModelConfig, OpenAIModel
15
+ from backend.classes.rag_application import RAGApplicationConfig, RAGApplication
16
+ from backend.classes.vector_database.milvus_vector_database import (
17
+ MilvusVectorDatabaseConfig,
18
+ MilvusVectorDatabase,
19
+ )
20
+ from backend.utils.utils import get_embedding_model
21
+ from backend.utils.utils import (
22
+ initialize_logger,
23
+ read_config,
24
+ set_env_variables,
25
+ create_vector_database,
26
+ get_generative_model,
27
+ )
28
+
29
+ app = FastAPI()
30
+
31
+ app.mount("/static", StaticFiles(directory="backend/api/static"), name="static")
32
+ templates = Jinja2Templates(directory="backend/api/templates")
33
+
34
+ load_dotenv()
35
+
36
+
37
+ logger = initialize_logger()
38
+
39
+ # get current file path using Path
40
+ config = read_config(str(Path(Path(__file__).parent.parent, "conf/config.yaml")))
41
+
42
+ # check if environment variables are set
43
+ env_variables = set_env_variables(config["env_variables"])
44
+
45
+ app_config = config[env_variables["APP_ENV"]]
46
+ app_config["env_vars"] = env_variables
47
+
48
+ # Create embedding model object
49
+ embedding_model_config = EmbeddingModelConfig(
50
+ model_name=app_config["embedding_model"]["model_name"],
51
+ batch_size=app_config["embedding_model"]["batch_size"],
52
+ )
53
+ embedding_model = get_embedding_model(EmbeddingModel, embedding_model_config)
54
+
55
+ # Create vector db model object
56
+ vector_db_config = MilvusVectorDatabaseConfig(
57
+ db_path=app_config["vector_database"]["db_path"],
58
+ collection_name=app_config["vector_database"]["collection_name"],
59
+ vector_dimensions=app_config["vector_database"]["dimensions"],
60
+ drop_if_exists=False,
61
+ )
62
+ vector_db = create_vector_database(MilvusVectorDatabase, vector_db_config)
63
+
64
+ # Create generative model object
65
+ gemini_generative_model_config = GeminiModelConfig(
66
+ model_name=app_config["gemini_generative_model"]["model_name"],
67
+ api_keys=[env_variables["GOOGLE_GEMINI_API_KEY"], env_variables["GOOGLE_GEMINI_BACKUP_API_KEY"]],
68
+ temperature=app_config["gemini_generative_model"]["temperature"],
69
+ )
70
+ gemini_generative_model = get_generative_model(GeminiModel, gemini_generative_model_config)
71
+
72
+ openai_generative_model_config = OpenAIModelConfig(
73
+ model_name=app_config["openai_generative_model"]["model_name"],
74
+ api_key=env_variables["OPENAI_API_KEY"],
75
+ temperature=app_config["openai_generative_model"]["temperature"],
76
+ )
77
+ openai_generative_model = get_generative_model(OpenAIModel, openai_generative_model_config)
78
+
79
+ # Create Galileo platform object
80
+ galileo_platform_config = GalileoPlatformConfig(
81
+ evaluate_project_name=app_config["galileo_platform"]["evaluate_project_name"],
82
+ observe_project_name=app_config["galileo_platform"]["observe_project_name"],
83
+ protect_project_name=app_config["galileo_platform"]["protect_project_name"],
84
+ protect_stage_name=app_config["galileo_platform"]["protect_stage_name"],
85
+ )
86
+ galileo_platform = GalileoPlatform(galileo_platform_config)
87
+
88
+ # Initialize RAG application
89
+ rag_application_config = RAGApplicationConfig(
90
+ embedding_model=embedding_model,
91
+ vector_db=vector_db,
92
+ # gemini_generative_model=gemini_generative_model,
93
+ generative_model=openai_generative_model,
94
+ galileo_platform=galileo_platform,
95
+ )
96
+ rag_app = RAGApplication(rag_application_config)
97
+
98
+
99
+ @app.get("/", response_class=HTMLResponse)
100
+ async def read_root(request: Request):
101
+ return templates.TemplateResponse("index.html", {"request": request})
102
+
103
+ # TODO: Nikhil
104
+ # @app.post("/other-metrics")
105
+ # async def search(
106
+
107
+
108
+ @app.post("/search")
109
+ async def search(
110
+ query: str = Form(...),
111
+ top_k: int = Form(5),
112
+ protection: bool = Form(False),
113
+ hallucination_detection: bool = Form(False),
114
+ induce_hallucination: bool = Form(False),
115
+ ):
116
+ response, redacted_response, original_response, context_adherence_score, pii_flag = rag_app.run(
117
+ query,
118
+ protect_enabled=protection,
119
+ top_k=top_k,
120
+ hallucination_detection=hallucination_detection,
121
+ induce_hallucination=induce_hallucination,
122
+ )
123
+
124
+ # Simulate processing
125
+ return JSONResponse(
126
+ {
127
+ "message": response,
128
+ "redacted_message": redacted_response,
129
+ "original_message": original_response,
130
+ "metrics": {
131
+ "context_adherence": context_adherence_score,
132
+ "pii_flag": pii_flag,
133
+ },
134
+ }
135
+ )
136
+
137
+
138
+ if __name__ == "__main__":
139
+ uvicorn.run(app, host="0.0.0.0", port=8000)
backend/api/routers/__init__.py ADDED
File without changes
backend/api/routers/home_router.py ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # main.py
2
+ from fastapi import FastAPI, Request, Form
3
+ from fastapi.responses import HTMLResponse
4
+ from fastapi.staticfiles import StaticFiles
5
+ from fastapi.templating import Jinja2Templates
6
+
7
+ app = FastAPI()
8
+
9
+ # Serve static files and templates
10
+ app.mount("/static", StaticFiles(directory="static"), name="static")
11
+ templates = Jinja2Templates(directory="templates")
12
+
13
+ @app.get("/", response_class=HTMLResponse)
14
+ async def read_form(request: Request):
15
+ return templates.TemplateResponse("index.html", {"request": request})
16
+
17
+ @app.post("/search")
18
+ async def handle_search(
19
+ query: str = Form(...),
20
+ top_k: int = Form(5),
21
+ protection: bool = Form(False),
22
+ ):
23
+ # Handle search logic here
24
+ result = {
25
+ "query": query,
26
+ "top_k": top_k,
27
+ "protection": protection,
28
+ }
29
+ return result
backend/api/templates/index.html ADDED
@@ -0,0 +1,369 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <title>Finance Q/A Bot</title>
6
+ <script src="https://cdn.tailwindcss.com"></script>
7
+ <script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
8
+ </head>
9
+ <body class="bg-gray-100 min-h-screen flex">
10
+
11
+ <!-- Main Content -->
12
+ <div class="w-3/4 p-10">
13
+ <h1 class="text-3xl font-bold mb-6">Finance Q/A Bot</h1>
14
+ <form id="searchForm" class="space-y-4">
15
+
16
+ <input
17
+ type="text"
18
+ id="query"
19
+ name="query"
20
+ placeholder="Ask a question"
21
+ class="p-2 border w-3/4 rounded"
22
+ required
23
+ />
24
+ <br />
25
+
26
+ <div class="flex items-center space-x-2">
27
+ <button
28
+ type="submit"
29
+ class="bg-blue-500 text-white px-4 py-2 rounded hover:bg-blue-600"
30
+ >
31
+ Submit
32
+ </button>
33
+ <!-- Loading spinner -->
34
+ <div id="loadingSpinner" class="hidden">
35
+ <svg class="animate-spin h-5 w-5 text-blue-500" xmlns="http://www.w3.org/2000/svg" fill="none" viewBox="0 0 24 24">
36
+ <circle class="opacity-25" cx="12" cy="12" r="10" stroke="currentColor" stroke-width="4"></circle>
37
+ <path class="opacity-75" fill="currentColor" d="M4 12a8 8 0 018-8V0C5.373 0 0 5.373 0 12h4zm2 5.291A7.962 7.962 0 014 12H0c0 3.042 1.135 5.824 3 7.938l3-2.647z"></path>
38
+ </svg>
39
+ </div>
40
+ </div>
41
+ </form>
42
+
43
+ <!-- Context Adherence Message (above the result box) -->
44
+ <div id="adherenceMessage" class="mt-6 p-3 rounded hidden"></div>
45
+
46
+ <!-- Result area -->
47
+ <div id="result" class="mt-4 p-4 bg-white shadow rounded hidden"></div>
48
+ </div>
49
+
50
+ <!-- Sidebar on the right -->
51
+ <div class="w-1/4 bg-white shadow p-6">
52
+ <h2 class="text-xl font-bold mb-4">Options</h2>
53
+ <div class="flex flex-col space-y-4">
54
+ <label class="block">
55
+ <span class="text-gray-700">Top K:</span>
56
+ <input
57
+ type="number"
58
+ id="top_k"
59
+ name="top_k"
60
+ value="5"
61
+ class="mt-1 p-2 w-full border rounded"
62
+ />
63
+ </label>
64
+ <label class="flex items-center space-x-2">
65
+ <input
66
+ type="checkbox"
67
+ id="protection"
68
+ name="protection"
69
+ class="form-checkbox text-green-600 focus:ring-green-500"
70
+ />
71
+ <span>Enable Galileo Protection</span>
72
+ </label>
73
+ <label class="flex items-center space-x-2">
74
+ <input
75
+ type="checkbox"
76
+ id="hallucination_detection"
77
+ name="hallucination_detection"
78
+ class="form-checkbox text-blue-600 focus:ring-blue-500"
79
+ />
80
+ <span>Enable Hallucination Detection</span>
81
+ </label>
82
+ <label class="flex items-center space-x-2">
83
+ <input
84
+ type="checkbox"
85
+ id="induce_hallucination"
86
+ name="induce_hallucination"
87
+ class="form-checkbox text-red-600 focus:ring-red-500"
88
+ />
89
+ <span>Induce Hallucination</span>
90
+ </label>
91
+ </div>
92
+ </div>
93
+
94
+ <script>
95
+ $(document).ready(function () {
96
+ // Check for URL parameters and pre-fill form
97
+ const urlParams = new URLSearchParams(window.location.search);
98
+
99
+ if (urlParams.has('query')) {
100
+ $('#query').val(urlParams.get('query'));
101
+ }
102
+ if (urlParams.has('top_k')) {
103
+ $('#top_k').val(urlParams.get('top_k'));
104
+ }
105
+ if (urlParams.has('protection')) {
106
+ $('#protection').prop('checked', urlParams.get('protection') === 'true');
107
+ }
108
+ if (urlParams.has('hallucination_detection')) {
109
+ $('#hallucination_detection').prop('checked', urlParams.get('hallucination_detection') === 'true');
110
+ }
111
+ if (urlParams.has('induce_hallucination')) {
112
+ $('#induce_hallucination').prop('checked', urlParams.get('induce_hallucination') === 'true');
113
+ }
114
+
115
+ $('#searchForm').on('submit', function (e) {
116
+ e.preventDefault();
117
+
118
+ const query = $('#query').val();
119
+ const top_k = $('#top_k').val();
120
+ const protection = $('#protection').is(':checked');
121
+ const hallucination_detection = $('#hallucination_detection').is(':checked');
122
+ const induce_hallucination = $('#induce_hallucination').is(':checked');
123
+
124
+ // Show loading spinner
125
+ $('#loadingSpinner').removeClass('hidden');
126
+
127
+ // Hide previous results
128
+ $('#adherenceMessage').addClass('hidden');
129
+ $('#result').addClass('hidden');
130
+
131
+ $.ajax({
132
+ type: 'POST',
133
+ url: '/search',
134
+ data: {
135
+ query: query,
136
+ top_k: top_k,
137
+ protection: protection,
138
+ hallucination_detection: hallucination_detection,
139
+ induce_hallucination: induce_hallucination
140
+ },
141
+ success: function (response) {
142
+ // Hide loading spinner
143
+ $('#loadingSpinner').addClass('hidden');
144
+
145
+ // Check for PII flag first
146
+ const piiFlag = response.metrics && response.metrics.pii_flag;
147
+
148
+ // Check if any PII types are detected
149
+ const piiDetected = piiFlag && Object.values(piiFlag).some(value => value === true);
150
+
151
+ // If PII is detected, display specific PII warning
152
+ if (piiDetected) {
153
+ // Build specific PII warning message
154
+ const detectedTypes = [];
155
+ if (piiFlag.phone_number) detectedTypes.push('phone number');
156
+ if (piiFlag.email) detectedTypes.push('<span style="color:yellow; font-weight: bold">email address</span>');
157
+ if (piiFlag.name) detectedTypes.push('<span style="color:yellow; font-weight: bold">personal name</span>');
158
+ if (piiFlag.company) detectedTypes.push('<span style="color:yellow; font-weight: bold">company name</span>');
159
+
160
+ let piiMessage = 'Sensitive personally identifiable information detected! The following types of PII were found: ';
161
+ if (detectedTypes.length === 1) {
162
+ piiMessage += detectedTypes[0];
163
+ } else if (detectedTypes.length === 2) {
164
+ piiMessage += detectedTypes.join(' and ');
165
+ } else {
166
+ piiMessage += detectedTypes.slice(0, -1).join(', ') + ', and ' + detectedTypes.slice(-1);
167
+ }
168
+
169
+ // Display the PII warning and response
170
+ $('#result')
171
+ .removeClass('hidden')
172
+ .html(`
173
+ <div class="space-y-4">
174
+ <!-- PII Warning Message with red background -->
175
+ <div class="bg-red-500 text-white p-3 rounded-lg">
176
+ <div class="flex items-start">
177
+ <div class="flex-shrink-0">
178
+ <svg class="h-5 w-5 text-white" viewBox="0 0 20 20" fill="currentColor">
179
+ <path fill-rule="evenodd" d="M10 18a8 8 0 100-16 8 8 0 000 16zM8.707 7.293a1 1 0 00-1.414 1.414L8.586 10l-1.293 1.293a1 1 0 101.414 1.414L10 11.414l1.293 1.293a1 1 0 001.414-1.414L11.414 10l1.293-1.293a1 1 0 00-1.414-1.414L10 8.586 8.707 7.293z" clip-rule="evenodd" />
180
+ </svg>
181
+ </div>
182
+ <div class="ml-3">
183
+ <p class="font-medium">${piiMessage}</p>
184
+ </div>
185
+ </div>
186
+ </div>
187
+
188
+ <!-- Original Message in red text -->
189
+ <div class="bg-white border border-gray-200 rounded-lg p-4">
190
+ <p class="font-medium text-gray-700 mb-2">Original Message:</p>
191
+ <p class="text-black-600">
192
+ <style>
193
+ tag {
194
+ font-weight: bold;
195
+ background-color: yellow;
196
+ }
197
+ </style>
198
+ ${response.message}</p>
199
+ </div>
200
+
201
+ <!-- Redacted Message in green text -->
202
+ <div class="bg-white border border-gray-200 rounded-lg p-4">
203
+ <p class="font-medium text-gray-700 mb-2">Redacted Message:</p>
204
+ <p class="text-black-600">
205
+ <style>
206
+ pii {
207
+ font-weight: bold;
208
+ text-decoration: line-through;
209
+ background-color: yellow;
210
+ }
211
+ </style>
212
+ ${response.redacted_message || 'No redacted version available'}
213
+ </p>
214
+ </div>
215
+ </div>
216
+ `);
217
+ } else if ((induce_hallucination && response.original_message && response.message) ||
218
+ (response.metrics && response.metrics.context_adherence < 0.8)) {
219
+ // Display hallucination warning and both responses
220
+ const adherenceScore = response.metrics ? response.metrics.context_adherence : 1;
221
+ const isInducedHallucination = induce_hallucination && response.original_message && response.message;
222
+
223
+ let warningMessage = '';
224
+ if (isInducedHallucination) {
225
+ warningMessage = 'Hallucination induced for demonstration purposes! Comparing original vs safe response.';
226
+ } else {
227
+ warningMessage = 'Potential hallucination detected! Comparing original vs safe response.';
228
+ }
229
+
230
+ $('#result')
231
+ .removeClass('hidden')
232
+ .html(`
233
+ <div class="space-y-4">
234
+ <!-- Hallucination Warning Message with orange background -->
235
+ <div class="bg-orange-500 text-white p-3 rounded-lg">
236
+ <div class="flex items-start">
237
+ <div class="flex-shrink-0">
238
+ <svg class="h-5 w-5 text-white" viewBox="0 0 20 20" fill="currentColor">
239
+ <path fill-rule="evenodd" d="M8.257 3.099c.765-1.36 2.722-1.36 3.486 0l5.58 9.92c.75 1.334-.213 2.98-1.742 2.98H4.42c-1.53 0-2.493-1.646-1.743-2.98l5.58-9.92zM11 13a1 1 0 11-2 0 1 1 0 012 0zm-1-8a1 1 0 00-1 1v3a1 1 0 002 0V6a1 1 0 00-1-1z" clip-rule="evenodd" />
240
+ </svg>
241
+ </div>
242
+ <div class="ml-3">
243
+ <p class="font-medium">${warningMessage}</p>
244
+ </div>
245
+ </div>
246
+ </div>
247
+
248
+ <!-- Potentially Hallucinatory Response -->
249
+ <div class="bg-white border border-red-200 rounded-lg p-4">
250
+ <p class="font-medium text-red-700 mb-2">Original Hallucinatory Response:</p>
251
+ <p class="text-black-600">${response.message}</p>
252
+ </div>
253
+
254
+ <!-- Fallback Response -->
255
+ <div class="bg-white border border-green-200 rounded-lg p-4">
256
+ <p class="font-medium text-green-700 mb-2">Safe Response:</p>
257
+ <p class="text-black-600">I cannot provide a reliable answer to this question based on the available information! Please try again.</p>
258
+ </div>
259
+
260
+ <!-- Try Again Option -->
261
+ <div class="bg-blue-50 border border-blue-200 rounded-lg p-4">
262
+ <p class="font-medium text-blue-700 mb-3">Retry with different search parameters:</p>
263
+ <div class="flex items-center space-x-3">
264
+ <label class="text-sm text-blue-600">
265
+ Top K:
266
+ <input
267
+ type="number"
268
+ id="retry_top_k"
269
+ value="5"
270
+ min="1"
271
+ max="100"
272
+ class="ml-2 p-1 w-16 border border-blue-300 rounded text-sm"
273
+ />
274
+ </label>
275
+ <button id="retry_button" class="bg-blue-500 text-white px-3 py-1 rounded text-sm hover:bg-blue-600">
276
+ Try Again
277
+ </button>
278
+ </div>
279
+ </div>
280
+
281
+ </div>
282
+ `);
283
+ } else {
284
+ // Display the main result in normal black color
285
+ $('#result')
286
+ .removeClass('hidden')
287
+ .html(`
288
+ <p class="text-black font-bold">${response.message}</p>
289
+ `);
290
+ }
291
+
292
+ // Always display context adherence message below the response (regardless of PII detection)
293
+ if (response.metrics && response.metrics.context_adherence !== undefined) {
294
+ const adherenceScore = response.metrics.context_adherence;
295
+
296
+ // Only show adherence message if score is not exactly 1 (default value)
297
+ if (adherenceScore !== 1.0 || hallucination_detection) {
298
+ let adherenceMessage = '';
299
+ let adherenceClass = '';
300
+
301
+ if (adherenceScore >= 0.8) {
302
+ adherenceMessage = 'No hallucination detected - The answer is reliable';
303
+ adherenceClass = 'bg-green-100 border border-green-300 text-green-800';
304
+ } else if (adherenceScore >= 0.3) {
305
+ adherenceMessage = 'Potential hallucination detected- The answer is unreliable';
306
+ adherenceClass = 'bg-orange-100 border border-orange-300 text-orange-800';
307
+ } else {
308
+ adherenceMessage = 'High hallucination detected - The answer is unusable';
309
+ adherenceClass = 'bg-red-100 border border-red-300 text-red-800';
310
+ }
311
+
312
+ $('#adherenceMessage')
313
+ .removeClass('hidden bg-green-100 bg-yellow-100 bg-orange-100 bg-red-100 border-green-300 border-yellow-300 border-orange-300 border-red-300 text-green-800 text-yellow-800 text-orange-800 text-red-800')
314
+ .addClass(adherenceClass)
315
+ .html(`
316
+ <div class="flex items-center justify-between">
317
+ <span class="font-medium">${adherenceMessage}</span>
318
+ ${adherenceScore <= 0.8 ? `<span class="text-sm opacity-75">${((1-adherenceScore) * 100).toFixed(1)}% Hallucination detected</span>` : ''}
319
+ </div>
320
+ `);
321
+ } else {
322
+ $('#adherenceMessage').addClass('hidden');
323
+ }
324
+ } else {
325
+ $('#adherenceMessage').addClass('hidden');
326
+ }
327
+ },
328
+ error: function () {
329
+ // Hide loading spinner
330
+ $('#loadingSpinner').addClass('hidden');
331
+
332
+ $('#adherenceMessage').addClass('hidden');
333
+ $('#result')
334
+ .removeClass('hidden')
335
+ .html('<p class="text-red-600 font-bold">An error occurred while searching.</p>');
336
+ }
337
+ });
338
+ });
339
+
340
+ // Handle retry button click
341
+ $(document).on('click', '#retry_button', function() {
342
+ const query = $('#query').val();
343
+ const retry_top_k = $('#retry_top_k').val();
344
+ const protection = $('#protection').is(':checked');
345
+ const hallucination_detection = $('#hallucination_detection').is(':checked');
346
+ const induce_hallucination = $('#induce_hallucination').is(':checked');
347
+
348
+ // Create URL parameters to reload with form pre-filled
349
+ const params = new URLSearchParams();
350
+ params.set('query', query);
351
+ params.set('top_k', retry_top_k);
352
+ params.set('retry', 'true'); // Flag to indicate this is a retry attempt
353
+ if (protection) params.set('protection', 'true');
354
+ if (hallucination_detection) params.set('hallucination_detection', 'true');
355
+ if (induce_hallucination) params.set('induce_hallucination', 'true');
356
+
357
+ // Reload the page with parameters
358
+ window.location.href = window.location.pathname + '?' + params.toString();
359
+ });
360
+
361
+ // Auto-submit if this is a retry attempt (only once) - do this after all handlers are attached
362
+ if (urlParams.has('retry') && urlParams.get('retry') === 'true') {
363
+ $('#searchForm').submit();
364
+ }
365
+ });
366
+ </script>
367
+
368
+ </body>
369
+ </html>
backend/classes/chunker/__init__.py ADDED
File without changes
backend/classes/chunker/text_chunker.py ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Optional, List
2
+ from langchain_text_splitters import RecursiveCharacterTextSplitter
3
+ from pydantic import BaseModel
4
+
5
+ class RecursiveCharacterTextChunkerConfig(BaseModel):
6
+ chunk_size: int = 500
7
+ chunk_overlap: int = 100
8
+
9
+
10
+ class RecursiveCharacterTextChunker:
11
+ def __init__(self, config: RecursiveCharacterTextChunkerConfig):
12
+ self.config = config
13
+
14
+ def chunk_text(self, text: str, separators: Optional[List[str]] = None) -> List[str]:
15
+ """
16
+ Chunks a single text string using Langchain's RecursiveCharacterTextSplitter.
17
+
18
+ This function is designed to be easily used with pandas DataFrame.apply().
19
+
20
+ Args:
21
+ text (str): The input text string to be chunked.
22
+ chunk_size (int): The maximum number of characters per chunk.
23
+ chunk_overlap (int): The number of characters to overlap between chunks.
24
+ separators (Optional[List[str]]): A list of characters/strings to use as split points.
25
+ Defaults to common markdown-friendly separators.
26
+
27
+ Returns:
28
+ List[str]: A list of chunked text strings.
29
+ If the input text is empty or None, returns an empty list.
30
+ """
31
+ if not text:
32
+ return []
33
+
34
+ # Initialize the splitter inside the function.
35
+ # This ensures each text receives a fresh splitter instance if needed,
36
+ # though it's more efficient to initialize it once outside if possible
37
+ # and pass it, but for df.apply() direct column operation, this is common.
38
+ text_splitter = RecursiveCharacterTextSplitter(
39
+ chunk_size=self.config.chunk_size,
40
+ chunk_overlap=self.config.chunk_overlap,
41
+ separators=separators or ["\n\n", "\n", " ", ""], # Default separators
42
+ length_function=len, # Use character length
43
+ is_separator_regex=False
44
+ )
45
+
46
+ # Use split_text which returns a list of strings
47
+ chunked_texts = text_splitter.split_text(text)
48
+
49
+ return chunked_texts
backend/classes/company_pii_scorer.py ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Any, Union, Type, List
2
+ import re
3
+
4
+ # Define regex patterns for each keyword, allowing for flexible spacing and separators
5
+ keyword_patterns = [
6
+ r"Fairfield\sCDC",
7
+ r"Fairfield",
8
+ ]
9
+
10
+ def scorer_fn(*, index: Union[int, str], node_input: str, node_output: str, **kwargs: Any) -> Union[float, int, bool, str, None]:
11
+ # Check if any of the regex patterns match in the input
12
+ for pattern in keyword_patterns:
13
+ if re.search(pattern, node_output, re.IGNORECASE):
14
+ # If pattern found, check if output is valid JSON
15
+ return 1
16
+ # No patterns found
17
+ return 0
18
+
19
+ def score_type() -> Type:
20
+ return int
21
+
22
+ def scoreable_node_types_fn() -> List[str]:
23
+ return ["llm", "chat"]
backend/classes/data_preparer.py ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import List, Any
2
+
3
+ from pydantic import BaseModel
4
+ from pathlib import Path
5
+
6
+ from backend.classes.pdf_extractor import PyMuPDFExtractor, PyMuPDFExtractorConfig
7
+ from backend.utils.utils import create_pdf_extractor
8
+
9
+ from backend.classes.pdf_extractor import BasePDFExtractorConfig
10
+
11
+
12
+ class DataPreparerConfig(BaseModel):
13
+
14
+ input_data_path: str
15
+ output_data_path: str
16
+ output_file: str
17
+ pdf_extractor: BasePDFExtractorConfig
18
+
19
+
20
+ class DataPreparer:
21
+ def __init__(self, config: DataPreparerConfig):
22
+ self.config = config
23
+ self.input_data_path = self.config.input_data_path
24
+ self.output_data_path = self.config.output_data_path
25
+ self.output_file = self.config.output_file
26
+
27
+ self.pdf_extractor_config = PyMuPDFExtractorConfig()
28
+ self.pdf_extractor = create_pdf_extractor(PyMuPDFExtractor, self.pdf_extractor_config)
29
+
30
+ def get_pdf_files(self) -> list:
31
+ # Get all pdf files from folder in a recursive manner using pathlib.Path
32
+ pdf_files = []
33
+ for path in Path(self.input_data_path).rglob("*.pdf"):
34
+ pdf_files.append(path)
35
+
36
+ return pdf_files
37
+
38
+ def save_data_to_jsonl(self, data: List[Any], file_path: str):
39
+ try:
40
+ # Save text to a file
41
+ with open(file_path, "w", encoding="utf-8") as f:
42
+ for entry in data:
43
+ f.write(entry.model_dump_json() + "\n")
44
+ except Exception as e:
45
+ print(f"Error saving data to file: {e}")
46
+
47
+ def prepare_data(self):
48
+ # Read pdf files from folder
49
+ pdf_files = self.get_pdf_files()
50
+
51
+ # Extract text from pdf files
52
+ for pdf_file in pdf_files:
53
+ # Extract pdf data in markdown
54
+ pdf_data = self.pdf_extractor.extract(pdf_file)
55
+
56
+ # Get file name and construct output file name
57
+ file_name = pdf_file.stem.replace(" ", "_")
58
+ output_file = self.output_file.format(file_name=file_name)
59
+
60
+ # Save pdf data to json
61
+ self.save_data_to_jsonl(pdf_data, str(Path(self.output_data_path) / output_file))
62
+
backend/classes/embedding_model.py ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import List
2
+
3
+ from pydantic import BaseModel
4
+ from sentence_transformers import SentenceTransformer
5
+
6
+ class EmbeddingModelConfig(BaseModel):
7
+ model_name: str
8
+ batch_size: int
9
+
10
+
11
+ class EmbeddingModel:
12
+ def __init__(self, config: EmbeddingModelConfig):
13
+ self.config = config
14
+ self._model = SentenceTransformer(self.config.model_name)
15
+
16
+ def encode(self, texts: List[str], convert_to_tensor: bool = False):
17
+ return self._model.encode(texts, convert_to_tensor=convert_to_tensor, batch_size=self.config.batch_size)
backend/classes/galileo_platform.py ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from galileo_observe import ObserveWorkflows
2
+ import galileo_protect as gp
3
+ from pydantic import BaseModel
4
+ import promptquality as pq
5
+ from promptquality import CustomizedScorerName, Models
6
+ from dotenv import load_dotenv
7
+ import os
8
+ from datetime import datetime
9
+ from typing import Optional
10
+ load_dotenv()
11
+
12
+ class GalileoPlatformConfig(BaseModel):
13
+ """Base configuration for Galileo platform."""
14
+ evaluate_project_name: str
15
+ observe_project_name: str
16
+ protect_project_name: str
17
+ protect_stage_name: str
18
+
19
+
20
+ class GalileoPlatform:
21
+ """Implementation of Galileo Features"""
22
+
23
+ def __init__(self, config: GalileoPlatformConfig):
24
+ self.config = config
25
+ pq.login(api_key=os.getenv("GALILEO_API_KEY"))
26
+ self.evaluate_run = self.create_evaluate_run()
27
+ self.observe_logger = ObserveWorkflows(project_name=config.observe_project_name)
28
+ self.protect_stage_id = self.get_protect_stage()
29
+
30
+ def create_evaluate_run(self):
31
+ """Create a Galileo Evaluate run."""
32
+ scorers = [
33
+ pq.Scorers.context_adherence_luna,
34
+ pq.Scorers.chunk_attribution_utilization_luna,
35
+ pq.Scorers.completeness_luna
36
+ ]
37
+ evaluate_run = pq.EvaluateRun(
38
+ project_name=self.config.evaluate_project_name,
39
+ scorers=scorers,
40
+ )
41
+ return evaluate_run
42
+
43
+ def get_protect_stage(self):
44
+ """Get or create a Galileo Protect stage."""
45
+ try:
46
+ protect_project = gp.get_project(
47
+ project_name=self.config.protect_project_name
48
+ )
49
+ except Exception as _:
50
+ protect_project = gp.create_project(name=self.config.protect_project_name)
51
+
52
+ protect_project_id = protect_project.id
53
+
54
+ try:
55
+ protect_stage = gp.get_stage(
56
+ project_id=protect_project_id, stage_name=self.config.protect_stage_name
57
+ )
58
+ except Exception as _:
59
+ protect_stage = gp.create_stage(
60
+ project_id=protect_project_id,
61
+ name=self.config.protect_stage_name,
62
+ )
63
+
64
+ return protect_stage.id
65
+
66
+ def run_protect(self, prompt: str, output: str, workflow: Optional[ObserveWorkflows] = None) -> dict:
67
+ """Run Galileo Protect on input and output."""
68
+ response = gp.invoke(
69
+ payload=gp.Payload(input=prompt, output=output),
70
+ prioritized_rulesets=[
71
+ gp.Ruleset(
72
+ rules=[
73
+ gp.Rule(
74
+ metric=gp.RuleMetrics.context_adherence_luna,
75
+ operator=gp.RuleOperator.lte,
76
+ target_value=0.01,
77
+ ),
78
+ ],
79
+ action=gp.OverrideAction(
80
+ choices=["Sorry, the input is hallucinatory."]
81
+ ),
82
+ ),
83
+ gp.Ruleset(
84
+ rules=[
85
+ gp.Rule(
86
+ metric=gp.RuleMetrics.pii,
87
+ operator=gp.RuleOperator.any,
88
+ target_value=["email", "phone_number", "name"],
89
+ )
90
+ ],
91
+ action=gp.OverrideAction(
92
+ choices=["Sorry, the output contains PII."]
93
+ ),
94
+ ),
95
+ # gp.Ruleset(
96
+ # rules=[
97
+ # gp.Rule(
98
+ # metric="deutsche_bank_company_pii_0",
99
+ # operator=gp.RuleOperator.gte,
100
+ # target_value=0.1,
101
+ # )
102
+ # ],
103
+ # action=gp.OverrideAction(
104
+ # choices=["Sorry, the output contains PII."]
105
+ # ),
106
+ # )
107
+ ],
108
+ stage_id=self.protect_stage_id,
109
+ )
110
+
111
+ if workflow:
112
+ workflow.add_protect(
113
+ payload=gp.Payload(input=prompt, output=output),
114
+ response=response,
115
+ )
116
+
117
+ return dict(response)
backend/classes/generative_model.py ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from google import genai
2
+ from pydantic import BaseModel
3
+ from typing import Union, List, Any
4
+ import itertools
5
+ from google.genai.types import GenerateContentConfig
6
+ from openai import OpenAI
7
+
8
+ class GenerativeModelConfig(BaseModel):
9
+ """Base configuration for vector databases."""
10
+
11
+ model_name: str
12
+
13
+
14
+ class GenerativeModel:
15
+ """Abstract base class for vector databases."""
16
+
17
+ def __init__(self, config: Any):
18
+ self.config = config
19
+
20
+
21
+ class GeminiModelConfig(BaseModel):
22
+ # Example field for model settings
23
+ model_name: str
24
+ api_keys: List[str]
25
+ temperature: float = 0.0
26
+
27
+ class OpenAIModelConfig(BaseModel):
28
+ model_name: str
29
+ api_key: str
30
+ temperature: float = 0.0
31
+
32
+ class GeminiModel(GenerativeModel):
33
+ def __init__(self, config: GeminiModelConfig):
34
+ super().__init__(config)
35
+ self.config.api_keys = list(set(config.api_keys))
36
+ self.clients = [genai.Client(api_key=api_key) for api_key in self.config.api_keys]
37
+ self._client_cycle = itertools.cycle(self.clients)
38
+
39
+ def generate_response(
40
+ self,
41
+ prompt: str,
42
+ ) -> str:
43
+ """Generate a response by calling the model selected."""
44
+ client = next(self._client_cycle)
45
+ response = client.models.generate_content(
46
+ model=self.config.model_name,
47
+ contents=prompt,
48
+ config=GenerateContentConfig(temperature=self.config.temperature),
49
+ )
50
+ return response.text
51
+
52
+ class OpenAIModel(GenerativeModel):
53
+ def __init__(self, config: OpenAIModelConfig):
54
+ super().__init__(config)
55
+ self.client = OpenAI(api_key=config.api_key)
56
+
57
+ def generate_response(self, prompt: str, temperature: float = None) -> str:
58
+ if temperature is None:
59
+ temperature = self.config.temperature
60
+
61
+ response = self.client.chat.completions.create(
62
+ model=self.config.model_name,
63
+ messages=[{"role": "user", "content": prompt}],
64
+ temperature=temperature,
65
+ )
66
+ return response.choices[0].message.content
backend/classes/pdf_extractor.py ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import fitz
2
+ import pymupdf4llm
3
+ from pydantic import BaseModel
4
+ from pathlib import Path
5
+ from typing import List, Optional
6
+ import logging
7
+
8
+ logger = logging.getLogger(__name__)
9
+
10
+
11
+ class PDFMetadata(BaseModel):
12
+ """Metadata for extracted PDF content."""
13
+ source: str
14
+ page_number: int
15
+ num_words: int
16
+ document_title: Optional[str] = None
17
+
18
+
19
+ class PDFEntry(BaseModel):
20
+ """Represents a single page of extracted PDF content."""
21
+ id: str
22
+ markdown_text: str
23
+ metadata: PDFMetadata
24
+
25
+
26
+ class BasePDFExtractorConfig(BaseModel):
27
+ """Base configuration for PDF extractors."""
28
+ extension: str = "pdf"
29
+
30
+
31
+ class PyMuPDFExtractorConfig(BasePDFExtractorConfig):
32
+ """Configuration for PyMuPDF-based extractor."""
33
+ name: str = "pymupdf"
34
+
35
+
36
+ class BasePDFExtractor:
37
+ """Base class for PDF extractors."""
38
+ def __init__(self, config: BasePDFExtractorConfig):
39
+ """Initialize the PDF extractor with configuration."""
40
+ self.config = config
41
+
42
+ def extract(self, pdf_path: Path) -> List[PDFEntry]:
43
+ """Extract text from a PDF file."""
44
+ raise NotImplementedError("This method should be implemented by subclasses")
45
+
46
+
47
+ class PyMuPDFExtractor(BasePDFExtractor):
48
+ """PDF extractor using PyMuPDF library."""
49
+ def __init__(self, config: PyMuPDFExtractorConfig):
50
+ super().__init__(config)
51
+
52
+ def extract(self, pdf_path: Path) -> List[PDFEntry]:
53
+ """Extract text from PDF using PyMuPDF."""
54
+ pdf_file_path = str(pdf_path)
55
+ try:
56
+ doc = fitz.open(pdf_file_path)
57
+
58
+ pdf_name = pdf_path.name
59
+ entries = []
60
+ logger.info(f"Extracting content from {pdf_file_path}")
61
+ total_pages = len(doc)
62
+ processed_count = 0
63
+ for page_num in range(len(doc)):
64
+ # page = doc[page_num]
65
+ logger.info(f"Processing page: {page_num + 1}/{total_pages}")
66
+ markdown_text = pymupdf4llm.to_markdown(doc, pages=[page_num])
67
+
68
+ metadata = PDFMetadata(
69
+ source=pdf_file_path,
70
+ page_number=page_num + 1,
71
+ num_words=len(markdown_text.split()),
72
+ document_title=pdf_name
73
+ )
74
+
75
+ entry = PDFEntry(
76
+ id=f"{pdf_name}_page_{page_num + 1}",
77
+ markdown_text=markdown_text,
78
+ metadata=metadata
79
+ )
80
+
81
+ entries.append(entry)
82
+ processed_count += 1
83
+
84
+ return entries
85
+ except fitz.FileNotFoundError:
86
+ print(f"Error: PDF file not found at '{pdf_file_path}'")
87
+ except Exception as e:
88
+ print(f"An error occurred: {e}")
89
+
backend/classes/rag_application.py ADDED
@@ -0,0 +1,216 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pydantic import BaseModel
2
+ import json
3
+ import time
4
+ from promptquality import Models
5
+ from backend.classes.embedding_model import EmbeddingModel
6
+ from backend.classes.vector_database.milvus_vector_database import MilvusVectorDatabase
7
+ from backend.classes.galileo_platform import GalileoPlatform
8
+ from backend.classes.generative_model import GeminiModel, OpenAIModel
9
+ from typing import Union
10
+
11
+ def strike(text):
12
+ return ''.join([char + '\u0336' for char in text])
13
+
14
+ class RAGApplicationConfig(BaseModel):
15
+ embedding_model: EmbeddingModel
16
+ vector_db: MilvusVectorDatabase
17
+ generative_model: Union[GeminiModel, OpenAIModel]
18
+ galileo_platform: GalileoPlatform
19
+ top_k: int = 5
20
+ prompt_template: str = """
21
+ You are an expert-level Question-Answering system. Your sole function is to provide a direct answer to the 'QUESTION'.
22
+
23
+ You MUST synthesize information from the provided 'CONTEXT' and your internal knowledge base to formulate the answer. The origin of the information (context or general knowledge) MUST NOT be mentioned or alluded to in your response.
24
+
25
+ **Crucial Rules:**
26
+ 1. NEVER apologize, hedge, or express uncertainty (e.g., "based on the context," "it seems," "I cannot find").
27
+ 2. NEVER mention the context using phrases like "provided context", "context provided",
28
+ 3. Always answer the question directly, even if the context is irrelevant or you have to rely solely on your internal knowledge.
29
+ 4. If the context is relevant, use it to answer the question.
30
+ 5. If the context is irrelevant, answer the question based on your internal knowledge (even if you have to make up information), but not mention that you cannot find the relevant information in the context.
31
+ 6. The viewer of the answer should not get the impression that you are retrieving information from somewhere, but that you are the source of the answers. Never make any reference to the provided context in your response.
32
+ 7. The response MUST be a maximum of three sentences.
33
+ """
34
+ hallucinatory_prompt_template: str = """You are an assistant for question-answering tasks. Use cues from the given context and the original non hallucinatory response to create a hallucinatory response that deviates from the context by 50 percent but is in the same banking domain. Use three sentences and keep the answer concise.
35
+ Question: {question}
36
+ Original Response: {original_response}
37
+ Context: {context}
38
+
39
+ Answer:"""
40
+ redacted_prompt_template: str = """You are an assistant for PII identification. Based on the categories mentioned below that are true, output a response with the PII wrapped in <pii></pii> tags. Other than the tag addition for PII, keep the response the same as the original response.
41
+ The following are the categories that need to be redacted:
42
+ - Phone numbers
43
+ - Email addresses
44
+ - Names
45
+ - Company names
46
+ For every PII that needs to be redacted, wrap it in <pii></pii> tags.
47
+
48
+ Categories: {pii_flag}
49
+ Response: {response}
50
+ Modified Response: """
51
+
52
+
53
+ hallucinatory_chunks: list[str] = [
54
+ "Fairfield CDC is issuing this RFP to select a banking partner for its ambitious new program to fund the city's first dragon-powered public transportation system.",
55
+ "Merchant services must include psychic energy transfer gateways for multi-reality donation collection.",
56
+ "Technological capabilities must include temporal online banking for pre-cognitive transaction approvals.",
57
+ "Nonprofit expertise should cover managing pixie dust endowments and alchemical transmutations for the Fairfield CDC.",
58
+ "Deposit collateral for amounts over $250,000 can include moon rock deeds or dragon scale parchments.",
59
+ "Community advancement efforts may include an Elven Kingdom portal to enhance residents' magical aspects of life.",
60
+ ]
61
+
62
+ class Config:
63
+ arbitrary_types_allowed = True
64
+
65
+
66
+ class RAGApplication:
67
+ def __init__(self, config: RAGApplicationConfig):
68
+ self.config = config
69
+
70
+ def run(
71
+ self,
72
+ query: str,
73
+ prompt_template: str = None,
74
+ protect_enabled: bool = False,
75
+ top_k: int = 5,
76
+ hallucination_detection: bool = False,
77
+ induce_hallucination: bool = False,
78
+ ) -> str:
79
+ # Create a workflow to track this query
80
+ observe_workflow = self.config.galileo_platform.observe_logger.add_workflow(
81
+ name="RAG Workflow", input={"query": query}
82
+ )
83
+
84
+ evaluate_workflow = self.config.galileo_platform.evaluate_run.add_workflow(
85
+ name="RAG Workflow", input={"query": query}
86
+ )
87
+
88
+ context_adherence_score = 1
89
+ pii_flag = {
90
+ "phone_number": False,
91
+ "email": False,
92
+ "name": False,
93
+ "company": False,
94
+ }
95
+ redacted_result = ""
96
+ original_result = ""
97
+ try:
98
+ start_time = time.time()
99
+
100
+ # Get query embedding
101
+ query_embedding = self.config.embedding_model.encode([query])
102
+
103
+ # Get top-k similar texts
104
+ retrieved_documents = [
105
+ str(text["text"])
106
+ for text in self.config.vector_db.search_similar_texts(
107
+ query_embedding, limit=top_k
108
+ )
109
+ ]
110
+
111
+ # Log retriever step to Galileo Observe
112
+ observe_workflow.add_retriever(
113
+ name="Milvus Retrieval",
114
+ input=query,
115
+ documents=retrieved_documents,
116
+ duration_ns=int((time.time() - start_time) * 1e9),
117
+ )
118
+
119
+ evaluate_workflow.add_retriever(
120
+ name="Milvus Retrieval",
121
+ input=query,
122
+ documents=retrieved_documents,
123
+ # documents=[
124
+ # Document(content=doc, metadata={"length": len(doc)}) for doc in retrieved_documents],
125
+ duration_ns=int((time.time() - start_time) * 1e9),
126
+ )
127
+
128
+ start_time = time.time()
129
+
130
+ if not retrieved_documents:
131
+ return "There is nothing to return", redacted_result, context_adherence_score, pii_flag
132
+
133
+ # Create context by combining the retrieved documents
134
+ context = "\n\n".join(retrieved_documents)
135
+
136
+ # Set prompt template
137
+ prompt = (
138
+ self.config.prompt_template
139
+ if not prompt_template
140
+ else prompt_template
141
+ )
142
+
143
+ # Construct prompt
144
+ formatted_prompt = f"{prompt}\n\nQUESTION: {query}\n\nCONTEXT: {context}"
145
+
146
+ # Generate response
147
+ result = self.config.generative_model.generate_response(
148
+ formatted_prompt
149
+ )
150
+
151
+ if induce_hallucination:
152
+ original_result = result
153
+ hallucinatory_prompt = self.config.hallucinatory_prompt_template.format(question=query, context=context, original_response=result)
154
+ result = self.config.generative_model.generate_response(
155
+ hallucinatory_prompt,
156
+ temperature=1.0,
157
+ )
158
+
159
+ # Log LLM call to Galileo Observe
160
+ observe_workflow.add_llm(
161
+ name="Answer Generation",
162
+ input=retrieved_documents,
163
+ output=result,
164
+ model=self.config.generative_model.config.model_name,
165
+ duration_ns=int((time.time() - start_time) * 1e9),
166
+ )
167
+
168
+ evaluate_workflow.add_llm(
169
+ # input=Message(content=prompt, role=MessageRole.user),
170
+ # output=Message(content=result, role=MessageRole.assistant),
171
+ name="Answer Generation",
172
+ input=prompt,
173
+ output=result,
174
+ model=Models.gpt_4o,
175
+ duration_ns=int((time.time() - start_time) * 1e9),
176
+ )
177
+
178
+ start_time = time.time()
179
+
180
+ protect_response = self.config.galileo_platform.run_protect(
181
+ context, result, observe_workflow
182
+ )
183
+
184
+ if protect_enabled and protect_response["text"] != result:
185
+ pii_flag["phone_number"] = "phone_number" in protect_response["metric_results"]["pii"]["value"]
186
+ pii_flag["email"] = "email" in protect_response["metric_results"]["pii"]["value"]
187
+ pii_flag["name"] = "name" in protect_response["metric_results"]["pii"]["value"]
188
+ # pii_flag["company"] = protect_response["metric_results"]["deutsche_bank_company_pii_0"]["value"]>0.1
189
+ redacted_result = self.get_redacted_result(result, pii_flag)
190
+ result = redacted_result.replace("<pii>", "<tag>").replace("</pii>", "</tag>")
191
+
192
+ if hallucination_detection:
193
+ context_adherence_score = protect_response["metric_results"]["context_adherence_luna"]["value"]
194
+ # print(context_adherence_score)
195
+
196
+ # Conclude the workflow with the final result and set output
197
+ observe_workflow.conclude(output=result)
198
+ evaluate_workflow.output = result
199
+ self.config.galileo_platform.observe_logger.upload_workflows()
200
+
201
+ # Start evaluation in separate thread
202
+ self.config.galileo_platform.evaluate_run.finish(wait=True, silent=True)
203
+ # print(self.config.galileo_platform.evaluate_run)
204
+
205
+ return result, redacted_result, original_result, context_adherence_score, pii_flag
206
+
207
+ except Exception as e:
208
+ # Log errors to Galileo Observe
209
+ observe_workflow.conclude(output={"error": str(e)})
210
+ self.config.galileo_platform.observe_logger.upload_workflows()
211
+ raise e
212
+
213
+ def get_redacted_result(self, result, pii_flag):
214
+ prompt = self.config.redacted_prompt_template.format(pii_flag=pii_flag, response=result)
215
+ redacted_result = self.config.generative_model.generate_response(prompt)
216
+ return redacted_result
backend/classes/vector_database/__init__.py ADDED
File without changes
backend/classes/vector_database/base_vector_database.py ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pydantic import BaseModel
2
+ import pandas as pd
3
+
4
+
5
+ class VectorDatabaseConfig(BaseModel):
6
+ """Base configuration for vector databases."""
7
+ collection_name: str
8
+
9
+ class Config:
10
+ arbitrary_types_allowed = True
11
+
12
+ class VectorDatabase:
13
+ """Abstract base class for vector databases."""
14
+ def __init__(self, config: VectorDatabaseConfig):
15
+ self.config = config
16
+
17
+ def add_texts(self, df: pd.DataFrame, embeddings: list):
18
+ """Add texts and their embeddings to the collection."""
19
+ raise NotImplementedError
20
+
21
+ def search_similar_texts(self, query_embedding: list, limit: int = 5):
22
+ """Search for similar texts based on embeddings."""
23
+ raise NotImplementedError
24
+
25
+ def drop_collection(self):
26
+ """Drop the collection."""
27
+ raise NotImplementedError
28
+
29
+ def count_entities(self) -> int:
30
+ """Get the number of entities in the collection."""
31
+ raise NotImplementedError
backend/classes/vector_database/milvus_vector_database.py ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import shutil
3
+ from typing import List
4
+
5
+ import pandas as pd
6
+ from pymilvus import MilvusClient, connections, FieldSchema, CollectionSchema, DataType, Collection
7
+ import logging
8
+
9
+ from backend.classes.vector_database.base_vector_database import VectorDatabaseConfig, VectorDatabase
10
+
11
+ logger = logging.getLogger(__name__)
12
+
13
+
14
+ class MilvusVectorDatabaseConfig(VectorDatabaseConfig):
15
+ """Configuration for Milvus vector database."""
16
+ db_path: str
17
+ collection_name: str
18
+ vector_dimensions: int
19
+ drop_if_exists: bool = True
20
+
21
+ class Config:
22
+ arbitrary_types_allowed = True
23
+
24
+
25
+ class MilvusVectorDatabase(VectorDatabase):
26
+ """Implementation of vector database using Milvus."""
27
+ def __init__(self, config: MilvusVectorDatabaseConfig):
28
+ super().__init__(config)
29
+
30
+ # Create database
31
+ self.client = self.connect()
32
+
33
+ self.create_collection(config.drop_if_exists)
34
+
35
+ # # Create or get collection
36
+ # schema = CollectionSchema(fields, description="Text embeddings collection")
37
+ # self.collection:Collection = Collection(name=self.config.collection_name, schema=schema)
38
+
39
+ def connect(self):
40
+ logger.info(f"\nConnecting to Milvus at {self.config.db_path}...")
41
+ client = MilvusClient(self.config.db_path)
42
+ logger.info("Connected to Milvus.")
43
+ return client
44
+
45
+ def _define_schema(self) -> List[FieldSchema]:
46
+ """
47
+ Defines the Milvus collection schema for hybrid search.
48
+
49
+ - `id`: Primary key for unique chunk identification.
50
+ - `text_content`: Stores the chunked text, suitable for keyword filtering using `LIKE` or equality.
51
+ - `embedding`: Stores the dense vector embedding for similarity search.
52
+ - `doc_metadata`: A JSON field to store additional, flexible metadata for filtering.
53
+ """
54
+ fields = [
55
+ FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
56
+ FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=1024),
57
+ FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=self.config.vector_dimensions),
58
+ FieldSchema(name="metadata", dtype=DataType.JSON, description="Flexible JSON metadata for the document")
59
+ ]
60
+ return fields
61
+
62
+ def create_collection(self, drop_if_exists: bool = True):
63
+ """
64
+ Creates the Milvus collection with the defined schema and necessary indexes.
65
+
66
+ Args:
67
+ drop_if_exists (bool): If True, drops the collection if it already exists
68
+ before creating a new one. Defaults to True.
69
+ """
70
+ if drop_if_exists: # and self.client.has_collection(collection_name=self.config.collection_name):
71
+ logger.info(f"Dropping existing collection '{self.config.collection_name}'...")
72
+ self.client.drop_collection(collection_name=self.config.collection_name)
73
+
74
+ # Create scalar index on 'text_content' for efficient filtering (e.g., using LIKE)
75
+ logger.info(f"Creating scalar index on 'text_content' for filtering...")
76
+ index_params = self.client.prepare_index_params()
77
+ index_params.add_index(
78
+ field_name="embedding",
79
+ metric_type="COSINE", # Metric type is ignored for scalar indexes but required by API
80
+ index_type="IVF_FLAT", # HNSW is a good general-purpose vector index
81
+ params={"nlist": 128}
82
+ )
83
+
84
+ fields = self._define_schema()
85
+ milvus_schema = CollectionSchema(
86
+ fields=fields,
87
+ description="Hybrid search collection for Finance documents" # You can customize this description
88
+ )
89
+
90
+ logger.info(f"Creating collection '{self.config.collection_name}'...")
91
+ self.client.create_collection(
92
+ collection_name=self.config.collection_name,
93
+ schema=milvus_schema,
94
+ index_params=index_params,
95
+ dimension=self.config.vector_dimensions
96
+ )
97
+
98
+ # # Create scalar index on 'text_content' for efficient filtering (e.g., using LIKE)
99
+ # print(f"Creating scalar index on 'text' for filtering...")
100
+ # self.client.create_index(
101
+ # collection_name=self.config.collection_name,
102
+ # field_name="text",
103
+ # index_type="STL", # Segment Tree Index, suitable for VARCHAR filtering (equality, range, LIKE)
104
+ # metric_type="COSINE", # Metric type is ignored for scalar indexes but required by API
105
+ # index_params=index_params
106
+ # )
107
+
108
+
109
+ def add_texts(self, df: pd.DataFrame, embeddings: list):
110
+ """
111
+ Add texts and their embeddings to the collection.
112
+
113
+ Args:
114
+ df: DataFrame containing text data with columns
115
+ embeddings: List of embeddings corresponding to each text
116
+ """
117
+ # Prepare data
118
+ data = []
119
+ for index, row in df.iterrows():
120
+ row["embedding"] = embeddings[index]
121
+ data.append(row.to_dict())
122
+
123
+ # data = [
124
+ # df.text.tolist(),
125
+ # embeddings,
126
+ # df.metadata.tolist()
127
+ # ]
128
+ #
129
+ # Insert data
130
+ self.client.insert(collection_name=self.config.collection_name,data=data)
131
+
132
+ def hybrid_search(self, query_embedding: list, query_text: str, limit: int = 5,
133
+ text_weight: float = 0.4, embedding_weight: float = 0.6) -> list:
134
+ """
135
+ Perform hybrid search combining text-based and vector similarity search.
136
+
137
+ Args:
138
+ query_embedding: Embedding vector for similarity search
139
+ query_text: Text query for text-based search
140
+ limit: Number of results to return
141
+ text_weight: Weight for text-based search score
142
+ embedding_weight: Weight for embedding similarity score
143
+
144
+ Returns:
145
+ List of search results with combined scores
146
+ """
147
+ output_fields = ["text", "metadata"]
148
+
149
+ # Vector similarity search
150
+ search_results = self.client.search(
151
+ collection_name=self.config.collection_name,
152
+ data=[query_embedding],
153
+ anns_field="embedding",
154
+ param={"metric_type": "L2", "params": {"nprobe": 10}},
155
+ limit=limit * 2, # Get more candidates to combine with text search
156
+ output_fields=output_fields
157
+ )
158
+
159
+ # Process embedding results
160
+ formatted_results = []
161
+ if search_results and search_results[0]:
162
+ for hit in search_results[0]:
163
+ result = {
164
+ "id": hit['id'],
165
+ "distance": hit['distance'],
166
+ "text": hit.get('text', 'N/A'),
167
+ "metadata": hit.get('metadata', {})
168
+ }
169
+ # Add any other requested output fields
170
+ for field in output_fields:
171
+ if field not in result: # Avoid overwriting 'text' or 'metadata' if already handled
172
+ result[field] = hit.get(field)
173
+ formatted_results.append(result)
174
+ return formatted_results
175
+
176
+ def search_similar_texts(self, query_embedding: list, limit: int = 5):
177
+ """
178
+ Search for similar texts based on embeddings.
179
+
180
+ Args:
181
+ query_embedding: Embedding vector to search for
182
+ limit: Number of results to return
183
+
184
+ Returns:
185
+ List of similar texts and their distances
186
+ """
187
+ output_fields = ["text"]
188
+ search_results = self.client.search(
189
+ collection_name=self.config.collection_name,
190
+ data=query_embedding,
191
+ anns_field="embedding",
192
+ # param={"metric_type": "L2", "params": {"nprobe": 10}},
193
+ limit=limit, # Get more candidates to combine with text search
194
+ output_fields=output_fields
195
+ )
196
+
197
+ return [{
198
+ "text": result.get("text"),
199
+ "distance": result["distance"]
200
+ } for result in search_results[0]]
201
+
202
+ def drop_collection(self):
203
+ """Drop the collection."""
204
+ if os.path.exists(self.config.db_path):
205
+ logger.info(f"Removing local Milvus Lite data directory: {self.config.db_path}...")
206
+ shutil.rmtree(self.config.db_path)
207
+ logger.info("Local data removed.")
208
+ else:
209
+ logger.info(f"Local data directory '{self.config.db_path}' not found, nothing to clean.")
210
+
backend/conf/config.yaml ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ env_variables:
2
+ - APP_ENV
3
+ - GOOGLE_GEMINI_API_KEY
4
+ - GOOGLE_GEMINI_BACKUP_API_KEY
5
+ - GALILEO_API_KEY
6
+ - GALILEO_CONSOLE_URL
7
+ - GALILEO_API_ACCESS_TOKEN
8
+ - OPENAI_API_KEY
9
+
10
+ local_ks:
11
+ data:
12
+ input_data_path: "/Users/kannan/Documents/2025/galileo-poc/rfp-data/input"
13
+ output_data_path: "/Users/kannan/Documents/2025/galileo-poc/rfp-data/processed"
14
+ output_file: "{file_name}.jsonl"
15
+
16
+ chunker:
17
+ chunk_size: 500
18
+ chunk_overlap: 100
19
+
20
+ vector_database:
21
+ db_path: "/Users/kannan/Documents/2025/galileo-poc/rfp-data/processed/vector_db/rfp_data_milvus.db"
22
+ collection_name: "rfp_data"
23
+ dimensions: 768
24
+
25
+ embedding_model:
26
+ model_name: "philschmid/bge-base-financial-matryoshka"
27
+ batch_size: 32
28
+
29
+ pdf_extractor:
30
+ extension: ".pdf"
31
+
32
+ generative_model:
33
+ model_name: "gemini-2.0-flash-lite"
34
+ api_key: "{GOOGLE_GEMINI_API_KEY}"
35
+ backup_api_key: "{GOOGLE_GEMINI_BACKUP_API_KEY}"
36
+
37
+ galileo_platform:
38
+ evaluate_project_name: "deutsche-bank-evaluate"
39
+ observe_project_name: "deutsche-bank-test"
40
+ observe_project_id: "185841b9-fe41-4fe3-ad75-c5217f7d554d"
41
+ protect_project_name: "protect-test-project"
42
+ protect_stage_name: "protect-test-stage"
43
+
44
+ local_ne:
45
+ data:
46
+ input_data_path: "/Users/nikhile/Work/repos/galileo-poc/fin-data/input"
47
+ output_data_path: "/Users/nikhile/Work/repos/galileo-poc/fin-data/processed"
48
+ output_file: "{file_name}.jsonl"
49
+
50
+ chunker:
51
+ chunk_size: 500
52
+ chunk_overlap: 100
53
+
54
+ vector_database:
55
+ db_path: "/Users/nikhile/Work/repos/galileo-poc/fin-data/processed/vector_db/fin_data_milvus.db"
56
+ collection_name: "fin_data"
57
+ dimensions: 768
58
+
59
+ embedding_model:
60
+ model_name: "philschmid/bge-base-financial-matryoshka"
61
+ batch_size: 32
62
+
63
+ pdf_extractor:
64
+ extension: ".pdf"
65
+
66
+ gemini_generative_model:
67
+ model_name: "gemini-2.0-flash"
68
+ api_key: "{GOOGLE_GEMINI_API_KEY}"
69
+ backup_api_key: "{GOOGLE_GEMINI_BACKUP_API_KEY}"
70
+ temperature: 1.0
71
+
72
+ openai_generative_model:
73
+ model_name: "gpt-4.1-nano-2025-04-14"
74
+ api_key: "{OPENAI_API_KEY}"
75
+ temperature: 0.0
76
+
77
+ galileo_platform:
78
+ evaluate_project_name: "lseg-qa-evaluate"
79
+ observe_project_name: "lseq-qa-observe"
80
+ observe_project_id: "185841b9-fe41-4fe3-ad75-c5217f7d554d"
81
+ protect_project_name: "protect-test-project"
82
+ protect_stage_name: "protect-test-stage"
83
+
84
+ local_docker:
85
+ data:
86
+ input_data_path: "/app/fin-data/input"
87
+ output_data_path: "/app/fin-data/processed"
88
+ output_file: "{file_name}.jsonl"
89
+
90
+ chunker:
91
+ chunk_size: 500
92
+ chunk_overlap: 100
93
+
94
+ vector_database:
95
+ db_path: "/app/fin-data/processed/vector_db/fin_data_milvus.db"
96
+ collection_name: "fin_data"
97
+ dimensions: 768
98
+
99
+ embedding_model:
100
+ model_name: "philschmid/bge-base-financial-matryoshka"
101
+ batch_size: 32
102
+
103
+ pdf_extractor:
104
+ extension: ".pdf"
105
+
106
+ gemini_generative_model:
107
+ model_name: "gemini-2.0-flash"
108
+ api_key: "{GOOGLE_GEMINI_API_KEY}"
109
+ backup_api_key: "{GOOGLE_GEMINI_BACKUP_API_KEY}"
110
+ temperature: 1.0
111
+
112
+ openai_generative_model:
113
+ model_name: "gpt-4.1-nano-2025-04-14"
114
+ api_key: "{OPENAI_API_KEY}"
115
+ temperature: 0.0
116
+
117
+ galileo_platform:
118
+ evaluate_project_name: "lseg-qa-evaluate"
119
+ observe_project_name: "lseq-qa-observe"
120
+ observe_project_id: "185841b9-fe41-4fe3-ad75-c5217f7d554d"
121
+ protect_project_name: "protect-test-project"
122
+ protect_stage_name: "protect-test-stage"
backend/main/chunk_and_save_to_vector_db.py ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pathlib import Path
2
+ import pandas as pd
3
+
4
+ from backend.classes.chunker.text_chunker import RecursiveCharacterTextChunkerConfig, RecursiveCharacterTextChunker
5
+ from backend.classes.embedding_model import EmbeddingModelConfig, EmbeddingModel
6
+ from pydantic import BaseModel
7
+ import json
8
+ import dotenv
9
+
10
+ from backend.classes.vector_database.milvus_vector_database import MilvusVectorDatabaseConfig, MilvusVectorDatabase
11
+ from backend.utils.utils import get_embedding_model, read_config, set_env_variables, create_vector_database, \
12
+ create_text_chunker, initialize_logger
13
+
14
+ dotenv.load_dotenv()
15
+
16
+ def get_files(folder_path: str, extension: str = "jsonl") -> list:
17
+ # Get all pdf files from folder in a recursive manner using pathlib.Path
18
+ files = []
19
+ for path in Path(folder_path).rglob(f"*.{extension}"):
20
+ files.append(path)
21
+
22
+ return files
23
+
24
+
25
+ class ChunkerVectorDbConfig(BaseModel):
26
+ folder_path: str
27
+ chunker: RecursiveCharacterTextChunker
28
+ vector_database: MilvusVectorDatabase
29
+ embedding_model: EmbeddingModel
30
+
31
+ class Config:
32
+ arbitrary_types_allowed = True
33
+
34
+
35
+ def get_file_data(file_path: str) -> pd.DataFrame:
36
+ try:
37
+ return pd.read_json(file_path, lines=True)
38
+ except Exception as e:
39
+ logger.exception(e)
40
+ raise e
41
+
42
+ def chunk_and_save_to_vector_db(config: ChunkerVectorDbConfig):
43
+ # Read files from folder
44
+ file_paths = get_files(config.folder_path)
45
+ logger.info(f"There are {len(file_paths)} to process")
46
+
47
+ # Extract text from pdf files
48
+ for file_path in file_paths:
49
+ # Extract pdf data in markdown
50
+ logger.info(f"Processing {file_path}")
51
+ data_df = get_file_data(str(file_path))
52
+
53
+ # There are a few rows that are empty due to images not being extracted
54
+ # Remove them
55
+ data_df = data_df[data_df["markdown_text"] != ""]
56
+
57
+ data_df["text_chunks"] = data_df["markdown_text"].apply(config.chunker.chunk_text)
58
+ data_df = data_df.explode("text_chunks").rename(columns={"text_chunks": "text"})
59
+ data_df["chunk_id"] = data_df.groupby("id").cumcount() + 1
60
+ data_df["row_chunk_id"] = data_df["id"] + data_df["chunk_id"].astype(str)
61
+
62
+ data_df["metadata_json"] = data_df["metadata"].apply(lambda d: json.dumps(d))
63
+ data_df = data_df.drop(columns=["metadata", "id", "row_chunk_id", "markdown_text", "chunk_id"]).rename(columns={"metadata_json": "metadata"})
64
+
65
+ embeddings = config.embedding_model.encode(data_df.text.tolist())
66
+ config.vector_database.add_texts(data_df, embeddings)
67
+
68
+
69
+ def run(config: dict):
70
+ # Create embedding model object
71
+ embedding_model_config = EmbeddingModelConfig(model_name=config["embedding_model"]["model_name"],
72
+ batch_size=config["embedding_model"]["batch_size"])
73
+ embedding_model = get_embedding_model(EmbeddingModel, embedding_model_config)
74
+
75
+ # Create vector db model object
76
+ vector_db_config = MilvusVectorDatabaseConfig(db_path=config["vector_database"]["db_path"],
77
+ collection_name=config["vector_database"]["collection_name"],
78
+ vector_dimensions=config["vector_database"]["dimensions"])
79
+ vector_db = create_vector_database(MilvusVectorDatabase, vector_db_config)
80
+
81
+ text_chunker_config = RecursiveCharacterTextChunkerConfig(chunk_size=config["chunker"]["chunk_size"],
82
+ chunk_overlap=config["chunker"]["chunk_overlap"])
83
+ text_chunker = create_text_chunker(RecursiveCharacterTextChunker, text_chunker_config)
84
+
85
+ chunker_vector_db_config = ChunkerVectorDbConfig(folder_path=config["data"]["output_data_path"],
86
+ chunker=text_chunker,
87
+ vector_database=vector_db,
88
+ embedding_model=embedding_model)
89
+
90
+ chunk_and_save_to_vector_db(chunker_vector_db_config)
91
+
92
+
93
+ if __name__ == "__main__":
94
+ logger = initialize_logger()
95
+
96
+ # get current file path using Path
97
+ config = read_config(str(Path(Path(__file__).parent, "../conf/config.yaml")))
98
+
99
+ # check if environment variables are set
100
+ env_variables = set_env_variables(config["env_variables"])
101
+
102
+ app_config = config[env_variables["APP_ENV"]]
103
+ app_config["env_vars"] = env_variables
104
+
105
+ run(app_config)
106
+
107
+
backend/main/prepare_data.py ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pathlib import Path
2
+
3
+ from backend.classes.data_preparer import DataPreparerConfig, DataPreparer
4
+ from backend.utils.utils import initialize_logger, read_config, set_env_variables
5
+ from dotenv import load_dotenv
6
+
7
+ load_dotenv()
8
+
9
+
10
+ def run(config: dict):
11
+ """
12
+ Run the RAG application.
13
+ :param config: Configuration dictionary
14
+ """
15
+ logger.info("Prepare data")
16
+ data_preparer_config = DataPreparerConfig(input_data_path=config["data"]["input_data_path"],
17
+ output_data_path=config["data"]["output_data_path"],
18
+ output_file=config["data"]["output_file"],
19
+ pdf_extractor=config["pdf_extractor"],
20
+ vector_database=config["vector_database"],
21
+ embedding_model=config["embedding_model"])
22
+ data_preparer = DataPreparer(data_preparer_config)
23
+ data_preparer.prepare_data()
24
+ logger.info("Data prepared")
25
+
26
+
27
+
28
+ if __name__ == '__main__':
29
+ logger = initialize_logger()
30
+
31
+ # get current file path using Path
32
+ config = read_config(str(Path(Path(__file__).parent, "../conf/config.yaml")))
33
+
34
+ # check if environment variables are set
35
+ env_variables = set_env_variables(config["env_variables"])
36
+
37
+ app_config = config[env_variables["APP_ENV"]]
38
+ app_config["env_vars"] = env_variables
39
+
40
+ run(app_config)
41
+
backend/main/run_rag.py ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # from classes.rag_application import RAGApplication
2
+ # from classes.pdf_extractor import PDFExtractor
3
+ # from classes.generative_model import GenerativeModel, GenerativeModelConfig
4
+ # from utils import initialize_logger, read_config
5
+ # from pathlib import Path
6
+ # import logging
7
+ #
8
+ # logging.basicConfig(level=logging.INFO)
9
+ # logger = logging.getLogger(__name__)
10
+ #
11
+ # logger = initialize_logger()
12
+ #
13
+ # # Initialize required components from config
14
+ # config = read_config("conf/config.yaml")
15
+ # pdf_extractor = PDFExtractor(config["pdf_extractor"])
16
+ # vector_db = VectorDatabase(config["vector_database"])
17
+ #
18
+ # # Get PDF path from config
19
+ # pdf_path = config["data"]["input_data_path"]
20
+ #
21
+ # def run(config: dict):
22
+ # """
23
+ # Run the RAG application.
24
+ # :param config: Configuration dictionary
25
+ # """
26
+ # logger.info("Create generative model")
27
+ # generative_model_config = GenerativeModelConfig(model_name=config['generative_model']['model_name'])
28
+ # generative_model = GenerativeModel(generative_model_config)
29
+ #
30
+ # logger.info("Create RAG application")
31
+ # rag_app = RAGApplication(vector_db, pdf_extractor, generative_model)
32
+ # rag_app.run(pdf_path, 'Give me a summary of the document')
33
+ # logger.info("RAG application completed")
34
+ #
35
+ # if __name__ == '__main__':
36
+ # run(config)
backend/models/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+
backend/models/rag_models.py ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ from pydantic import BaseModel
2
+
3
+
4
+ class RAGRequest(BaseModel):
5
+ query: str
6
+ protect_enabled: bool = False
7
+
8
+
9
+ class RAGResponse(BaseModel):
10
+ response: str
backend/requirements.txt ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Requirements for RAG Application
2
+ python-dotenv
3
+ PyMuPDF
4
+ chromadb
5
+ google-genai
6
+ pandas
7
+ fastapi
8
+ uvicorn
9
+ requests
10
+ streamlit
11
+ pymilvus
12
+ docling
13
+ sentence-transformers
14
+ torch
15
+ langchain-text-splitters
backend/utils/utils.py ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import yaml
3
+ import logging
4
+ from typing import List, Any
5
+ from dotenv import load_dotenv
6
+ load_dotenv()
7
+
8
+ def set_env_variables(env_vars: List):
9
+ env_vars_dict = {}
10
+ for env_var in env_vars:
11
+ if env_var not in os.environ or not os.environ[env_var]:
12
+ raise ValueError(f"ERROR: Please set {env_var}.")
13
+ env_vars_dict[env_var] = os.environ[env_var]
14
+ return env_vars_dict
15
+
16
+
17
+ def initialize_logger():
18
+ logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
19
+ logger = logging.getLogger(__name__)
20
+ return logger
21
+
22
+
23
+ def read_config(config_path: str) -> dict:
24
+ with open(config_path, 'r') as config_file:
25
+ config = yaml.safe_load(config_file)
26
+ return config
27
+
28
+
29
+ def create_vector_database(db: Any, config: Any) -> Any:
30
+ vector_db = db(config)
31
+ return vector_db
32
+
33
+
34
+ def create_text_chunker(text_chunker: Any, text_chunker_config: Any) -> Any:
35
+ return text_chunker(text_chunker_config)
36
+
37
+
38
+ def create_pdf_extractor(pdf_extractor: Any, pdf_extractor_config: Any) -> Any:
39
+ return pdf_extractor(pdf_extractor_config)
40
+
41
+
42
+ def get_embedding_model(model: Any, config: Any) -> Any:
43
+ embedding_model = model(config)
44
+ return embedding_model
45
+
46
+ def get_generative_model(model: Any, config: Any) -> Any:
47
+ generative_model = model(config)
48
+ return generative_model
49
+
documentation/backend.puml ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ @startuml backend
2
+ class VectorDatabase {
3
+ - chroma_client
4
+ - collection
5
+ + __init__(config: class VectorDatabaseConfig {})
6
+ + store_text_as_vector(df: pd.DataFrame)
7
+ }
8
+
9
+ class PDFExtractor {
10
+ + __init__(config: class PDFExtractorConfig {})
11
+ + extract_text_from_pdf(pdf_path: str)
12
+ }
13
+
14
+ class GenerativeModel {
15
+ - client
16
+ + __init__(config: class GenerativeModelConfig {})
17
+ + generate_response(query: str)
18
+ }
19
+
20
+ class RAGApplicationConfig {
21
+ + vector_db: VectorDatabase
22
+ + pdf_extractor: PDFExtractor
23
+ + generative_model: GenerativeModel
24
+ }
25
+
26
+ class DataPreparerConfig {
27
+ + input_data_path: str
28
+ + output_data_path: str
29
+ + output_file: str
30
+ + pdf_extractor: PDFExtractorConfig
31
+ + vector_database: VectorDatabaseConfig
32
+ + embedding_model: EmbeddingModelConfig
33
+ }
34
+
35
+ class RAGApplication {
36
+ - vector_db
37
+ - pdf_extractor
38
+ - generative_model
39
+ + __init__(config: class RAGApplicationConfig)
40
+ + run(pdf_path: str, query: str)
41
+ }
42
+
43
+ @enduml
documentation/workflow.excalidraw ADDED
@@ -0,0 +1,1563 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "type": "excalidraw",
3
+ "version": 2,
4
+ "source": "https://marketplace.visualstudio.com/items?itemName=pomdtr.excalidraw-editor",
5
+ "elements": [
6
+ {
7
+ "id": "ahbQwKAsijgSdg9jT0hg3",
8
+ "type": "rectangle",
9
+ "x": 113.92242891508351,
10
+ "y": 546.2754497232347,
11
+ "width": 145.73046875,
12
+ "height": 60,
13
+ "angle": 0,
14
+ "strokeColor": "#1e1e1e",
15
+ "backgroundColor": "transparent",
16
+ "fillStyle": "solid",
17
+ "strokeWidth": 2,
18
+ "strokeStyle": "solid",
19
+ "roughness": 1,
20
+ "opacity": 100,
21
+ "groupIds": [],
22
+ "frameId": null,
23
+ "index": "a0",
24
+ "roundness": {
25
+ "type": 3
26
+ },
27
+ "seed": 732491180,
28
+ "version": 134,
29
+ "versionNonce": 2128788504,
30
+ "isDeleted": false,
31
+ "boundElements": [
32
+ {
33
+ "id": "l2Bhq9WJokYdEsOu-kqdv",
34
+ "type": "arrow"
35
+ },
36
+ {
37
+ "type": "text",
38
+ "id": "UFlJSimF8SzkCx_fogXYo"
39
+ },
40
+ {
41
+ "id": "FPhrCFjAj_hPudKjQjMHb",
42
+ "type": "arrow"
43
+ }
44
+ ],
45
+ "updated": 1747664916081,
46
+ "link": null,
47
+ "locked": false
48
+ },
49
+ {
50
+ "id": "UFlJSimF8SzkCx_fogXYo",
51
+ "type": "text",
52
+ "x": 122.77770693022023,
53
+ "y": 563.7754497232347,
54
+ "width": 128.01991271972656,
55
+ "height": 25,
56
+ "angle": 0,
57
+ "strokeColor": "#1e1e1e",
58
+ "backgroundColor": "transparent",
59
+ "fillStyle": "solid",
60
+ "strokeWidth": 2,
61
+ "strokeStyle": "solid",
62
+ "roughness": 1,
63
+ "opacity": 100,
64
+ "groupIds": [],
65
+ "frameId": null,
66
+ "index": "a1",
67
+ "roundness": null,
68
+ "seed": 1377196076,
69
+ "version": 150,
70
+ "versionNonce": 274954520,
71
+ "isDeleted": false,
72
+ "boundElements": [],
73
+ "updated": 1747664916081,
74
+ "link": null,
75
+ "locked": false,
76
+ "text": "Save as jsonl",
77
+ "fontSize": 20,
78
+ "fontFamily": 5,
79
+ "textAlign": "center",
80
+ "verticalAlign": "middle",
81
+ "containerId": "ahbQwKAsijgSdg9jT0hg3",
82
+ "originalText": "Save as jsonl",
83
+ "autoResize": true,
84
+ "lineHeight": 1.25
85
+ },
86
+ {
87
+ "id": "l2Bhq9WJokYdEsOu-kqdv",
88
+ "type": "arrow",
89
+ "x": 181.26889178080395,
90
+ "y": 481.2445801020841,
91
+ "width": 0.47992077927702326,
92
+ "height": 64.64388544319729,
93
+ "angle": 0,
94
+ "strokeColor": "#1e1e1e",
95
+ "backgroundColor": "transparent",
96
+ "fillStyle": "solid",
97
+ "strokeWidth": 2,
98
+ "strokeStyle": "solid",
99
+ "roughness": 1,
100
+ "opacity": 100,
101
+ "groupIds": [],
102
+ "frameId": null,
103
+ "index": "a2",
104
+ "roundness": {
105
+ "type": 2
106
+ },
107
+ "seed": 1282568876,
108
+ "version": 214,
109
+ "versionNonce": 297608040,
110
+ "isDeleted": false,
111
+ "boundElements": [],
112
+ "updated": 1747666564671,
113
+ "link": null,
114
+ "locked": false,
115
+ "points": [
116
+ [
117
+ 0,
118
+ 0
119
+ ],
120
+ [
121
+ 0.47992077927702326,
122
+ 64.64388544319729
123
+ ]
124
+ ],
125
+ "lastCommittedPoint": null,
126
+ "startBinding": {
127
+ "elementId": "S2mRdzEdWZomP_4VMODv9",
128
+ "focus": 0.03491911256120813,
129
+ "gap": 1.7043524039286808
130
+ },
131
+ "endBinding": {
132
+ "elementId": "ahbQwKAsijgSdg9jT0hg3",
133
+ "focus": -0.06586435538969795,
134
+ "gap": 1
135
+ },
136
+ "startArrowhead": null,
137
+ "endArrowhead": "arrow",
138
+ "elbowed": false
139
+ },
140
+ {
141
+ "id": "8qw2U9U9RrOSm8H7KwMOK",
142
+ "type": "rectangle",
143
+ "x": 86.36768348636292,
144
+ "y": 665.4955189693495,
145
+ "width": 209.10886312467028,
146
+ "height": 126.70472854491413,
147
+ "angle": 0,
148
+ "strokeColor": "#1e1e1e",
149
+ "backgroundColor": "#ffec99",
150
+ "fillStyle": "solid",
151
+ "strokeWidth": 2,
152
+ "strokeStyle": "solid",
153
+ "roughness": 1,
154
+ "opacity": 100,
155
+ "groupIds": [],
156
+ "frameId": null,
157
+ "index": "a4",
158
+ "roundness": {
159
+ "type": 3
160
+ },
161
+ "seed": 1709894932,
162
+ "version": 286,
163
+ "versionNonce": 143242008,
164
+ "isDeleted": false,
165
+ "boundElements": [
166
+ {
167
+ "id": "FPhrCFjAj_hPudKjQjMHb",
168
+ "type": "arrow"
169
+ },
170
+ {
171
+ "id": "VfDy5P7I3DdOSK-tio8_Z",
172
+ "type": "text"
173
+ }
174
+ ],
175
+ "updated": 1747665906821,
176
+ "link": null,
177
+ "locked": false
178
+ },
179
+ {
180
+ "id": "VfDy5P7I3DdOSK-tio8_Z",
181
+ "type": "text",
182
+ "x": 98.4121892064129,
183
+ "y": 691.3478832418066,
184
+ "width": 185.0198516845703,
185
+ "height": 75,
186
+ "angle": 0,
187
+ "strokeColor": "#1e1e1e",
188
+ "backgroundColor": "transparent",
189
+ "fillStyle": "solid",
190
+ "strokeWidth": 2,
191
+ "strokeStyle": "solid",
192
+ "roughness": 1,
193
+ "opacity": 100,
194
+ "groupIds": [],
195
+ "frameId": null,
196
+ "index": "a5",
197
+ "roundness": null,
198
+ "seed": 2108959380,
199
+ "version": 454,
200
+ "versionNonce": 1477677080,
201
+ "isDeleted": false,
202
+ "boundElements": [],
203
+ "updated": 1747665906821,
204
+ "link": null,
205
+ "locked": false,
206
+ "text": "Create embeddings,\nuse Matryoshka\nembeddings",
207
+ "fontSize": 20,
208
+ "fontFamily": 5,
209
+ "textAlign": "center",
210
+ "verticalAlign": "middle",
211
+ "containerId": "8qw2U9U9RrOSm8H7KwMOK",
212
+ "originalText": "Create embeddings, use Matryoshka embeddings",
213
+ "autoResize": true,
214
+ "lineHeight": 1.25
215
+ },
216
+ {
217
+ "id": "FPhrCFjAj_hPudKjQjMHb",
218
+ "type": "arrow",
219
+ "x": 183.93007917700277,
220
+ "y": 606.662433901188,
221
+ "width": 0.3427040579839229,
222
+ "height": 58.31253716582853,
223
+ "angle": 0,
224
+ "strokeColor": "#1e1e1e",
225
+ "backgroundColor": "transparent",
226
+ "fillStyle": "solid",
227
+ "strokeWidth": 2,
228
+ "strokeStyle": "solid",
229
+ "roughness": 1,
230
+ "opacity": 100,
231
+ "groupIds": [],
232
+ "frameId": null,
233
+ "index": "a6",
234
+ "roundness": {
235
+ "type": 2
236
+ },
237
+ "seed": 343521300,
238
+ "version": 398,
239
+ "versionNonce": 878300440,
240
+ "isDeleted": false,
241
+ "boundElements": [],
242
+ "updated": 1747665906821,
243
+ "link": null,
244
+ "locked": false,
245
+ "points": [
246
+ [
247
+ 0,
248
+ 0
249
+ ],
250
+ [
251
+ -0.3427040579839229,
252
+ 58.31253716582853
253
+ ]
254
+ ],
255
+ "lastCommittedPoint": null,
256
+ "startBinding": {
257
+ "elementId": "ahbQwKAsijgSdg9jT0hg3",
258
+ "focus": 0.03801391972015758,
259
+ "gap": 1
260
+ },
261
+ "endBinding": {
262
+ "elementId": "8qw2U9U9RrOSm8H7KwMOK",
263
+ "focus": -0.07396043014036746,
264
+ "gap": 1
265
+ },
266
+ "startArrowhead": null,
267
+ "endArrowhead": "arrow",
268
+ "elbowed": false
269
+ },
270
+ {
271
+ "id": "S2mRdzEdWZomP_4VMODv9",
272
+ "type": "rectangle",
273
+ "x": 110.72815104552075,
274
+ "y": 420.5777198758129,
275
+ "width": 145.73046875,
276
+ "height": 64,
277
+ "angle": 0,
278
+ "strokeColor": "#1e1e1e",
279
+ "backgroundColor": "transparent",
280
+ "fillStyle": "solid",
281
+ "strokeWidth": 2,
282
+ "strokeStyle": "solid",
283
+ "roughness": 1,
284
+ "opacity": 100,
285
+ "groupIds": [],
286
+ "frameId": null,
287
+ "index": "a7",
288
+ "roundness": {
289
+ "type": 3
290
+ },
291
+ "seed": 1766607916,
292
+ "version": 176,
293
+ "versionNonce": 712793960,
294
+ "isDeleted": false,
295
+ "boundElements": [
296
+ {
297
+ "id": "4CsqH5PyXh3iccuPZ0I0L",
298
+ "type": "arrow"
299
+ },
300
+ {
301
+ "type": "text",
302
+ "id": "gzAFKkWZ8OVdvSP3NptvV"
303
+ },
304
+ {
305
+ "id": "l2Bhq9WJokYdEsOu-kqdv",
306
+ "type": "arrow"
307
+ }
308
+ ],
309
+ "updated": 1747666605404,
310
+ "link": null,
311
+ "locked": false
312
+ },
313
+ {
314
+ "id": "gzAFKkWZ8OVdvSP3NptvV",
315
+ "type": "text",
316
+ "x": 128.96345683165356,
317
+ "y": 425.5777198758129,
318
+ "width": 109.25985717773438,
319
+ "height": 54,
320
+ "angle": 0,
321
+ "strokeColor": "#1e1e1e",
322
+ "backgroundColor": "transparent",
323
+ "fillStyle": "solid",
324
+ "strokeWidth": 2,
325
+ "strokeStyle": "solid",
326
+ "roughness": 1,
327
+ "opacity": 100,
328
+ "groupIds": [],
329
+ "frameId": null,
330
+ "index": "a8",
331
+ "roundness": null,
332
+ "seed": 1892279980,
333
+ "version": 268,
334
+ "versionNonce": 1504654872,
335
+ "isDeleted": false,
336
+ "boundElements": [],
337
+ "updated": 1747666617437,
338
+ "link": null,
339
+ "locked": false,
340
+ "text": "Extract data\nfrom pdfs",
341
+ "fontSize": 20,
342
+ "fontFamily": 6,
343
+ "textAlign": "center",
344
+ "verticalAlign": "middle",
345
+ "containerId": "S2mRdzEdWZomP_4VMODv9",
346
+ "originalText": "Extract data from pdfs",
347
+ "autoResize": true,
348
+ "lineHeight": 1.35
349
+ },
350
+ {
351
+ "id": "4CsqH5PyXh3iccuPZ0I0L",
352
+ "type": "arrow",
353
+ "x": 178.40108773648336,
354
+ "y": 358.8316261258129,
355
+ "width": 0.7712439836965075,
356
+ "height": 61.294579681627624,
357
+ "angle": 0,
358
+ "strokeColor": "#1e1e1e",
359
+ "backgroundColor": "transparent",
360
+ "fillStyle": "solid",
361
+ "strokeWidth": 2,
362
+ "strokeStyle": "solid",
363
+ "roughness": 1,
364
+ "opacity": 100,
365
+ "groupIds": [],
366
+ "frameId": null,
367
+ "index": "a9",
368
+ "roundness": {
369
+ "type": 2
370
+ },
371
+ "seed": 447754540,
372
+ "version": 290,
373
+ "versionNonce": 2117562472,
374
+ "isDeleted": false,
375
+ "boundElements": [],
376
+ "updated": 1747666564671,
377
+ "link": null,
378
+ "locked": false,
379
+ "points": [
380
+ [
381
+ 0,
382
+ 0
383
+ ],
384
+ [
385
+ -0.7712439836965075,
386
+ 61.294579681627624
387
+ ]
388
+ ],
389
+ "lastCommittedPoint": null,
390
+ "startBinding": null,
391
+ "endBinding": {
392
+ "elementId": "S2mRdzEdWZomP_4VMODv9",
393
+ "focus": -0.08665299440759953,
394
+ "gap": 1.163726888744634
395
+ },
396
+ "startArrowhead": null,
397
+ "endArrowhead": "arrow",
398
+ "elbowed": false
399
+ },
400
+ {
401
+ "id": "sQ3a2n9urbtaCts05NIgi",
402
+ "type": "text",
403
+ "x": 152.56951496517308,
404
+ "y": 329.55127711710816,
405
+ "width": 53.63995361328125,
406
+ "height": 25,
407
+ "angle": 0,
408
+ "strokeColor": "#1e1e1e",
409
+ "backgroundColor": "transparent",
410
+ "fillStyle": "solid",
411
+ "strokeWidth": 2,
412
+ "strokeStyle": "solid",
413
+ "roughness": 1,
414
+ "opacity": 100,
415
+ "groupIds": [],
416
+ "frameId": null,
417
+ "index": "aA",
418
+ "roundness": null,
419
+ "seed": 720358572,
420
+ "version": 45,
421
+ "versionNonce": 975122968,
422
+ "isDeleted": false,
423
+ "boundElements": [],
424
+ "updated": 1747664916082,
425
+ "link": null,
426
+ "locked": false,
427
+ "text": "PDFs",
428
+ "fontSize": 20,
429
+ "fontFamily": 5,
430
+ "textAlign": "left",
431
+ "verticalAlign": "top",
432
+ "containerId": null,
433
+ "originalText": "PDFs",
434
+ "autoResize": true,
435
+ "lineHeight": 1.25
436
+ },
437
+ {
438
+ "id": "bHSkFr3jNSgNI1uBSw_YH",
439
+ "type": "rectangle",
440
+ "x": 87.6826415406075,
441
+ "y": 861.2677483936271,
442
+ "width": 209.10886312467028,
443
+ "height": 126.70472854491413,
444
+ "angle": 0,
445
+ "strokeColor": "#1e1e1e",
446
+ "backgroundColor": "transparent",
447
+ "fillStyle": "solid",
448
+ "strokeWidth": 2,
449
+ "strokeStyle": "solid",
450
+ "roughness": 1,
451
+ "opacity": 100,
452
+ "groupIds": [],
453
+ "frameId": null,
454
+ "index": "aB",
455
+ "roundness": {
456
+ "type": 3
457
+ },
458
+ "seed": 988513964,
459
+ "version": 266,
460
+ "versionNonce": 1486958360,
461
+ "isDeleted": false,
462
+ "boundElements": [
463
+ {
464
+ "id": "zXok0v3zHlKbCmoSypY8o",
465
+ "type": "arrow"
466
+ },
467
+ {
468
+ "type": "text",
469
+ "id": "sg3SXgncifiwZc-zwXYjd"
470
+ }
471
+ ],
472
+ "updated": 1747664916082,
473
+ "link": null,
474
+ "locked": false
475
+ },
476
+ {
477
+ "id": "sg3SXgncifiwZc-zwXYjd",
478
+ "type": "text",
479
+ "x": 96.8871356639778,
480
+ "y": 887.1201126660842,
481
+ "width": 190.6998748779297,
482
+ "height": 75,
483
+ "angle": 0,
484
+ "strokeColor": "#1e1e1e",
485
+ "backgroundColor": "transparent",
486
+ "fillStyle": "solid",
487
+ "strokeWidth": 2,
488
+ "strokeStyle": "solid",
489
+ "roughness": 1,
490
+ "opacity": 100,
491
+ "groupIds": [],
492
+ "frameId": null,
493
+ "index": "aC",
494
+ "roundness": null,
495
+ "seed": 1872613676,
496
+ "version": 352,
497
+ "versionNonce": 802917400,
498
+ "isDeleted": false,
499
+ "boundElements": [],
500
+ "updated": 1747664916082,
501
+ "link": null,
502
+ "locked": false,
503
+ "text": "Create and save to\ncollection in Chroma\nvector db",
504
+ "fontSize": 20,
505
+ "fontFamily": 5,
506
+ "textAlign": "center",
507
+ "verticalAlign": "middle",
508
+ "containerId": "bHSkFr3jNSgNI1uBSw_YH",
509
+ "originalText": "Create and save to collection in Chroma vector db",
510
+ "autoResize": true,
511
+ "lineHeight": 1.25
512
+ },
513
+ {
514
+ "id": "zXok0v3zHlKbCmoSypY8o",
515
+ "type": "arrow",
516
+ "x": 185.16187324544518,
517
+ "y": 799.2927796444268,
518
+ "width": 0.4120303976070261,
519
+ "height": 61.368755427712586,
520
+ "angle": 0,
521
+ "strokeColor": "#1e1e1e",
522
+ "backgroundColor": "transparent",
523
+ "fillStyle": "solid",
524
+ "strokeWidth": 2,
525
+ "strokeStyle": "solid",
526
+ "roughness": 1,
527
+ "opacity": 100,
528
+ "groupIds": [],
529
+ "frameId": null,
530
+ "index": "aD",
531
+ "roundness": {
532
+ "type": 2
533
+ },
534
+ "seed": 480431020,
535
+ "version": 426,
536
+ "versionNonce": 1082347112,
537
+ "isDeleted": false,
538
+ "boundElements": [],
539
+ "updated": 1747664916199,
540
+ "link": null,
541
+ "locked": false,
542
+ "points": [
543
+ [
544
+ 0,
545
+ 0
546
+ ],
547
+ [
548
+ -0.4120303976070261,
549
+ 61.368755427712586
550
+ ]
551
+ ],
552
+ "lastCommittedPoint": null,
553
+ "startBinding": null,
554
+ "endBinding": {
555
+ "elementId": "bHSkFr3jNSgNI1uBSw_YH",
556
+ "focus": -0.07541117678427305,
557
+ "gap": 1
558
+ },
559
+ "startArrowhead": null,
560
+ "endArrowhead": "arrow",
561
+ "elbowed": false
562
+ },
563
+ {
564
+ "id": "d9zvLKXktuu4KKoNwDHks",
565
+ "type": "text",
566
+ "x": -207.9106402991797,
567
+ "y": 670.8478145278248,
568
+ "width": 257.6997985839844,
569
+ "height": 125,
570
+ "angle": 0,
571
+ "strokeColor": "#1e1e1e",
572
+ "backgroundColor": "transparent",
573
+ "fillStyle": "solid",
574
+ "strokeWidth": 2,
575
+ "strokeStyle": "solid",
576
+ "roughness": 1,
577
+ "opacity": 100,
578
+ "groupIds": [],
579
+ "frameId": null,
580
+ "index": "aF",
581
+ "roundness": null,
582
+ "seed": 297698072,
583
+ "version": 205,
584
+ "versionNonce": 1692721944,
585
+ "isDeleted": false,
586
+ "boundElements": null,
587
+ "updated": 1747664916082,
588
+ "link": null,
589
+ "locked": false,
590
+ "text": "Matryoshka embeddings\ncan be used with different\ndimensions and used with\nGalileo evals to show \nimprovement ",
591
+ "fontSize": 20,
592
+ "fontFamily": 5,
593
+ "textAlign": "left",
594
+ "verticalAlign": "top",
595
+ "containerId": null,
596
+ "originalText": "Matryoshka embeddings\ncan be used with different\ndimensions and used with\nGalileo evals to show \nimprovement ",
597
+ "autoResize": true,
598
+ "lineHeight": 1.25
599
+ },
600
+ {
601
+ "id": "LUAd0l_VFSMzCn477Y5tU",
602
+ "type": "text",
603
+ "x": -32.44826338674477,
604
+ "y": 246.50406452782477,
605
+ "width": 419.84990437825513,
606
+ "height": 46.875,
607
+ "angle": 0,
608
+ "strokeColor": "#1e1e1e",
609
+ "backgroundColor": "transparent",
610
+ "fillStyle": "solid",
611
+ "strokeWidth": 2,
612
+ "strokeStyle": "solid",
613
+ "roughness": 1,
614
+ "opacity": 100,
615
+ "groupIds": [],
616
+ "frameId": null,
617
+ "index": "aG",
618
+ "roundness": null,
619
+ "seed": 220549992,
620
+ "version": 96,
621
+ "versionNonce": 1636352024,
622
+ "isDeleted": false,
623
+ "boundElements": null,
624
+ "updated": 1747664916082,
625
+ "link": null,
626
+ "locked": false,
627
+ "text": "BACKEND DATA PREP",
628
+ "fontSize": 37.49999999999999,
629
+ "fontFamily": 5,
630
+ "textAlign": "left",
631
+ "verticalAlign": "top",
632
+ "containerId": null,
633
+ "originalText": "BACKEND DATA PREP",
634
+ "autoResize": true,
635
+ "lineHeight": 1.25
636
+ },
637
+ {
638
+ "id": "LCcahyR5mZ-VrhSSJ0RrX",
639
+ "type": "text",
640
+ "x": 844.4690950116927,
641
+ "y": 238.00406452782477,
642
+ "width": 263.5499572753906,
643
+ "height": 46.87499999999999,
644
+ "angle": 0,
645
+ "strokeColor": "#1e1e1e",
646
+ "backgroundColor": "transparent",
647
+ "fillStyle": "solid",
648
+ "strokeWidth": 2,
649
+ "strokeStyle": "solid",
650
+ "roughness": 1,
651
+ "opacity": 100,
652
+ "groupIds": [],
653
+ "frameId": null,
654
+ "index": "aH",
655
+ "roundness": null,
656
+ "seed": 1455065880,
657
+ "version": 235,
658
+ "versionNonce": 933821720,
659
+ "isDeleted": false,
660
+ "boundElements": [],
661
+ "updated": 1747664916082,
662
+ "link": null,
663
+ "locked": false,
664
+ "text": "UI WITH RAG",
665
+ "fontSize": 37.49999999999999,
666
+ "fontFamily": 5,
667
+ "textAlign": "left",
668
+ "verticalAlign": "top",
669
+ "containerId": null,
670
+ "originalText": "UI WITH RAG",
671
+ "autoResize": true,
672
+ "lineHeight": 1.25
673
+ },
674
+ {
675
+ "id": "EdUn3nnIEvlF6aBa1kj7M",
676
+ "type": "text",
677
+ "x": 921.6245159508203,
678
+ "y": 313.83609577782477,
679
+ "width": 134.23989868164062,
680
+ "height": 25,
681
+ "angle": 0,
682
+ "strokeColor": "#1e1e1e",
683
+ "backgroundColor": "transparent",
684
+ "fillStyle": "solid",
685
+ "strokeWidth": 2,
686
+ "strokeStyle": "solid",
687
+ "roughness": 1,
688
+ "opacity": 100,
689
+ "groupIds": [],
690
+ "frameId": null,
691
+ "index": "aK",
692
+ "roundness": null,
693
+ "seed": 1834382440,
694
+ "version": 103,
695
+ "versionNonce": 951687704,
696
+ "isDeleted": false,
697
+ "boundElements": null,
698
+ "updated": 1747664916082,
699
+ "link": null,
700
+ "locked": false,
701
+ "text": "User question",
702
+ "fontSize": 20,
703
+ "fontFamily": 5,
704
+ "textAlign": "left",
705
+ "verticalAlign": "top",
706
+ "containerId": null,
707
+ "originalText": "User question",
708
+ "autoResize": true,
709
+ "lineHeight": 1.25
710
+ },
711
+ {
712
+ "id": "ZN_mKooYb6wL_YlpOin8R",
713
+ "type": "rectangle",
714
+ "x": 918.8139690758203,
715
+ "y": 423.5783258824798,
716
+ "width": 145.73046875,
717
+ "height": 60,
718
+ "angle": 0,
719
+ "strokeColor": "#1e1e1e",
720
+ "backgroundColor": "transparent",
721
+ "fillStyle": "solid",
722
+ "strokeWidth": 2,
723
+ "strokeStyle": "solid",
724
+ "roughness": 1,
725
+ "opacity": 100,
726
+ "groupIds": [],
727
+ "frameId": null,
728
+ "index": "aL",
729
+ "roundness": {
730
+ "type": 3
731
+ },
732
+ "seed": 585607704,
733
+ "version": 285,
734
+ "versionNonce": 1300833896,
735
+ "isDeleted": false,
736
+ "boundElements": [
737
+ {
738
+ "id": "XV1DRpL8zYaqsEKP0Z_nl",
739
+ "type": "arrow"
740
+ },
741
+ {
742
+ "type": "text",
743
+ "id": "-NyTkinb1h4vbr60eEtwZ"
744
+ }
745
+ ],
746
+ "updated": 1747665565753,
747
+ "link": null,
748
+ "locked": false
749
+ },
750
+ {
751
+ "id": "-NyTkinb1h4vbr60eEtwZ",
752
+ "type": "text",
753
+ "x": 934.9592556358789,
754
+ "y": 428.5783258824798,
755
+ "width": 113.43989562988281,
756
+ "height": 50,
757
+ "angle": 0,
758
+ "strokeColor": "#1e1e1e",
759
+ "backgroundColor": "transparent",
760
+ "fillStyle": "solid",
761
+ "strokeWidth": 2,
762
+ "strokeStyle": "solid",
763
+ "roughness": 1,
764
+ "opacity": 100,
765
+ "groupIds": [],
766
+ "frameId": null,
767
+ "index": "aM",
768
+ "roundness": null,
769
+ "seed": 26912536,
770
+ "version": 364,
771
+ "versionNonce": 963123560,
772
+ "isDeleted": false,
773
+ "boundElements": [],
774
+ "updated": 1747665565753,
775
+ "link": null,
776
+ "locked": false,
777
+ "text": "Streamlit\nUI/backend",
778
+ "fontSize": 20,
779
+ "fontFamily": 5,
780
+ "textAlign": "center",
781
+ "verticalAlign": "middle",
782
+ "containerId": "ZN_mKooYb6wL_YlpOin8R",
783
+ "originalText": "Streamlit UI/backend",
784
+ "autoResize": true,
785
+ "lineHeight": 1.25
786
+ },
787
+ {
788
+ "id": "XV1DRpL8zYaqsEKP0Z_nl",
789
+ "type": "arrow",
790
+ "x": 986.4869057667829,
791
+ "y": 361.8478571324798,
792
+ "width": 0.7711825485696409,
793
+ "height": 61.27895468162757,
794
+ "angle": 0,
795
+ "strokeColor": "#1e1e1e",
796
+ "backgroundColor": "transparent",
797
+ "fillStyle": "solid",
798
+ "strokeWidth": 2,
799
+ "strokeStyle": "solid",
800
+ "roughness": 1,
801
+ "opacity": 100,
802
+ "groupIds": [],
803
+ "frameId": null,
804
+ "index": "aN",
805
+ "roundness": {
806
+ "type": 2
807
+ },
808
+ "seed": 1793459224,
809
+ "version": 509,
810
+ "versionNonce": 1391504488,
811
+ "isDeleted": false,
812
+ "boundElements": [],
813
+ "updated": 1747665565754,
814
+ "link": null,
815
+ "locked": false,
816
+ "points": [
817
+ [
818
+ 0,
819
+ 0
820
+ ],
821
+ [
822
+ -0.7711825485696409,
823
+ 61.27895468162757
824
+ ]
825
+ ],
826
+ "lastCommittedPoint": null,
827
+ "startBinding": null,
828
+ "endBinding": {
829
+ "elementId": "ZN_mKooYb6wL_YlpOin8R",
830
+ "focus": -0.08665299440759792,
831
+ "gap": 1.163726888744634
832
+ },
833
+ "startArrowhead": null,
834
+ "endArrowhead": "arrow",
835
+ "elbowed": false
836
+ },
837
+ {
838
+ "id": "VaeeYk5sVGivRgHn8KaCo",
839
+ "type": "rectangle",
840
+ "x": 922.3256878258203,
841
+ "y": 547.8040436970491,
842
+ "width": 145.73046875,
843
+ "height": 85,
844
+ "angle": 0,
845
+ "strokeColor": "#1e1e1e",
846
+ "backgroundColor": "transparent",
847
+ "fillStyle": "solid",
848
+ "strokeWidth": 2,
849
+ "strokeStyle": "solid",
850
+ "roughness": 1,
851
+ "opacity": 100,
852
+ "groupIds": [],
853
+ "frameId": null,
854
+ "index": "aO",
855
+ "roundness": {
856
+ "type": 3
857
+ },
858
+ "seed": 1355576856,
859
+ "version": 314,
860
+ "versionNonce": 1845381912,
861
+ "isDeleted": false,
862
+ "boundElements": [
863
+ {
864
+ "id": "kOt0lJ7fT7uuJmHCxsxGM",
865
+ "type": "arrow"
866
+ },
867
+ {
868
+ "type": "text",
869
+ "id": "1sXCYRnQofixcmBJbRUYk"
870
+ }
871
+ ],
872
+ "updated": 1747664916082,
873
+ "link": null,
874
+ "locked": false
875
+ },
876
+ {
877
+ "id": "1sXCYRnQofixcmBJbRUYk",
878
+ "type": "text",
879
+ "x": 938.7609524132226,
880
+ "y": 552.8040436970491,
881
+ "width": 112.85993957519531,
882
+ "height": 75,
883
+ "angle": 0,
884
+ "strokeColor": "#1e1e1e",
885
+ "backgroundColor": "transparent",
886
+ "fillStyle": "solid",
887
+ "strokeWidth": 2,
888
+ "strokeStyle": "solid",
889
+ "roughness": 1,
890
+ "opacity": 100,
891
+ "groupIds": [],
892
+ "frameId": null,
893
+ "index": "aP",
894
+ "roundness": null,
895
+ "seed": 822473496,
896
+ "version": 423,
897
+ "versionNonce": 1395992600,
898
+ "isDeleted": false,
899
+ "boundElements": [],
900
+ "updated": 1747664916082,
901
+ "link": null,
902
+ "locked": false,
903
+ "text": "Convert\nquestion to\nembedding",
904
+ "fontSize": 20,
905
+ "fontFamily": 5,
906
+ "textAlign": "center",
907
+ "verticalAlign": "middle",
908
+ "containerId": "VaeeYk5sVGivRgHn8KaCo",
909
+ "originalText": "Convert question to embedding",
910
+ "autoResize": true,
911
+ "lineHeight": 1.25
912
+ },
913
+ {
914
+ "id": "kOt0lJ7fT7uuJmHCxsxGM",
915
+ "type": "arrow",
916
+ "x": 989.9986245167829,
917
+ "y": 486.3079499470491,
918
+ "width": 0.7612717529503357,
919
+ "height": 60.332366861255366,
920
+ "angle": 0,
921
+ "strokeColor": "#1e1e1e",
922
+ "backgroundColor": "transparent",
923
+ "fillStyle": "solid",
924
+ "strokeWidth": 2,
925
+ "strokeStyle": "solid",
926
+ "roughness": 1,
927
+ "opacity": 100,
928
+ "groupIds": [],
929
+ "frameId": null,
930
+ "index": "aQ",
931
+ "roundness": {
932
+ "type": 2
933
+ },
934
+ "seed": 654823448,
935
+ "version": 564,
936
+ "versionNonce": 461427816,
937
+ "isDeleted": false,
938
+ "boundElements": [],
939
+ "updated": 1747664916199,
940
+ "link": null,
941
+ "locked": false,
942
+ "points": [
943
+ [
944
+ 0,
945
+ 0
946
+ ],
947
+ [
948
+ -0.7612717529503357,
949
+ 60.332366861255366
950
+ ]
951
+ ],
952
+ "lastCommittedPoint": null,
953
+ "startBinding": null,
954
+ "endBinding": {
955
+ "elementId": "VaeeYk5sVGivRgHn8KaCo",
956
+ "focus": -0.08861558743629552,
957
+ "gap": 1.163726888744577
958
+ },
959
+ "startArrowhead": null,
960
+ "endArrowhead": "arrow",
961
+ "elbowed": false
962
+ },
963
+ {
964
+ "id": "Xfbn92OkPEM6ZzK149DxS",
965
+ "type": "rectangle",
966
+ "x": 924.7045940758203,
967
+ "y": 697.1442769037981,
968
+ "width": 145.73046875,
969
+ "height": 85,
970
+ "angle": 0,
971
+ "strokeColor": "#1e1e1e",
972
+ "backgroundColor": "transparent",
973
+ "fillStyle": "solid",
974
+ "strokeWidth": 2,
975
+ "strokeStyle": "solid",
976
+ "roughness": 1,
977
+ "opacity": 100,
978
+ "groupIds": [],
979
+ "frameId": null,
980
+ "index": "aR",
981
+ "roundness": {
982
+ "type": 3
983
+ },
984
+ "seed": 242743832,
985
+ "version": 359,
986
+ "versionNonce": 81002264,
987
+ "isDeleted": false,
988
+ "boundElements": [
989
+ {
990
+ "id": "TFd3QnaQ0gbgIILWoCbfK",
991
+ "type": "arrow"
992
+ },
993
+ {
994
+ "type": "text",
995
+ "id": "AANZteIq6UfeoE2UXHRri"
996
+ }
997
+ ],
998
+ "updated": 1747664916082,
999
+ "link": null,
1000
+ "locked": false
1001
+ },
1002
+ {
1003
+ "id": "AANZteIq6UfeoE2UXHRri",
1004
+ "type": "text",
1005
+ "x": 937.3798641563867,
1006
+ "y": 702.1442769037981,
1007
+ "width": 120.37992858886719,
1008
+ "height": 75,
1009
+ "angle": 0,
1010
+ "strokeColor": "#1e1e1e",
1011
+ "backgroundColor": "transparent",
1012
+ "fillStyle": "solid",
1013
+ "strokeWidth": 2,
1014
+ "strokeStyle": "solid",
1015
+ "roughness": 1,
1016
+ "opacity": 100,
1017
+ "groupIds": [],
1018
+ "frameId": null,
1019
+ "index": "aS",
1020
+ "roundness": null,
1021
+ "seed": 916518680,
1022
+ "version": 512,
1023
+ "versionNonce": 1832147992,
1024
+ "isDeleted": false,
1025
+ "boundElements": [],
1026
+ "updated": 1747664916082,
1027
+ "link": null,
1028
+ "locked": false,
1029
+ "text": "Get top-k\nfrom Chroma\nvector db",
1030
+ "fontSize": 20,
1031
+ "fontFamily": 5,
1032
+ "textAlign": "center",
1033
+ "verticalAlign": "middle",
1034
+ "containerId": "Xfbn92OkPEM6ZzK149DxS",
1035
+ "originalText": "Get top-k from Chroma vector db",
1036
+ "autoResize": true,
1037
+ "lineHeight": 1.25
1038
+ },
1039
+ {
1040
+ "id": "TFd3QnaQ0gbgIILWoCbfK",
1041
+ "type": "arrow",
1042
+ "x": 992.3775307667829,
1043
+ "y": 635.6481831537981,
1044
+ "width": 0.7612717529503357,
1045
+ "height": 60.332366861255366,
1046
+ "angle": 0,
1047
+ "strokeColor": "#1e1e1e",
1048
+ "backgroundColor": "transparent",
1049
+ "fillStyle": "solid",
1050
+ "strokeWidth": 2,
1051
+ "strokeStyle": "solid",
1052
+ "roughness": 1,
1053
+ "opacity": 100,
1054
+ "groupIds": [],
1055
+ "frameId": null,
1056
+ "index": "aT",
1057
+ "roundness": {
1058
+ "type": 2
1059
+ },
1060
+ "seed": 1252152344,
1061
+ "version": 652,
1062
+ "versionNonce": 138020712,
1063
+ "isDeleted": false,
1064
+ "boundElements": [],
1065
+ "updated": 1747664916200,
1066
+ "link": null,
1067
+ "locked": false,
1068
+ "points": [
1069
+ [
1070
+ 0,
1071
+ 0
1072
+ ],
1073
+ [
1074
+ -0.7612717529503357,
1075
+ 60.332366861255366
1076
+ ]
1077
+ ],
1078
+ "lastCommittedPoint": null,
1079
+ "startBinding": null,
1080
+ "endBinding": {
1081
+ "elementId": "Xfbn92OkPEM6ZzK149DxS",
1082
+ "focus": -0.08861558743629806,
1083
+ "gap": 1.163726888744577
1084
+ },
1085
+ "startArrowhead": null,
1086
+ "endArrowhead": "arrow",
1087
+ "elbowed": false
1088
+ },
1089
+ {
1090
+ "id": "2iNtOdIxMDmUgNRqT2SHv",
1091
+ "type": "rectangle",
1092
+ "x": 929.2280315758203,
1093
+ "y": 846.8381009518404,
1094
+ "width": 145.73046875,
1095
+ "height": 85,
1096
+ "angle": 0,
1097
+ "strokeColor": "#1e1e1e",
1098
+ "backgroundColor": "#ffec99",
1099
+ "fillStyle": "solid",
1100
+ "strokeWidth": 2,
1101
+ "strokeStyle": "solid",
1102
+ "roughness": 1,
1103
+ "opacity": 100,
1104
+ "groupIds": [],
1105
+ "frameId": null,
1106
+ "index": "aU",
1107
+ "roundness": {
1108
+ "type": 3
1109
+ },
1110
+ "seed": 418077208,
1111
+ "version": 411,
1112
+ "versionNonce": 959427352,
1113
+ "isDeleted": false,
1114
+ "boundElements": [
1115
+ {
1116
+ "id": "mt3dUdBKJks2RtTSQqE1X",
1117
+ "type": "arrow"
1118
+ },
1119
+ {
1120
+ "type": "text",
1121
+ "id": "PO5LIHoHfyLfjDsOD0DAw"
1122
+ }
1123
+ ],
1124
+ "updated": 1747664916082,
1125
+ "link": null,
1126
+ "locked": false
1127
+ },
1128
+ {
1129
+ "id": "PO5LIHoHfyLfjDsOD0DAw",
1130
+ "type": "text",
1131
+ "x": 947.3633007408594,
1132
+ "y": 864.3381009518404,
1133
+ "width": 109.45993041992188,
1134
+ "height": 50,
1135
+ "angle": 0,
1136
+ "strokeColor": "#1e1e1e",
1137
+ "backgroundColor": "transparent",
1138
+ "fillStyle": "solid",
1139
+ "strokeWidth": 2,
1140
+ "strokeStyle": "solid",
1141
+ "roughness": 1,
1142
+ "opacity": 100,
1143
+ "groupIds": [],
1144
+ "frameId": null,
1145
+ "index": "aV",
1146
+ "roundness": null,
1147
+ "seed": 1537759000,
1148
+ "version": 610,
1149
+ "versionNonce": 66687000,
1150
+ "isDeleted": false,
1151
+ "boundElements": [],
1152
+ "updated": 1747664916082,
1153
+ "link": null,
1154
+ "locked": false,
1155
+ "text": "Add Galileo\nevals",
1156
+ "fontSize": 20,
1157
+ "fontFamily": 5,
1158
+ "textAlign": "center",
1159
+ "verticalAlign": "middle",
1160
+ "containerId": "2iNtOdIxMDmUgNRqT2SHv",
1161
+ "originalText": "Add Galileo evals",
1162
+ "autoResize": true,
1163
+ "lineHeight": 1.25
1164
+ },
1165
+ {
1166
+ "id": "mt3dUdBKJks2RtTSQqE1X",
1167
+ "type": "arrow",
1168
+ "x": 996.6509682667829,
1169
+ "y": 785.3420072018404,
1170
+ "width": 0.7612717529503357,
1171
+ "height": 60.332366861255366,
1172
+ "angle": 0,
1173
+ "strokeColor": "#1e1e1e",
1174
+ "backgroundColor": "transparent",
1175
+ "fillStyle": "solid",
1176
+ "strokeWidth": 2,
1177
+ "strokeStyle": "solid",
1178
+ "roughness": 1,
1179
+ "opacity": 100,
1180
+ "groupIds": [],
1181
+ "frameId": null,
1182
+ "index": "aW",
1183
+ "roundness": {
1184
+ "type": 2
1185
+ },
1186
+ "seed": 398308376,
1187
+ "version": 754,
1188
+ "versionNonce": 223664744,
1189
+ "isDeleted": false,
1190
+ "boundElements": [],
1191
+ "updated": 1747664916200,
1192
+ "link": null,
1193
+ "locked": false,
1194
+ "points": [
1195
+ [
1196
+ 0,
1197
+ 0
1198
+ ],
1199
+ [
1200
+ -0.7612717529503357,
1201
+ 60.332366861255366
1202
+ ]
1203
+ ],
1204
+ "lastCommittedPoint": null,
1205
+ "startBinding": null,
1206
+ "endBinding": {
1207
+ "elementId": "2iNtOdIxMDmUgNRqT2SHv",
1208
+ "focus": -0.09202151247941287,
1209
+ "gap": 1.1637268887446908
1210
+ },
1211
+ "startArrowhead": null,
1212
+ "endArrowhead": "arrow",
1213
+ "elbowed": false
1214
+ },
1215
+ {
1216
+ "id": "_l8h4crDy_ePJQQpee6SU",
1217
+ "type": "arrow",
1218
+ "x": 998.8268626084431,
1219
+ "y": 931.5477478779942,
1220
+ "width": 0.7612717529503357,
1221
+ "height": 60.332366861255366,
1222
+ "angle": 0,
1223
+ "strokeColor": "#1e1e1e",
1224
+ "backgroundColor": "transparent",
1225
+ "fillStyle": "solid",
1226
+ "strokeWidth": 2,
1227
+ "strokeStyle": "solid",
1228
+ "roughness": 1,
1229
+ "opacity": 100,
1230
+ "groupIds": [],
1231
+ "frameId": null,
1232
+ "index": "aX",
1233
+ "roundness": {
1234
+ "type": 2
1235
+ },
1236
+ "seed": 1835770648,
1237
+ "version": 731,
1238
+ "versionNonce": 403793688,
1239
+ "isDeleted": false,
1240
+ "boundElements": [],
1241
+ "updated": 1747664916082,
1242
+ "link": null,
1243
+ "locked": false,
1244
+ "points": [
1245
+ [
1246
+ 0,
1247
+ 0
1248
+ ],
1249
+ [
1250
+ -0.7612717529503357,
1251
+ 60.332366861255366
1252
+ ]
1253
+ ],
1254
+ "lastCommittedPoint": null,
1255
+ "startBinding": null,
1256
+ "endBinding": null,
1257
+ "startArrowhead": null,
1258
+ "endArrowhead": "arrow",
1259
+ "elbowed": false
1260
+ },
1261
+ {
1262
+ "id": "uOvtezevF7hRnF5IhUdTk",
1263
+ "type": "text",
1264
+ "x": 877.2417034508203,
1265
+ "y": 1150.0626582778248,
1266
+ "width": 230.87986755371094,
1267
+ "height": 25,
1268
+ "angle": 0,
1269
+ "strokeColor": "#1e1e1e",
1270
+ "backgroundColor": "transparent",
1271
+ "fillStyle": "solid",
1272
+ "strokeWidth": 2,
1273
+ "strokeStyle": "solid",
1274
+ "roughness": 1,
1275
+ "opacity": 100,
1276
+ "groupIds": [],
1277
+ "frameId": null,
1278
+ "index": "aY",
1279
+ "roundness": null,
1280
+ "seed": 1347031320,
1281
+ "version": 146,
1282
+ "versionNonce": 2103029272,
1283
+ "isDeleted": false,
1284
+ "boundElements": null,
1285
+ "updated": 1747665658987,
1286
+ "link": null,
1287
+ "locked": false,
1288
+ "text": "Show top-k + eval in UI",
1289
+ "fontSize": 20,
1290
+ "fontFamily": 5,
1291
+ "textAlign": "left",
1292
+ "verticalAlign": "top",
1293
+ "containerId": null,
1294
+ "originalText": "Show top-k + eval in UI",
1295
+ "autoResize": true,
1296
+ "lineHeight": 1.25
1297
+ },
1298
+ {
1299
+ "id": "95VkcTxIBcZMz2gr1CcCo",
1300
+ "type": "text",
1301
+ "x": 398.624890941522,
1302
+ "y": 479.95474812157477,
1303
+ "width": 389.2597351074219,
1304
+ "height": 150,
1305
+ "angle": 0,
1306
+ "strokeColor": "#1e1e1e",
1307
+ "backgroundColor": "transparent",
1308
+ "fillStyle": "solid",
1309
+ "strokeWidth": 2,
1310
+ "strokeStyle": "solid",
1311
+ "roughness": 1,
1312
+ "opacity": 100,
1313
+ "groupIds": [],
1314
+ "frameId": null,
1315
+ "index": "ac",
1316
+ "roundness": null,
1317
+ "seed": 1184416536,
1318
+ "version": 117,
1319
+ "versionNonce": 251684456,
1320
+ "isDeleted": false,
1321
+ "boundElements": null,
1322
+ "updated": 1747664921990,
1323
+ "link": null,
1324
+ "locked": false,
1325
+ "text": "- Phase 1: Simple RAG with Galileo evals\n\n- Phase 2: Agentic RAG with ability to \nself-correct and/with Galileo evals\n\n- Phase 3: TBD",
1326
+ "fontSize": 20,
1327
+ "fontFamily": 5,
1328
+ "textAlign": "left",
1329
+ "verticalAlign": "top",
1330
+ "containerId": null,
1331
+ "originalText": "- Phase 1: Simple RAG with Galileo evals\n\n- Phase 2: Agentic RAG with ability to \nself-correct and/with Galileo evals\n\n- Phase 3: TBD",
1332
+ "autoResize": true,
1333
+ "lineHeight": 1.25
1334
+ },
1335
+ {
1336
+ "id": "X0ww4FIuAjwX6E_HhWLpX",
1337
+ "type": "text",
1338
+ "x": 524.1010784508203,
1339
+ "y": 398.76187702782477,
1340
+ "width": 77.95994567871094,
1341
+ "height": 25,
1342
+ "angle": 0,
1343
+ "strokeColor": "#1e1e1e",
1344
+ "backgroundColor": "transparent",
1345
+ "fillStyle": "solid",
1346
+ "strokeWidth": 2,
1347
+ "strokeStyle": "solid",
1348
+ "roughness": 1,
1349
+ "opacity": 100,
1350
+ "groupIds": [],
1351
+ "frameId": null,
1352
+ "index": "ae",
1353
+ "roundness": null,
1354
+ "seed": 142996072,
1355
+ "version": 8,
1356
+ "versionNonce": 1670002536,
1357
+ "isDeleted": false,
1358
+ "boundElements": null,
1359
+ "updated": 1747664926713,
1360
+ "link": null,
1361
+ "locked": false,
1362
+ "text": "PHASES",
1363
+ "fontSize": 20,
1364
+ "fontFamily": 5,
1365
+ "textAlign": "left",
1366
+ "verticalAlign": "top",
1367
+ "containerId": null,
1368
+ "originalText": "PHASES",
1369
+ "autoResize": true,
1370
+ "lineHeight": 1.25
1371
+ },
1372
+ {
1373
+ "id": "umtwNm9fcxd1bsz6uIFH3",
1374
+ "type": "text",
1375
+ "x": -196.4262652991797,
1376
+ "y": 894.7618770278248,
1377
+ "width": 216.8598175048828,
1378
+ "height": 25,
1379
+ "angle": 0,
1380
+ "strokeColor": "#1e1e1e",
1381
+ "backgroundColor": "transparent",
1382
+ "fillStyle": "solid",
1383
+ "strokeWidth": 2,
1384
+ "strokeStyle": "solid",
1385
+ "roughness": 1,
1386
+ "opacity": 100,
1387
+ "groupIds": [],
1388
+ "frameId": null,
1389
+ "index": "af",
1390
+ "roundness": null,
1391
+ "seed": 1217313560,
1392
+ "version": 41,
1393
+ "versionNonce": 728632680,
1394
+ "isDeleted": false,
1395
+ "boundElements": null,
1396
+ "updated": 1747665493854,
1397
+ "link": null,
1398
+ "locked": false,
1399
+ "text": "Fuzzy search in Milvus",
1400
+ "fontSize": 20,
1401
+ "fontFamily": 5,
1402
+ "textAlign": "left",
1403
+ "verticalAlign": "top",
1404
+ "containerId": null,
1405
+ "originalText": "Fuzzy search in Milvus",
1406
+ "autoResize": true,
1407
+ "lineHeight": 1.25
1408
+ },
1409
+ {
1410
+ "id": "7wPfS9EWewDNqNV2lGfYX",
1411
+ "type": "rectangle",
1412
+ "x": 927.0092815758203,
1413
+ "y": 998.382970777825,
1414
+ "width": 145.73046875,
1415
+ "height": 110,
1416
+ "angle": 0,
1417
+ "strokeColor": "#1e1e1e",
1418
+ "backgroundColor": "#ffec99",
1419
+ "fillStyle": "solid",
1420
+ "strokeWidth": 2,
1421
+ "strokeStyle": "solid",
1422
+ "roughness": 1,
1423
+ "opacity": 100,
1424
+ "groupIds": [],
1425
+ "frameId": null,
1426
+ "index": "ag",
1427
+ "roundness": {
1428
+ "type": 3
1429
+ },
1430
+ "seed": 974709272,
1431
+ "version": 484,
1432
+ "versionNonce": 334137624,
1433
+ "isDeleted": false,
1434
+ "boundElements": [
1435
+ {
1436
+ "type": "text",
1437
+ "id": "lYwockrM6JTcy0vqjVEyE"
1438
+ }
1439
+ ],
1440
+ "updated": 1747665654880,
1441
+ "link": null,
1442
+ "locked": false
1443
+ },
1444
+ {
1445
+ "id": "lYwockrM6JTcy0vqjVEyE",
1446
+ "type": "text",
1447
+ "x": 941.7745632530664,
1448
+ "y": 1003.382970777825,
1449
+ "width": 116.19990539550781,
1450
+ "height": 100,
1451
+ "angle": 0,
1452
+ "strokeColor": "#1e1e1e",
1453
+ "backgroundColor": "transparent",
1454
+ "fillStyle": "solid",
1455
+ "strokeWidth": 2,
1456
+ "strokeStyle": "solid",
1457
+ "roughness": 1,
1458
+ "opacity": 100,
1459
+ "groupIds": [],
1460
+ "frameId": null,
1461
+ "index": "ah",
1462
+ "roundness": null,
1463
+ "seed": 233103128,
1464
+ "version": 743,
1465
+ "versionNonce": 545124632,
1466
+ "isDeleted": false,
1467
+ "boundElements": [],
1468
+ "updated": 1747665655863,
1469
+ "link": null,
1470
+ "locked": false,
1471
+ "text": "Check\ncontextual\nrelevance,\nplan actions",
1472
+ "fontSize": 20,
1473
+ "fontFamily": 5,
1474
+ "textAlign": "center",
1475
+ "verticalAlign": "middle",
1476
+ "containerId": "7wPfS9EWewDNqNV2lGfYX",
1477
+ "originalText": "Check contextual relevance, plan actions",
1478
+ "autoResize": true,
1479
+ "lineHeight": 1.25
1480
+ },
1481
+ {
1482
+ "id": "q8QUVhGqw9gk_AYyPZ-7P",
1483
+ "type": "text",
1484
+ "x": -193.6059527991797,
1485
+ "y": 415.57828327782477,
1486
+ "width": 204.9798583984375,
1487
+ "height": 75,
1488
+ "angle": 0,
1489
+ "strokeColor": "#1e1e1e",
1490
+ "backgroundColor": "transparent",
1491
+ "fillStyle": "solid",
1492
+ "strokeWidth": 2,
1493
+ "strokeStyle": "solid",
1494
+ "roughness": 1,
1495
+ "opacity": 100,
1496
+ "groupIds": [],
1497
+ "frameId": null,
1498
+ "index": "aj",
1499
+ "roundness": null,
1500
+ "seed": 797414424,
1501
+ "version": 105,
1502
+ "versionNonce": 626327832,
1503
+ "isDeleted": false,
1504
+ "boundElements": null,
1505
+ "updated": 1747666036988,
1506
+ "link": null,
1507
+ "locked": false,
1508
+ "text": "- Try docling\nbetter than pymupdf\n- export in markdown",
1509
+ "fontSize": 20,
1510
+ "fontFamily": 5,
1511
+ "textAlign": "left",
1512
+ "verticalAlign": "top",
1513
+ "containerId": null,
1514
+ "originalText": "- Try docling\nbetter than pymupdf\n- export in markdown",
1515
+ "autoResize": true,
1516
+ "lineHeight": 1.25
1517
+ },
1518
+ {
1519
+ "id": "vrvvrH4qCb-xGQbERHh1C",
1520
+ "type": "text",
1521
+ "x": -173.8207965491797,
1522
+ "y": 809.8126582778248,
1523
+ "width": 153.69985961914062,
1524
+ "height": 50,
1525
+ "angle": 0,
1526
+ "strokeColor": "#1e1e1e",
1527
+ "backgroundColor": "transparent",
1528
+ "fillStyle": "solid",
1529
+ "strokeWidth": 2,
1530
+ "strokeStyle": "solid",
1531
+ "roughness": 1,
1532
+ "opacity": 100,
1533
+ "groupIds": [],
1534
+ "frameId": null,
1535
+ "index": "ak",
1536
+ "roundness": null,
1537
+ "seed": 1771655704,
1538
+ "version": 40,
1539
+ "versionNonce": 2071710824,
1540
+ "isDeleted": false,
1541
+ "boundElements": null,
1542
+ "updated": 1747666171125,
1543
+ "link": null,
1544
+ "locked": false,
1545
+ "text": "Jina embeddings\n1024",
1546
+ "fontSize": 20,
1547
+ "fontFamily": 5,
1548
+ "textAlign": "left",
1549
+ "verticalAlign": "top",
1550
+ "containerId": null,
1551
+ "originalText": "Jina embeddings\n1024",
1552
+ "autoResize": true,
1553
+ "lineHeight": 1.25
1554
+ }
1555
+ ],
1556
+ "appState": {
1557
+ "gridSize": 20,
1558
+ "gridStep": 5,
1559
+ "gridModeEnabled": false,
1560
+ "viewBackgroundColor": "#ffffff"
1561
+ },
1562
+ "files": {}
1563
+ }
fin-data/processed/test_file ADDED
File without changes
fin-data/processed/vector_db/test_file ADDED
File without changes
main.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ def main():
2
+ print("Hello from galileo-poc!")
3
+
4
+
5
+ if __name__ == "__main__":
6
+ main()
pyproject.toml ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "galileo-poc"
3
+ version = "0.1.0"
4
+ description = "Add your description here"
5
+ readme = "README.md"
6
+ requires-python = ">=3.12.10"
7
+ dependencies = [
8
+ "fastapi>=0.115.9",
9
+ "galileo-observe>=1.23.0",
10
+ "galileo-protect>=0.17.1",
11
+ "google-genai>=1.19.0",
12
+ "langchain-text-splitters>=0.3.8",
13
+ "openai>=1.95.1",
14
+ "pandas>=2.3.0",
15
+ "promptquality>=1.11.3",
16
+ "pymilvus>=2.5.11",
17
+ "pymupdf>=1.26.0",
18
+ "pymupdf4llm>=0.0.24",
19
+ "python-dotenv>=1.1.0",
20
+ "python-multipart>=0.0.20",
21
+ "requests>=2.32.4",
22
+ "sentence-transformers>=4.1.0",
23
+ "streamlit>=1.45.1",
24
+ "torch>=2.7.1",
25
+ "uvicorn>=0.34.3",
26
+ "watchdog>=6.0.0",
27
+ ]
requirements.txt ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ fastapi
2
+ uvicorn
3
+ requests
4
+ streamlit
5
+ python-dotenv
6
+ PyMuPDF
7
+ chromadb
8
+ google-genai
9
+ pandas
10
+ pymilvus
11
+ docling
12
+ sentence-transformers
13
+ torch
14
+ langchain-text-splitters
15
+ watchdog
16
+ galileo-observe>=1.23.0
17
+ galileo-protect>=0.17.1
18
+ pymupdf4llm
19
+ python-multipart
20
+ promptquality
ui/app.py ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import requests
3
+ from dotenv import load_dotenv
4
+
5
+ # Load environment variables
6
+ load_dotenv()
7
+
8
+ def handle_enter():
9
+ st.session_state['submitted'] = True
10
+
11
+
12
+ if __name__ == "__main__":
13
+ # Streamlit UI
14
+ st.title("RFP Q/A")
15
+
16
+ # Sidebar for settings
17
+ with st.sidebar:
18
+ st.title("Settings")
19
+ st.divider()
20
+ protect_enabled = st.toggle("Galileo Protect", value=False,
21
+ help="Enable content protection with Galileo")
22
+
23
+ # Callback function when Enter is pressed
24
+ def on_enter():
25
+ st.session_state["run_search"] = True
26
+
27
+ # User input
28
+ st.text_input("Ask your question:", key="query", on_change=on_enter)
29
+
30
+ # Search button
31
+ if st.button("Search"):
32
+ st.session_state["run_search"] = True
33
+
34
+ # Run the search if triggered by button or Enter
35
+ if st.session_state.get("run_search"):
36
+ try:
37
+ query = st.session_state.get("query", "").strip()
38
+ # Make request to FastAPI endpoint
39
+ response = requests.post(
40
+ "http://localhost:8000/rag/search",
41
+ json={"query": query, "protect_enabled": protect_enabled}
42
+ )
43
+
44
+ if response.status_code == 200:
45
+ result = response.json()
46
+ st.success("Search Results:")
47
+ st.write(result["response"])
48
+ else:
49
+ st.error(f"Error: {response.text}")
50
+
51
+ except Exception as e:
52
+ st.error(f"Error: {str(e)}")
uv.lock ADDED
The diff for this file is too large to render. See raw diff