tmdeptrai3012 commited on
Commit
11057a5
Β·
verified Β·
1 Parent(s): f459b26

deploy 2025-08-09 18:22:12

Browse files
README.md CHANGED
@@ -10,6 +10,7 @@ A MLOps project of an AI-powered RAG Chatbot for understanding and querying lega
10
  - [Getting Started](#getting-started)
11
  - [API Endpoints](#api-endpoints)
12
  - [Monitoring](#monitoring)
 
13
  - [Project Structure](#project-structure)
14
  - [License](#license)
15
  - [Acknowledgements](#acknowledgements)
@@ -42,7 +43,7 @@ Demo on Render: https://legalcontractanalyzer.onrender.com/
42
  - [X] Real-time streaming response.
43
  - [X] Contextual retrieving + querying via ChromaDB.
44
  - [X] CI pipeline with Github Actions.
45
- - [ ] CD pipeline with HuggingFace Space.
46
  - [X] Monitoring with Prometheus & Grafana.
47
  - [ ] Evaluation of the system (automated tests, LLM-as-judge).
48
 
@@ -196,6 +197,133 @@ In Grafana, I've built a dedicated **Queries Dashboard** to give you real-time i
196
  └── requirements.txt
197
  ```
198
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
199
  ## Licence
200
 
201
  [Apache 2.0](./LICENSE)
 
10
  - [Getting Started](#getting-started)
11
  - [API Endpoints](#api-endpoints)
12
  - [Monitoring](#monitoring)
13
+ - [Models Serving](#models-serving)
14
  - [Project Structure](#project-structure)
15
  - [License](#license)
16
  - [Acknowledgements](#acknowledgements)
 
43
  - [X] Real-time streaming response.
44
  - [X] Contextual retrieving + querying via ChromaDB.
45
  - [X] CI pipeline with Github Actions.
46
+ - [X] CD pipeline with Render.
47
  - [X] Monitoring with Prometheus & Grafana.
48
  - [ ] Evaluation of the system (automated tests, LLM-as-judge).
49
 
 
197
  └── requirements.txt
198
  ```
199
 
200
+ ## Models Serving (optional)
201
+
202
+ If you dig deep into the code, you will find the link https://glowing-workable-arachnid.ngrok-free.app/docs as the OpenAI API-like server, this is because I deploy it on my school server and then tunnel via ngrok xD.
203
+
204
+ So if you want to start your own model serving server (assuming you have a really strong DGX, H100, A100, or just 3 RTX 3090s like me xD), here's are the steps:
205
+
206
+ ### 1. Installation
207
+
208
+ #### 1.1 Install FastChat
209
+
210
+ FastChat is the backend server that can run multiple model workers and serve them via the OpenAI-compatible API.
211
+
212
+ ```bash
213
+ # Create and activate virtual environment (optional but recommended)
214
+ conda create -n fastchat python=3.10 -y
215
+ conda activate fastchat
216
+
217
+ # Install FastChat
218
+ pip install fschat
219
+ ```
220
+
221
+ **Tip:** If you want GPU acceleration, make sure PyTorch with CUDA is installed before installing FastChat:
222
+ >
223
+ > ```bash
224
+ > pip install torch --index-url https://download.pytorch.org/whl/cu121
225
+ > ```
226
+
227
+ #### 1.2 Install ngrok
228
+
229
+ ngrok will allow you to expose your FastChat API to the internet.
230
+
231
+ ```bash
232
+ curl -sSL https://ngrok-agent.s3.amazonaws.com/ngrok.asc \
233
+ | sudo tee /etc/apt/trusted.gpg.d/ngrok.asc >/dev/null \
234
+ && echo "deb https://ngrok-agent.s3.amazonaws.com bookworm main" \
235
+ | sudo tee /etc/apt/sources.list.d/ngrok.list \
236
+ && sudo apt update \
237
+ && sudo apt install ngrok
238
+ ```
239
+
240
+ If you have troubles downloading ngrok, try visiting their official website: https://ngrok.com/downloads/
241
+
242
+ Log into [ngrok](https://dashboard.ngrok.com/get-started) and get your auth token:
243
+
244
+ ```bash
245
+ ngrok config add-authtoken <YOUR_AUTH_TOKEN>
246
+ ```
247
+
248
+ ---
249
+
250
+ ### 2. πŸ–₯️ Configurable FastChat Run Script
251
+
252
+ In the folder /model_serving, check out the file `serve_models.sh` and make it executable:
253
+
254
+ ```bash
255
+ chmod +x serve_models.sh
256
+ ```
257
+
258
+ ---
259
+
260
+ ### 3. Usage Examples
261
+
262
+ #### Run with defaults (Qwen3-0.6B + Qwen3-Embedding-0.6B)
263
+
264
+ ```bash
265
+ ./model_serving/serve_models.sh
266
+ ```
267
+
268
+ #### Run with custom models, ports, and ngrok URL
269
+
270
+ ```bash
271
+ ./model_serving/serve_models.sh Qwen/Qwen2-7B Qwen2-7B 21010 \
272
+ Qwen/Qwen2-Embedding Qwen2-Embedding 21011 \
273
+ 8000 https://mycustomtunnel.ngrok-free.app
274
+ ```
275
+
276
+ This will:
277
+
278
+ * Run `Qwen2-7B` chat model on port `21010`.
279
+ * Run `Qwen2-Embedding` embedding model on port `21011`.
280
+ * Serve API on port `8000`.
281
+ * Tunnel via the given ngrok URL.
282
+
283
+ ---
284
+
285
+ ### 4. πŸ” Testing the API
286
+
287
+ List all models:
288
+
289
+ ```bash
290
+ curl https://YOUR_NGROK_URL/v1/models
291
+ ```
292
+
293
+ Or you may access it via a browser, for example: https://glowing-workable-arachnid.ngrok-free.app/v1/models
294
+
295
+ Get embeddings:
296
+
297
+ ```bash
298
+ curl https://YOUR_NGROK_URL/v1/embeddings \
299
+ -H "Content-Type: application/json" \
300
+ -d '{
301
+ "model": "Qwen3-Embedding-0.6B",
302
+ "input": "FastChat is running two models now!"
303
+ }'
304
+ ```
305
+
306
+ Chat completion:
307
+
308
+ ```bash
309
+ curl https://YOUR_NGROK_URL/v1/chat/completions \
310
+ -H "Content-Type: application/json" \
311
+ -d '{
312
+ "model": "Qwen3-0.6B",
313
+ "messages": [{"role": "user", "content": "Hello from FastChat!"}]
314
+ }'
315
+ ```
316
+
317
+ ---
318
+
319
+ ### 5. Notes
320
+
321
+ * Always **set different ports** for each worker.
322
+ * `--worker-address` **must match** the worker’s host\:port so FastChat doesn’t overwrite registrations.
323
+ * Ngrok **free plan** requires reserving the subdomain before you can set a fixed `--url`. You may go on ngrok website to claim your own free subdomain to use, otherwise, whenever you start a tunnel, it will be a random public url.
324
+ * Contact me if you need help ;) I'll be glad to help.
325
+
326
+
327
  ## Licence
328
 
329
  [Apache 2.0](./LICENSE)
backend/chroma_vector_db/chroma.sqlite3 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:bedaaed932573c2da9f3c8ad609773f192cb3e027c9aabc8927f9567f35b288b
3
- size 3514368
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:46cc306774b0946061527a0d3673f6c5aa63d2111d8a69d176a8216390b2e62c
3
+ size 5554176
backend/ingest.py CHANGED
@@ -53,7 +53,7 @@ def chunk_paragraph(paragraph):
53
  # ─── 4) EMBEDDING VIA OPENAI ────────────────────────────────────────────────────
54
  def embed_via_openai(text_chunks):
55
  resp = openai_client.embeddings.create(
56
- model="Qwen3-0.6B",
57
  input=text_chunks
58
  )
59
  # resp.data is a list of objects with .index and .embedding
 
53
  # ─── 4) EMBEDDING VIA OPENAI ────────────────────────────────────────────────────
54
  def embed_via_openai(text_chunks):
55
  resp = openai_client.embeddings.create(
56
+ model="Qwen3-Embedding-0.6B",
57
  input=text_chunks
58
  )
59
  # resp.data is a list of objects with .index and .embedding
backend/query.py CHANGED
@@ -1,63 +1,39 @@
1
- # backend/query.py
2
- import numpy as np
3
  import chromadb
4
  from openai import OpenAI
5
  from dotenv import load_dotenv
6
  from backend.config import CHROMA_DB_PATH
7
- import os
8
-
9
  load_dotenv()
10
- API_KEY = os.getenv("OPENAI_API_KEY")
11
- BASE_URL = os.getenv("FASTCHAT_URL", "https://glowing-workable-arachnid.ngrok-free.app/v1")
12
  openai_client = OpenAI(api_key=API_KEY, base_url=BASE_URL)
13
 
 
14
  chroma_client = chromadb.PersistentClient(path=CHROMA_DB_PATH)
15
  collection = chroma_client.get_or_create_collection("legal_docs")
16
 
17
- def embed_texts(texts):
 
18
  resp = openai_client.embeddings.create(
19
- model="Qwen3-0.6B",
20
- input=texts
21
  )
22
- # ensure order
23
- return [item.embedding for item in sorted(resp.data, key=lambda d: d.index)]
24
-
25
- def normalize(vec):
26
- arr = np.array(vec, dtype=np.float32)
27
- return arr / (np.linalg.norm(arr) + 1e-10)
28
-
29
- def query_top_k(query_text, k=10, rerank_top_n=5):
30
- # 1) embed
31
- q_emb = embed_texts([query_text])[0]
32
- q_norm = normalize(q_emb)
33
-
34
- # 2) dense retrieval (get more candidates)
35
- results = collection.query(query_embeddings=[q_emb], n_results=k)
36
- docs = results.get('documents', [[]])[0]
37
- dists = results.get('distances', [[]])[0]
38
-
39
- # Note: Chroma distances are lower = better. We'll compute cosine from stored embeddings if available.
40
- # If you stored embeddings in collection, pull them (some Chroma versions allow include=['embeddings'])
41
- # Here we fallback to converting distance -> similarity (if the metric is cosine)
42
- sims = []
43
- for idx, doc in enumerate(docs):
44
- # try to get the stored embedding if available:
45
- try:
46
- emb = results['embeddings'][0][idx]
47
- sim = float(np.dot(q_norm, normalize(emb)))
48
- except Exception:
49
- # fallback: invert distance (only approximate)
50
- dist = dists[idx] if idx < len(dists) else 1.0
51
- sim = 1.0 - float(dist)
52
- sims.append((doc, sim))
53
-
54
- # sort by similarity desc
55
- sims.sort(key=lambda x: x[1], reverse=True)
56
-
57
- # optional: rerank top candidates with a cross-encoder here
58
-
59
- return sims[:rerank_top_n] # return top rerank_top_n with similarity
60
-
61
 
62
  # Example usage:
63
  if __name__ == "__main__":
 
1
+ import os
 
2
  import chromadb
3
  from openai import OpenAI
4
  from dotenv import load_dotenv
5
  from backend.config import CHROMA_DB_PATH
6
+ # ─── ENVIRONMENT ──────────────────────────────────────────────────────────────
 
7
  load_dotenv()
8
+ API_KEY = os.getenv("OPENAI_API_KEY", "TRANMINHDUONGDEPTRAI")
9
+ BASE_URL = "https://glowing-workable-arachnid.ngrok-free.app/v1" # or ngrok URL
10
  openai_client = OpenAI(api_key=API_KEY, base_url=BASE_URL)
11
 
12
+ # ─── CHROMA SETUP ─────────────────────────────────────────────────────────────
13
  chroma_client = chromadb.PersistentClient(path=CHROMA_DB_PATH)
14
  collection = chroma_client.get_or_create_collection("legal_docs")
15
 
16
+ # ─── EMBEDDING FUNCTION ───────────────────────────────────────────────────────
17
+ def embed_query(query_text):
18
  resp = openai_client.embeddings.create(
19
+ model="Qwen3-Embedding-0.6B",
20
+ input=[query_text]
21
  )
22
+ return resp.data[0].embedding
23
+
24
+ # ─── TOP-K RETRIEVAL ──────────────────────────────────────────────────────────
25
+ def query_top_k(query_text, k=5):
26
+ query_emb = embed_query(query_text)
27
+ results = collection.query(
28
+ query_embeddings=[query_emb],
29
+ n_results=k
30
+ )
31
+ # results['documents'] is a list of lists (one per query)
32
+ # results['distances'] is a list of lists (one per query)
33
+ # We'll return a list of (chunk, distance) tuples
34
+ docs = results['documents'][0] if results['documents'] else []
35
+ dists = results['distances'][0] if results['distances'] else []
36
+ return list(zip(docs, dists))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  # Example usage:
39
  if __name__ == "__main__":
model_serving/serve_models.sh ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # ==============================
4
+ # Customizable parameters
5
+ # ==============================
6
+ CHAT_MODEL_PATH=${1:-Qwen/Qwen3-0.6B} # First argument or default
7
+ CHAT_MODEL_NAME=${2:-Qwen3-0.6B} # Second argument or default
8
+ CHAT_PORT=${3:-21002}
9
+
10
+ EMBED_MODEL_PATH=${4:-Qwen/Qwen3-Embedding-0.6B} # Fourth argument or default
11
+ EMBED_MODEL_NAME=${5:-Qwen3-Embedding-0.6B} # Fifth argument or default
12
+ EMBED_PORT=${6:-21003}
13
+
14
+ API_PORT=${7:-8000}
15
+ NGROK_URL=${8:-https://example-tunnel.ngrok-free.app} # Eighth argument or default
16
+
17
+ # ==============================
18
+ # Start services
19
+ # ==============================
20
+ echo "Starting controller..."
21
+ nohup python3 -m fastchat.serve.controller \
22
+ --host localhost \
23
+ --port 21001 \
24
+ > controller.log 2>&1 &
25
+ sleep 3
26
+
27
+ echo "Starting $CHAT_MODEL_NAME worker..."
28
+ nohup python3 -m fastchat.serve.model_worker \
29
+ --model-path "$CHAT_MODEL_PATH" \
30
+ --model-name "$CHAT_MODEL_NAME" \
31
+ --host localhost \
32
+ --port $CHAT_PORT \
33
+ --worker-address "http://localhost:$CHAT_PORT" \
34
+ --controller-address http://localhost:21001 \
35
+ > worker_chat.log 2>&1 &
36
+ sleep 5
37
+
38
+ echo "Starting $EMBED_MODEL_NAME worker..."
39
+ nohup python3 -m fastchat.serve.model_worker \
40
+ --model-path "$EMBED_MODEL_PATH" \
41
+ --model-name "$EMBED_MODEL_NAME" \
42
+ --host localhost \
43
+ --port $EMBED_PORT \
44
+ --worker-address "http://localhost:$EMBED_PORT" \
45
+ --controller-address http://localhost:21001 \
46
+ > worker_embed.log 2>&1 &
47
+ sleep 5
48
+
49
+ echo "Starting OpenAI API server on port $API_PORT..."
50
+ nohup python3 -m fastchat.serve.openai_api_server \
51
+ --host 0.0.0.0 \
52
+ --port $API_PORT \
53
+ --controller-address http://localhost:21001 \
54
+ --allowed-origins '["*"]' \
55
+ > api_server.log 2>&1 &
56
+
57
+ echo "βœ… All servers started!"
58
+ echo "Logs: controller.log, worker_chat.log, worker_embed.log, api_server.log"
59
+
60
+ # ==============================
61
+ # Start ngrok tunnel
62
+ # ==============================
63
+ while true; do
64
+ ngrok http $API_PORT --url "$NGROK_URL" --log=stdout
65
+ echo "ngrok exited unexpectedly, restarting in 5s…" >&2
66
+ sleep 5
67
+ done