Akash-Dragon commited on
Commit
0000719
Β·
verified Β·
1 Parent(s): 3579588

Upload 5 files

Browse files
Files changed (5) hide show
  1. README.md +34 -6
  2. app.py +334 -0
  3. fdic_section_3_2_chunks_refined.json +0 -0
  4. packages.txt +2 -0
  5. requirements.txt +7 -0
README.md CHANGED
@@ -1,13 +1,41 @@
1
  ---
2
- title: Regulatory Bot
3
- emoji: 🏒
4
  colorFrom: blue
5
- colorTo: gray
6
  sdk: gradio
7
- sdk_version: 6.3.0
8
  app_file: app.py
9
  pinned: false
10
- short_description: Prompt Engineering Regulatory bot based on section 3.2
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Regulatory Loan Evaluation Assistant
3
+ emoji: 🏦
4
  colorFrom: blue
5
+ colorTo: green
6
  sdk: gradio
7
+ sdk_version: "4.0"
8
  app_file: app.py
9
  pinned: false
 
10
  ---
11
 
12
+ ## πŸ“„ Regulatory Loan Evaluation Assistant
13
+
14
+ This application is a **prompt-engineered regulatory reasoning system** designed for
15
+ loan evaluation in accordance with the **FDIC RMS Manual of Examination Policies – Section 3.2 (Loans)**.
16
+
17
+ ### πŸ” What this system does
18
+ - Extracts **structured loan facts** from uploaded loan documents using OCR
19
+ - Answers user questions using **only**:
20
+ - Extracted loan facts, and
21
+ - FDIC Section 3.2 regulatory guidance
22
+ - Refuses non-loan or out-of-scope questions
23
+ - Avoids approvals, rejections, or predictions
24
+
25
+ ### 🧠 Key Design Principles
26
+ - **Prompt engineering only** (no model training or fine-tuning)
27
+ - **Single source of truth** for regulatory reasoning
28
+ - **Audit-ready**, document-grounded responses
29
+ - **Regulatory tone** aligned with examiner expectations
30
+
31
+ ### πŸ“Œ Inputs
32
+ - Optional loan documents (PDF / image)
33
+ - User regulatory or loan-related questions
34
+
35
+ ### 🚫 Explicitly excluded
36
+ - Credit scoring
37
+ - Automated decisions
38
+ - OCR beyond basic text extraction
39
+ - External data sources
40
+
41
+ This project is intended for **educational and regulatory analysis purposes only**.
app.py ADDED
@@ -0,0 +1,334 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding: utf-8
3
+
4
+ # =========================================================
5
+ # 1. IMPORTS & ENV
6
+ # =========================================================
7
+ import os
8
+ import json
9
+ import re
10
+ import hashlib
11
+ from dotenv import load_dotenv
12
+ from PIL import Image
13
+
14
+ import gradio as gr
15
+ import pytesseract
16
+ from pdf2image import convert_from_path
17
+ from groq import Groq
18
+
19
+ load_dotenv()
20
+
21
+ # =========================================================
22
+ # 2. LOAD FDIC SECTION 3.2 ONCE (GLOBAL)
23
+ # =========================================================
24
+ with open("data/fdic_section_3_2_chunks_refined.json") as f:
25
+ FDIC_CHUNKS = json.load(f)
26
+
27
+ # =========================================================
28
+ # 3. GROQ CLIENT & MODELS
29
+ # =========================================================
30
+ client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
31
+
32
+ MODEL_LLM1 = "llama-3.1-8b-instant" # OCR β†’ loan summary
33
+ MODEL_LLM2 = "llama-3.1-8b-instant" # topic indexing
34
+ MODEL_LLM4 = "meta-llama/llama-4-scout-17b-16e-instruct" # reasoning
35
+
36
+ # =========================================================
37
+ # 4. SESSION STATE
38
+ # =========================================================
39
+ SESSION_STATE = {
40
+ "ocr_text": "",
41
+ "loan_summary": None
42
+ }
43
+
44
+ OCR_CACHE = {}
45
+
46
+ # =========================================================
47
+ # 5. GUARDRAILS
48
+ # =========================================================
49
+ NON_LOAN_KEYWORDS = [
50
+ "movie", "music", "sports", "weather", "joke", "recipe",
51
+ "health", "cold", "fever", "doctor", "medicine",
52
+ "politics", "election"
53
+ ]
54
+
55
+ def sanitize_user_input(text):
56
+ return text.strip()[:5000] if text else ""
57
+
58
+ def is_non_loan_question(text):
59
+ return any(k in text.lower() for k in NON_LOAN_KEYWORDS)
60
+
61
+ # =========================================================
62
+ # 6. SAFE JSON PARSER
63
+ # =========================================================
64
+ def safe_json_loads(text, stage):
65
+ if not text:
66
+ raise ValueError(f"{stage} returned empty response")
67
+
68
+ text = re.sub(r"```json|```", "", text).strip()
69
+ match = re.search(r"\{.*\}", text, re.DOTALL)
70
+
71
+ if not match:
72
+ raise ValueError(f"{stage} returned no JSON:\n{text}")
73
+
74
+ return json.loads(match.group())
75
+
76
+ # =========================================================
77
+ # 7. OCR HELPERS
78
+ # =========================================================
79
+ MAX_PAGES = 5
80
+
81
+ def file_hash(path, max_bytes=1024 * 1024):
82
+ h = hashlib.md5()
83
+ with open(path, "rb") as f:
84
+ h.update(f.read(max_bytes))
85
+ return h.hexdigest()
86
+
87
+ def ocr_file(path):
88
+ if path.lower().endswith(".pdf"):
89
+ text = ""
90
+ pages = convert_from_path(path, dpi=200)[:MAX_PAGES]
91
+ for p in pages:
92
+ text += pytesseract.image_to_string(p.convert("L")) + "\n"
93
+ return text.strip()
94
+ else:
95
+ img = Image.open(path).convert("L")
96
+ return pytesseract.image_to_string(img).strip()
97
+
98
+ def run_ocr_pipeline(uploaded_files):
99
+ texts = []
100
+ for f in uploaded_files:
101
+ path = str(f)
102
+ key = file_hash(path)
103
+ if key not in OCR_CACHE:
104
+ OCR_CACHE[key] = ocr_file(path)
105
+ texts.append(OCR_CACHE[key])
106
+ return "\n".join(texts)
107
+
108
+ # =========================================================
109
+ # 8. LOAN SCHEMA
110
+ # =========================================================
111
+ LOAN_SCHEMA = """<same as your original schema>"""
112
+
113
+ # =========================================================
114
+ # 9. SYSTEM PROMPTS
115
+ # =========================================================
116
+ LLM1_SYSTEM_PROMPT = f"""
117
+ You are an information extraction engine for bank loan documents.
118
+
119
+ Task:
120
+ - Extract ONLY facts that are explicitly stated in the text.
121
+ - Do NOT infer, assume, normalize, or calculate anything.
122
+ - If a value is missing or unclear, use null or "unknown".
123
+
124
+ Rules:
125
+ - Use ONLY the provided OCR text.
126
+ - Do NOT add explanations.
127
+ - Do NOT reference regulations.
128
+ - Output MUST strictly match the schema below.
129
+ - Return ONLY valid JSON.
130
+
131
+ Schema:
132
+ {LOAN_SCHEMA}
133
+ """
134
+
135
+ LLM2_SYSTEM_PROMPT = """
136
+ You are a regulatory topic indexing assistant.
137
+
138
+ Inputs:
139
+ - A user question
140
+ - A list of FDIC RMS Manual Section 3.2 headings with chunk_ids
141
+
142
+ Task:
143
+ - Select ONLY the chunk_ids whose headings are directly relevant
144
+ to answering the user question.
145
+ - Base your decision ONLY on the heading titles.
146
+ - Do NOT interpret or summarize policy text.
147
+
148
+ Rules:
149
+ - Select between 1 and 6 chunk_ids.
150
+ - If no headings are relevant, return an empty list.
151
+ - Do NOT explain your reasoning.
152
+ - Return ONLY valid JSON.
153
+
154
+ Output format:
155
+ {
156
+ "selected_chunk_ids": ["string"]
157
+ }
158
+ """
159
+
160
+ LLM4_SYSTEM_PROMPT = """
161
+ You are a regulatory-aligned loan evaluation assistant.
162
+
163
+ You are given TWO authoritative sources:
164
+
165
+ SOURCE A β€” Loan Summary
166
+ β€’ Structured facts extracted from uploaded loan documents
167
+ β€’ This is the ONLY source for borrower name, loan type, interest rate,
168
+ amounts, collateral, and other loan-specific details
169
+
170
+ SOURCE B β€” FDIC RMS Manual Section 3.2 (Loans)
171
+ β€’ This is the ONLY source for regulatory objectives, examiner expectations,
172
+ loan review systems, risk management, and policy intent
173
+
174
+ RULES (STRICT):
175
+ 1. If the user asks for loan details β†’ answer ONLY from SOURCE A
176
+ 2. If the user asks regulatory or examiner questions β†’ answer ONLY from SOURCE B
177
+ 3. If the user asks a mixed question β†’ clearly separate:
178
+ β€’ factual loan details (SOURCE A)
179
+ β€’ regulatory interpretation (SOURCE B)
180
+ 4. Do NOT infer or assume missing facts
181
+ 5. Do NOT use general banking knowledge
182
+ 6. Do NOT approve, reject, or predict loan outcomes
183
+ 7. If required information is missing, explicitly state that it is not available
184
+
185
+ Tone:
186
+ Professional, neutral, examiner-style.
187
+ No markdown. No speculation.
188
+
189
+ """
190
+
191
+ NO_DOC_PROMPT = f"""
192
+ You are creating a placeholder loan summary.
193
+
194
+ Rules:
195
+ - Use ONLY the schema provided.
196
+ - Do NOT infer or fabricate details.
197
+ - Populate fields only if explicitly stated in the user input.
198
+ - Otherwise, use null or "unknown".
199
+ - Return ONLY valid JSON.
200
+
201
+ Schema:
202
+ {LOAN_SCHEMA}
203
+ """
204
+
205
+ # =========================================================
206
+ # 10. LLM CALL
207
+ # =========================================================
208
+ def call_llm(system_prompt, user_prompt, model, temperature=0):
209
+ r = client.chat.completions.create(
210
+ model=model,
211
+ temperature=temperature,
212
+ messages=[
213
+ {"role": "system", "content": system_prompt},
214
+ {"role": "user", "content": user_prompt}
215
+ ]
216
+ )
217
+ return r.choices[0].message.content.strip()
218
+
219
+ # =========================================================
220
+ # 11. MAIN LOGIC (FINAL)
221
+ # =========================================================
222
+ def process_request(user_text, uploaded_files):
223
+ user_text = sanitize_user_input(user_text)
224
+
225
+ # 🚫 NON-LOAN GUARDRAIL
226
+ if is_non_loan_question(user_text):
227
+ return "", "⚠️ Only FDIC Section 3.2 loan and regulatory questions are supported."
228
+
229
+ # ======================================================
230
+ # LLM-1: OCR β†’ Loan Summary (ONLY if files exist)
231
+ # ======================================================
232
+ if uploaded_files:
233
+ ocr_text = run_ocr_pipeline(uploaded_files)
234
+
235
+ loan_summary = safe_json_loads(
236
+ call_llm(
237
+ LLM1_SYSTEM_PROMPT,
238
+ ocr_text,
239
+ MODEL_LLM1
240
+ ),
241
+ "LLM-1"
242
+ )
243
+
244
+ SESSION_STATE["ocr_text"] = ocr_text
245
+ SESSION_STATE["loan_summary"] = loan_summary
246
+
247
+ else:
248
+ # Follow-up or regulatory-only question
249
+ ocr_text = SESSION_STATE.get("ocr_text", "")
250
+ loan_summary = SESSION_STATE.get("loan_summary")
251
+
252
+ # ❗ Do NOT force NO-DOC extraction for regulatory questions
253
+ if loan_summary is None:
254
+ loan_summary = {
255
+ "note": "No loan documents uploaded. Loan-specific facts unavailable."
256
+ }
257
+
258
+ # ======================================================
259
+ # LLM-2: FDIC Section 3.2 Topic Indexing (HEADINGS ONLY)
260
+ # ======================================================
261
+ headings_payload = {
262
+ "user_question": user_text,
263
+ "fdic_headings": [
264
+ {
265
+ "chunk_id": c["chunk_id"],
266
+ "heading": c.get("subtopic") or c.get("title")
267
+ }
268
+ for c in FDIC_CHUNKS
269
+ ]
270
+ }
271
+
272
+ selected_ids = safe_json_loads(
273
+ call_llm(
274
+ LLM2_SYSTEM_PROMPT,
275
+ json.dumps(headings_payload),
276
+ MODEL_LLM2
277
+ ),
278
+ "LLM-2"
279
+ ).get("selected_chunk_ids", [])
280
+
281
+ selected_chunks = [
282
+ {
283
+ "chunk_id": c["chunk_id"],
284
+ "heading": c.get("subtopic") or c.get("title")
285
+ }
286
+ for c in FDIC_CHUNKS
287
+ if c["chunk_id"] in set(selected_ids)
288
+ ][:6] # πŸ”’ HARD CAP (very important)
289
+
290
+
291
+ # ======================================================
292
+ # LLM-4: FINAL REGULATORY + FACTUAL ANSWER
293
+ # ======================================================
294
+ llm4_payload = {
295
+ "loan_summary": loan_summary,
296
+ "fdic_section_3_2": selected_chunks,
297
+ "user_question": user_text
298
+ }
299
+
300
+ answer = call_llm(
301
+ LLM4_SYSTEM_PROMPT,
302
+ json.dumps(llm4_payload),
303
+ MODEL_LLM4,
304
+ temperature=0.2
305
+ )
306
+
307
+ return ocr_text, answer
308
+
309
+ # =========================================================
310
+ # 12. GRADIO UI
311
+ # =========================================================
312
+ def chat_handler(user_text, uploaded_files, chat_history):
313
+ chat_history = chat_history or []
314
+ _, answer = process_request(user_text, uploaded_files)
315
+ chat_history.append({"role": "user", "content": user_text})
316
+ chat_history.append({"role": "assistant", "content": answer})
317
+ return chat_history
318
+
319
+ with gr.Blocks(title="Regulatory Loan Evaluation Assistant") as demo:
320
+ gr.Markdown("## πŸ“„ Regulatory Loan Evaluation Assistant")
321
+ chat = gr.Chatbot(height=450)
322
+ files = gr.File(
323
+ label="Upload Loan Documents (Optional)",
324
+ file_types=[".pdf", ".png", ".jpg", ".jpeg"],
325
+ file_count="multiple"
326
+ )
327
+ user_input = gr.Textbox(placeholder="Ask a regulatory or loan question")
328
+ gr.Button("Send").click(
329
+ fn=chat_handler,
330
+ inputs=[user_input, files, chat],
331
+ outputs=[chat]
332
+ )
333
+
334
+ demo.launch()
fdic_section_3_2_chunks_refined.json ADDED
The diff for this file is too large to render. See raw diff
 
packages.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ tesseract-ocr
2
+ poppler-utils
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ gradio>=4.0
2
+ pytesseract
3
+ pdf2image
4
+ pillow
5
+ python-dotenv
6
+ groq
7
+ opencv-python-headless