Yosef Skolnick commited on
Commit
bbde628
·
1 Parent(s): 5843f2c

Enhance RAG processing and UI components

Browse files

1. Updated `run_retrieval_step` to accept an optional `original_query` parameter for improved document retrieval.
2. Refactored `retrieve_documents` to be asynchronous, enhancing performance during document queries.
3. Introduced a new `format_source_html` function for consistent HTML formatting of source documents in the chat UI.
4. Improved mixed language text handling in the chat display, ensuring proper RTL support for Hebrew content.
5. Added logging for better debugging and error handling in the RAG processing pipeline.
6. Enhanced sidebar functionality with improved language and font selection management.
7. Updated task tracking guidelines in `memorybank/tasks.md` to reflect recent project developments.
8. Removed outdated `rules_compliance.md` file to streamline documentation.

.cursor/rules/task-tracking.mdc ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description:
3
+ globs:
4
+ alwaysApply: true
5
+ ---
6
+ # Task Tracking Guidelines
7
+
8
+ ## Overview
9
+ This rule provides guidelines for maintaining and updating [memorybank/tasks.md](mdc:memorybank/tasks.md) throughout the development process. The task list should be updated after each significant process step to maintain accurate project status.
10
+
11
+ ## Update Requirements
12
+
13
+ ### After Each Process Step
14
+ 1. Update "Last Updated" date in Executive Summary
15
+ 2. Adjust task counts in Executive Summary
16
+ 3. Move completed tasks to appropriate sections
17
+ 4. Update progress indicators (✓, [ ], [x])
18
+ 5. Add any new tasks discovered during the process
19
+
20
+ ### Section-Specific Updates
21
+
22
+ #### Executive Summary
23
+ - Update Project Status if phase changes
24
+ - Recalculate Pending and Completed task counts
25
+ - Update Complexity Level if assessment changes
26
+
27
+ #### Branch Achievements
28
+ - Add new achievements with ✓ prefix
29
+ - Group achievements by category
30
+ - Include specific implementation details
31
+
32
+ #### Active Development Tasks
33
+ - Update priority levels if changed
34
+ - Add new technical debt items discovered
35
+ - Update security improvements needed
36
+ - Add new UX improvement requirements
37
+ - Maintain clear issue references (e.g., RET-DYNAMIC-IMPORTS)
38
+
39
+ #### Project Notes
40
+ - Add timestamped entries for significant changes
41
+ - Update rules compliance status
42
+ - Document new issues identified
43
+ - Track implementation progress
44
+
45
+ #### Task Archive
46
+ - Move completed tasks with completion timestamps
47
+ - Maintain categorization of archived tasks
48
+ - Include brief completion notes if relevant
49
+
50
+ ## Update Process
51
+
52
+ ### Pre-Update Checks
53
+ 1. Review current task status
54
+ 2. Identify completed tasks
55
+ 3. List new tasks discovered
56
+ 4. Note any priority changes
57
+
58
+ ### Update Steps
59
+ 1. Update Executive Summary metrics
60
+ 2. Add new achievements if applicable
61
+ 3. Update Active Development Tasks
62
+ - Mark completed tasks
63
+ - Add new tasks
64
+ - Update priorities
65
+ 4. Add relevant Project Notes
66
+ 5. Move completed tasks to Archive
67
+
68
+ ### Post-Update Validation
69
+ 1. Verify all sections are properly formatted
70
+ 2. Check timestamp accuracy
71
+ 3. Validate task counts match
72
+ 4. Ensure proper categorization
73
+
74
+ ## Format Requirements
75
+
76
+ ### Task Format
77
+ ```markdown
78
+ - [ ] **Task Name** (ISSUE-REFERENCE)
79
+ - File: `filename.py` (Line XX)
80
+ - Issue: Brief description
81
+ - Solution: Proposed solution
82
+ ```
83
+
84
+ ### Achievement Format
85
+ ```markdown
86
+ - ✓ Achievement description with specific details
87
+ ```
88
+
89
+ ### Project Note Format
90
+ ```markdown
91
+ - Note description (YYYY-MM-DD HH:MM:SS)
92
+ ```
93
+
94
+ ## Automation Hooks
95
+
96
+ ### Pre-Commit
97
+ - Validate task.md format
98
+ - Update task counts
99
+ - Verify timestamps
100
+
101
+ ### Post-Process
102
+ - Check for completed tasks
103
+ - Update achievement list
104
+ - Recalculate metrics
105
+
106
+ ## Error Prevention
107
+ - Maintain consistent formatting
108
+ - Use proper timestamp format
109
+ - Keep issue references consistent
110
+ - Preserve existing categories
111
+ - Maintain proper indentation
112
+
113
+ ## Best Practices
114
+ 1. Update tasks.md immediately after process completion
115
+ 2. Include specific file references when relevant
116
+ 3. Maintain clear issue tracking references
117
+ 4. Keep achievement descriptions specific and measurable
118
+ 5. Use consistent formatting throughout the document
components/chat.py CHANGED
@@ -3,113 +3,40 @@ from typing import Dict, Any, List
3
  import asyncio
4
  import logging
5
  import traceback
 
 
 
 
6
 
7
  # Setup logger
8
  logger = logging.getLogger(__name__)
9
 
10
- def display_chat_message(message: Dict[str, Any]):
11
- """
12
- Display a chat message with proper formatting.
13
-
14
- Args:
15
- message (Dict[str, Any]): Message object with role, content, and optional metadata
16
- """
17
- # Import here to avoid circular imports
18
- from i18n import get_direction, get_text
19
- from utils.sanitization import sanitize_html
20
- from utils import clean_source_text
21
-
22
- text_direction = get_direction()
23
- role = message.get("role", "assistant")
24
-
25
- # Ensure the current font is used
26
- hebrew_font = st.session_state.hebrew_font
27
-
28
- with st.chat_message(role):
29
- # Sanitize the message content for HTML rendering
30
- content = message.get('content', '')
31
- if isinstance(content, str):
32
- content = sanitize_html(content)
33
-
34
- # Add font-family style if not already present
35
- if 'font-family' not in content and role == "assistant" and text_direction == "rtl":
36
- content = f"<div style='font-family: \"{hebrew_font}\", \"Open Sans Hebrew\", \"Alef Hebrew\", \"Arial Hebrew\", sans-serif !important; direction: {text_direction};'>{content}</div>"
37
-
38
- st.markdown(content, unsafe_allow_html=True)
39
-
40
- if role == "assistant" and message.get("final_docs"):
41
- docs = message["final_docs"]
42
- # Use a simple text title for the expander
43
- with st.expander(f"{get_text('sources_title')} ({len(docs)})", expanded=False):
44
- # Add the rich HTML content inside the expander (static HTML is safe)
45
- expander_title = f"<div class='expander-title {text_direction}-text' style='font-family: \"{hebrew_font}\", \"Open Sans Hebrew\", \"Alef Hebrew\", \"Arial Hebrew\", sans-serif !important;'>{get_text('sources_text').format(len(docs))}</div>"
46
- st.markdown(expander_title, unsafe_allow_html=True)
47
- st.markdown(f"<div dir='{text_direction}' class='expander-content' style='font-family: \"{hebrew_font}\", \"Open Sans Hebrew\", \"Alef Hebrew\", \"Arial Hebrew\", sans-serif !important;'>", unsafe_allow_html=True)
48
- for i, doc in enumerate(docs, start=1):
49
- source = doc.get('source_name', '') or get_text('unknown_source')
50
- # Sanitize the source text
51
- source = sanitize_html(source)
52
-
53
- # Clean and sanitize the Hebrew text
54
- text = doc.get('hebrew_text', '')
55
- text = clean_source_text(text)
56
- text = sanitize_html(text)
57
-
58
- source_html = f"<div class='source-info {text_direction}-text' style='font-family: \"{hebrew_font}\", \"Open Sans Hebrew\", \"Alef Hebrew\", \"Arial Hebrew\", sans-serif !important;'><strong>{get_text('source_label').format(i)}</strong> {source}</div>"
59
- text_html = f"<div class='hebrew-text' style='font-family: \"{hebrew_font}\", \"Open Sans Hebrew\", \"Alef Hebrew\", \"Arial Hebrew\", sans-serif !important;'>{text}</div>"
60
-
61
- st.markdown(source_html, unsafe_allow_html=True)
62
- st.markdown(text_html, unsafe_allow_html=True)
63
- st.markdown("</div>", unsafe_allow_html=True)
64
-
65
-
66
- def display_status_updates(status_log: List[str]):
67
- """
68
- Display processing status log in an expander.
69
-
70
- Args:
71
- status_log (List[str]): List of status update messages
72
- """
73
- # Import here to avoid circular imports
74
- from i18n import get_direction, get_text
75
-
76
- text_direction = get_direction()
77
- hebrew_font = st.session_state.hebrew_font
78
-
79
- if status_log:
80
- # Use a simple text title for the expander
81
- with st.expander(get_text('processing_details'), expanded=False):
82
- # Add the rich HTML content inside the expander
83
- st.markdown(f"<div class='expander-title {text_direction}-text' style='font-family: \"{hebrew_font}\", \"Open Sans Hebrew\", \"Alef Hebrew\", \"Arial Hebrew\", sans-serif !important;'>{get_text('processing_log')}</div>", unsafe_allow_html=True)
84
- for u in status_log:
85
- st.markdown(
86
- f"<code class='status-update {text_direction}-text' style='font-family: \"{hebrew_font}\", \"Open Sans Hebrew\", \"Alef Hebrew\", \"Arial Hebrew\", sans-serif !important;'>- {u}</code>",
87
- unsafe_allow_html=True
88
- )
89
-
90
 
91
  def process_prompt(prompt: str, rag_params: Dict[str, Any]):
92
  """
93
  Process a user prompt and generate a response.
94
 
95
  Args:
96
- prompt (str): User input prompt
97
  rag_params (Dict[str, Any]): RAG parameters from sidebar
98
  """
99
  # Import here to avoid circular imports
100
  from i18n import get_direction, get_text
101
  from utils.sanitization import sanitize_html
102
- from utils import clean_source_text
103
- from services.openai_service import extract_citations_with_openai
104
- from rag_processor import execute_validate_generate_pipeline, PIPELINE_VALIDATE_GENERATE_GPT4O
105
- import nest_asyncio
 
106
 
107
  text_direction = get_direction()
108
  hebrew_font = st.session_state.hebrew_font
109
 
110
- # Ensure we can run async code
111
- nest_asyncio.apply()
112
-
113
  st.session_state.messages.append({"role": "user", "content": prompt})
114
  display_chat_message(st.session_state.messages[-1])
115
 
@@ -120,21 +47,23 @@ def process_prompt(prompt: str, rag_params: Dict[str, Any]):
120
  try:
121
  def status_cb(m): status_container.update(label=f"{get_text('processing_step')} {m}")
122
  def stream_cb(c):
123
- # Sanitize the chunk before appending and displaying
124
  if isinstance(c, str):
125
  c = sanitize_html(c)
126
  chunks.append(c)
127
- # Sanitize the entire response for rendering
128
- sanitized_response = sanitize_html(''.join(chunks))
129
- msg_placeholder.markdown(
130
- f"<div dir='{text_direction}' class='{text_direction}-text' style='font-family: \"{hebrew_font}\", \"Open Sans Hebrew\", \"Alef Hebrew\", \"Arial Hebrew\", sans-serif !important;'>{sanitized_response}▌</div>",
131
- unsafe_allow_html=True
132
- )
133
 
134
  try:
135
- loop = asyncio.get_event_loop()
 
 
136
  final_rag = loop.run_until_complete(
137
- execute_validate_generate_pipeline(
138
  history=st.session_state.messages,
139
  params=rag_params,
140
  status_callback=status_cb,
@@ -142,21 +71,23 @@ def process_prompt(prompt: str, rag_params: Dict[str, Any]):
142
  )
143
  )
144
 
145
- # Try to get cited IDs after getting the response
146
  cited_ids = []
147
  if isinstance(final_rag, dict):
148
  raw = final_rag.get("final_response", "")
149
  # Only attempt to extract citations if we have a valid response
150
- if raw:
151
- try:
152
- cited_ids = loop.run_until_complete(extract_citations_with_openai(raw))
153
- except Exception as citation_err:
154
- # Log citation extraction error but don't fail the entire request
155
- st.error(f"Citation extraction failed: {citation_err}", icon="⚠️")
156
- cited_ids = []
157
  except (RuntimeError, asyncio.CancelledError, asyncio.TimeoutError) as loop_err:
158
  st.error(f"{get_text('error_async')} {loop_err}", icon="⚠️")
159
- err_html = f"<div dir='{text_direction}' class='{text_direction}-text' style='font-family: \"{hebrew_font}\", \"Open Sans Hebrew\", \"Alef Hebrew\", \"Arial Hebrew\", sans-serif !important;'><strong>{get_text('request_error')}</strong><br>{get_text('error_async')}<br>{type(loop_err).__name__}</div>"
 
 
 
 
 
 
 
160
  # Sanitize error HTML
161
  err_html = sanitize_html(err_html)
162
  msg_placeholder.markdown(err_html, unsafe_allow_html=True)
@@ -181,15 +112,14 @@ def process_prompt(prompt: str, rag_params: Dict[str, Any]):
181
  docs = final_rag.get("generator_input_documents", [])
182
  pipeline = final_rag.get("pipeline_used", PIPELINE_VALIDATE_GENERATE_GPT4O)
183
 
184
- # wrap result
185
- final = raw
186
- if not (err and final.strip().startswith("<div")) and not final.strip().startswith((
187
- '<div', '<p', '<ul', '<ol', '<strong'
188
- )):
189
- final = f"<div dir='{text_direction}' class='{text_direction}-text' style='font-family: \"{hebrew_font}\", \"Open Sans Hebrew\", \"Alef Hebrew\", \"Arial Hebrew\", sans-serif !important;'>{final or get_text('no_response')}</div>"
 
190
 
191
- # Final sanitization before rendering
192
- final = sanitize_html(final)
193
  msg_placeholder.markdown(final, unsafe_allow_html=True)
194
 
195
  # Use the citations extracted earlier instead of making another async call
@@ -202,25 +132,25 @@ def process_prompt(prompt: str, rag_params: Dict[str, Any]):
202
  if docs_to_show:
203
  # Use a simple text title for the expander
204
  with st.expander(f"{get_text('sources_title')} ({len(docs_to_show)})", expanded=False):
205
- # Add the rich HTML content inside the expander
206
- expander_title = f"<div class='expander-title {text_direction}-text' style='font-family: \"{hebrew_font}\", \"Open Sans Hebrew\", \"Alef Hebrew\", \"Arial Hebrew\", sans-serif !important;'>{get_text('sources_text').format(len(docs_to_show))}</div>"
207
- st.markdown(sanitize_html(expander_title), unsafe_allow_html=True)
208
- st.markdown(sanitize_html(f"<div dir='{text_direction}' class='expander-content' style='font-family: \"{hebrew_font}\", \"Open Sans Hebrew\", \"Alef Hebrew\", \"Arial Hebrew\", sans-serif !important;'>"), unsafe_allow_html=True)
 
 
 
 
 
 
 
 
 
 
209
  for idx, doc in docs_to_show:
210
- source = doc.get('source_name', '') or get_text('unknown_source')
211
- # Sanitize the source text
212
- source = sanitize_html(source)
213
-
214
- # Clean and sanitize the Hebrew text
215
- text = doc.get('hebrew_text', '')
216
- text = clean_source_text(text)
217
- text = sanitize_html(text)
218
-
219
- source_html = f"<div class='source-info {text_direction}-text' style='font-family: \"{hebrew_font}\", \"Open Sans Hebrew\", \"Alef Hebrew\", \"Arial Hebrew\", sans-serif !important;'><strong>{get_text('source_label').format(idx)}</strong> {source}</div>"
220
- text_html = f"<div class='hebrew-text' style='font-family: \"{hebrew_font}\", \"Open Sans Hebrew\", \"Alef Hebrew\", \"Arial Hebrew\", sans-serif !important;'>{text}</div>"
221
-
222
  st.markdown(source_html, unsafe_allow_html=True)
223
  st.markdown(text_html, unsafe_allow_html=True)
 
224
  st.markdown("</div>", unsafe_allow_html=True)
225
 
226
  # store message
@@ -239,10 +169,14 @@ def process_prompt(prompt: str, rag_params: Dict[str, Any]):
239
  else:
240
  status_container.update(label=get_text('processing_complete'), state="complete", expanded=False)
241
  else:
242
- msg_placeholder.markdown(
243
- f"<div dir='{text_direction}' class='{text_direction}-text'><strong>{get_text('communication_error')}</strong></div>",
244
- unsafe_allow_html=True
245
- )
 
 
 
 
246
  st.session_state.messages.append({
247
  "role": "assistant",
248
  "content": get_text('communication_error'),
@@ -254,12 +188,19 @@ def process_prompt(prompt: str, rag_params: Dict[str, Any]):
254
  status_container.update(label=f"{get_text('error')}!", state="error", expanded=False)
255
 
256
  except Exception as e:
257
- logger.error(f"Unhandled exception in RAG processing: {e}", exc_info=True)
258
  traceback.print_exc()
259
- err_html = (
260
- f"<div dir='{text_direction}' class='{text_direction}-text'><strong>{get_text('critical_error')}</strong><br>{get_text('reload')}"
261
- f"<details><summary>{get_text('details')}</summary><pre>{sanitize_html(traceback.format_exc())}</pre></details></div>"
262
- )
 
 
 
 
 
 
 
263
  # Sanitize error HTML
264
  err_html = sanitize_html(err_html)
265
  msg_placeholder.error(err_html, icon="🔥")
@@ -271,4 +212,4 @@ def process_prompt(prompt: str, rag_params: Dict[str, Any]):
271
  "status_log": [f"Critical: {type(e).__name__}"],
272
  "error": str(e)
273
  })
274
- status_container.update(label=str(e), state="error", expanded=False)
 
3
  import asyncio
4
  import logging
5
  import traceback
6
+ import nest_asyncio
7
+
8
+ # Apply nest_asyncio once during module import
9
+ nest_asyncio.apply()
10
 
11
  # Setup logger
12
  logger = logging.getLogger(__name__)
13
 
14
+ # Import our refactored modules
15
+ from ui.hebrew import handle_mixed_language_text
16
+ from ui.chat_render import display_chat_message, display_status_updates, format_source_html
17
+ from pipeline.rag import process_rag_request, create_async_execution_context, extract_citations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  def process_prompt(prompt: str, rag_params: Dict[str, Any]):
20
  """
21
  Process a user prompt and generate a response.
22
 
23
  Args:
24
+ prompt (str): User input prompt (may contain template)
25
  rag_params (Dict[str, Any]): RAG parameters from sidebar
26
  """
27
  # Import here to avoid circular imports
28
  from i18n import get_direction, get_text
29
  from utils.sanitization import sanitize_html
30
+ from rag_processor import PIPELINE_VALIDATE_GENERATE_GPT4O
31
+
32
+ # Initialize session state if needed
33
+ if 'messages' not in st.session_state:
34
+ st.session_state.messages = []
35
 
36
  text_direction = get_direction()
37
  hebrew_font = st.session_state.hebrew_font
38
 
39
+ # Add the visible prompt to chat history (what the user sees)
 
 
40
  st.session_state.messages.append({"role": "user", "content": prompt})
41
  display_chat_message(st.session_state.messages[-1])
42
 
 
47
  try:
48
  def status_cb(m): status_container.update(label=f"{get_text('processing_step')} {m}")
49
  def stream_cb(c):
50
+ # Sanitize the chunk before appending
51
  if isinstance(c, str):
52
  c = sanitize_html(c)
53
  chunks.append(c)
54
+ # Process the entire response with mixed language handler first
55
+ joined_text = ''.join(chunks) + "▌" # Add cursor
56
+ display_html = handle_mixed_language_text(joined_text, hebrew_font)
57
+ # Sanitize the final HTML to prevent injection
58
+ safe_html = sanitize_html(display_html)
59
+ msg_placeholder.markdown(safe_html, unsafe_allow_html=True)
60
 
61
  try:
62
+ # Get the current event loop or create a new one
63
+ loop = create_async_execution_context()
64
+
65
  final_rag = loop.run_until_complete(
66
+ process_rag_request(
67
  history=st.session_state.messages,
68
  params=rag_params,
69
  status_callback=status_cb,
 
71
  )
72
  )
73
 
74
+ # Extract citations after getting the response
75
  cited_ids = []
76
  if isinstance(final_rag, dict):
77
  raw = final_rag.get("final_response", "")
78
  # Only attempt to extract citations if we have a valid response
79
+ if raw and raw.strip():
80
+ cited_ids = extract_citations(raw)
 
 
 
 
 
81
  except (RuntimeError, asyncio.CancelledError, asyncio.TimeoutError) as loop_err:
82
  st.error(f"{get_text('error_async')} {loop_err}", icon="⚠️")
83
+ # Format error message with RTL support
84
+ err_html = f"""
85
+ <div dir='{text_direction}' class='{text_direction}-text hebrew-font'>
86
+ <strong>{get_text('request_error')}</strong><br>
87
+ {get_text('error_async')}<br>
88
+ {type(loop_err).__name__}
89
+ </div>
90
+ """
91
  # Sanitize error HTML
92
  err_html = sanitize_html(err_html)
93
  msg_placeholder.markdown(err_html, unsafe_allow_html=True)
 
112
  docs = final_rag.get("generator_input_documents", [])
113
  pipeline = final_rag.get("pipeline_used", PIPELINE_VALIDATE_GENERATE_GPT4O)
114
 
115
+ # Process final content with mixed language handler
116
+ if not (err and raw.strip().startswith("<div")):
117
+ final = handle_mixed_language_text(raw, hebrew_font)
118
+ # Final sanitization after processing
119
+ final = sanitize_html(final)
120
+ else:
121
+ final = sanitize_html(raw)
122
 
 
 
123
  msg_placeholder.markdown(final, unsafe_allow_html=True)
124
 
125
  # Use the citations extracted earlier instead of making another async call
 
132
  if docs_to_show:
133
  # Use a simple text title for the expander
134
  with st.expander(f"{get_text('sources_title')} ({len(docs_to_show)})", expanded=False):
135
+ # Add RTL Hebrew wrapper with embed
136
+ expander_title = f"""
137
+ <div class='expander-title rtl-text hebrew-font' dir='rtl' lang="he">
138
+ {get_text('sources_text').format(len(docs_to_show))}
139
+ </div>
140
+ """
141
+ st.markdown(expander_title, unsafe_allow_html=True)
142
+
143
+ # Container for all sources with RTL direction
144
+ st.markdown(f"""
145
+ <div dir='rtl' lang="he" class='expander-content rtl-text hebrew-font'>
146
+ """, unsafe_allow_html=True)
147
+
148
+ # Format each source consistently using our helper function
149
  for idx, doc in docs_to_show:
150
+ source_html, text_html = format_source_html(doc, idx, hebrew_font, get_text)
 
 
 
 
 
 
 
 
 
 
 
151
  st.markdown(source_html, unsafe_allow_html=True)
152
  st.markdown(text_html, unsafe_allow_html=True)
153
+
154
  st.markdown("</div>", unsafe_allow_html=True)
155
 
156
  # store message
 
169
  else:
170
  status_container.update(label=get_text('processing_complete'), state="complete", expanded=False)
171
  else:
172
+ # Format communication error message with proper RTL support
173
+ err_msg = f"""
174
+ <div dir='{text_direction}' class='{text_direction}-text hebrew-font'>
175
+ <strong>{get_text('communication_error')}</strong>
176
+ </div>
177
+ """
178
+ msg_placeholder.markdown(sanitize_html(err_msg), unsafe_allow_html=True)
179
+
180
  st.session_state.messages.append({
181
  "role": "assistant",
182
  "content": get_text('communication_error'),
 
188
  status_container.update(label=f"{get_text('error')}!", state="error", expanded=False)
189
 
190
  except Exception as e:
191
+ logger.exception("Unhandled exception in RAG processing")
192
  traceback.print_exc()
193
+ # Format critical error with RTL support
194
+ err_html = f"""
195
+ <div dir='{text_direction}' class='{text_direction}-text hebrew-font'>
196
+ <strong>{get_text('critical_error')}</strong><br>
197
+ {get_text('reload')}
198
+ <details>
199
+ <summary>{get_text('details')}</summary>
200
+ <pre>{sanitize_html(traceback.format_exc())}</pre>
201
+ </details>
202
+ </div>
203
+ """
204
  # Sanitize error HTML
205
  err_html = sanitize_html(err_html)
206
  msg_placeholder.error(err_html, icon="🔥")
 
212
  "status_log": [f"Critical: {type(e).__name__}"],
213
  "error": str(e)
214
  })
215
+ status_container.update(label=get_text('processing_error'), state="error", expanded=False)
components/sidebar.py CHANGED
@@ -1,6 +1,9 @@
1
  import streamlit as st
2
  from typing import Dict, Any
3
- import time
 
 
 
4
 
5
  def display_sidebar() -> Dict[str, Any]:
6
  """
@@ -17,89 +20,85 @@ def display_sidebar() -> Dict[str, Any]:
17
 
18
  text_direction = get_direction()
19
 
20
- # Use Streamlit's native sidebar instead of custom toggle
21
  with st.sidebar:
22
- st.markdown(f"<h3 class='{text_direction}-text'>{get_text('settings_title')}</h3>", unsafe_allow_html=True)
23
-
24
- # Language selection
25
- st.markdown(f"<h4 class='{text_direction}-text'>{get_text('display_settings')}</h4>", unsafe_allow_html=True)
26
 
27
- # Convert language codes to display names for the selector
28
  language_options = {code: name for code, name in LANGUAGES.items()}
 
29
 
30
- # Language selector
31
  selected_language = st.selectbox(
32
  get_text('language_setting'),
33
  options=list(language_options.keys()),
34
  format_func=lambda x: language_options.get(x, x),
35
- index=list(language_options.keys()).index(st.session_state.language) if st.session_state.language in language_options else 0
36
  )
37
 
38
- # Update language in session state when changed
39
  if selected_language != st.session_state.language:
 
40
  st.session_state.language = selected_language
41
- st.rerun() # Rerun to apply the new language
 
 
 
 
 
42
 
43
- # Get appropriate font options for current language
44
  font_options = get_font_options()
 
45
 
46
- # Font selector
47
  selected_font = st.selectbox(
48
  get_text('font_setting'),
49
  options=list(font_options.keys()),
50
  format_func=lambda x: font_options.get(x, x),
51
- index=list(font_options.keys()).index(st.session_state.hebrew_font) if st.session_state.hebrew_font in font_options else 0
52
  )
53
 
54
- # Update font in session state when changed
55
  if selected_font != st.session_state.hebrew_font:
56
- # Store the new font in session state
57
  st.session_state.hebrew_font = selected_font
58
- # Add a reload flag to force CSS regeneration - use a random number to ensure it's always different
59
- st.session_state.css_reload_trigger = int(time.time() * 1000) # Use milliseconds for better uniqueness
60
- # Show notification about font change
61
- st.success(f"Font changed to {font_options.get(selected_font, selected_font)}. Applying changes...", icon="✅")
62
- # Force a complete rerun to apply the new CSS with the updated font
63
  st.rerun()
64
 
65
- # Show font preview with improved styling
66
  font_preview_text = get_text('font_preview').format(font_options.get(selected_font, selected_font))
67
- preview_style = f"""
68
- font-family: "{selected_font}", "Open Sans Hebrew", "Alef Hebrew", "Arial Hebrew", sans-serif !important;
69
- direction: {text_direction};
70
- text-align: {("right" if text_direction == "rtl" else "left")};
71
- padding: 10px;
72
- border: 1px solid #ddd;
73
- border-radius: 4px;
74
- margin-top: 10px;
75
- font-size: 1.2em;
76
- line-height: 1.5;
77
- background-color: #f8f8f8;
78
- """
79
- st.markdown(f"<div style='{preview_style}'>{font_preview_text}</div>", unsafe_allow_html=True)
80
 
81
- # Separator
82
- st.markdown("<hr>", unsafe_allow_html=True)
 
 
83
 
84
- # Service status
 
 
85
  retriever_ready, _ = get_retriever_status()
86
  openai_ready, _ = get_openai_status()
87
 
88
- st.markdown(
89
- f"<p class='{text_direction}-text'><strong>{get_text('retriever_status')}</strong> {'✅' if retriever_ready else '❌'}</p>",
90
- unsafe_allow_html=True
91
- )
 
 
92
  if not retriever_ready:
93
- st.sidebar.error(get_text('retriever_error'), icon="🛑")
94
  st.stop()
 
 
 
 
 
 
95
 
96
- st.markdown(
97
- f"<p class='{text_direction}-text'><strong>{get_text('openai_status')}</strong> {'✅' if openai_ready else '❌'} ({config.OPENAI_VALIDATION_MODEL} / {config.OPENAI_GENERATION_MODEL})</p>",
98
- unsafe_allow_html=True
99
- )
100
  if not openai_ready:
101
- st.sidebar.error(get_text('openai_error'), icon="⚠️")
102
-
103
  # RAG parameters
104
  n_retrieve = st.slider(get_text('retrieval_count'), 1, 300, config.DEFAULT_N_RETRIEVE)
105
  max_validate = min(n_retrieve, 100)
@@ -112,19 +111,21 @@ def display_sidebar() -> Dict[str, Any]:
112
  )
113
  st.info(get_text('validation_info'), icon="ℹ️")
114
 
115
- # Prompt editors
116
  with st.expander(get_text('edit_prompts'), expanded=False):
117
- gen_prompt = st.text_area(
118
  get_text('system_prompt'),
119
  value=config.OPENAI_SYSTEM_PROMPT,
120
  height=200
121
  )
122
- val_prompt = st.text_area(
123
  get_text('validation_prompt'),
124
  value=config.VALIDATION_PROMPT_TEMPLATE,
125
  height=200
126
  )
127
- config.OPENAI_SYSTEM_PROMPT = gen_prompt
128
- config.VALIDATION_PROMPT_TEMPLATE = val_prompt
129
 
130
- return {"n_retrieve": n_retrieve, "n_validate": n_validate, "services_ready": (retriever_ready and openai_ready)}
 
 
 
 
 
1
  import streamlit as st
2
  from typing import Dict, Any
3
+ import logging
4
+
5
+ # Setup logger
6
+ logger = logging.getLogger(__name__)
7
 
8
  def display_sidebar() -> Dict[str, Any]:
9
  """
 
20
 
21
  text_direction = get_direction()
22
 
23
+ # Use Streamlit's native sidebar
24
  with st.sidebar:
25
+ # All sidebar header items
26
+ st.header(get_text('settings_title'))
27
+ st.subheader(get_text('display_settings'))
 
28
 
29
+ # Language selector
30
  language_options = {code: name for code, name in LANGUAGES.items()}
31
+ current_language_index = list(language_options.keys()).index(st.session_state.language) if st.session_state.language in language_options else 0
32
 
 
33
  selected_language = st.selectbox(
34
  get_text('language_setting'),
35
  options=list(language_options.keys()),
36
  format_func=lambda x: language_options.get(x, x),
37
+ index=current_language_index
38
  )
39
 
40
+ # Update language if changed
41
  if selected_language != st.session_state.language:
42
+ # Update session state and clear template
43
  st.session_state.language = selected_language
44
+ if "active_template" in st.session_state:
45
+ st.session_state.active_template = None
46
+
47
+ # Notification and rerun
48
+ st.success(f"Language changed to {language_options.get(selected_language, selected_language)}", icon="✅")
49
+ st.rerun()
50
 
51
+ # Font selector
52
  font_options = get_font_options()
53
+ font_index = list(font_options.keys()).index(st.session_state.hebrew_font) if st.session_state.hebrew_font in font_options else 0
54
 
 
55
  selected_font = st.selectbox(
56
  get_text('font_setting'),
57
  options=list(font_options.keys()),
58
  format_func=lambda x: font_options.get(x, x),
59
+ index=font_index
60
  )
61
 
62
+ # Update font if changed
63
  if selected_font != st.session_state.hebrew_font:
 
64
  st.session_state.hebrew_font = selected_font
65
+ st.success(f"Font changed to {font_options.get(selected_font, selected_font)}", icon="✅")
 
 
 
 
66
  st.rerun()
67
 
68
+ # Font preview
69
  font_preview_text = get_text('font_preview').format(font_options.get(selected_font, selected_font))
70
+ special_class = "rashi-text" if selected_font == "Noto Rashi Hebrew" else ""
 
 
 
 
 
 
 
 
 
 
 
 
71
 
72
+ # Use streamlit container instead of custom HTML for preview
73
+ preview = st.container(border=True)
74
+ with preview:
75
+ st.markdown(f"<div class='{special_class}' style='font-family: \"{selected_font}\", sans-serif;direction:{text_direction};text-align:{'right' if text_direction=='rtl' else 'left'}'>{font_preview_text}</div>", unsafe_allow_html=True)
76
 
77
+ st.divider() # Native divider instead of HTML hr
78
+
79
+ # Service status with native components
80
  retriever_ready, _ = get_retriever_status()
81
  openai_ready, _ = get_openai_status()
82
 
83
+ status_col1, status_col2 = st.columns(2)
84
+ with status_col1:
85
+ st.write(f"**{get_text('retriever_status')}**")
86
+ with status_col2:
87
+ st.write("✅" if retriever_ready else "❌")
88
+
89
  if not retriever_ready:
90
+ st.error(get_text('retriever_error'), icon="🛑")
91
  st.stop()
92
+
93
+ status_col3, status_col4 = st.columns(2)
94
+ with status_col3:
95
+ st.write(f"**{get_text('openai_status')}**")
96
+ with status_col4:
97
+ st.write(f"{'✅' if openai_ready else '❌'}")
98
 
 
 
 
 
99
  if not openai_ready:
100
+ st.warning(get_text('openai_error'), icon="⚠️")
101
+
102
  # RAG parameters
103
  n_retrieve = st.slider(get_text('retrieval_count'), 1, 300, config.DEFAULT_N_RETRIEVE)
104
  max_validate = min(n_retrieve, 100)
 
111
  )
112
  st.info(get_text('validation_info'), icon="ℹ️")
113
 
114
+ # Prompt editors in expander
115
  with st.expander(get_text('edit_prompts'), expanded=False):
116
+ config.OPENAI_SYSTEM_PROMPT = st.text_area(
117
  get_text('system_prompt'),
118
  value=config.OPENAI_SYSTEM_PROMPT,
119
  height=200
120
  )
121
+ config.VALIDATION_PROMPT_TEMPLATE = st.text_area(
122
  get_text('validation_prompt'),
123
  value=config.VALIDATION_PROMPT_TEMPLATE,
124
  height=200
125
  )
 
 
126
 
127
+ return {
128
+ "n_retrieve": n_retrieve,
129
+ "n_validate": n_validate,
130
+ "services_ready": (retriever_ready and openai_ready)
131
+ }
memorybank/projectDocs/rules_compliance.md DELETED
@@ -1,169 +0,0 @@
1
- # Rules Compliance Report Guide
2
-
3
- ## Overview
4
- The `rules_compliance_report.md` file documents the evaluation of Python files in the Divrei Yoel AI Chat project against established coding rules and best practices. This report serves as a quality assessment tool to identify potential issues, anti-patterns, and areas for improvement across the codebase.
5
-
6
- ## Purpose and Value
7
- - **Quality Assurance**: Identifies potential code quality issues before they impact production
8
- - **Consistency**: Ensures adherence to project-wide coding standards
9
- - **Documentation**: Provides a record of technical debt and improvement opportunities
10
- - **Onboarding**: Helps new developers understand code quality expectations
11
-
12
- ## Report Structure
13
-
14
- ### File Organization
15
- The report evaluates 19 Python files across multiple directories:
16
- - Main application files (`app.py`, `config.py`, `i18n.py`, `rag_processor.py`)
17
- - Component modules (`components/*.py`)
18
- - CSS modules (`css/*.py`)
19
- - Prompt templates (`prompts/*.py`)
20
- - Service integrations (`services/*.py`)
21
- - Utility functions (`utils/*.py`)
22
-
23
- ### Evaluation Categories
24
- Each file is evaluated against several rule categories:
25
- - **`python.mdc`**: General Python best practices
26
- - **`streamlit.mdc`**: Streamlit-specific guidelines
27
- - **`openai.mdc`**: OpenAI API interaction standards
28
- - **`asyncio.mdc`**: Asynchronous programming patterns
29
- - **`langsmith.mdc`**: LangSmith tracing and monitoring
30
- - **`langchain.mdc`**: LangChain development patterns
31
- - **`pydantic.mdc`**: Pydantic usage recommendations
32
-
33
- ### Issue Documentation Format
34
- For each identified issue, the report provides:
35
- - **Issue ID**: Unique identifier (e.g., `APP-CSS-HACK`)
36
- - **File**: Location of the issue
37
- - **Lines**: Line numbers where the issue appears
38
- - **Description**: Explanation of the problem
39
- - **Rule Violation**: Reference to the specific rule being violated
40
- - **Severity**: Impact level (Critical, Medium, Minor, Informational)
41
- - **Suggestion**: Recommended fix (when applicable)
42
-
43
- ## Key Findings Summary
44
-
45
- ### Critical Issues
46
- - **Dynamic Imports (RET-DYNAMIC-IMPORTS, UTIL-DYNAMIC-CONFIG-IMPORT)**:
47
- - Issue in `services/retriever.py` (lines 13-22) and `utils/__init__.py` (lines 9-13)
48
- - Using `sys.path.insert` and `importlib.util` for importing modules
49
- - Impact: Breaks static analysis and standard Python import behaviors
50
- - Recommendation: Restructure imports to use standard Python import system
51
-
52
- - **Global State (OPENAI-GLOBAL-CLIENT, RET-GLOBAL-CLIENT)**:
53
- - Issue in `services/openai_service.py` (lines 15-17) and `services/retriever.py` (lines 28-31)
54
- - Managing service clients using global variables
55
- - Impact: Makes testing and configuration management less flexible
56
- - Recommendation: Use class-based services or dependency injection
57
-
58
- - **Config Modification (SBAR-CONFIG-MODIFICATION)**:
59
- - Issue in `components/sidebar.py` (lines 128-129)
60
- - Directly modifies `config.OPENAI_SYSTEM_PROMPT` and `config.VALIDATION_PROMPT_TEMPLATE`
61
- - Impact: Creates unpredictable application state
62
- - Recommendation: Store modified values in session state instead
63
-
64
- ### Medium Issues
65
- - **Async/Sync Mixing (RAG-ASYNC-SYNC-RETRIEVER, UTIL-SYNC-EMBEDDING)**:
66
- - Issue in `rag_processor.py` (line 20) and `utils/__init__.py` (in `get_embedding` function)
67
- - Calling synchronous functions with blocking I/O from async contexts
68
- - Impact: Blocks the event loop, degrading performance
69
- - Recommendation: Make retrieval functions async or use thread pool executors
70
-
71
- - **LangSmith Tracing (RAG-LS-MISSING-TAGS-META, OPENAI-LS-MISSING-TAGS-META)**:
72
- - Issue in multiple `@traceable` decorators across files
73
- - Lack of tags (e.g., `model:gpt-4o`) and metadata in traces
74
- - Impact: Reduced observability and filtering capabilities
75
- - Recommendation: Add consistent tags and metadata to all traced functions
76
-
77
- - **Input Sanitization (RAG-LS-INPUT-SANITIZATION, OPENAI-LS-INPUT-SANITIZATION)**:
78
- - Issue in traced functions receiving user input
79
- - Potential exposure of sensitive information in traces
80
- - Impact: Privacy and security concerns
81
- - Recommendation: Sanitize inputs before tracing and add "sanitized: True" to metadata
82
-
83
- - **CSS in Python (CSS-IN-PYTHON)**:
84
- - Issue in `css/styles.py` and UI components
85
- - Large blocks of CSS generated as Python f-strings
86
- - Impact: Makes CSS editing, linting, and maintenance difficult
87
- - Recommendation: Use separate .css files when possible
88
-
89
- ### Minor Issues
90
- - **Print vs Logging (CFG-PRINT-LOGGING, RAG-PRINT-LOGGING, etc.)**:
91
- - Issue in multiple files using `print()` for status updates and errors
92
- - Impact: Inconsistent logging and reduced control over output
93
- - Recommendation: Use the `logging` module consistently throughout
94
-
95
- - **Long Functions (RAG-LONG-FUNCTION)**:
96
- - Issue in `rag_processor.py` (`execute_validate_generate_pipeline`, lines 107-202)
97
- - Impact: Reduced readability and maintainability
98
- - Recommendation: Refactor into smaller, focused helper functions
99
-
100
- - **Pydantic Settings (CFG-PYDANTIC-SETTINGS)**:
101
- - Issue in `config.py` (manual environment variable handling)
102
- - Impact: Less robust configuration management
103
- - Recommendation: Use Pydantic's `BaseSettings` for validation and typing
104
-
105
- - **Local Imports (CHAT-LOCAL-IMPORTS, PGALLERY-LOCAL-IMPORTS, SBAR-LOCAL-IMPORTS)**:
106
- - Issue in component modules to avoid circular dependencies
107
- - Impact: Reduces clarity and can mask design issues
108
- - Recommendation: Restructure dependencies for top-level imports
109
-
110
- - **HTML Generation (CHAT-HTML-IN-PYTHON, PGALLERY-HTML-IN-PYTHON, SBAR-HTML-IN-PYTHON)**:
111
- - Issue in UI components generating HTML with inline styles
112
- - Impact: Reduced maintainability and separation of concerns
113
- - Recommendation: Extract styling to CSS and use simpler markup generation
114
-
115
- ## How to Use This Report
116
-
117
- ### Prioritizing Issues
118
- 1. **Critical Issues**: Must be addressed immediately as they affect stability or security
119
- 2. **Medium Issues**: Should be scheduled for remediation in the next development cycle
120
- 3. **Minor Issues**: Can be addressed during regular maintenance or refactoring
121
-
122
- ### Implementing Fixes
123
- For each issue category, consider the following approaches:
124
-
125
- #### Dynamic Imports
126
- - Restructure imports to use standard Python import system
127
- - Fix PYTHONPATH issues rather than manipulating `sys.path`
128
-
129
- #### Global State
130
- - Refactor to use dependency injection or service classes
131
- - Create proper initialization and shutdown procedures
132
-
133
- #### Logging
134
- - Replace all `print()` calls with appropriate `logging` module usage
135
- - Implement centralized logging configuration
136
-
137
- #### CSS Management
138
- - Extract CSS to dedicated CSS files
139
- - Consider using Streamlit's theming capabilities
140
-
141
- #### LangSmith Tracing
142
- - Add consistent tags, metadata, and sanitization
143
- - Follow the patterns outlined in `langsmith.mdc`
144
-
145
- ## Integration with Development Workflow
146
-
147
- ### Code Review
148
- Use this report as a reference during code reviews to ensure new code doesn't introduce known issues.
149
-
150
- ### Continuous Improvement
151
- - Update the report regularly as the codebase evolves
152
- - Use it to track progress in resolving identified issues
153
- - Reference issue IDs in commit messages when fixing problems
154
-
155
- ### Documentation
156
- Keep the report synchronized with other project documentation to maintain a consistent view of code quality standards.
157
-
158
- ## Relationship to Other Documentation
159
- This report complements other project documentation:
160
- - **Technical Documentation**: Provides a quality assessment perspective
161
- - **Development Workflow Guide**: Informs development standards
162
- - **Module Guides**: Identifies specific areas for improvement within modules
163
-
164
- ## Recent Updates
165
- - Comprehensive evaluation of all Python files
166
- - Addition of severity ratings for each issue
167
- - Enhanced guidance for resolving critical issues
168
- - Improved issue categorization and naming conventions
169
- - Detailed examples of specific code locations needing attention
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
memorybank/tasks.md CHANGED
@@ -1,12 +1,114 @@
1
- # Tasks
2
 
3
- ## Current Status
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - [x] VAN mode initialization (TIMESTAMP: 2025-05-06 16:31:19)
5
  - [x] System files verification
6
  - [x] Project structure evaluation
7
  - [x] Technology stack assessment
8
  - [x] Memory Bank initialization
9
  - [x] Creation of core Memory Bank files
 
 
10
  - [x] Initial technical validation
11
  - [x] Detailed code review of app.py
12
  - [x] Detailed code review of rag_processor.py
@@ -17,21 +119,17 @@
17
  - [x] Performance optimization assessment
18
  - [x] Complexity assessment finalization
19
  - [x] Rules compliance analysis completed
20
- - [x] Product documentation enhancement (2025-05-06 16:35:20)
21
-
22
- ## System Implementation Tasks
23
  - [x] Complete VAN mode validation
 
 
24
  - [x] Determine appropriate next mode based on complexity
25
  - [x] Execute mode transition
26
  - [x] Update Memory Bank with comprehensive project analysis
27
-
28
- ## Next Mode Preparation
29
- - [ ] If Level 1 complexity → Prepare for IMPLEMENT mode
30
  - [x] If Level 2-4 complexity → Prepare for PLAN mode
31
  - [x] Document evaluation results in Memory Bank
32
  - [x] Check for missing dependencies or environment variables
33
 
34
- ## PLAN Mode Tasks (Current Phase)
35
  - [x] Define optimization priorities
36
  - [x] Identify potential enhancements
37
  - [x] Draft architectural improvements
@@ -39,47 +137,33 @@
39
  - [x] Schedule task execution
40
  - [x] Document technical specifications
41
 
42
- ## Implementation Tasks (Completed)
43
  - [x] Improved citation detection using OpenAI instead of regex
44
  - [x] Implemented grid-based prompt gallery with Hebrew examples
45
  - [x] Enhanced CSS for RTL text display
46
  - [x] Updated UI labels for better Hebrew support
47
  - [x] Added collapsible sidebar with settings button
48
  - [x] Improved overall user experience
 
 
 
 
 
 
 
 
49
 
50
- ## Newly Identified Issues (Current Analysis)
51
  - [x] Fix missing input variable handling in app.py (Line 171: Using `prompt` without initializing)
52
  - [x] Implement error handling for asyncio loop failures
53
  - [x] Enhance exception handling in the main RAG processing block
54
  - [x] Review and secure HTML rendering (multiple instances of `unsafe_allow_html=True`)
55
- - [ ] Implement proper internationalization for Hebrew text
56
  - [x] Remove debug print statements from production code
57
- - [ ] Optimize performance by making UI rendering calls asynchronous
58
- - [ ] Add retry mechanism for service initialization failures
59
-
60
- ## Code Quality Improvement Tasks (From Rules Compliance Report)
61
- ### Critical Priority
62
- - [ ] Fix dynamic imports in retriever.py (RET-DYNAMIC-IMPORTS) - Lines 13-22
63
- - [ ] Fix dynamic imports in utils/__init__.py (UTIL-DYNAMIC-CONFIG-IMPORT) - Lines 9-13
64
- - [ ] Refactor global client state in openai_service.py (OPENAI-GLOBAL-CLIENT) - Lines 15-17
65
- - [ ] Refactor global client state in retriever.py (RET-GLOBAL-CLIENT) - Lines 28-31
66
- - [ ] Fix runtime config modification in sidebar.py (SBAR-CONFIG-MODIFICATION) - Lines 128-129
67
-
68
- ### Medium Priority
69
- - [ ] Fix async/sync mixing in rag_processor.py (RAG-ASYNC-SYNC-RETRIEVER) - Line 20
70
- - [ ] Update get_embedding() in utils/__init__.py to use async (UTIL-SYNC-EMBEDDING)
71
- - [ ] Add tags and metadata to traced functions (RAG-LS-MISSING-TAGS-META) in multiple files
72
- - [ ] Implement input sanitization for LangSmith tracing (RAG-LS-INPUT-SANITIZATION)
73
- - [ ] Refactor CSS generation in css/styles.py (CSS-IN-PYTHON) to use external files
74
-
75
- ### Minor Priority
76
- - [ ] Replace print() with logging in all files (CFG-PRINT-LOGGING, RAG-PRINT-LOGGING, etc.)
77
- - [ ] Refactor long function in rag_processor.py (RAG-LONG-FUNCTION) - Lines 107-202
78
- - [ ] Use Pydantic BaseSettings in config.py (CFG-PYDANTIC-SETTINGS)
79
- - [ ] Fix local imports in component files (CHAT-LOCAL-IMPORTS, PGALLERY-LOCAL-IMPORTS)
80
- - [ ] Extract HTML generation to separate functions or templates (CHAT-HTML-IN-PYTHON)
81
-
82
- ## Documentation Enhancements
83
  - [x] Improve product description in projectbrief.md
84
  - [x] Expand user experience workflow in productContext.md
85
  - [x] Add detailed corpus and knowledge base information
@@ -87,17 +171,4 @@
87
  - [x] Enhance technical architecture documentation
88
  - [x] Clarify core functionality and capabilities
89
  - [x] Update activeContext.md with documentation improvement tracking
90
- - [x] Create comprehensive rules compliance report with specific issue tracking
91
-
92
- ## Notes
93
- - Memory Bank system initialized via VAN mode
94
- - VAN mode technical validation completed
95
- - Rules compliance analysis completed (2025-05-06)
96
- - Product documentation enhanced (2025-05-06 16:35:20)
97
- - Final assessment confirms moderate complexity (Level 2)
98
- - Core RAG functionality is well-implemented with good architecture
99
- - PLAN mode completed with successful implementation of enhancements
100
- - Project has robust error handling and good separation of concerns
101
- - New analysis (2025-05-06) identified 8 issues that require addressing
102
- - Implementation completed for 4 of 8 issues (2025-05-06)
103
- - Comprehensive rules compliance report added with 15 prioritized tasks (2025-05-07)
 
1
+ # DivreiYoelChezky Project Task Tracker
2
 
3
+ ## 📋 Executive Summary
4
+ - **Project Status**: Phase 2 (Implementation)
5
+ - **Complexity Level**: 2 (Moderate)
6
+ - **Last Updated**: 2025-05-08
7
+ - **Pending Tasks**: 8
8
+ - **Completed Tasks**: 43
9
+
10
+ ## 🏆 Branch Achievements Summary
11
+
12
+ ### Core Application Development
13
+ - ✓ Built Streamlit-based chat application for Rabbi Yoel Teitelbaum's teachings using RAG
14
+ - ✓ Implemented bilingual support (Hebrew/English) with RTL text handling
15
+ - ✓ Created customizable Hebrew font system with Google Fonts integration
16
+ - ✓ Developed modular component architecture with reusable UI elements
17
+ - ✓ Added source citation functionality with OpenAI-based extraction
18
+ - ✓ Optimized bidi text handling with proper unicode-bidi embedding
19
+
20
+ ### RAG Pipeline Implementation
21
+ - ✓ Designed three-stage RAG pipeline (Retrieve → Validate → Generate)
22
+ - ✓ Integrated Pinecone for vector retrieval of relevant documents
23
+ - ✓ Added OpenAI GPT-4o validation to filter relevant passages
24
+ - ✓ Implemented streaming responses for better user experience
25
+ - ✓ Added tracing with LangSmith for observability and debugging
26
+
27
+ ### Code Quality & Architecture
28
+ - ✓ Refactored codebase into modular components and services
29
+ - ✓ Implemented comprehensive error handling throughout application
30
+ - ✓ Created centralized configuration system for environment variables
31
+ - ✓ Added HTML sanitization for security with bleach
32
+ - ✓ Improved session state management for Streamlit
33
+ - ✓ Refactored UI components for better separation of concerns
34
+ - ✓ Moved CSS to external files to reduce duplication
35
+
36
+ ### Documentation & Best Practices
37
+ - ✓ Created extensive module guides for all components
38
+ - ✓ Developed comprehensive rules compliance report
39
+ - ✓ Documented development workflow with best practices
40
+ - ✓ Added internationalization system with translation dictionaries
41
+ - ✓ Created product documentation with user experience flow
42
+
43
+ ### Process & Project Management
44
+ - ✓ Set up Memory Bank system with VAN mode initialization
45
+ - ✓ Completed technical validation of all components
46
+ - ✓ Executed PLAN mode with optimization priorities
47
+ - ✓ Created detailed implementation plans
48
+ - ✓ Fixed 4 of 8 identified issues during review
49
+
50
+ ## 🚀 Active Development Tasks
51
+
52
+ ### 🔧 Technical Debt (Critical)
53
+
54
+ ### ⚡ Performance Optimization (High Priority)
55
+
56
+ ### 🔒 Security Improvements (High Priority)
57
+ - [ ] **Add Input Sanitization for Tracing** (OPENAI-LS-INPUT-SANITIZATION, RET-LS-QUERY-SANITIZATION)
58
+ - Files: `openai_service.py`, `retriever.py`
59
+ - Issue: Missing input sanitization for LangSmith tracing of potentially sensitive user queries
60
+ - Solution: Add sanitization before tracing for security
61
+
62
+ ### 🌐 User Experience Improvements
63
+ - [ ] **Custom Prompt Templates** (UI-PROMPT-TEMPLATES)
64
+ - Feature: Add system for custom prompt template management
65
+ - Priority: Medium
66
+ - Components: Template editor, save/load functionality, template validation
67
+
68
+ - [ ] **Enhance Service Reliability**
69
+ - Feature: Add retry mechanism for service initialization failures
70
+ - Priority: Medium
71
+
72
+ ### 📈 Future Enhancements
73
+ - [ ] **Add Comprehensive API Retries** (OPENAI-NO-RETRIES)
74
+ - Feature: Enhanced retry mechanism for API calls with exponential backoff
75
+ - Priority: Low
76
+
77
+ - [ ] **Improve Observability**
78
+ - Feature: Add tags to LangSmith tracing decorators
79
+ - Files: `rag_processor.py`, `openai_service.py`
80
+ - Priority: Low
81
+
82
+
83
+ ## 📝 Project Notes
84
+ - Memory Bank system initialized via VAN mode
85
+ - VAN mode technical validation completed
86
+ - Rules compliance analysis completed (2025-05-06)
87
+ - Product documentation enhanced (2025-05-06 16:35:20)
88
+ - Final assessment confirms moderate complexity (Level 2)
89
+ - Core RAG functionality is well-implemented with good architecture
90
+ - PLAN mode completed with successful implementation of enhancements
91
+ - Project has robust error handling and good separation of concerns
92
+ - New analysis (2025-05-06) identified 8 issues that require addressing
93
+ - Implementation completed for 4 of 8 issues (2025-05-06)
94
+ - Comprehensive rules compliance report added with 15 prioritized tasks (2025-05-07)
95
+ - UI components refactored for better separation of concerns (2025-05-08)
96
+ - Moved CSS to external files and reduced duplication (2025-05-08)
97
+ - Improved HTML sanitization and Unicode bidirectional text handling (2025-05-08)
98
+
99
+ ---
100
+
101
+ # 📚 Task Archive (Completed)
102
+
103
+ ## ✅ System Initialization (2025-05-06)
104
  - [x] VAN mode initialization (TIMESTAMP: 2025-05-06 16:31:19)
105
  - [x] System files verification
106
  - [x] Project structure evaluation
107
  - [x] Technology stack assessment
108
  - [x] Memory Bank initialization
109
  - [x] Creation of core Memory Bank files
110
+
111
+ ## ✅ Technical Validation
112
  - [x] Initial technical validation
113
  - [x] Detailed code review of app.py
114
  - [x] Detailed code review of rag_processor.py
 
119
  - [x] Performance optimization assessment
120
  - [x] Complexity assessment finalization
121
  - [x] Rules compliance analysis completed
 
 
 
122
  - [x] Complete VAN mode validation
123
+
124
+ ## ✅ Project Management
125
  - [x] Determine appropriate next mode based on complexity
126
  - [x] Execute mode transition
127
  - [x] Update Memory Bank with comprehensive project analysis
 
 
 
128
  - [x] If Level 2-4 complexity → Prepare for PLAN mode
129
  - [x] Document evaluation results in Memory Bank
130
  - [x] Check for missing dependencies or environment variables
131
 
132
+ ## Planning Phase
133
  - [x] Define optimization priorities
134
  - [x] Identify potential enhancements
135
  - [x] Draft architectural improvements
 
137
  - [x] Schedule task execution
138
  - [x] Document technical specifications
139
 
140
+ ## Implementation Achievements
141
  - [x] Improved citation detection using OpenAI instead of regex
142
  - [x] Implemented grid-based prompt gallery with Hebrew examples
143
  - [x] Enhanced CSS for RTL text display
144
  - [x] Updated UI labels for better Hebrew support
145
  - [x] Added collapsible sidebar with settings button
146
  - [x] Improved overall user experience
147
+ - [x] Completed dropdown menu reorganization and renaming (UI-DROPDOWN-RENAME)
148
+ - [x] Implemented advanced citation filtering system (UI-CITATION-FILTER)
149
+ - [x] Optimized RTL citation display with proper Hebrew support (UI-RTL-CITATIONS)
150
+ - [x] Fixed dynamic imports in retriever.py (RET-DYNAMIC-IMPORTS)
151
+ - [x] Fixed dynamic imports in utils/__init__.py (UTIL-DYNAMIC-CONFIG-IMPORT)
152
+ - [x] Converted retriever.retrieve_documents to async with asyncio.to_thread for Pinecone queries
153
+ - [x] Updated embedding function to use AsyncOpenAI client and asyncio.sleep
154
+ - [x] Completed Streamlit UI Enhancement (UI-STREAMLIT-ENHANCE) - Refactored components, improved CSS, optimized RTL handling (2025-05-08)
155
 
156
+ ## Bug Fixes
157
  - [x] Fix missing input variable handling in app.py (Line 171: Using `prompt` without initializing)
158
  - [x] Implement error handling for asyncio loop failures
159
  - [x] Enhance exception handling in the main RAG processing block
160
  - [x] Review and secure HTML rendering (multiple instances of `unsafe_allow_html=True`)
 
161
  - [x] Remove debug print statements from production code
162
+ - [x] Fixed citation extraction unnecessary API calls with empty responses (2025-05-08)
163
+ - [x] Added unicode bidirectional text optimization to prevent browser cursor jumps (2025-05-08)
164
+ - [x] Improved HTML sanitization to prevent XSS with better timing (2025-05-08)
165
+
166
+ ## Documentation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
167
  - [x] Improve product description in projectbrief.md
168
  - [x] Expand user experience workflow in productContext.md
169
  - [x] Add detailed corpus and knowledge base information
 
171
  - [x] Enhance technical architecture documentation
172
  - [x] Clarify core functionality and capabilities
173
  - [x] Update activeContext.md with documentation improvement tracking
174
+ - [x] Create comprehensive rules compliance report with specific issue tracking
 
 
 
 
 
 
 
 
 
 
 
 
 
pipeline/__init__.py ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ # Pipeline components package
2
+ """
3
+ Contains pipeline processing components:
4
+ - rag.py: RAG pipeline wrapper and processing
5
+ """
pipeline/rag.py ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import asyncio
2
+ import logging
3
+ import traceback
4
+ from typing import Dict, Any, List, Callable, Optional
5
+
6
+ import streamlit as st
7
+
8
+ # Setup logger
9
+ logger = logging.getLogger(__name__)
10
+
11
+ async def process_rag_request(
12
+ history: List[Dict[str, Any]],
13
+ params: Dict[str, Any],
14
+ status_callback: Optional[Callable[[str], None]] = None,
15
+ stream_callback: Optional[Callable[[str], None]] = None
16
+ ) -> Dict[str, Any]:
17
+ """
18
+ Process a RAG request asynchronously.
19
+
20
+ Args:
21
+ history (List[Dict[str, Any]]): Message history
22
+ params (Dict[str, Any]): RAG parameters
23
+ status_callback (Optional[Callable]): Callback for status updates
24
+ stream_callback (Optional[Callable]): Callback for streaming response chunks
25
+
26
+ Returns:
27
+ Dict[str, Any]: Response data including final response, documents, and logs
28
+ """
29
+ from i18n import get_text
30
+ from rag_processor import execute_validate_generate_pipeline
31
+
32
+ try:
33
+ return await execute_validate_generate_pipeline(
34
+ history=history,
35
+ params=params,
36
+ status_callback=status_callback,
37
+ stream_callback=stream_callback
38
+ )
39
+ except asyncio.CancelledError:
40
+ logger.warning("RAG request was cancelled")
41
+ return {
42
+ "final_response": get_text('request_cancelled'),
43
+ "error": "Request cancelled",
44
+ "status_log": ["Request cancelled by user or system"],
45
+ "generator_input_documents": [],
46
+ "pipeline_used": "Cancelled"
47
+ }
48
+ except asyncio.TimeoutError:
49
+ logger.error("RAG request timed out")
50
+ return {
51
+ "final_response": get_text('request_timeout'),
52
+ "error": "Request timed out",
53
+ "status_log": ["Request exceeded maximum allowed time"],
54
+ "generator_input_documents": [],
55
+ "pipeline_used": "Timeout"
56
+ }
57
+ except Exception as e:
58
+ logger.exception("Error in RAG processing")
59
+ return {
60
+ "final_response": f"{get_text('processing_error')}: {type(e).__name__}",
61
+ "error": str(e),
62
+ "status_log": [f"Error: {type(e).__name__}", traceback.format_exc()],
63
+ "generator_input_documents": [],
64
+ "pipeline_used": "Error"
65
+ }
66
+
67
+ def create_async_execution_context():
68
+ """
69
+ Create or get the appropriate asyncio execution context.
70
+
71
+ Returns:
72
+ asyncio.AbstractEventLoop: The event loop to use
73
+ """
74
+ try:
75
+ # Try to get the current running loop
76
+ loop = asyncio.get_running_loop()
77
+ except RuntimeError:
78
+ # No running loop, create a new one
79
+ loop = asyncio.new_event_loop()
80
+ asyncio.set_event_loop(loop)
81
+
82
+ return loop
83
+
84
+ def extract_citations(response: str) -> List[str]:
85
+ """
86
+ Extract citation IDs from a response text.
87
+
88
+ Args:
89
+ response (str): Response text with potential citations
90
+
91
+ Returns:
92
+ List[str]: List of citation IDs
93
+ """
94
+ # Early return if empty response
95
+ if not response or not response.strip():
96
+ return []
97
+
98
+ try:
99
+ from services.openai_service import extract_citations_with_openai
100
+ loop = create_async_execution_context()
101
+ return loop.run_until_complete(extract_citations_with_openai(response))
102
+ except Exception as e:
103
+ logger.exception("Failed to extract citations")
104
+ st.error(f"Citation extraction failed: {e}", icon="⚠️")
105
+ return []
prompts/__init__.py CHANGED
@@ -4,4 +4,5 @@ Provides access to system and validation prompts.
4
  """
5
 
6
  from .system_prompt import OPENAI_SYSTEM_PROMPT
7
- from .validation_prompt import VALIDATION_PROMPT_TEMPLATE
 
 
4
  """
5
 
6
  from .system_prompt import OPENAI_SYSTEM_PROMPT
7
+ from .validation_prompt import VALIDATION_PROMPT_TEMPLATE
8
+ from .templates import get_templates_for_language, get_template_by_id
prompts/templates.py ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Prompt templates module for the DivreiYoel app.
3
+ Provides structured templates for common query patterns.
4
+ """
5
+ from typing import Dict, Any, List
6
+ import logging
7
+
8
+ # Setup logger
9
+ logger = logging.getLogger(__name__)
10
+
11
+ # Template structure:
12
+ # - id: Unique identifier for the template
13
+ # - name: Display name for the template (localized)
14
+ # - template: The template text that prefixes user input
15
+ # - isolate_query: Whether to isolate the user's query when searching Pinecone
16
+ # - description: Optional description of what the template does
17
+ # - language_code: The language code this template belongs to (added for verification)
18
+
19
+ PROMPT_TEMPLATES = {
20
+ "he": [
21
+ {
22
+ "id": "dvar_torah_support",
23
+ "name": "גיבוי לרעיון לדבר תורה",
24
+ "template": """תפקידך הוא לעזור לי לגבות את רעיונות דברי התורה שלי ולתמוך בהם עם מקורות שיכולים לתמוך במושג או במושגים שאני מבסס עליהם את דבר התורה שלי.
25
+
26
+ עליך להעריך בקפידה כל דבר תורה כדי לראות אם וכיצד הוא יכול לעזור לגבות את דבר התורה שלי, עליך להשתמש ביצירתיות והיגיון כדי להגיע לכך.
27
+
28
+ הנה הרעיון שאני מחפש לגבות:
29
+ """,
30
+ "isolate_query": True,
31
+ "description": "מציאת מקורות לתמיכה ברעיון דבר תורה",
32
+ "language_code": "he"
33
+ },
34
+ {
35
+ "id": "html_template_he",
36
+ "name": "תבנית HTML - מציאת מקורות",
37
+ "template": """<div dir="rtl" lang="he" class="rtl-text mixed-content hebrew-font">
38
+ תפקידך הוא לעזור לי לגבות את רעיונות דברי התורה שלי ולתמוך בהם עם מקורות שיכולים לתמוך במושג או במושג שאני מבסס עליו את הדרשה.
39
+
40
+ <ul>
41
+ <li>יש לאתר <strong>מקורות רלוונטיים</strong> מהחומר הנלמד</li>
42
+ <li>יש להביא <strong>ציטוטים מדויקים</strong> מהמקורות</li>
43
+ <li>יש להסביר את <strong>הקשר בין המקורות</strong> לרעיון המרכזי</li>
44
+ </ul>
45
+
46
+ הנה הרעיון שאני מחפש לגבות:
47
+ </div>""",
48
+ "isolate_query": True,
49
+ "description": "תבנית עם תגיות HTML",
50
+ "language_code": "he"
51
+ }
52
+ ],
53
+ "en": [
54
+ {
55
+ "id": "dvar_torah_support",
56
+ "name": "Dvar Torah Support",
57
+ "template": """Your job is to help me back up my dvar torah ideas that I come up with and back it up with sources that can support the concept or concepts that I'm basing my dvar torah on.
58
+
59
+ You have to carefully evaluate each dvar torah to see if and how it can help back up this dvar torah of mine and how, you have to use some creativity and reasoning in order to get to that.
60
+
61
+ Here's the idea that I'm looking to backup:
62
+ """,
63
+ "isolate_query": True,
64
+ "description": "Find sources to support a dvar torah concept",
65
+ "language_code": "en"
66
+ },
67
+ {
68
+ "id": "html_template_en",
69
+ "name": "HTML Template - Finding Sources",
70
+ "template": """<div dir="ltr" lang="en" class="ltr-text">
71
+ Your task is to help me support my Torah insights by finding relevant sources that can back up the concepts I'm developing in my teaching.
72
+
73
+ <ul>
74
+ <li>Find <strong>relevant sources</strong> from the studied material</li>
75
+ <li>Provide <strong>exact quotations</strong> from these sources</li>
76
+ <li>Explain the <strong>connection between the sources</strong> and the central idea</li>
77
+ </ul>
78
+
79
+ Here's the concept I'm looking to support:
80
+ </div>""",
81
+ "isolate_query": True,
82
+ "description": "Template with HTML tags",
83
+ "language_code": "en"
84
+ }
85
+ ]
86
+ }
87
+
88
+ def get_templates_for_language(language_code: str) -> List[Dict[str, Any]]:
89
+ """
90
+ Get all prompt templates for a specific language.
91
+
92
+ Args:
93
+ language_code (str): Language code ('he' or 'en')
94
+
95
+ Returns:
96
+ List[Dict[str, Any]]: List of template objects for the language
97
+ """
98
+ # Log the requested language code to help debug
99
+ logger.info(f"Getting templates for language code: {language_code}")
100
+
101
+ # Strict language matching - only return templates for the exact language
102
+ templates = PROMPT_TEMPLATES.get(language_code, [])
103
+
104
+ # Fallback to English if no templates found for the requested language
105
+ if not templates and language_code != "en":
106
+ logger.warning(f"No templates found for {language_code}, falling back to English templates")
107
+ templates = PROMPT_TEMPLATES.get("en", [])
108
+
109
+ # Add language_code to each template if not already present
110
+ for template in templates:
111
+ if "language_code" not in template:
112
+ template["language_code"] = language_code
113
+
114
+ # Log the number of templates found
115
+ logger.info(f"Found {len(templates)} templates for language: {language_code}")
116
+
117
+ return templates
118
+
119
+ def get_template_by_id(template_id: str, language_code: str) -> Dict[str, Any]:
120
+ """
121
+ Get a specific template by ID for a language.
122
+
123
+ Args:
124
+ template_id (str): Template identifier
125
+ language_code (str): Language code ('he' or 'en')
126
+
127
+ Returns:
128
+ Dict[str, Any]: Template object or empty dict if not found
129
+ """
130
+ # Log the requested template lookup
131
+ logger.info(f"Looking up template ID '{template_id}' for language: {language_code}")
132
+
133
+ templates = get_templates_for_language(language_code)
134
+ for template in templates:
135
+ if template.get("id") == template_id:
136
+ logger.info(f"Found template: {template.get('name', 'Unnamed')}")
137
+ return template
138
+
139
+ logger.warning(f"Template with ID '{template_id}' not found for language: {language_code}")
140
+ return {}
rag_processor.py CHANGED
@@ -18,10 +18,29 @@ StatusCallback = Callable[[str], None]
18
  # --- Step Functions ---
19
 
20
  @traceable(name="rag-step-retrieve")
21
- async def run_retrieval_step(query: str, n_retrieve: int, update_status: StatusCallback) -> List[Dict]:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  update_status(get_text("retrieving_docs").format(n_retrieve))
23
  start_time = time.time()
24
- retrieved_docs = retriever.retrieve_documents(query_text=query, n_results=n_retrieve)
25
  retrieval_time = time.time() - start_time
26
  update_status(get_text("retrieved_docs").format(len(retrieved_docs), f"{retrieval_time:.2f}"))
27
  if not retrieved_docs:
@@ -143,9 +162,12 @@ async def execute_validate_generate_pipeline(
143
  return result
144
 
145
  try:
 
 
 
146
  # 1. Retrieval
147
  retrieved_docs = await run_retrieval_step(
148
- current_query_text, params['n_retrieve'], update_status_and_log
149
  )
150
  if not retrieved_docs:
151
  result["error"] = get_text("no_docs_found")
 
18
  # --- Step Functions ---
19
 
20
  @traceable(name="rag-step-retrieve")
21
+ async def run_retrieval_step(query: str, n_retrieve: int, update_status: StatusCallback, original_query: str = None) -> List[Dict]:
22
+ """
23
+ Retrieve documents from the vector store.
24
+
25
+ Args:
26
+ query (str): The full query text (may include template)
27
+ n_retrieve (int): Number of documents to retrieve
28
+ update_status (StatusCallback): Status update callback function
29
+ original_query (str, optional): The original user query without template
30
+
31
+ Returns:
32
+ List[Dict]: List of retrieved documents
33
+ """
34
+ # Import inside function to avoid circular imports
35
+ from i18n import get_text
36
+ from services.retriever import retrieve_documents
37
+
38
+ # Use original query for Pinecone search if provided
39
+ search_query = original_query if original_query else query
40
+
41
  update_status(get_text("retrieving_docs").format(n_retrieve))
42
  start_time = time.time()
43
+ retrieved_docs = await retrieve_documents(query_text=search_query, n_results=n_retrieve)
44
  retrieval_time = time.time() - start_time
45
  update_status(get_text("retrieved_docs").format(len(retrieved_docs), f"{retrieval_time:.2f}"))
46
  if not retrieved_docs:
 
162
  return result
163
 
164
  try:
165
+ # Extract original query for search if present
166
+ original_query = params.get('original_query')
167
+
168
  # 1. Retrieval
169
  retrieved_docs = await run_retrieval_step(
170
+ current_query_text, params['n_retrieve'], update_status_and_log, original_query
171
  )
172
  if not retrieved_docs:
173
  result["error"] = get_text("no_docs_found")
services/retriever.py CHANGED
@@ -3,33 +3,21 @@
3
  # It correctly uses config and utils.
4
  import time
5
  import traceback
6
- import sys
7
  import os
8
- import importlib.util
9
  from typing import List, Dict, Optional, Tuple
10
  from pinecone import Pinecone, Index
11
  from langsmith import traceable
12
 
13
- try:
14
- # Add parent directory to path to ensure imports work
15
- parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
16
- sys.path.insert(0, parent_dir)
17
-
18
- # Import config.py directly from parent directory
19
- print(f"Loading config from: {os.path.join(parent_dir, 'config.py')}")
20
- spec = importlib.util.spec_from_file_location("config", os.path.join(parent_dir, "config.py"))
21
- config = importlib.util.module_from_spec(spec)
22
- spec.loader.exec_module(config)
23
-
24
- # Import from utils package
25
- print(f"Importing from utils package")
26
- from utils import clean_source_text, get_embedding
27
-
28
- except Exception as e:
29
- print(f"Error loading modules: {e}")
30
- print(f"Current working directory: {os.getcwd()}")
31
- traceback.print_exc()
32
- raise SystemExit("Failed to load required modules in retriever.py")
33
 
34
  # --- Globals ---
35
  pinecone_client: Optional[Pinecone] = None
@@ -42,16 +30,16 @@ def init_retriever() -> Tuple[bool, str]:
42
  """Initializes the Pinecone client and index connection."""
43
  global pinecone_client, pinecone_index, is_retriever_ready, retriever_status_message
44
  if is_retriever_ready: return True, retriever_status_message
45
- if not config.PINECONE_API_KEY:
46
  retriever_status_message = "Error: PINECONE_API_KEY not found in Secrets."
47
  is_retriever_ready = False; return False, retriever_status_message
48
- if not config.OPENAI_API_KEY:
49
  retriever_status_message = "Error: OPENAI_API_KEY not found (needed for query embeddings)."
50
  is_retriever_ready = False; return False, retriever_status_message
51
  try:
52
  print("Retriever: Initializing Pinecone client...")
53
- pinecone_client = Pinecone(api_key=config.PINECONE_API_KEY)
54
- index_name = config.PINECONE_INDEX_NAME
55
  print(f"Retriever: Checking for Pinecone index '{index_name}'...")
56
  available_indexes = [idx.name for idx in pinecone_client.list_indexes().indexes]
57
  if index_name not in available_indexes:
@@ -64,7 +52,7 @@ def init_retriever() -> Tuple[bool, str]:
64
  if stats.total_vector_count == 0:
65
  retriever_status_message = f"Retriever connected, but index '{index_name}' is empty."
66
  else:
67
- retriever_status_message = f"Retriever ready (Index: {index_name}, Embed Model: {config.EMBEDDING_MODEL})."
68
  is_retriever_ready = True
69
  return True, retriever_status_message
70
  except Exception as e:
@@ -78,16 +66,22 @@ def get_retriever_status() -> Tuple[bool, str]:
78
 
79
  # --- Core Function ---
80
  @traceable(name="pinecone-retrieve-documents")
81
- def retrieve_documents(query_text: str, n_results: int) -> List[Dict]:
82
  global pinecone_index
83
  ready, message = get_retriever_status()
84
  if not ready or pinecone_index is None:
85
  print(f"Retriever not ready: {message}"); return []
86
  print(f"Retriever: Retrieving top {n_results} docs for query: '{query_text[:100]}...'"); start_time = time.time()
87
  try:
88
- query_embedding = get_embedding(query_text, model=config.EMBEDDING_MODEL)
89
  if query_embedding is None: print("Retriever: Failed query embedding."); return []
90
- response = pinecone_index.query(vector=query_embedding, top_k=n_results, include_metadata=True)
 
 
 
 
 
 
91
  formatted_results = []
92
  if not response or not response.matches: print("Retriever: No results found."); return []
93
  for match in response.matches:
 
3
  # It correctly uses config and utils.
4
  import time
5
  import traceback
 
6
  import os
7
+ import asyncio
8
  from typing import List, Dict, Optional, Tuple
9
  from pinecone import Pinecone, Index
10
  from langsmith import traceable
11
 
12
+ # Change relative imports to absolute imports
13
+ import config
14
+ from config import (
15
+ PINECONE_API_KEY,
16
+ OPENAI_API_KEY,
17
+ PINECONE_INDEX_NAME,
18
+ EMBEDDING_MODEL
19
+ )
20
+ from utils import clean_source_text, get_embedding
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  # --- Globals ---
23
  pinecone_client: Optional[Pinecone] = None
 
30
  """Initializes the Pinecone client and index connection."""
31
  global pinecone_client, pinecone_index, is_retriever_ready, retriever_status_message
32
  if is_retriever_ready: return True, retriever_status_message
33
+ if not PINECONE_API_KEY:
34
  retriever_status_message = "Error: PINECONE_API_KEY not found in Secrets."
35
  is_retriever_ready = False; return False, retriever_status_message
36
+ if not OPENAI_API_KEY:
37
  retriever_status_message = "Error: OPENAI_API_KEY not found (needed for query embeddings)."
38
  is_retriever_ready = False; return False, retriever_status_message
39
  try:
40
  print("Retriever: Initializing Pinecone client...")
41
+ pinecone_client = Pinecone(api_key=PINECONE_API_KEY)
42
+ index_name = PINECONE_INDEX_NAME
43
  print(f"Retriever: Checking for Pinecone index '{index_name}'...")
44
  available_indexes = [idx.name for idx in pinecone_client.list_indexes().indexes]
45
  if index_name not in available_indexes:
 
52
  if stats.total_vector_count == 0:
53
  retriever_status_message = f"Retriever connected, but index '{index_name}' is empty."
54
  else:
55
+ retriever_status_message = f"Retriever ready (Index: {index_name}, Embed Model: {EMBEDDING_MODEL})."
56
  is_retriever_ready = True
57
  return True, retriever_status_message
58
  except Exception as e:
 
66
 
67
  # --- Core Function ---
68
  @traceable(name="pinecone-retrieve-documents")
69
+ async def retrieve_documents(query_text: str, n_results: int) -> List[Dict]:
70
  global pinecone_index
71
  ready, message = get_retriever_status()
72
  if not ready or pinecone_index is None:
73
  print(f"Retriever not ready: {message}"); return []
74
  print(f"Retriever: Retrieving top {n_results} docs for query: '{query_text[:100]}...'"); start_time = time.time()
75
  try:
76
+ query_embedding = await get_embedding(query_text, model=EMBEDDING_MODEL)
77
  if query_embedding is None: print("Retriever: Failed query embedding."); return []
78
+ # Run Pinecone query in a thread to avoid blocking
79
+ response = await asyncio.to_thread(
80
+ pinecone_index.query,
81
+ vector=query_embedding,
82
+ top_k=n_results,
83
+ include_metadata=True
84
+ )
85
  formatted_results = []
86
  if not response or not response.matches: print("Retriever: No results found."); return []
87
  for match in response.matches:
ui/__init__.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ # UI components package
2
+ """
3
+ Contains UI components for the app:
4
+ - hebrew.py: Hebrew text handling
5
+ - chat_render.py: Chat UI rendering functions
6
+ """
ui/chat_render.py ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ from typing import Dict, Any, List
3
+ import logging
4
+
5
+ # Setup logger
6
+ logger = logging.getLogger(__name__)
7
+
8
+ def format_source_html(doc: Dict[str, Any], i: int, hebrew_font: str, get_text: callable) -> tuple:
9
+ """
10
+ Format a single source document as HTML with proper RTL styling.
11
+
12
+ Args:
13
+ doc (Dict): Source document
14
+ i (int): Source index
15
+ hebrew_font (str): Hebrew font to use
16
+ get_text (callable): Function to get translated text
17
+
18
+ Returns:
19
+ tuple: (source_html, text_html) formatted HTML strings
20
+ """
21
+ from utils.sanitization import sanitize_html
22
+ from utils import clean_source_text
23
+
24
+ source = doc.get('source_name', '') or get_text('unknown_source')
25
+ source = sanitize_html(source)
26
+
27
+ # Clean and sanitize the Hebrew text
28
+ text = doc.get('hebrew_text', '')
29
+ if text is None:
30
+ text = get_text('no_text_available')
31
+ text = clean_source_text(text)
32
+ text = sanitize_html(text)
33
+
34
+ # Force RTL and proper Hebrew font styling for sources
35
+ source_html = f"""
36
+ <div class='source-info rtl-text hebrew-font' dir='rtl' lang="he">
37
+ <strong>{get_text('source_label').format(i)}</strong> {source}
38
+ </div>
39
+ """
40
+
41
+ text_html = f"""
42
+ <div class='hebrew-text rtl-text hebrew-font' dir='rtl' lang="he">
43
+ {text}
44
+ </div>
45
+ """
46
+
47
+ return source_html, text_html
48
+
49
+ def display_chat_message(message: Dict[str, Any]):
50
+ """
51
+ Display a chat message with proper formatting.
52
+
53
+ Args:
54
+ message (Dict[str, Any]): Message object with role, content, and optional metadata
55
+ """
56
+ # Import here to avoid circular imports
57
+ from i18n import get_direction, get_text
58
+ from utils.sanitization import sanitize_html, escape_html
59
+ from ui.hebrew import handle_mixed_language_text
60
+
61
+ text_direction = get_direction()
62
+ role = message.get("role", "assistant")
63
+
64
+ # Ensure we have a valid font setting
65
+ if 'hebrew_font' not in st.session_state:
66
+ st.session_state.hebrew_font = "Noto Rashi Hebrew"
67
+
68
+ # Use the current font
69
+ hebrew_font = st.session_state.hebrew_font
70
+
71
+ with st.chat_message(role):
72
+ # Sanitize the message content for HTML rendering
73
+ content = message.get('content', '')
74
+ if isinstance(content, str):
75
+ # Escape any HTML tags in the original content
76
+ content = escape_html(content)
77
+ content = sanitize_html(content)
78
+
79
+ # Process with the mixed language handler
80
+ content = handle_mixed_language_text(content, hebrew_font)
81
+ # Final sanitization after processing
82
+ content = sanitize_html(content)
83
+
84
+ # If this is a user message (prompt) and it's longer than a threshold, show it in an expandable section
85
+ if role == "user" and len(content) > 150:
86
+ # Create a preview of the content
87
+ preview_content = content[:147] + "..."
88
+ preview_content = handle_mixed_language_text(preview_content, hebrew_font)
89
+ preview_content = sanitize_html(preview_content)
90
+
91
+ # Show the preview
92
+ st.markdown(preview_content, unsafe_allow_html=True)
93
+
94
+ # Show the full content in an expander
95
+ with st.expander(get_text('show_full_prompt'), expanded=False):
96
+ st.markdown(f"""<div dir="{text_direction}"
97
+ class="prompt-full-text {text_direction}-text">
98
+ {content}
99
+ </div>""", unsafe_allow_html=True)
100
+ else:
101
+ # Display normal content
102
+ st.markdown(content, unsafe_allow_html=True)
103
+
104
+ if role == "assistant" and message.get("final_docs"):
105
+ docs = message["final_docs"]
106
+ # Use a simple text title for the expander
107
+ with st.expander(f"{get_text('sources_title')} ({len(docs)})", expanded=False):
108
+ # Add the rich HTML content inside the expander (static HTML is safe)
109
+ expander_title = f"""
110
+ <div class='expander-title rtl-text hebrew-font' dir="rtl" lang="he">
111
+ {get_text('sources_text').format(len(docs))}
112
+ </div>
113
+ """
114
+ st.markdown(expander_title, unsafe_allow_html=True)
115
+ st.markdown(f"""
116
+ <div dir='rtl' lang="he" class='expander-content rtl-text hebrew-font'>
117
+ """, unsafe_allow_html=True)
118
+
119
+ for i, doc in enumerate(docs, start=1):
120
+ source_html, text_html = format_source_html(doc, i, hebrew_font, get_text)
121
+ st.markdown(source_html, unsafe_allow_html=True)
122
+ st.markdown(text_html, unsafe_allow_html=True)
123
+
124
+ st.markdown("</div>", unsafe_allow_html=True)
125
+
126
+
127
+ def display_status_updates(status_log: List[str]):
128
+ """
129
+ Display processing status log in an expander.
130
+
131
+ Args:
132
+ status_log (List[str]): List of status update messages
133
+ """
134
+ # Import here to avoid circular imports
135
+ from i18n import get_direction, get_text
136
+
137
+ text_direction = get_direction()
138
+
139
+ # Ensure we have a valid font setting
140
+ if 'hebrew_font' not in st.session_state:
141
+ st.session_state.hebrew_font = "Noto Rashi Hebrew"
142
+
143
+ hebrew_font = st.session_state.hebrew_font
144
+
145
+ # Text alignment based on direction
146
+ text_align = "right" if text_direction == "rtl" else "left"
147
+
148
+ if status_log:
149
+ # Use a simple text title for the expander
150
+ with st.expander(get_text('processing_details'), expanded=False):
151
+ # Add the rich HTML content inside the expander
152
+ st.markdown(f"""<div
153
+ class='expander-title {text_direction}-text hebrew-font'
154
+ dir="{text_direction}"
155
+ >{get_text('processing_log')}</div>""", unsafe_allow_html=True)
156
+
157
+ for u in status_log:
158
+ st.markdown(
159
+ f"""<code
160
+ class='status-update {text_direction}-text hebrew-font'
161
+ dir="{text_direction}"
162
+ >- {u}</code>""",
163
+ unsafe_allow_html=True
164
+ )
ui/hebrew.py ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ from typing import Optional
3
+
4
+ def contains_hebrew(text: str) -> bool:
5
+ """
6
+ Check if text contains Hebrew characters.
7
+
8
+ Args:
9
+ text (str): Text to check
10
+
11
+ Returns:
12
+ bool: True if text contains Hebrew characters
13
+ """
14
+ # Hebrew Unicode range (1,424–1,535) plus some additional ranges for Hebrew characters
15
+ hebrew_pattern = re.compile(r'[\u0590-\u05FF\uFB1D-\uFB4F]')
16
+ return bool(hebrew_pattern.search(text))
17
+
18
+ def contains_english(text: str) -> bool:
19
+ """
20
+ Check if text contains English characters.
21
+
22
+ Args:
23
+ text (str): Text to check
24
+
25
+ Returns:
26
+ bool: True if text contains English characters
27
+ """
28
+ english_pattern = re.compile(r'[a-zA-Z]')
29
+ return bool(english_pattern.search(text))
30
+
31
+ def handle_mixed_language_text(text: str, hebrew_font: str) -> str:
32
+ """
33
+ Process text that may contain both Hebrew and English to display correctly.
34
+
35
+ Args:
36
+ text (str): The text to process
37
+ hebrew_font (str): The Hebrew font to use
38
+
39
+ Returns:
40
+ str: Processed HTML with proper bidirectional handling
41
+ """
42
+ has_hebrew = contains_hebrew(text)
43
+ has_english = contains_english(text)
44
+
45
+ if has_hebrew and has_english:
46
+ # For mixed language content, we need to identify and wrap English words
47
+ # Split by whitespace to find words
48
+ words = text.split()
49
+ processed_words = []
50
+
51
+ for word in words:
52
+ # If word contains English but not Hebrew, wrap it in LTR span
53
+ if contains_english(word) and not contains_hebrew(word):
54
+ processed_words.append(f'<span dir="ltr">{word}</span>')
55
+ else:
56
+ processed_words.append(word)
57
+
58
+ # Join words back with spaces
59
+ processed_text = ' '.join(processed_words)
60
+
61
+ # Wrap everything in RTL container
62
+ return f"""
63
+ <div dir="rtl" lang="he" class="rtl-text mixed-content hebrew-font">
64
+ {processed_text}
65
+ </div>
66
+ """
67
+ elif has_hebrew:
68
+ # For Hebrew-only content, use RTL direction
69
+ return f"""
70
+ <div dir="rtl" lang="he" class="rtl-text hebrew-font">
71
+ {text}
72
+ </div>
73
+ """
74
+ else:
75
+ # For English-only content, use LTR direction
76
+ return f"""
77
+ <div dir="ltr" lang="en" class="ltr-text hebrew-font">
78
+ {text}
79
+ </div>
80
+ """
utils/__init__.py CHANGED
@@ -1,18 +1,13 @@
1
  from .sanitization import sanitize_html
2
  import re
3
- import sys
4
  import os
5
- import time
6
- import openai
7
  from typing import List, Dict, Optional
 
8
 
9
- # Add config to the path if needed
10
- sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
11
- try:
12
- import config
13
- except ImportError:
14
- print("Error: Failed to import config in utils/__init__.py")
15
- raise
16
 
17
  def clean_source_text(text: str) -> str:
18
  """
@@ -30,9 +25,9 @@ def clean_source_text(text: str) -> str:
30
  text = re.sub(r'\s+', ' ', text).strip()
31
  return text
32
 
33
- def get_embedding(text: str, model: str = None, max_retries: int = 3) -> Optional[List[float]]:
34
  """
35
- Get embedding for text using OpenAI's API
36
 
37
  Args:
38
  text (str): Text to get embedding for
@@ -43,9 +38,9 @@ def get_embedding(text: str, model: str = None, max_retries: int = 3) -> Optiona
43
  List[float]: Embedding vector or None if failed
44
  """
45
  if model is None:
46
- model = config.EMBEDDING_MODEL
47
 
48
- openai_client = openai.OpenAI(api_key=config.OPENAI_API_KEY)
49
 
50
  if not text or not isinstance(text, str):
51
  print("Error: Invalid input text for embedding.")
@@ -59,16 +54,16 @@ def get_embedding(text: str, model: str = None, max_retries: int = 3) -> Optiona
59
  attempt = 0
60
  while attempt < max_retries:
61
  try:
62
- response = openai_client.embeddings.create(input=[cleaned_text], model=model)
63
  return response.data[0].embedding
64
  except openai.RateLimitError as e:
65
  wait_time = (2 ** attempt)
66
  print(f"Rate limit embedding. Retrying in {wait_time}s...")
67
- time.sleep(wait_time)
68
  attempt += 1
69
  except openai.APIConnectionError as e:
70
  print(f"Connection error embedding. Retrying...")
71
- time.sleep(2)
72
  attempt += 1
73
  except Exception as e:
74
  print(f"Error generating embedding (Attempt {attempt + 1}/{max_retries}): {type(e).__name__}")
 
1
  from .sanitization import sanitize_html
2
  import re
 
3
  import os
4
+ import asyncio
 
5
  from typing import List, Dict, Optional
6
+ from openai import AsyncOpenAI
7
 
8
+ # Change relative imports to absolute imports
9
+ import config
10
+ from config import OPENAI_API_KEY, EMBEDDING_MODEL
 
 
 
 
11
 
12
  def clean_source_text(text: str) -> str:
13
  """
 
25
  text = re.sub(r'\s+', ' ', text).strip()
26
  return text
27
 
28
+ async def get_embedding(text: str, model: str = None, max_retries: int = 3) -> Optional[List[float]]:
29
  """
30
+ Get embedding for text using OpenAI's API asynchronously
31
 
32
  Args:
33
  text (str): Text to get embedding for
 
38
  List[float]: Embedding vector or None if failed
39
  """
40
  if model is None:
41
+ model = EMBEDDING_MODEL
42
 
43
+ openai_client = AsyncOpenAI(api_key=OPENAI_API_KEY)
44
 
45
  if not text or not isinstance(text, str):
46
  print("Error: Invalid input text for embedding.")
 
54
  attempt = 0
55
  while attempt < max_retries:
56
  try:
57
+ response = await openai_client.embeddings.create(input=[cleaned_text], model=model)
58
  return response.data[0].embedding
59
  except openai.RateLimitError as e:
60
  wait_time = (2 ** attempt)
61
  print(f"Rate limit embedding. Retrying in {wait_time}s...")
62
+ await asyncio.sleep(wait_time)
63
  attempt += 1
64
  except openai.APIConnectionError as e:
65
  print(f"Connection error embedding. Retrying...")
66
+ await asyncio.sleep(2)
67
  attempt += 1
68
  except Exception as e:
69
  print(f"Error generating embedding (Attempt {attempt + 1}/{max_retries}): {type(e).__name__}")
utils/sanitization.py CHANGED
@@ -1,31 +1,83 @@
1
  import bleach # For HTML sanitization
 
2
 
3
  def sanitize_html(html_content: str) -> str:
4
  """
5
- Sanitize HTML content to prevent XSS attacks but preserve necessary tags
6
- for Hebrew text and RTL formatting.
7
  """
8
  if not isinstance(html_content, str):
9
  return str(html_content)
10
-
 
11
  allowed_tags = [
12
- 'div', 'p', 'span', 'br', 'hr', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
13
- 'ul', 'ol', 'li', 'dl', 'dt', 'dd', 'img', 'a', 'strong', 'em',
14
- 'b', 'i', 'u', 'code', 'pre', 'blockquote', 'table', 'thead',
15
- 'tbody', 'tr', 'th', 'td', 'sup', 'sub', 'details', 'summary'
 
 
 
 
 
 
 
 
 
 
 
16
  ]
 
 
17
  allowed_attributes = {
18
- '*': ['class', 'style', 'dir'],
 
 
 
19
  'a': ['href', 'title', 'target'],
20
  'img': ['src', 'alt', 'width', 'height'],
21
- 'div': ['dir', 'class', 'style'],
22
- 'span': ['dir', 'class', 'style'],
23
  }
24
 
25
- # Use bleach to sanitize the HTML - only using supported parameters
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  return bleach.clean(
27
  html_content,
28
  tags=allowed_tags,
29
  attributes=allowed_attributes,
 
30
  strip=True
31
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import bleach # For HTML sanitization
2
+ from bleach.css_sanitizer import CSSSanitizer
3
 
4
  def sanitize_html(html_content: str) -> str:
5
  """
6
+ Sanitize HTML content to prevent XSS attacks while preserving Hebrew text and RTL support.
 
7
  """
8
  if not isinstance(html_content, str):
9
  return str(html_content)
10
+
11
+ # Allowed HTML tags organized by purpose
12
  allowed_tags = [
13
+ # Structure elements
14
+ 'div', 'p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
15
+ 'ul', 'ol', 'li', 'blockquote', 'pre',
16
+
17
+ # Inline formatting
18
+ 'span', 'strong', 'em', 'b', 'i', 'u', 'code', 'sup', 'sub',
19
+
20
+ # Interactive elements
21
+ 'a', 'img', 'button',
22
+
23
+ # Layout elements
24
+ 'br', 'hr', 'table', 'thead', 'tbody', 'tr', 'th', 'td',
25
+
26
+ # Style
27
+ 'style'
28
  ]
29
+
30
+ # Allowed HTML attributes
31
  allowed_attributes = {
32
+ # Global attributes
33
+ '*': ['class', 'style', 'dir', 'lang'],
34
+
35
+ # Specific elements
36
  'a': ['href', 'title', 'target'],
37
  'img': ['src', 'alt', 'width', 'height'],
38
+ 'div': ['data-testid'],
39
+ 'button': ['kind']
40
  }
41
 
42
+ # Essential CSS properties for proper text direction and sidebar positioning
43
+ allowed_css_properties = [
44
+ # Text formatting
45
+ 'font-family', 'font-size', 'font-weight', 'font-style',
46
+ 'text-align', 'direction', 'line-height', 'color',
47
+
48
+ # Basic layout
49
+ 'display', 'margin', 'padding', 'width', 'height',
50
+ 'border', 'border-radius', 'background-color',
51
+
52
+ # Positioning (needed for sidebar)
53
+ 'position', 'top', 'right', 'bottom', 'left', 'z-index'
54
+ ]
55
+
56
+ # Create CSS sanitizer
57
+ css_sanitizer = CSSSanitizer(allowed_css_properties=allowed_css_properties)
58
+
59
+ # Sanitize and return the HTML
60
  return bleach.clean(
61
  html_content,
62
  tags=allowed_tags,
63
  attributes=allowed_attributes,
64
+ css_sanitizer=css_sanitizer,
65
  strip=True
66
+ )
67
+
68
+ def escape_html(text: str) -> str:
69
+ """
70
+ Escape HTML characters in a string to safely embed it within HTML.
71
+ This is used for user inputs that should not contain HTML.
72
+
73
+ Args:
74
+ text (str): Text that might contain HTML characters
75
+
76
+ Returns:
77
+ str: Text with HTML characters escaped
78
+ """
79
+ if not isinstance(text, str):
80
+ text = str(text)
81
+
82
+ # Replace HTML special characters with their entities
83
+ return text.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&#39;')