Spaces:

TilanB
/

SmartDocAI

Running

App Files Files Community

TilanB commited on Dec 30, 2025

Commit

2a7fd26

1 Parent(s): 50fcf88

improvements

Browse files

Files changed (23) hide show

.gitattributes +0 -35
README.md +49 -5
configuration/parameters.py +19 -8
content_analyzer/document_parser.py +18 -50
content_analyzer/visual_detector.py +45 -28
core/diagnostics.py +0 -125
core/lifecycle.py +0 -160
intelligence/accuracy_verifier.py +28 -10
intelligence/orchestrator.py +63 -130
main.py +107 -192
packages.txt +1 -0
requirements.txt +1 -1
search_engine/indexer.py +3 -3
tests/conftest.py +0 -71
tests/test_accuracy_verifier.py +0 -110
tests/test_context_validator.py +0 -120
tests/test_knowledge_synthesizer.py +0 -50
tests/test_visual_extraction.py +0 -169
vector_store/33eccd62-a7fc-4b0d-a118-02552f5cad42/data_level0.bin +0 -3
vector_store/33eccd62-a7fc-4b0d-a118-02552f5cad42/header.bin +0 -3
vector_store/33eccd62-a7fc-4b0d-a118-02552f5cad42/index_metadata.pickle +0 -3
vector_store/33eccd62-a7fc-4b0d-a118-02552f5cad42/length.bin +0 -3
vector_store/33eccd62-a7fc-4b0d-a118-02552f5cad42/link_lists.bin +0 -3

.gitattributes DELETED Viewed

@@ -1,35 +0,0 @@
-*.7z filter=lfs diff=lfs merge=lfs -text
-*.arrow filter=lfs diff=lfs merge=lfs -text
-*.bin filter=lfs diff=lfs merge=lfs -text
-*.bz2 filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.ftz filter=lfs diff=lfs merge=lfs -text
-*.gz filter=lfs diff=lfs merge=lfs -text
-*.h5 filter=lfs diff=lfs merge=lfs -text
-*.joblib filter=lfs diff=lfs merge=lfs -text
-*.lfs.* filter=lfs diff=lfs merge=lfs -text
-*.mlmodel filter=lfs diff=lfs merge=lfs -text
-*.model filter=lfs diff=lfs merge=lfs -text
-*.msgpack filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.npz filter=lfs diff=lfs merge=lfs -text
-*.onnx filter=lfs diff=lfs merge=lfs -text
-*.ot filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pickle filter=lfs diff=lfs merge=lfs -text
-*.pkl filter=lfs diff=lfs merge=lfs -text
-*.pt filter=lfs diff=lfs merge=lfs -text
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.rar filter=lfs diff=lfs merge=lfs -text
-*.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
-*.tflite filter=lfs diff=lfs merge=lfs -text
-*.tgz filter=lfs diff=lfs merge=lfs -text
-*.wasm filter=lfs diff=lfs merge=lfs -text
-*.xz filter=lfs diff=lfs merge=lfs -text
-*.zip filter=lfs diff=lfs merge=lfs -text
-*.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,6 +1,50 @@
 # SmartDoc AI
-SmartDoc AI is an advanced document analysis and question answering system. It allows you to upload documents, ask questions, and receive accurate, source-verified answers. The system uses a multi-agent workflow, hybrid search, and both local and cloud-based chart detection for high performance and cost efficiency.
 ---
@@ -27,8 +71,8 @@ SmartDoc AI is an advanced document analysis and question answering system. It a
 1. Clone the repository:
 ```bash
-git clone https://github.com/TilanTAB/Intelligent-Document-Analysis-Q-A-3.git
-cd Intelligent-Document-Analysis-Q-A-3
 ```
 2. Activate the virtual environment:
@@ -43,7 +87,7 @@ activate_venv.bat
 3. Install dependencies (if needed):
 ```bash
-pip install -r dependencies.txt
 ```
 4. Configure environment variables:
@@ -114,4 +158,4 @@ This project is licensed under the MIT License.
 ---
-SmartDoc AI is actively maintained and designed for real-world document analysis and Q&A. For updates and support, visit the [GitHub repository](https://github.com/TilanTAB/Intelligent-Document-Analysis-Q-A-3).

 # SmartDoc AI
+SmartDoc AI is an advanced document analysis and question answering system, designed for source-grounded Q&A over complex business and scientific reports�especially where key evidence lives in tables and charts.
+---
+## ?? Personal Research Update
+**SmartDoc AI � Document Q&A + Selective Chart Understanding**
+I�ve been developing SmartDoc AI as a technical experiment to improve question answering over complex business/scientific reports�especially where key evidence lives in tables and charts.
+### Technical highlights:
+- **Multi-format ingestion:** PDF, DOCX, TXT, Markdown
+- **LLM-assisted query decomposition:** breaks complex prompts into clearer sub-questions for retrieval + answering
+- **Selective chart pipeline (cost-aware):**
+  - Local OpenCV heuristics flag pages that likely contain charts
+  - Gemini Vision is invoked only for chart pages to generate structured chart analysis (reduces unnecessary vision calls)
+- **Table extraction + robust PDF parsing:** pdfplumber strategies for bordered and borderless tables
+- **Parallelized processing:** concurrent PDF parsing + chart detection; batch chart analysis where enabled
+- **Hybrid retrieval:** BM25 + vector search combined via an ensemble retriever
+- **Multi-agent answering:** answer drafting + verification pass, with retrieved context available for inspection (page/source metadata)
+**Runtime note:** Large PDFs (many pages/charts) can take minutes depending on DPI, chart volume, and available memory/CPU (HF Spaces limits can be a factor).
+---
+## ?? Demo Videos
+- [SmartDoc AI technical demo #1](https://youtu.be/uVU_sLiJU4w)
+- [SmartDoc AI technical demo #2](https://youtu.be/c8CF7-OaKmQ)
+- [SmartDoc AI technical demo #3](https://youtu.be/P17SZSQJ6Wc)
+---
+## Repository
+?? https://github.com/TilanTAB/Intelligent-Document-Analysis-SmartDoc-AI
+---
+## Use Cases
+- Source-grounded Q&A for business/research documents
+- Automated extraction and summarization from tables/charts
+If you�re interested in architecture tradeoffs (cost, latency, memory limits, retrieval quality), feel free to connect.
 ---
 1. Clone the repository:
 ```bash
+git clone https://github.com/TilanTAB/Intelligent-Document-Analysis-SmartDoc-AI.git
+cd Intelligent-Document-Analysis-SmartDoc-AI
 ```
 2. Activate the virtual environment:
 3. Install dependencies (if needed):
 ```bash
+pip install -r requirements.txt
 ```
 4. Configure environment variables:
 ---
+SmartDoc AI is actively maintained and designed for real-world document analysis and Q&A. For updates and support, visit the [GitHub repository](https://github.com/TilanTAB/Intelligent-Document-Analysis-SmartDoc-AI).

configuration/parameters.py CHANGED Viewed

@@ -5,6 +5,16 @@ import os
 from .definitions import MAX_FILE_SIZE, MAX_TOTAL_SIZE, ALLOWED_TYPES
 class Settings(BaseSettings):
     """
     Application parameters loaded from environment variables.
@@ -35,7 +45,7 @@ class Settings(BaseSettings):
     )
     # Database parameters
-    CHROMA_DB_PATH: str = "./chroma_db"
     # Chunking parameters
     CHUNK_SIZE: int = 2000
@@ -51,18 +61,19 @@ class Settings(BaseSettings):
     CHROMA_COLLECTION_NAME: str = "documents"
     # Workflow parameters
-    MAX_RESEARCH_ATTEMPTS: int = 2
     ENABLE_QUERY_REWRITING: bool = True
     MAX_QUERY_REWRITES: int = 1
     RELEVANCE_CHECK_K: int = 20
     # Research agent parameters
     RESEARCH_TOP_K: int = 15
-    RESEARCH_MAX_CONTEXT_CHARS: int = 8000000000
     RESEARCH_MAX_OUTPUT_TOKENS: int = 500
     # Verification parameters
-    VERIFICATION_MAX_CONTEXT_CHARS: int = 800000000
     VERIFICATION_MAX_OUTPUT_TOKENS: int = 300
     # Logging parameters
@@ -86,12 +97,12 @@ class Settings(BaseSettings):
     ENABLE_CHART_EXTRACTION: bool = True
     CHART_VISION_MODEL: str = "gemini-2.5-flash-lite"
     CHART_MAX_TOKENS: int = 1500
-    CHART_DPI: int = 150  # Lower DPI saves memory
-    CHART_BATCH_SIZE: int = 3  # Process pages in batches
-    CHART_MAX_IMAGE_SIZE: int = 1920  # Max dimension for images
     # Local chart detection parameters (cost optimization)
-    CHART_USE_LOCAL_DETECTION: bool = True  # Use OpenCV first (FREE)
     CHART_MIN_CONFIDENCE: float = 0.4  # Only analyze charts with confidence > 40%
     CHART_SKIP_GEMINI_DETECTION: bool = True  # Skip Gemini for detection, only use for analysis
     CHART_GEMINI_FALLBACK_ENABLED: bool = False  # Optional: Use Gemini if local fails

 from .definitions import MAX_FILE_SIZE, MAX_TOTAL_SIZE, ALLOWED_TYPES
+def _default_chroma_path() -> str:
+    if os.environ.get("SPACE_ID"):  # Hugging Face Spaces
+        return os.environ.get("CHROMA_DB_PATH", "/tmp/chroma_db")
+    return os.environ.get("CHROMA_DB_PATH", "./chroma_db")
+def _is_hf() -> bool:
+    return os.environ.get("SPACE_ID") is not None
 class Settings(BaseSettings):
     """
     Application parameters loaded from environment variables.
     )
     # Database parameters
+    CHROMA_DB_PATH: str = Field(default_factory=_default_chroma_path)
     # Chunking parameters
     CHUNK_SIZE: int = 2000
     CHROMA_COLLECTION_NAME: str = "documents"
     # Workflow parameters
+    MAX_RESEARCH_ATTEMPTS: int = 5
     ENABLE_QUERY_REWRITING: bool = True
     MAX_QUERY_REWRITES: int = 1
     RELEVANCE_CHECK_K: int = 20
     # Research agent parameters
     RESEARCH_TOP_K: int = 15
+    RESEARCH_MAX_CONTEXT_CHARS: int = Field(default_factory=lambda: 800_000 if _is_hf() else 8000000000)
     RESEARCH_MAX_OUTPUT_TOKENS: int = 500
+    NUM_RESEARCH_CANDIDATES: int = 2  # Number of research questions to generate
     # Verification parameters
+    VERIFICATION_MAX_CONTEXT_CHARS: int = Field(default_factory=lambda: 300_000 if _is_hf() else 800000000)
     VERIFICATION_MAX_OUTPUT_TOKENS: int = 300
     # Logging parameters
     ENABLE_CHART_EXTRACTION: bool = True
     CHART_VISION_MODEL: str = "gemini-2.5-flash-lite"
     CHART_MAX_TOKENS: int = 1500
+    CHART_DPI: int = Field(default_factory=lambda: 110 if _is_hf() else 110)  # Lower DPI saves memory
+    CHART_BATCH_SIZE: int = Field(default_factory=lambda: 1 if _is_hf() else 1)  # Process pages in batches
+    CHART_MAX_IMAGE_SIZE: int = Field(default_factory=lambda: 1200 if _is_hf() else 1200)  # Max dimension for images
     # Local chart detection parameters (cost optimization)
+    CHART_USE_LOCAL_DETECTION: bool = Field(default_factory=lambda: True if _is_hf() else True)  # Use OpenCV first (FREE)
     CHART_MIN_CONFIDENCE: float = 0.4  # Only analyze charts with confidence > 40%
     CHART_SKIP_GEMINI_DETECTION: bool = True  # Skip Gemini for detection, only use for analysis
     CHART_GEMINI_FALLBACK_ENABLED: bool = False  # Optional: Use Gemini if local fails

content_analyzer/document_parser.py CHANGED Viewed

@@ -33,7 +33,6 @@ def detect_chart_on_page(args):
     # Downscale image before detection to save memory
     image = preprocess_image(image, max_dim=1000)
     detection_result = LocalChartDetector.detect_charts(image)
-    # Do NOT delete image here; it will be saved in the main process
     return (page_num, image, detection_result)
 def analyze_batch(batch_tuple):
@@ -276,7 +275,7 @@ class DocumentProcessor:
         except Exception as e:
             logger.error(f"Failed to save cache to {cache_path.name}: {e}", exc_info=True)
-    def _process_file(self, file, progress_callback=None) -> List[Document]:
         file_ext = Path(file.name).suffix.lower()
         if file_ext not in ALLOWED_TYPES:
             logger.warning(f"Skipping unsupported file type: {file.name}")
@@ -341,26 +340,27 @@ class DocumentProcessor:
                 return []
             all_chunks = []
             total_docs = len(documents)
-            file_hash = self._generate_hash(file.name.encode())  # Unique per file
             for i, doc in enumerate(documents):
                 page_chunks = self.splitter.split_text(doc.page_content)
                 total_chunks = len(page_chunks)
                 for j, chunk in enumerate(page_chunks):
-                    chunk_id = f"{file_hash}_{doc.metadata.get('page', i + 1)}_{j}"
                     chunk_doc = Document(
                         page_content=chunk,
                         metadata={
-                            "source": doc.metadata.get("source", file.name),
                             "page": doc.metadata.get("page", i + 1),
                             "type": doc.metadata.get("type", "text"),
                             "chunk_id": chunk_id
                         }
                     )
                     all_chunks.append(chunk_doc)
-                    if progress_callback:
-                        percent = int(100 * ((i + (j + 1) / total_chunks) / total_docs))
-                        step = f"Splitting page {i+1} into chunks"
-                        progress_callback(percent, step)
             logger.info(f"Processed {file.name}: {len(documents)} page(s) → {len(all_chunks)} chunk(s)")
             return all_chunks
         except ImportError as e:
@@ -376,7 +376,9 @@ class DocumentProcessor:
         PHASE 1: Parallel local chart detection (CPU-bound, uses ProcessPoolExecutor)
         PHASE 2: Parallel Gemini batch analysis (I/O-bound, uses ThreadPoolExecutor)
         """
-        file_hash = self._generate_hash(file_path.encode())
         def deduplicate_charts_by_title(chart_chunks):
             seen_titles = set()
             unique_chunks = []
@@ -632,7 +634,9 @@ class DocumentProcessor:
         import pdfplumber
         logger.info(f"[PDFPLUMBER] Processing: {file_path}")
-        file_hash = self._generate_hash(file_path.encode())
         # Strategy 1: Line-based (default) - for tables with visible borders
         default_parameters = {}
@@ -733,11 +737,11 @@ class DocumentProcessor:
                     if len(page_content) > 1:
                         combined = "\n\n".join(page_content)
-                        chunk_id = f"{file_hash}_{page_num}_0"
                         doc = Document(
                             page_content=combined,
                             metadata={
-                                "source": file_path,
                                 "page": page_num,
                                 "loader": "pdfplumber",
                                 "tables_count": total_tables,
@@ -789,43 +793,7 @@ class DocumentProcessor:
         for row in cleaned_table[1:]:
             md_lines.append("| " + " | ".join(row) + " |")
-        return "\n".join(md_lines)
-    def process(self, files: List, progress_callback=None) -> List[Document]:
-        """
-        Process multiple files with caching and deduplication.
-        """
-        self.validate_files(files)
-        all_chunks = []
-        seen_hashes = set()
-        logger.info(f"Processing {len(files)} file(s)...")
-        for file in files:
-            try:
-                with open(file.name, 'rb') as f:
-                    file_content = f.read()
-                    file_hash = self._generate_hash(file_content)
-                cache_path = self.cache_dir / f"{file_hash}.pkl"
-                if self._is_cache_valid(cache_path):
-                    chunks = self._load_from_cache(cache_path)
-                    if chunks:
-                        logger.info(f"Using cached chunks for {file.name}")
-                    else:
-                        chunks = self._process_file(file, progress_callback=progress_callback)
-                        self._save_to_cache(chunks, cache_path)
-                else:
-                    logger.info(f"Processing and caching: {file.name}")
-                    chunks = self._process_file(file, progress_callback=progress_callback)
-                    self._save_to_cache(chunks, cache_path)
-                for chunk in chunks:
-                    chunk_hash = self._generate_hash(chunk.page_content.encode())
-                    if chunk_hash not in seen_hashes:
-                        seen_hashes.add(chunk_hash)
-                        all_chunks.append(chunk)
-            except Exception as e:
-                logger.error(f"Failed to process {file.name}: {e}", exc_info=True)
-                continue
-        logger.info(f"Processing complete: {len(all_chunks)} unique chunks from {len(files)} file(s)")
-        return all_chunks
 def run_pdfplumber(file_name):
     from content_analyzer.document_parser import DocumentProcessor

     # Downscale image before detection to save memory
     image = preprocess_image(image, max_dim=1000)
     detection_result = LocalChartDetector.detect_charts(image)
     return (page_num, image, detection_result)
 def analyze_batch(batch_tuple):
         except Exception as e:
             logger.error(f"Failed to save cache to {cache_path.name}: {e}", exc_info=True)
+    def _process_file(self, file) -> List[Document]:
         file_ext = Path(file.name).suffix.lower()
         if file_ext not in ALLOWED_TYPES:
             logger.warning(f"Skipping unsupported file type: {file.name}")
                 return []
             all_chunks = []
             total_docs = len(documents)
+            # --- STABLE FILE HASHING ---
+            with open(file.name, 'rb') as f:
+                file_bytes = f.read()
+            file_hash = self._generate_hash(file_bytes)  # Stable hash by file content
+            stable_source = f"{Path(file.name).name}::{file_hash}"
             for i, doc in enumerate(documents):
                 page_chunks = self.splitter.split_text(doc.page_content)
                 total_chunks = len(page_chunks)
                 for j, chunk in enumerate(page_chunks):
+                    chunk_id = f"txt_{file_hash}_{doc.metadata.get('page', i + 1)}_{j}"
                     chunk_doc = Document(
                         page_content=chunk,
                         metadata={
+                            "source": stable_source,
                             "page": doc.metadata.get("page", i + 1),
                             "type": doc.metadata.get("type", "text"),
                             "chunk_id": chunk_id
                         }
                     )
                     all_chunks.append(chunk_doc)
             logger.info(f"Processed {file.name}: {len(documents)} page(s) → {len(all_chunks)} chunk(s)")
             return all_chunks
         except ImportError as e:
         PHASE 1: Parallel local chart detection (CPU-bound, uses ProcessPoolExecutor)
         PHASE 2: Parallel Gemini batch analysis (I/O-bound, uses ThreadPoolExecutor)
         """
+        file_bytes = Path(file_path).read_bytes()
+        file_hash = self._generate_hash(file_bytes)
+        stable_source = f"{Path(file_path).name}::{file_hash}"
         def deduplicate_charts_by_title(chart_chunks):
             seen_titles = set()
             unique_chunks = []
         import pdfplumber
         logger.info(f"[PDFPLUMBER] Processing: {file_path}")
+        file_bytes = Path(file_path).read_bytes()
+        file_hash = self._generate_hash(file_bytes)
+        stable_source = f"{Path(file_path).name}::{file_hash}"
         # Strategy 1: Line-based (default) - for tables with visible borders
         default_parameters = {}
                     if len(page_content) > 1:
                         combined = "\n\n".join(page_content)
+                        chunk_id = f"txt_{file_hash}_{page_num}_0"
                         doc = Document(
                             page_content=combined,
                             metadata={
+                                "source": stable_source,
                                 "page": page_num,
                                 "loader": "pdfplumber",
                                 "tables_count": total_tables,
         for row in cleaned_table[1:]:
             md_lines.append("| " + " | ".join(row) + " |")
+        return "\n".join(md_lines)
 def run_pdfplumber(file_name):
     from content_analyzer.document_parser import DocumentProcessor

content_analyzer/visual_detector.py CHANGED Viewed

@@ -52,8 +52,22 @@ class LocalChartDetector:
             else:
                 image_cv = image
             height, width = image_cv.shape[:2]
             gray = cv2.cvtColor(image_cv, cv2.COLOR_BGR2GRAY)
             # --- Edge Detection ---
             edges = cv2.Canny(gray, 50, 150)
@@ -74,23 +88,24 @@ class LocalChartDetector:
                 edges,
                 rho=1,
                 theta=np.pi/180,
-                threshold=100,
                 minLineLength=100,
                 maxLineGap=10
             )
             line_count = len(lines) if lines is not None else 0
-            diagonal_lines = 0
-            line_angles = []
             if lines is not None:
                 for line in lines:
                     x1, y1, x2, y2 = line[0]
                     angle = np.abs(np.arctan2(y2 - y1, x2 - x1) * 180 / np.pi)
                     if 10 < angle < 80 or 100 < angle < 170:
-                        diagonal_lines += 1
-                        line_angles.append(angle)
             # --- Circle Detection (Optimized) ---
-            run_circles = diagonal_lines >= 1 or line_count >= 6 or overall_edge_density > 0.08
             circle_count = 0
             circles = None
             if run_circles:
@@ -110,12 +125,13 @@ class LocalChartDetector:
                     circle_count = circles.shape[2]
             # --- Color Diversity Analysis ---
-            hsv = cv2.cvtColor(image_cv, cv2.COLOR_BGR2HSV)
             hist = cv2.calcHist([hsv], [0], None, [180], [0, 180])
-            color_peaks = np.sum(hist > np.mean(hist) * 2)
             # --- Contour Detection ---
-            contours, _ = cv2.findContours(edges, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
             significant_contours = 0
             rectangle_contours = 0
             similar_rectangles = []
@@ -148,16 +164,16 @@ class LocalChartDetector:
                 if (width_std < avg_width * 0.3 or height_std < avg_height * 0.3):
                     bar_pattern = True
-            # --- Line Classification ---
             horizontal_lines = 0
             vertical_lines = 0
-            diagonal_lines = 0
             line_angles = []
             very_short_lines = 0
             if lines is not None:
                 for line in lines:
                     x1, y1, x2, y2 = line[0]
-                    length = np.sqrt((x2 - x1)**2 + (y2 - y1)**2)
                     if length < 50:
                         very_short_lines += 1
                         continue
@@ -170,11 +186,11 @@ class LocalChartDetector:
                     elif 80 < angle < 100:
                         vertical_lines += 1
                     else:
-                        diagonal_lines += 1
-            angle_variance = np.var(line_angles) if len(line_angles) > 2 else 0
             # --- Debug Logging ---
-            logger.debug(f"Chart detection features: lines={line_count}, diagonal_lines={diagonal_lines}, circles={circle_count}, horizontal_lines={horizontal_lines}, vertical_lines={vertical_lines}, color_peaks={color_peaks}, angle_variance={angle_variance}")
             # --- Chart Heuristics and Classification ---
             chart_types = []
@@ -183,16 +199,16 @@ class LocalChartDetector:
             rejection_reason = ""
             # Negative checks (text slides, decorative backgrounds, tables)
-            if has_text_region and circle_count < 2 and diagonal_lines < 2 and not bar_pattern:
                 if small_scattered_contours > 100 or very_short_lines > 50:
                     rejection_reason = f"Text slide with decorative background (overall density: {overall_edge_density:.2%})"
                     logger.debug(f"Rejected: {rejection_reason}")
                     return _chart_result(False, 0.0, [], rejection_reason, line_count, circle_count, overall_edge_density)
-            if very_short_lines > 50 and circle_count < 2 and diagonal_lines < 3 and line_count < 10:
                 rejection_reason = f"Decorative network background ({very_short_lines} tiny lines, no data elements)"
                 logger.debug(f"Rejected: {rejection_reason}")
                 return _chart_result(False, 0.0, [], rejection_reason, line_count, circle_count, overall_edge_density)
-            if horizontal_lines > 12 and vertical_lines > 12 and circle_count == 0 and diagonal_lines < 2:
                 grid_lines = horizontal_lines + vertical_lines
                 total_lines = line_count
                 grid_ratio = grid_lines / max(total_lines, 1)
@@ -204,15 +220,15 @@ class LocalChartDetector:
             # Positive chart heuristics (bubble, scatter, line, pie, bar, complex)
             # RELAXED: Detect as line chart if 2+ diagonal lines and angle variance > 40, or 1+ diagonal line and 1+ axis
             if (
-                (diagonal_lines >= 2 and angle_variance > 40) or
-                (diagonal_lines >= 1 and (horizontal_lines >= 1 or vertical_lines >= 1))
             ):
                 chart_types.append("line_chart")
-                confidence = max(confidence, min(0.88, 0.6 + (diagonal_lines / 40)))
                 if (horizontal_lines >= 1 or vertical_lines >= 1):
                     confidence = min(0.95, confidence + 0.08)
                 if not description:
-                    description = f"Line chart: {diagonal_lines} diagonal lines, axes: {horizontal_lines+vertical_lines}, variance: {angle_variance:.0f}"
             if circle_count >= 5:
                 chart_types.append("bubble_chart")
                 confidence = min(0.92, 0.70 + (min(circle_count, 20) * 0.01))
@@ -224,7 +240,7 @@ class LocalChartDetector:
                     confidence = min(0.97, confidence + 0.05)
                     chart_types.append("zone_diagram")
                     description += f", {large_contours} colored regions"
-            elif circle_count >= 3 and diagonal_lines > 2:
                 chart_types.append("scatter_plot")
                 confidence = max(confidence, 0.75)
                 description = f"Scatter plot: {circle_count} data points"
@@ -245,7 +261,7 @@ class LocalChartDetector:
                 if not description:
                     description = "Complex visualization with zones and data points"
             has_moderate_axes = (1 <= horizontal_lines <= 6 or 1 <= vertical_lines <= 6)
-            has_real_data = (circle_count >= 3 or diagonal_lines >= 2 or bar_pattern)
             if has_moderate_axes and has_real_data and confidence > 0.3:
                 confidence = min(0.90, confidence + 0.10)
                 if not description:
@@ -253,8 +269,8 @@ class LocalChartDetector:
             # Final chart determination
             strong_indicator = (
-                (diagonal_lines >= 2 and angle_variance > 40) or
-                (diagonal_lines >= 1 and (horizontal_lines >= 1 or vertical_lines >= 1)) or
                 circle_count >= 5 or
                 (circle_count >= 3 and large_contours >= 2) or
                 bar_pattern or
@@ -267,7 +283,7 @@ class LocalChartDetector:
             )
             total_time = time.time() - start_time
             if has_chart:
-                logger.info(f"?? OpenCV detection: {total_time*1000:.0f}ms (lines:{line_count}, diagonal_lines:{diagonal_lines}, circles:{circle_count}, axes:{horizontal_lines+vertical_lines}, angle_variance:{angle_variance})")
             else:
                 logger.debug(f"?? OpenCV detection: {total_time*1000:.0f}ms (rejected)")
             return {
@@ -277,7 +293,8 @@ class LocalChartDetector:
                 'description': description or "Potential chart detected",
                 'features': {
                     'lines': line_count,
-                    'diagonal_lines': diagonal_lines,
                     'circles': circle_count,
                     'contours': significant_contours,
                     'rectangles': rectangle_contours,

             else:
                 image_cv = image
             height, width = image_cv.shape[:2]
+            # Always downscale for detection (even if caller forgot)
+            MAX_DETECT_DIM = 900
+            if max(height, width) > MAX_DETECT_DIM:
+                scale = MAX_DETECT_DIM / max(height, width)
+                image_cv = cv2.resize(image_cv, (int(width * scale), int(height * scale)), interpolation=cv2.INTER_AREA)
+                height, width = image_cv.shape[:2]
             gray = cv2.cvtColor(image_cv, cv2.COLOR_BGR2GRAY)
+            # Optional: reduce OpenCV internal thread usage (helps in HF containers)
+            try:
+                cv2.setNumThreads(1)
+            except Exception:
+                pass
             # --- Edge Detection ---
             edges = cv2.Canny(gray, 50, 150)
                 edges,
                 rho=1,
                 theta=np.pi/180,
+                threshold=120,  # slightly higher reduces line explosion
                 minLineLength=100,
                 maxLineGap=10
             )
             line_count = len(lines) if lines is not None else 0
+            diag_lines_raw = 0
+            raw_angles = []
             if lines is not None:
                 for line in lines:
                     x1, y1, x2, y2 = line[0]
                     angle = np.abs(np.arctan2(y2 - y1, x2 - x1) * 180 / np.pi)
                     if 10 < angle < 80 or 100 < angle < 170:
+                        diag_lines_raw += 1
+                        raw_angles.append(angle)
+            run_circles = diag_lines_raw >= 1 or line_count >= 6
             # --- Circle Detection (Optimized) ---
             circle_count = 0
             circles = None
             if run_circles:
                     circle_count = circles.shape[2]
             # --- Color Diversity Analysis ---
+            small_for_hist = cv2.resize(image_cv, (256, 256), interpolation=cv2.INTER_AREA)
+            hsv = cv2.cvtColor(small_for_hist, cv2.COLOR_BGR2HSV)
             hist = cv2.calcHist([hsv], [0], None, [180], [0, 180])
+            color_peaks = int(np.sum(hist > (np.mean(hist) * 2)))
             # --- Contour Detection ---
+            contours, _ = cv2.findContours(edges, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
             significant_contours = 0
             rectangle_contours = 0
             similar_rectangles = []
                 if (width_std < avg_width * 0.3 or height_std < avg_height * 0.3):
                     bar_pattern = True
+            # --- Line Classification (filtered) ---
             horizontal_lines = 0
             vertical_lines = 0
+            diag_lines_filtered = 0
             line_angles = []
             very_short_lines = 0
             if lines is not None:
                 for line in lines:
                     x1, y1, x2, y2 = line[0]
+                    length = np.hypot(x2 - x1, y2 - y1)
                     if length < 50:
                         very_short_lines += 1
                         continue
                     elif 80 < angle < 100:
                         vertical_lines += 1
                     else:
+                        diag_lines_filtered += 1
+            angle_variance = float(np.var(line_angles)) if len(line_angles) > 2 else 0.0
             # --- Debug Logging ---
+            logger.debug(f"Chart detection features: lines={line_count}, diag_lines_raw={diag_lines_raw}, diag_lines_filtered={diag_lines_filtered}, circles={circle_count}, horizontal_lines={horizontal_lines}, vertical_lines={vertical_lines}, color_peaks={color_peaks}, angle_variance={angle_variance}")
             # --- Chart Heuristics and Classification ---
             chart_types = []
             rejection_reason = ""
             # Negative checks (text slides, decorative backgrounds, tables)
+            if has_text_region and circle_count < 2 and diag_lines_filtered < 2 and not bar_pattern:
                 if small_scattered_contours > 100 or very_short_lines > 50:
                     rejection_reason = f"Text slide with decorative background (overall density: {overall_edge_density:.2%})"
                     logger.debug(f"Rejected: {rejection_reason}")
                     return _chart_result(False, 0.0, [], rejection_reason, line_count, circle_count, overall_edge_density)
+            if very_short_lines > 50 and circle_count < 2 and diag_lines_filtered < 3 and line_count < 10:
                 rejection_reason = f"Decorative network background ({very_short_lines} tiny lines, no data elements)"
                 logger.debug(f"Rejected: {rejection_reason}")
                 return _chart_result(False, 0.0, [], rejection_reason, line_count, circle_count, overall_edge_density)
+            if horizontal_lines > 12 and vertical_lines > 12 and circle_count == 0 and diag_lines_filtered < 2:
                 grid_lines = horizontal_lines + vertical_lines
                 total_lines = line_count
                 grid_ratio = grid_lines / max(total_lines, 1)
             # Positive chart heuristics (bubble, scatter, line, pie, bar, complex)
             # RELAXED: Detect as line chart if 2+ diagonal lines and angle variance > 40, or 1+ diagonal line and 1+ axis
             if (
+                (diag_lines_filtered >= 2 and angle_variance > 40) or
+                (diag_lines_filtered >= 1 and (horizontal_lines >= 1 or vertical_lines >= 1))
             ):
                 chart_types.append("line_chart")
+                confidence = max(confidence, min(0.88, 0.6 + (diag_lines_filtered / 40)))
                 if (horizontal_lines >= 1 or vertical_lines >= 1):
                     confidence = min(0.95, confidence + 0.08)
                 if not description:
+                    description = f"Line chart: {diag_lines_filtered} diagonal lines, axes: {horizontal_lines+vertical_lines}, variance: {angle_variance:.0f}"
             if circle_count >= 5:
                 chart_types.append("bubble_chart")
                 confidence = min(0.92, 0.70 + (min(circle_count, 20) * 0.01))
                     confidence = min(0.97, confidence + 0.05)
                     chart_types.append("zone_diagram")
                     description += f", {large_contours} colored regions"
+            elif circle_count >= 3 and diag_lines_filtered > 2:
                 chart_types.append("scatter_plot")
                 confidence = max(confidence, 0.75)
                 description = f"Scatter plot: {circle_count} data points"
                 if not description:
                     description = "Complex visualization with zones and data points"
             has_moderate_axes = (1 <= horizontal_lines <= 6 or 1 <= vertical_lines <= 6)
+            has_real_data = (circle_count >= 3 or diag_lines_filtered >= 2 or bar_pattern)
             if has_moderate_axes and has_real_data and confidence > 0.3:
                 confidence = min(0.90, confidence + 0.10)
                 if not description:
             # Final chart determination
             strong_indicator = (
+                (diag_lines_filtered >= 2 and angle_variance > 40) or
+                (diag_lines_filtered >= 1 and (horizontal_lines >= 1 or vertical_lines >= 1)) or
                 circle_count >= 5 or
                 (circle_count >= 3 and large_contours >= 2) or
                 bar_pattern or
             )
             total_time = time.time() - start_time
             if has_chart:
+                logger.info(f"?? OpenCV detection: {total_time*1000:.0f}ms (lines:{line_count}, diag_lines_filtered:{diag_lines_filtered}, circles:{circle_count}, axes:{horizontal_lines+vertical_lines}, angle_variance:{angle_variance})")
             else:
                 logger.debug(f"?? OpenCV detection: {total_time*1000:.0f}ms (rejected)")
             return {
                 'description': description or "Potential chart detected",
                 'features': {
                     'lines': line_count,
+                    'diagonal_lines_raw': diag_lines_raw,
+                    'diagonal_lines_filtered': diag_lines_filtered,
                     'circles': circle_count,
                     'contours': significant_contours,
                     'rectangles': rectangle_contours,

core/diagnostics.py DELETED Viewed

@@ -1,125 +0,0 @@
-"""
-Health check utilities for DocChat.
-This module provides diagnostics check functions that can be used
-to verify the application is running correctly.
-"""
-import logging
-from typing import Dict, Any
-from datetime import datetime
-logger = logging.getLogger(__name__)
-def check_diagnostics() -> Dict[str, Any]:
-    """
-    Perform a comprehensive diagnostics check of the application.
-    Returns:
-        Dict with diagnostics status and component information
-    """
-    diagnostics_status = {
-        "status": "diagnosticsy",
-        "timestamp": datetime.utcnow().isoformat(),
-        "components": {}
-    }
-    # Check parameters
-    try:
-        from configuration.parameters import parameters
-        diagnostics_status["components"]["parameters"] = {
-            "status": "ok",
-            "chroma_db_path": parameters.CHROMA_DB_PATH,
-            "log_level": parameters.LOG_LEVEL
-        }
-    except Exception as e:
-        diagnostics_status["components"]["parameters"] = {
-            "status": "error",
-            "error": str(e)
-        }
-        diagnostics_status["status"] = "undiagnosticsy"
-    # Check ChromaDB directory
-    try:
-        from pathlib import Path
-        chroma_path = Path(parameters.CHROMA_DB_PATH)
-        diagnostics_status["components"]["chroma_db"] = {
-            "status": "ok",
-            "path_exists": chroma_path.exists(),
-            "is_writable": chroma_path.exists() and chroma_path.is_dir()
-        }
-    except Exception as e:
-        diagnostics_status["components"]["chroma_db"] = {
-            "status": "error",
-            "error": str(e)
-        }
-    # Check cache directory
-    try:
-        cache_path = Path(parameters.CACHE_DIR)
-        diagnostics_status["components"]["cache"] = {
-            "status": "ok",
-            "path_exists": cache_path.exists(),
-            "is_writable": cache_path.exists() and cache_path.is_dir()
-        }
-    except Exception as e:
-        diagnostics_status["components"]["cache"] = {
-            "status": "error",
-            "error": str(e)
-        }
-    # Check if required packages are importable
-    required_packages = [
-        "langchain",
-        "langchain_google_genai",
-        "chromadb",
-        "gradio"
-    ]
-    packages_status = {}
-    for package in required_packages:
-        try:
-            __import__(package)
-            packages_status[package] = "ok"
-        except ImportError as e:
-            packages_status[package] = f"missing: {e}"
-            diagnostics_status["status"] = "degraded"
-    diagnostics_status["components"]["packages"] = packages_status
-    return diagnostics_status
-def check_api_key() -> Dict[str, Any]:
-    """
-    Check if the Google API key is configured and valid format.
-    Returns:
-        Dict with API key status (does not expose the key)
-    """
-    try:
-        from configuration.parameters import parameters
-        api_key = parameters.GOOGLE_API_KEY
-        if not api_key:
-            return {"status": "missing", "message": "GOOGLE_API_KEY not set"}
-        if len(api_key) < 20:
-            return {"status": "invalid", "message": "API key appears too short"}
-        # Mask the key for logging (show first 4 and last 4 chars)
-        masked = f"{api_key[:4]}...{api_key[-4:]}"
-        return {
-            "status": "configured",
-            "masked_key": masked,
-            "length": len(api_key)
-        }
-    except Exception as e:
-        return {"status": "error", "message": str(e)}
-if __name__ == "__main__":
-    # Run diagnostics check when executed directly
-    import json
-    print(json.dumps(check_diagnostics(), indent=2))

core/lifecycle.py DELETED Viewed

@@ -1,160 +0,0 @@
-"""
-Signal handling and graceful lifecycle utilities.
-This module provides graceful lifecycle handling for the DocChat application,
-ensuring resources are properly cleaned up when the application is terminated.
-"""
-import signal
-import sys
-import logging
-import atexit
-from typing import Callable, List, Optional
-from pathlib import Path
-logger = logging.getLogger(__name__)
-class ShutdownHandler:
-    """
-    Manages graceful lifecycle of the application.
-    Registers cleanup callbacks that are executed when the application
-    receives a termination signal (SIGINT, SIGTERM) or exits normally.
-    """
-    _instance: Optional['ShutdownHandler'] = None
-    def __new__(cls) -> 'ShutdownHandler':
-        """Singleton pattern to ensure only one handler exists."""
-        if cls._instance is None:
-            cls._instance = super().__new__(cls)
-            cls._instance._initialized = False
-        return cls._instance
-    def __init__(self) -> None:
-        """Initialize the lifecycle handler."""
-        if self._initialized:
-            return
-        self._cleanup_callbacks: List[Callable] = []
-        self._lifecycle_in_progress: bool = False
-        self._initialized = True
-        # Register signal handlers
-        signal.signal(signal.SIGINT, self._signal_handler)
-        signal.signal(signal.SIGTERM, self._signal_handler)
-        # Register atexit handler for normal exits
-        atexit.register(self._atexit_handler)
-        logger.info("[SHUTDOWN] ShutdownHandler initialized")
-    def register_cleanup(self, callback: Callable, name: str = "") -> None:
-        """
-        Register a cleanup callback to be called on lifecycle.
-        Args:
-            callback: Function to call during lifecycle
-            name: Optional name for logging purposes
-        """
-        self._cleanup_callbacks.append((callback, name))
-        logger.debug(f"[SHUTDOWN] Registered cleanup callback: {name or callback.__name__}")
-    def _signal_handler(self, signum: int, frame) -> None:
-        """
-        Handle termination signals.
-        Args:
-            signum: Signal number
-            frame: Current stack frame
-        """
-        signal_name = signal.Signals(signum).name
-        logger.info(f"[SHUTDOWN] Received {signal_name}, initiating graceful lifecycle...")
-        self._execute_cleanup()
-        sys.exit(0)
-    def _atexit_handler(self) -> None:
-        """Handle normal application exit."""
-        if not self._lifecycle_in_progress:
-            logger.info("[SHUTDOWN] Application exiting normally, running cleanup...")
-            self._execute_cleanup()
-    def _execute_cleanup(self) -> None:
-        """Execute all registered cleanup callbacks."""
-        if self._lifecycle_in_progress:
-            return
-        self._lifecycle_in_progress = True
-        logger.info(f"[SHUTDOWN] Executing {len(self._cleanup_callbacks)} cleanup callbacks...")
-        for callback, name in reversed(self._cleanup_callbacks):
-            try:
-                callback_name = name or callback.__name__
-                logger.debug(f"[SHUTDOWN] Running cleanup: {callback_name}")
-                callback()
-                logger.debug(f"[SHUTDOWN] ? Cleanup completed: {callback_name}")
-            except Exception as e:
-                logger.error(f"[SHUTDOWN] ? Cleanup failed: {e}", exc_info=True)
-        logger.info("[SHUTDOWN] ? All cleanup callbacks executed")
-def cleanup_chroma_db() -> None:
-    """Clean up ChromaDB connections."""
-    try:
-        # ChromaDB cleanup if needed
-        logger.info("[CLEANUP] Cleaning up ChromaDB...")
-        # ChromaDB uses SQLite which handles cleanup automatically
-        logger.info("[CLEANUP] ? ChromaDB cleanup complete")
-    except Exception as e:
-        logger.error(f"[CLEANUP] ChromaDB cleanup failed: {e}")
-def cleanup_temp_files() -> None:
-    """Clean up temporary files created during processing."""
-    try:
-        import tempfile
-        import shutil
-        # Clean up any temp directories we created
-        temp_base = Path(tempfile.gettempdir())
-        # Only clean up directories that match our pattern
-        # Be conservative to avoid deleting user data
-        logger.info("[CLEANUP] Temporary file cleanup complete")
-    except Exception as e:
-        logger.error(f"[CLEANUP] Temp file cleanup failed: {e}")
-def cleanup_logging() -> None:
-    """Flush and close all log handlers."""
-    try:
-        logger.info("[CLEANUP] Flushing log handlers...")
-        # Get root logger and flush all handlers
-        root_logger = logging.getLogger()
-        for handler in root_logger.handlers:
-            handler.flush()
-        logger.info("[CLEANUP] ? Log handlers flushed")
-    except Exception as e:
-        # Can't log this since logging might be broken
-        print(f"[CLEANUP] Log handler cleanup failed: {e}", file=sys.stderr)
-def initialize_lifecycle_handler() -> ShutdownHandler:
-    """
-    Initialize the lifecycle handler with default cleanup callbacks.
-    Returns:
-        The initialized ShutdownHandler instance
-    """
-    handler = ShutdownHandler()
-    # Register default cleanup callbacks (order matters - reverse execution)
-    handler.register_cleanup(cleanup_logging, "Logging cleanup")
-    handler.register_cleanup(cleanup_temp_files, "Temp files cleanup")
-    handler.register_cleanup(cleanup_chroma_db, "ChromaDB cleanup")
-    return handler

intelligence/accuracy_verifier.py CHANGED Viewed

@@ -150,19 +150,38 @@ Provide your verification analysis."""
             feedback_parts.append(f"Additional Details: {verification.additional_details}")
         return " | ".join(feedback_parts) if feedback_parts else None
-    def should_retry_research(self, verification: VerificationResult) -> bool:
         """Determine if research should be retried."""
         if verification.supported == "NO" or verification.relevant == "NO":
             return True
         if verification.confidence == "LOW" and (
             verification.unsupported_claims or verification.contradictions
         ):
             return True
         if verification.supported == "PARTIAL" and verification.contradictions:
             return True
         return False
     def check(self, answer: str, documents: List[Document], question: Optional[str] = None) -> Dict:
@@ -219,7 +238,7 @@ Provide your verification analysis."""
             "verification_report": verification_report,
             "context_used": context,
             "structured_result": verification_result.model_dump(),
-            "should_retry": self.should_retry_research(verification_result),
             "feedback": feedback
         }
@@ -333,18 +352,17 @@ Select the best answer by providing its index (0-based) and explain your reasoni
             for line in response_text.split('\n'):
                 if ':' not in line:
                     continue
                 key, value = line.split(':', 1)
                 key = key.strip().lower().replace(' ', '_')
                 value = value.strip().upper()
-                if key == "SUPPORTED":
                     data["supported"] = "YES" if "YES" in value else ("PARTIAL" if "PARTIAL" in value else "NO")
-                elif key == "CONFIDENCE":
                     data["confidence"] = "HIGH" if "HIGH" in value else ("MEDIUM" if "MEDIUM" in value else "LOW")
-                elif key == "RELEVANT":
                     data["relevant"] = "YES" if "YES" in value else "NO"
-                elif key == "COMPLETENESS":
                     if "COMPLETE" in value and "INCOMPLETE" not in value:
                         data["completeness"] = "COMPLETE"
                     elif "PARTIAL" in value:

             feedback_parts.append(f"Additional Details: {verification.additional_details}")
         return " | ".join(feedback_parts) if feedback_parts else None
+    def should_retry_research(self, verification: VerificationResult, verification_report: str = None, feedback: Optional[str] = None) -> bool:
         """Determine if research should be retried."""
+        # Use structured fields first
         if verification.supported == "NO" or verification.relevant == "NO":
             return True
         if verification.confidence == "LOW" and (
             verification.unsupported_claims or verification.contradictions
         ):
             return True
         if verification.supported == "PARTIAL" and verification.contradictions:
             return True
+        # Also check verification_report string for extra signals (legacy/fallback)
+        if verification_report:
+            if "Supported: NO" in verification_report:
+                logger.warning("[Re-Research] Answer not supported; triggering re-research.")
+                return True
+            elif "Relevant: NO" in verification_report:
+                logger.warning("[Re-Research] Answer not relevant; triggering re-research.")
+                return True
+            elif "Confidence: LOW" in verification_report and "Supported: PARTIAL" in verification_report:
+                logger.warning("[Re-Research] Low confidence with partial support; triggering re-research.")
+                return True
+            elif "Completeness: INCOMPLETE" in verification_report:
+                logger.warning("[Re-Research] Answer is incomplete; triggering re-research.")
+                return True
+            elif "Completeness: PARTIAL" in verification_report:
+                logger.warning("[Re-Research] Answer is partially complete; triggering re-research.")
+                return True
+        # Check feedback for contradiction/unsupported
+        if feedback and ("contradiction" in feedback.lower() or "unsupported" in feedback.lower()):
+            logger.warning("[Re-Research] Feedback indicates contradiction/unsupported; triggering re-research.")
+            return True
         return False
     def check(self, answer: str, documents: List[Document], question: Optional[str] = None) -> Dict:
             "verification_report": verification_report,
             "context_used": context,
             "structured_result": verification_result.model_dump(),
+            "should_retry": self.should_retry_research(verification_result, verification_report, feedback),
             "feedback": feedback
         }
             for line in response_text.split('\n'):
                 if ':' not in line:
                     continue
                 key, value = line.split(':', 1)
                 key = key.strip().lower().replace(' ', '_')
                 value = value.strip().upper()
+                if key == "supported":
                     data["supported"] = "YES" if "YES" in value else ("PARTIAL" if "PARTIAL" in value else "NO")
+                elif key == "confidence":
                     data["confidence"] = "HIGH" if "HIGH" in value else ("MEDIUM" if "MEDIUM" in value else "LOW")
+                elif key == "relevant":
                     data["relevant"] = "YES" if "YES" in value else "NO"
+                elif key == "completeness":
                     if "COMPLETE" in value and "INCOMPLETE" not in value:
                         data["completeness"] = "COMPLETE"
                     elif "PARTIAL" in value:

intelligence/orchestrator.py CHANGED Viewed

@@ -44,15 +44,10 @@ class AgentState(TypedDict):
 class AgentWorkflow:
     """
     Orchestrates multi-agent orchestrator for document Q&A.
-    Workflow:
-    1. Relevance Check - Determines if documents can answer the question
-    2. Research - Generates multiple answer candidates using document context
-    3. Verification - Selects the best answer from candidates
     """
-    MAX_RESEARCH_ATTEMPTS: int = 7
-    NUM_RESEARCH_CANDIDATES: int = 3
     def __init__(self, num_candidates: int = None) -> None:
         """Initialize orchestrator with required agents."""
@@ -173,41 +168,40 @@ Question: {state['question']}
         return {"draft_answer": combined, "verification_report": "Multi-question answer combined."}
     def _check_relevance_step(self, state: AgentState) -> Dict[str, Any]:
-        """Check if retrieved documents are relevant to the question."""
         logger.debug("Checking context relevance...")
         result = self.context_validator.context_validate_with_rewrite(
-            question=state["question"],
-            retriever=state["retriever"],
-            k=20,
-            max_rewrites=1
         )
-        classification = result["classification"]
-        query_used = result["query_used"]
-        was_rewritten = result.get("was_rewritten", False)
-        logger.info(f"Relevance: {classification}")
-        if was_rewritten:
-            logger.debug(f"Query rewritten: {query_used[:60]}...")
-        if classification in ["CAN_ANSWER", "PARTIAL"]:
-            if was_rewritten:
-                documents = state["retriever"].invoke(query_used)
-                return {"is_relevant": True, "query_used": query_used, "documents": documents}
-            return {"is_relevant": True, "query_used": state["question"]}
-        else:
             return {
-                "is_relevant": False,
-                "query_used": state["question"],
-                "draft_answer": "This question isn't related to the uploaded documents. Please ask another question."
             }
     def _decide_after_relevance_check(self, state: AgentState) -> str:
         """Decide next step after relevance check."""
         return "relevant" if state["is_relevant"] else "irrelevant"
-    def full_pipeline(self, question: str, retriever: BaseRetriever) -> Dict[str, str]:
         """
         Execute the full Q&A pipeline.
@@ -221,15 +215,10 @@ Question: {state['question']}
         try:
             if self.compiled_orchestrator is None:
                 self.compiled_orchestrator = self.build_orchestrator()
-            logger.info(f"Starting pipeline: {question[:80]}...")
-            documents = retriever.invoke(question)
-            logger.info(f"Retrieved {len(documents)} documents")
             initial_state: AgentState = {
                 "question": question,
-                "documents": documents,
                 "draft_answer": "",
                 "verification_report": "",
                 "is_relevant": False,
@@ -243,67 +232,24 @@ Question: {state['question']}
                 "sub_queries": [],
                 "sub_answers": []
             }
             final_state = self.compiled_orchestrator.invoke(initial_state)
             logger.info(f"Pipeline completed (attempts: {final_state.get('research_attempts', 1)})")
             return {
                 "draft_answer": final_state["draft_answer"],
                 "verification_report": final_state["verification_report"]
             }
         except Exception as e:
             logger.error(f"Pipeline failed: {e}", exc_info=True)
             raise RuntimeError(f"Workflow execution failed: {e}") from e
-    def _research_step(self, state: AgentState) -> Dict[str, Any]:
-        """Generate multiple answer candidates using the research agent."""
-        attempts = state.get("research_attempts", 0) + 1
-        feedback = state.get("feedback")
-        previous_answer = state.get("draft_answer") if feedback else None
-        # Consolidate contradictions and unsupported claims into feedback
-        contradictions = state.get("contradictions_for_research", [])
-        unsupported_claims = state.get("unsupported_claims_for_research", [])
-        feedback_for_research = state.get("feedback_for_research", feedback)
-        extra_feedback = ""
-        if contradictions:
-            extra_feedback += " Contradictions: " + "; ".join(contradictions) + "."
-        if unsupported_claims:
-            extra_feedback += " Unsupported Claims: " + "; ".join(unsupported_claims) + "."
-        # If feedback_for_research is present, append extra_feedback; otherwise, use extra_feedback only
-        if feedback_for_research:
-            feedback_for_research = feedback_for_research + extra_feedback
-        else:
-            feedback_for_research = extra_feedback.strip()
-        logger.info(f"Research step (attempt {attempts}/{self.MAX_RESEARCH_ATTEMPTS})")
-        logger.info(f"Generating {self.NUM_RESEARCH_CANDIDATES} candidate answers...")
-        candidate_answers = []
-        for i in range(self.NUM_RESEARCH_CANDIDATES):
-            logger.info(f"Generating candidate {i + 1}/{self.NUM_RESEARCH_CANDIDATES}")
-            result = self.researcher.generate(
-                question=state["question"],
-                documents=state["documents"],
-                feedback=feedback_for_research,
-                previous_answer=previous_answer
-            )
-            candidate_answers.append(result["draft_answer"])
-        logger.info(f"Generated {len(candidate_answers)} candidate answers")
-        return {
-            "candidate_answers": candidate_answers,
-            "research_attempts": attempts,
-            "feedback": None
-        }
     def _verification_step(self, state: AgentState) -> Dict[str, Any]:
         """Select the best answer from candidates and verify it."""
         logger.debug("Selecting best answer from candidates...")
-        candidate_answers = state.get("candidate_answers", [])
-        if not candidate_answers:
-            logger.warning("No candidate answers found, using draft_answer")
-            candidate_answers = [state.get("draft_answer", "")]
         # Select the best answer from candidates
         selection_result = self.verifier.select_best_answer(
@@ -331,58 +277,45 @@ Question: {state['question']}
                              f"**Selection Confidence:** {selection_result.get('confidence', 'N/A')}\n" + \
                              f"**Selection Reasoning:** {selection_reasoning}\n\n" + \
                              verification_report
         return {
             "draft_answer": best_answer,
             "verification_report": verification_report,
-            "feedback": verification_result.get("feedback"),
-            "selection_reasoning": selection_reasoning
         }
     def _decide_next_step(self, state: AgentState) -> str:
         """Decide whether to re-research or end orchestrator."""
-        verification_report = state["verification_report"]
         research_attempts = state.get("research_attempts", 1)
-        feedback = state.get("feedback")
-        needs_re_research = False
-        # Extract contradictions and unsupported claims for feedback
-        contradictions = []
-        unsupported_claims = []
-        import re
-        for line in verification_report.splitlines():
-            if line.startswith("**Contradictions:"):
-                contradictions = [c.strip() for c in line.split(":", 1)[-1].split(",") if c.strip() and c.strip().lower() != "none"]
-            if line.startswith("**Unsupported Claims:"):
-                unsupported_claims = [u.strip() for u in line.split(":", 1)[-1].split(",") if u.strip() and u.strip().lower() != "none"]
-        if "Supported: NO" in verification_report:
-            needs_re_research = True
-            logger.warning("[Re-Research] Answer not supported; triggering re-research.")
-        elif "Relevant: NO" in verification_report:
-            needs_re_research = True
-            logger.warning("[Re-Research] Answer not relevant; triggering re-research.")
-        elif "Confidence: LOW" in verification_report and "Supported: PARTIAL" in verification_report:
-            needs_re_research = True
-            logger.warning("[Re-Research] Low confidence with partial support; triggering re-research.")
-        elif "Completeness: INCOMPLETE" in verification_report:
-            needs_re_research = True
-            logger.warning("[Re-Research] Answer is incomplete; triggering re-research.")
-        elif "Completeness: PARTIAL" in verification_report:
-            needs_re_research = True
-            logger.warning("[Re-Research] Answer is partially complete; triggering re-research.")
-        if feedback and not needs_re_research:
-            if "contradiction" in feedback.lower() or "unsupported" in feedback.lower():
-                needs_re_research = True
-                logger.warning("[Re-Research] Feedback indicates contradiction/unsupported; triggering re-research.")
-        # Store extra feedback for research node
-        state["contradictions_for_research"] = contradictions
-        state["unsupported_claims_for_research"] = unsupported_claims
-        state["feedback_for_research"] = feedback
-        if needs_re_research and research_attempts < self.MAX_RESEARCH_ATTEMPTS:
-            logger.info(f"[Re-Research] Re-researching (attempt {research_attempts + 1})")
             return "re_research"
-        elif needs_re_research:
-            logger.warning("[Re-Research] Max attempts reached, returning best effort.")
-            return "end"
-        else:
-            logger.info("[Re-Research] Verification passed; ending workflow.")
-            return "end"

 class AgentWorkflow:
     """
     Orchestrates multi-agent orchestrator for document Q&A.
     """
+    MAX_RESEARCH_ATTEMPTS: int = parameters.MAX_RESEARCH_ATTEMPTS
+    NUM_RESEARCH_CANDIDATES: int = parameters.NUM_RESEARCH_CANDIDATES
     def __init__(self, num_candidates: int = None) -> None:
         """Initialize orchestrator with required agents."""
         return {"draft_answer": combined, "verification_report": "Multi-question answer combined."}
     def _check_relevance_step(self, state: AgentState) -> Dict[str, Any]:
         logger.debug("Checking context relevance...")
         result = self.context_validator.context_validate_with_rewrite(
+            question=state["question"],
+            retriever=state["retriever"],
+            k=parameters.RELEVANCE_CHECK_K,   # use config instead of hardcoding 20
+            max_rewrites=parameters.MAX_QUERY_REWRITES,
         )
+        classification = result.get("classification", "NO_MATCH")
+        query_used = result.get("query_used", state["question"])
+        logger.info(f"Relevance: {classification} (query_used={query_used[:80]})")
+        if classification in ("CAN_ANSWER", "PARTIAL"):
+            # ? ALWAYS retrieve docs for the query we�re actually going to answer
+            documents = state["retriever"].invoke(query_used)
             return {
+                "is_relevant": True,
+                "query_used": query_used,
+                "documents": documents
             }
+        return {
+            "is_relevant": False,
+            "query_used": query_used,
+            "draft_answer": "This question isn't related to the uploaded documents. Please ask another question.",
+        }
     def _decide_after_relevance_check(self, state: AgentState) -> str:
         """Decide next step after relevance check."""
         return "relevant" if state["is_relevant"] else "irrelevant"
+    def run_workflow(self, question: str, retriever: BaseRetriever) -> Dict[str, str]:
         """
         Execute the full Q&A pipeline.
         try:
             if self.compiled_orchestrator is None:
                 self.compiled_orchestrator = self.build_orchestrator()
             initial_state: AgentState = {
                 "question": question,
+                "documents": [],  # Let _check_relevance_step fill this
                 "draft_answer": "",
                 "verification_report": "",
                 "is_relevant": False,
                 "sub_queries": [],
                 "sub_answers": []
             }
             final_state = self.compiled_orchestrator.invoke(initial_state)
             logger.info(f"Pipeline completed (attempts: {final_state.get('research_attempts', 1)})")
             return {
                 "draft_answer": final_state["draft_answer"],
                 "verification_report": final_state["verification_report"]
             }
         except Exception as e:
             logger.error(f"Pipeline failed: {e}", exc_info=True)
             raise RuntimeError(f"Workflow execution failed: {e}") from e
     def _verification_step(self, state: AgentState) -> Dict[str, Any]:
         """Select the best answer from candidates and verify it."""
         logger.debug("Selecting best answer from candidates...")
+        candidate_answers = state.get("candidate_answers", []) or [state.get("draft_answer", "")]
         # Select the best answer from candidates
         selection_result = self.verifier.select_best_answer(
                              f"**Selection Confidence:** {selection_result.get('confidence', 'N/A')}\n" + \
                              f"**Selection Reasoning:** {selection_reasoning}\n\n" + \
                              verification_report
+        feedback_for_research = verification_result.get("feedback")
         return {
             "draft_answer": best_answer,
             "verification_report": verification_report,
+            "feedback_for_research": feedback_for_research,
+            "selection_reasoning": selection_reasoning,
+            "should_retry": verification_result.get("should_retry", False),
         }
     def _decide_next_step(self, state: AgentState) -> str:
         """Decide whether to re-research or end orchestrator."""
         research_attempts = state.get("research_attempts", 1)
+        should_retry = bool(state.get("should_retry", False))
+        if should_retry and research_attempts < self.MAX_RESEARCH_ATTEMPTS:
             return "re_research"
+        return "end"
+    def _research_step(self, state: AgentState) -> Dict[str, Any]:
+        """Generate multiple answer candidates using the research agent."""
+        attempts = state.get("research_attempts", 0) + 1
+        feedback_for_research = state.get("feedback_for_research")
+        previous_answer = state.get("draft_answer") if feedback_for_research else None
+        logger.info(f"Research step (attempt {attempts}/{self.MAX_RESEARCH_ATTEMPTS})")
+        logger.info(f"Generating {self.NUM_RESEARCH_CANDIDATES} candidate answers...")
+        candidate_answers = []
+        for i in range(self.NUM_RESEARCH_CANDIDATES):
+            logger.info(f"Generating candidate {i + 1}/{self.NUM_RESEARCH_CANDIDATES}")
+            result = self.researcher.generate(
+                question=state["question"],
+                documents=state["documents"],
+                feedback=feedback_for_research,
+                previous_answer=previous_answer
+            )
+            candidate_answers.append(result["draft_answer"])
+        logger.info(f"Generated {len(candidate_answers)} candidate answers")
+        return {
+            "candidate_answers": candidate_answers,
+            "research_attempts": attempts,
+            "feedback": None
+        }

main.py CHANGED Viewed

@@ -17,7 +17,7 @@ from content_analyzer.document_parser import DocumentProcessor
 from search_engine.indexer import RetrieverBuilder
 from intelligence.orchestrator import AgentWorkflow
 from configuration import definitions, parameters
-import gradio as gr
 # Example data for demo
 EXAMPLES = {
@@ -127,9 +127,26 @@ def _find_open_port(start_port: int, max_attempts: int = 20) -> int:
     raise RuntimeError(f"Could not find an open port starting at {start_port}")
 def _setup_gradio_shim():
     """Shim Gradio's JSON schema conversion to tolerate boolean additionalProperties values."""
-    import gradio as gr
     from gradio_client import utils as grc_utils
     _orig_json_schema_to_python_type = grc_utils._json_schema_to_python_type
     def _json_schema_to_python_type_safe(schema, defs=None):
@@ -140,7 +157,8 @@ def _setup_gradio_shim():
 def main():
-    """Main application entry point."""
     _setup_gradio_shim()
     logger.info("=" * 60)
@@ -499,36 +517,9 @@ def main():
         margin-bottom: 16px !important;
     }
     """
-    js = """
-    function createGradioAnimation() {
-        var container = document.createElement('div');
-        container.id = 'gradio-animation';
-        container.style.fontSize = '2.4em';
-        container.style.fontWeight = '700';
-        container.style.textAlign = 'center';
-        container.style.marginBottom = '20px';
-        container.style.marginTop = '10px';
-        container.style.color = '#0369a1';
-        container.style.letterSpacing = '-0.02em';
-        var text = '📄 SmartDoc AI';
-        for (var i = 0; i < text.length; i++) {
-            (function(i){
-                setTimeout(function(){
-                    var letter = document.createElement('span');
-                    letter.style.opacity = '0';
-                    letter.style.transition = 'opacity 0.2s ease';
-                    letter.innerText = text[i];
-                    container.appendChild(letter);
-                    setTimeout(function() { letter.style.opacity = '1'; }, 50);
-                }, i * 80);
-            })(i);
-        }
-        var gradioContainer = document.querySelector('.gradio-container');
-        gradioContainer.insertBefore(container, gradioContainer.firstChild);
-        return 'Animation created';
-    }
-    (() => {
-  const upload_messages = [
     "Crunching your documents...",
     "Warming up the AI...",
     "Extracting knowledge...",
@@ -541,99 +532,69 @@ def main():
     "Almost ready..."
   ];
-  let intervalId = null;
-  let timerId = null;
-  let startMs = null;
   let lastMsg = null;
-  function pickMsg() {
-    if (upload_messages.length === 0) return "";
-    if (upload_messages.length === 1) return upload_messages[0];
     let m;
-    do {
-      m = upload_messages[Math.floor(Math.random() * upload_messages.length)];
-    } while (m === lastMsg);
     lastMsg = m;
     return m;
-  }
-  function getMsgSpan() {
-    const root = document.getElementById("processing-message");
-    if (!root) return null;
-    return root.querySelector("#processing-msg");
-  }
-  function getTimerSpan() {
-    const root = document.getElementById("processing-message");
-    if (!root) return null;
-    return root.querySelector("#processing-timer");
-  }
-  function setMsg(text) {
-    const span = getMsgSpan();
-    if (!span) return;
-    span.textContent = text;
-  }
-  function formatElapsed(startMs) {
-    const s = (Date.now() - startMs) / 1000;
-    return `${s.toFixed(1)}s elapsed`;
-  }
-  function startRotationAndTimer() {
-    stopRotationAndTimer();
-    setMsg(pickMsg());
     startMs = Date.now();
-    intervalId = setInterval(() => setMsg(pickMsg()), 2000);
-    const timerSpan = getTimerSpan();
-    if (timerSpan) {
-      timerSpan.textContent = formatElapsed(startMs);
-      timerId = setInterval(() => {
-        timerSpan.textContent = formatElapsed(startMs);
-      }, 200);
-    }
-  }
-  function stopRotationAndTimer() {
-    if (intervalId) {
-      clearInterval(intervalId);
-      intervalId = null;
-    }
-    if (timerId) {
-      clearInterval(timerId);
-      timerId = null;
-    }
-    const timerSpan = getTimerSpan();
-    if (timerSpan) timerSpan.textContent = "";
-  }
-  // Auto start/stop based on visibility of the processing box
-  function watchProcessingBox() {
-    const root = document.getElementById("processing-message");
-    if (!root) {
-      setTimeout(watchProcessingBox, 250);
-      return;
-    }
-    const isVisible = () => root.offsetParent !== null;
-    let prev = isVisible();
-    if (prev) startRotationAndTimer();
-    const obs = new MutationObserver(() => {
-      const now = isVisible();
-      if (now && !prev) startRotationAndTimer();
-      if (!now && prev) stopRotationAndTimer();
-      prev = now;
-    });
-    obs.observe(root, { attributes: true, attributeFilter: ["style", "class"] });
-  }
-  window.smartdocStartRotationAndTimer = startRotationAndTimer;
-  window.smartdocStopRotationAndTimer = stopRotationAndTimer;
-  watchProcessingBox();
 })();
-    """
     with gr.Blocks(theme=gr.themes.Soft(), title="SmartDoc AI", css=css, js=js) as demo:
         gr.Markdown("### SmartDoc AI - Document Q&A", elem_classes="app-title")
@@ -668,26 +629,8 @@ def main():
             "session_start": datetime.now().strftime("%Y-%m-%d %H:%M")
         })
-        def process_question(question_text, uploaded_files, chat_history):
-            import time
-            import random
-            chat_history = chat_history or []
-            upload_messages = [
-                "Crunching your documents...",
-                "Warming up the AI...",
-                "Extracting knowledge...",
-                "Scanning for insights...",
-                "Preparing your data...",
-                "Looking for answers...",
-                "Analyzing file structure...",
-                "Reading your files...",
-                "Indexing content...",
-                "Almost ready..."
-            ]
-            last_msg = None
-            start_time = time.time()
-            msg = random.choice([m for m in upload_messages if m != last_msg])
-            last_msg = msg
             yield (
                 chat_history,
                 gr.update(visible=False),
@@ -701,7 +644,6 @@ def main():
   <span id="processing-timer" style="opacity:0.8; margin-left:8px;"></span>
 </div>''', visible=True)
             )
             try:
                 if not question_text.strip():
                     chat_history.append({"role": "user", "content": question_text})
@@ -732,42 +674,32 @@ def main():
                     )
                     return
                 # Stage 2: Chunking with per-chunk progress and rotating status
-                all_chunks = []
-                seen_hashes = set()
-                total_chunks = 0
-                chunk_counts = []
-                for file in uploaded_files:
-                    with open(file.name, 'rb') as f:
                         file_content = f.read()
-                        file_hash = processor._generate_hash(file_content)
                     cache_path = processor.cache_dir / f"{file_hash}.pkl"
                     if processor._is_cache_valid(cache_path):
                         chunks = processor._load_from_cache(cache_path)
-                        if not chunks:
-                            chunks = processor._process_file(file)
-                            processor._save_to_cache(chunks, cache_path)
-                    else:
-                        chunks = processor._process_file(file)
-                        processor._save_to_cache(chunks, cache_path)
-                    chunk_counts.append(len(chunks))
                     total_chunks += len(chunks)
                 if total_chunks == 0:
                     total_chunks = 1
                 chunk_idx = 0
-                msg = random.choice(upload_messages)
-                for file, file_chunk_count in zip(uploaded_files, chunk_counts):
-                    with open(file.name, 'rb') as f:
-                        file_content = f.read()
-                        file_hash = processor._generate_hash(file_content)
-                    cache_path = processor.cache_dir / f"{file_hash}.pkl"
-                    if processor._is_cache_valid(cache_path):
-                        chunks = processor._load_from_cache(cache_path)
-                        if not chunks:
-                            chunks = processor._process_file(file)
-                            processor._save_to_cache(chunks, cache_path)
-                    else:
-                        chunks = processor._process_file(file)
-                        processor._save_to_cache(chunks, cache_path)
                     for chunk in chunks:
                         chunk_hash = processor._generate_hash(chunk.page_content.encode())
                         if chunk_hash not in seen_hashes:
@@ -775,12 +707,7 @@ def main():
                             all_chunks.append(chunk)
                         # else: skip duplicate chunk
                         chunk_idx += 1
-                        # Rotate status message every 10 seconds
-                        elapsed = time.time() - start_time
-                        if chunk_idx == 1 or (elapsed // 10) > ((elapsed-1) // 10):
-                            msg = random.choice([m for m in upload_messages if m != last_msg])
-                            last_msg = msg
-                        # When yielding progress, always do:
                         yield (
                             chat_history,
                             gr.update(visible=False),
@@ -794,8 +721,7 @@ def main():
   <span id="processing-timer" style="opacity:0.8; margin-left:8px;"></span>
 </div>''', visible=True)
                         )
-                # After all chunks, show 100%
-                elapsed = time.time() - start_time
                 yield (
                     chat_history,
                     gr.update(visible=False),
@@ -810,7 +736,6 @@ def main():
 </div>''', visible=True)
                 )
                 # Stage 3: Building Retriever
-                elapsed = time.time() - start_time
                 yield (
                     chat_history,
                     gr.update(visible=False),
@@ -828,7 +753,6 @@ def main():
                 )
                 retriever = retriever_indexer.build_hybrid_retriever(all_chunks)
                 # Stage 4: Generating Answer
-                elapsed = time.time() - start_time
                 yield (
                     chat_history,
                     gr.update(visible=False),
@@ -842,10 +766,9 @@ def main():
   <span id="processing-timer" style="opacity:0.8; margin-left:8px;"></span>
 </div>''', visible=True)
                 )
-                result = orchestrator.full_pipeline(question=question_text, retriever=retriever)
                 answer = result["draft_answer"]
                 # Stage 5: Verifying Answer
-                elapsed = time.time() - start_time
                 yield (
                     chat_history,
                     gr.update(visible=False),
@@ -864,10 +787,7 @@ def main():
                 # Do not display verification to user, only use internally
                 chat_history.append({"role": "user", "content": question_text})
                 chat_history.append({"role": "assistant", "content": f"**Answer:**\n{answer}"})
                 session_state.value["last_documents"] = retriever.invoke(question_text)
-                # Final: Show results and make context tab visible
-                total_elapsed = time.time() - start_time
                 yield (
                     chat_history,
                     gr.update(visible=True),  # doc_context_display
@@ -880,9 +800,7 @@ def main():
   <span id="processing-msg"></span>
   <span id="processing-timer" style="opacity:0.8; margin-left:8px;"></span>
 </div>''', visible=True)
-                )
-                time.sleep(1.5)
                 yield (
                     chat_history,
                     gr.update(visible=True),
@@ -954,10 +872,8 @@ def main():
                     file_info_text += f"{source_file_path} not found\n"
             if not copied_files:
                 return [], "", "Could not load example files"
-            return copied_files, question_text, file_info_text
-        # Remove the Load Example button and related logic
-        # Instead, load the example immediately when dropdown changes
         example_dropdown.change(
             fn=load_example,
             inputs=[example_dropdown],
@@ -967,6 +883,7 @@ def main():
     # HF Spaces sets SPACE_ID environment variable
     is_hf_space = os.environ.get("SPACE_ID") is not None
     if is_hf_space:
         # Hugging Face Spaces configuration
         logger.info("Running on Hugging Face Spaces")
@@ -975,10 +892,8 @@ def main():
         # Local development configuration
         configured_port = int(os.environ.get("GRADIO_SERVER_PORT", "7860"))
         server_port = _find_open_port(configured_port)
         logger.info(f"Launching Gradio on port {server_port}")
         logger.info(f"Access the app at: http://127.0.0.1:{server_port}")
         demo.launch(server_name="127.0.0.1", server_port=server_port, share=False)

 from search_engine.indexer import RetrieverBuilder
 from intelligence.orchestrator import AgentWorkflow
 from configuration import definitions, parameters
 # Example data for demo
 EXAMPLES = {
     raise RuntimeError(f"Could not find an open port starting at {start_port}")
+def _ensure_hfhub_hffolder_compat():
+    """
+    Shim for Gradio <5.7.1 with huggingface_hub >=1.0.
+    """
+    import huggingface_hub
+    if hasattr(huggingface_hub, "HfFolder"):
+        return
+    try:
+        from huggingface_hub.utils import get_token
+    except Exception:
+        return
+    class HfFolder:
+        @staticmethod
+        def get_token():
+            return get_token()
+    huggingface_hub.HfFolder = HfFolder
 def _setup_gradio_shim():
     """Shim Gradio's JSON schema conversion to tolerate boolean additionalProperties values."""
     from gradio_client import utils as grc_utils
     _orig_json_schema_to_python_type = grc_utils._json_schema_to_python_type
     def _json_schema_to_python_type_safe(schema, defs=None):
 def main():
+    _ensure_hfhub_hffolder_compat()  # must run before importing gradio
+    import gradio as gr
     _setup_gradio_shim()
     logger.info("=" * 60)
         margin-bottom: 16px !important;
     }
     """
+    js = r"""
+(() => {
+  const uploadMessages = [
     "Crunching your documents...",
     "Warming up the AI...",
     "Extracting knowledge...",
     "Almost ready..."
   ];
+  let msgInterval = null;
+  let timerInterval = null;
+  let startMs = 0;
   let lastMsg = null;
+  // In Gradio re-renders, the element may get replaced; pick the visible one if duplicates ever appear
+  const root = () => {
+    const all = Array.from(document.querySelectorAll("#processing-message"));
+    return all.find(el => el && (el.offsetWidth || el.offsetHeight || el.getClientRects().length)) || all[0] || null;
+  };
+  const isVisible = (el) => !!(el && (el.offsetWidth || el.offsetHeight || el.getClientRects().length));
+  const pickMsg = () => {
+    if (uploadMessages.length === 0) return "";
+    if (uploadMessages.length === 1) return uploadMessages[0];
     let m;
+    do { m = uploadMessages[Math.floor(Math.random() * uploadMessages.length)]; }
+    while (m === lastMsg);
     lastMsg = m;
     return m;
+  };
+  const getMsgSpan = () => root()?.querySelector("#processing-msg");
+  const getTimerSpan = () => root()?.querySelector("#processing-timer");
+  const setMsg = (t) => { const s = getMsgSpan(); if (s) s.textContent = t; };
+  const fmtElapsed = () => `${((Date.now() - startMs) / 1000).toFixed(1)}s elapsed`;
+  const start = () => {
+    if (msgInterval || timerInterval) return;
     startMs = Date.now();
+    setMsg(pickMsg());
+    msgInterval = setInterval(() => setMsg(pickMsg()), 2000);
+    const t = getTimerSpan();
+    if (t) {
+      t.textContent = fmtElapsed();
+      timerInterval = setInterval(() => { t.textContent = fmtElapsed(); }, 200);
+    }
+  };
+  const stop = () => {
+    if (msgInterval) { clearInterval(msgInterval); msgInterval = null; }
+    if (timerInterval) { clearInterval(timerInterval); timerInterval = null; }
+    const t = getTimerSpan();
+    if (t) t.textContent = "";
+  };
+  const tick = () => {
+    const r = root();
+    if (isVisible(r)) start();
+    else stop();
+  };
+  const obs = new MutationObserver(tick);
+  obs.observe(document.body, { subtree: true, childList: true, attributes: true });
+  window.addEventListener("load", tick);
+  setInterval(tick, 500);
 })();
+"""
     with gr.Blocks(theme=gr.themes.Soft(), title="SmartDoc AI", css=css, js=js) as demo:
         gr.Markdown("### SmartDoc AI - Document Q&A", elem_classes="app-title")
             "session_start": datetime.now().strftime("%Y-%m-%d %H:%M")
         })
+        def process_question(question_text, uploaded_files, chat_history):
+            chat_history = chat_history or []
             yield (
                 chat_history,
                 gr.update(visible=False),
   <span id="processing-timer" style="opacity:0.8; margin-left:8px;"></span>
 </div>''', visible=True)
             )
             try:
                 if not question_text.strip():
                     chat_history.append({"role": "user", "content": question_text})
                     )
                     return
                 # Stage 2: Chunking with per-chunk progress and rotating status
+                def load_or_process(file):
+                    with open(file.name, "rb") as f:
                         file_content = f.read()
+                    file_hash = processor._generate_hash(file_content)
                     cache_path = processor.cache_dir / f"{file_hash}.pkl"
                     if processor._is_cache_valid(cache_path):
                         chunks = processor._load_from_cache(cache_path)
+                        if chunks:
+                            logger.info(f"Using cached chunks for {file.name}")
+                            return chunks
+                    chunks = processor._process_file(file)
+                    processor._save_to_cache(chunks, cache_path)
+                    return chunks
+                all_chunks = []
+                seen_hashes = set()
+                chunks_by_file = []
+                total_chunks = 0
+                for file in uploaded_files:
+                    chunks = load_or_process(file)
+                    chunks_by_file.append(chunks)
                     total_chunks += len(chunks)
                 if total_chunks == 0:
                     total_chunks = 1
                 chunk_idx = 0
+                for chunks in chunks_by_file:
                     for chunk in chunks:
                         chunk_hash = processor._generate_hash(chunk.page_content.encode())
                         if chunk_hash not in seen_hashes:
                             all_chunks.append(chunk)
                         # else: skip duplicate chunk
                         chunk_idx += 1
+                        # yield progress here if needed
                         yield (
                             chat_history,
                             gr.update(visible=False),
   <span id="processing-timer" style="opacity:0.8; margin-left:8px;"></span>
 </div>''', visible=True)
                         )
+                # After all chunks, show 100%
                 yield (
                     chat_history,
                     gr.update(visible=False),
 </div>''', visible=True)
                 )
                 # Stage 3: Building Retriever
                 yield (
                     chat_history,
                     gr.update(visible=False),
                 )
                 retriever = retriever_indexer.build_hybrid_retriever(all_chunks)
                 # Stage 4: Generating Answer
                 yield (
                     chat_history,
                     gr.update(visible=False),
   <span id="processing-timer" style="opacity:0.8; margin-left:8px;"></span>
 </div>''', visible=True)
                 )
+                result = orchestrator.run_workflow(question=question_text, retriever=retriever)
                 answer = result["draft_answer"]
                 # Stage 5: Verifying Answer
                 yield (
                     chat_history,
                     gr.update(visible=False),
                 # Do not display verification to user, only use internally
                 chat_history.append({"role": "user", "content": question_text})
                 chat_history.append({"role": "assistant", "content": f"**Answer:**\n{answer}"})
                 session_state.value["last_documents"] = retriever.invoke(question_text)
                 yield (
                     chat_history,
                     gr.update(visible=True),  # doc_context_display
   <span id="processing-msg"></span>
   <span id="processing-timer" style="opacity:0.8; margin-left:8px;"></span>
 </div>''', visible=True)
+                )
                 yield (
                     chat_history,
                     gr.update(visible=True),
                     file_info_text += f"{source_file_path} not found\n"
             if not copied_files:
                 return [], "", "Could not load example files"
+            return copied_files, question_text, file_info_text
         example_dropdown.change(
             fn=load_example,
             inputs=[example_dropdown],
     # HF Spaces sets SPACE_ID environment variable
     is_hf_space = os.environ.get("SPACE_ID") is not None
+    demo.queue()
     if is_hf_space:
         # Hugging Face Spaces configuration
         logger.info("Running on Hugging Face Spaces")
         # Local development configuration
         configured_port = int(os.environ.get("GRADIO_SERVER_PORT", "7860"))
         server_port = _find_open_port(configured_port)
         logger.info(f"Launching Gradio on port {server_port}")
         logger.info(f"Access the app at: http://127.0.0.1:{server_port}")
         demo.launch(server_name="127.0.0.1", server_port=server_port, share=False)

packages.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ poppler-utils

requirements.txt CHANGED Viewed

@@ -27,7 +27,7 @@ google-generativeai>=0.8.0
 chromadb>=0.6.3
 # Web framework
-gradio>=5.13.0
 # Data processing
 pandas>=2.1.4

 chromadb>=0.6.3
 # Web framework
+gradio>=5.7.1
 # Data processing
 pandas>=2.1.4

search_engine/indexer.py CHANGED Viewed

@@ -74,7 +74,7 @@ class EnsembleRetriever(BaseRetriever):
         *,
         run_manager: CallbackManagerForRetrieverRun = None
     ) -> List[Document]:
-        """Retrieve and combine documents using weighted RRF, deduplicating charts by content and aggregating page numbers."""
         logger.debug(f"[ENSEMBLE] Query: {query[:80]}...")
         all_docs_with_scores = {}
         retriever_names = ["BM25", "Vector"]
@@ -84,8 +84,8 @@ class EnsembleRetriever(BaseRetriever):
                 docs = retriever.invoke(query)
                 logger.debug(f"[ENSEMBLE] {retriever_name}: {len(docs)} docs (weight: {weight})")
                 for rank, doc in enumerate(docs):
-                    # Deduplicate by content and source only
-                    doc_key = (doc.page_content, doc.metadata.get('source', ''))
                     rrf_score = weight / (rank + 1 + self.c)
                     if doc_key in all_docs_with_scores:
                         existing_doc, existing_score = all_docs_with_scores[doc_key]

         *,
         run_manager: CallbackManagerForRetrieverRun = None
     ) -> List[Document]:
+        """Retrieve and combine documents using weighted RRF, deduplicating charts by doc_id and aggregating page numbers."""
         logger.debug(f"[ENSEMBLE] Query: {query[:80]}...")
         all_docs_with_scores = {}
         retriever_names = ["BM25", "Vector"]
                 docs = retriever.invoke(query)
                 logger.debug(f"[ENSEMBLE] {retriever_name}: {len(docs)} docs (weight: {weight})")
                 for rank, doc in enumerate(docs):
+                    # Deduplicate by doc_id only
+                    doc_key = doc_id(doc)
                     rrf_score = weight / (rank + 1 + self.c)
                     if doc_key in all_docs_with_scores:
                         existing_doc, existing_score = all_docs_with_scores[doc_key]

tests/conftest.py DELETED Viewed

@@ -1,71 +0,0 @@
-"""
-Test fixtures and shared utilities for DocChat tests.
-"""
-import pytest
-from unittest.mock import MagicMock
-from langchain_core.documents import Document
-class FakeLLM:
-    """Mock LLM for testing without API calls."""
-    def __init__(self, content: str = "Test response"):
-        self.content = content
-        self.last_prompt = None
-        self.invoke_count = 0
-    def invoke(self, prompt: str):
-        self.last_prompt = prompt
-        self.invoke_count += 1
-        return type("Response", (), {"content": self.content})()
-class FakeRetriever:
-    """Mock retriever for testing without vector store."""
-    def __init__(self, documents: list = None):
-        self.documents = documents or []
-        self.invoke_count = 0
-        self.last_query = None
-    def invoke(self, query: str):
-        self.last_query = query
-        self.invoke_count += 1
-        return self.documents
-@pytest.fixture
-def sample_documents():
-    """Create sample documents for testing."""
-    return [
-        Document(
-            page_content="The data center in Singapore achieved a PUE of 1.12 in 2022.",
-            metadata={"source": "test.pdf", "page": 1}
-        ),
-        Document(
-            page_content="Carbon-free energy in Asia Pacific reached 45% in 2023.",
-            metadata={"source": "test.pdf", "page": 2}
-        ),
-        Document(
-            page_content="DeepSeek-R1 outperformed o1-mini on coding benchmarks.",
-            metadata={"source": "deepseek.pdf", "page": 1}
-        ),
-    ]
-@pytest.fixture
-def fake_llm():
-    """Create a fake LLM for testing."""
-    return FakeLLM("This is a test response.")
-@pytest.fixture
-def fake_retriever(sample_documents):
-    """Create a fake retriever with sample documents."""
-    return FakeRetriever(sample_documents)
-@pytest.fixture
-def empty_retriever():
-    """Create a fake retriever that returns no documents."""
-    return FakeRetriever([])

tests/test_accuracy_verifier.py DELETED Viewed

@@ -1,110 +0,0 @@
-"""
-Tests for the VerificationAgent.
-"""
-import pytest
-from unittest.mock import MagicMock, patch
-from langchain_core.documents import Document
-# Import after setting up mocks to avoid API key validation
-import sys
-sys.path.insert(0, '.')
-class TestVerificationAgent:
-    """Test suite for VerificationAgent."""
-    @pytest.fixture
-    def mock_parameters(self, monkeypatch):
-        """Mock parameters to avoid API key requirement."""
-        monkeypatch.setenv("GOOGLE_API_KEY", "test_key_for_testing")
-    @pytest.fixture
-    def accuracy_verifier(self, mock_parameters, fake_llm):
-        """Create a VerificationAgent with mocked LLM."""
-        from intelligence.accuracy_verifier import VerificationAgent
-        return VerificationAgent(llm=fake_llm)
-    def test_check_with_supported_answer(self, accuracy_verifier, sample_documents):
-        """Test verification with an answer supported by documents."""
-        # Configure the fake LLM to return a supported response
-        accuracy_verifier.llm.content = """
-        Supported: YES
-        Unsupported Claims: []
-        Contradictions: []
-        Relevant: YES
-        Additional Details: The answer is well-supported by the context.
-        """
-        result = accuracy_verifier.check(
-            answer="The PUE in Singapore was 1.12 in 2022.",
-            documents=sample_documents
-        )
-        assert "verification_report" in result
-        assert "Supported: YES" in result["verification_report"]
-        assert "context_used" in result
-    def test_check_with_unsupported_answer(self, accuracy_verifier, sample_documents):
-        """Test verification with an unsupported answer."""
-        accuracy_verifier.llm.content = """
-        Supported: NO
-        Unsupported Claims: [The PUE was 1.5]
-        Contradictions: []
-        Relevant: YES
-        Additional Details: The claimed PUE value is not in the context.
-        """
-        result = accuracy_verifier.check(
-            answer="The PUE in Singapore was 1.5 in 2022.",
-            documents=sample_documents
-        )
-        assert "Supported: NO" in result["verification_report"]
-    def test_parse_verification_response_valid(self, accuracy_verifier):
-        """Test parsing a valid verification response."""
-        response = """
-        Supported: YES
-        Unsupported Claims: []
-        Contradictions: []
-        Relevant: YES
-        Additional Details: All claims verified.
-        """
-        parsed = accuracy_verifier.parse_verification_response(response)
-        assert parsed["Supported"] == "YES"
-        assert parsed["Relevant"] == "YES"
-        assert parsed["Unsupported Claims"] == []
-    def test_parse_verification_response_with_claims(self, accuracy_verifier):
-        """Test parsing response with unsupported claims."""
-        response = """
-        Supported: NO
-        Unsupported Claims: [claim1, claim2]
-        Contradictions: [contradiction1]
-        Relevant: YES
-        Additional Details: Multiple issues found.
-        """
-        parsed = accuracy_verifier.parse_verification_response(response)
-        assert parsed["Supported"] == "NO"
-        assert len(parsed["Unsupported Claims"]) == 2
-        assert len(parsed["Contradictions"]) == 1
-    def test_format_verification_report(self, accuracy_verifier):
-        """Test formatting a verification report."""
-        verification = {
-            "Supported": "YES",
-            "Unsupported Claims": [],
-            "Contradictions": [],
-            "Relevant": "YES",
-            "Additional Details": "Well verified."
-        }
-        report = accuracy_verifier.format_verification_report(verification)
-        assert "**Supported:** YES" in report
-        assert "**Relevant:** YES" in report
-        assert "**Unsupported Claims:** None" in report

tests/test_context_validator.py DELETED Viewed

@@ -1,120 +0,0 @@
-"""
-Tests for the RelevanceChecker.
-"""
-import pytest
-from unittest.mock import MagicMock
-from langchain_core.documents import Document
-import sys
-sys.path.insert(0, '.')
-class TestRelevanceChecker:
-    """Test suite for RelevanceChecker."""
-    @pytest.fixture
-    def mock_parameters(self, monkeypatch):
-        """Mock parameters to avoid API key requirement."""
-        monkeypatch.setenv("GOOGLE_API_KEY", "test_key_for_testing")
-    @pytest.fixture
-    def context_validator(self, mock_parameters, fake_llm):
-        """Create a RelevanceChecker with mocked LLM."""
-        from intelligence.context_validator import RelevanceChecker
-        checker = RelevanceChecker()
-        checker.llm = fake_llm
-        return checker
-    def test_check_can_answer(self, context_validator, fake_retriever):
-        """Test when documents can fully answer the question."""
-        context_validator.llm.content = "CAN_ANSWER"
-        result = context_validator.check(
-            question="What is the PUE in Singapore?",
-            retriever=fake_retriever,
-            k=3
-        )
-        assert result == "CAN_ANSWER"
-        assert fake_retriever.invoke_count == 1
-    def test_check_partial_match(self, context_validator, fake_retriever):
-        """Test when documents partially match the question."""
-        context_validator.llm.content = "PARTIAL"
-        result = context_validator.check(
-            question="What is the historical trend of PUE?",
-            retriever=fake_retriever,
-            k=3
-        )
-        assert result == "PARTIAL"
-    def test_check_no_match(self, context_validator, fake_retriever):
-        """Test when documents don't match the question."""
-        context_validator.llm.content = "NO_MATCH"
-        result = context_validator.check(
-            question="What is the weather in Paris?",
-            retriever=fake_retriever,
-            k=3
-        )
-        assert result == "NO_MATCH"
-    def test_check_empty_question(self, context_validator, fake_retriever):
-        """Test with empty question returns NO_MATCH."""
-        result = context_validator.check(
-            question="",
-            retriever=fake_retriever,
-            k=3
-        )
-        assert result == "NO_MATCH"
-    def test_check_empty_retriever_results(self, context_validator, empty_retriever):
-        """Test when retriever returns no documents."""
-        result = context_validator.check(
-            question="Any question",
-            retriever=empty_retriever,
-            k=3
-        )
-        assert result == "NO_MATCH"
-    def test_check_invalid_llm_response(self, context_validator, fake_retriever):
-        """Test when LLM returns invalid response."""
-        context_validator.llm.content = "INVALID_LABEL"
-        result = context_validator.check(
-            question="What is the PUE?",
-            retriever=fake_retriever,
-            k=3
-        )
-        assert result == "NO_MATCH"
-    def test_check_retriever_exception(self, context_validator):
-        """Test when retriever throws an exception."""
-        failing_retriever = MagicMock()
-        failing_retriever.invoke.side_effect = Exception("Connection error")
-        result = context_validator.check(
-            question="Any question",
-            retriever=failing_retriever,
-            k=3
-        )
-        assert result == "NO_MATCH"
-    def test_check_invalid_k_value(self, context_validator, fake_retriever):
-        """Test with invalid k value defaults to 3."""
-        context_validator.llm.content = "CAN_ANSWER"
-        result = context_validator.check(
-            question="What is the PUE?",
-            retriever=fake_retriever,
-            k=-1
-        )
-        assert result == "CAN_ANSWER"

tests/test_knowledge_synthesizer.py DELETED Viewed

@@ -1,50 +0,0 @@
-import unittest
-try:
-    from langchain_core.documents import Document
-    from intelligence.knowledge_synthesizer import ResearchAgent
-    LANGCHAIN_AVAILABLE = True
-except ImportError:
-    Document = None  # type: ignore
-    ResearchAgent = None  # type: ignore
-    LANGCHAIN_AVAILABLE = False
-class FakeLLM:
-    """Simple stand-in for ChatGoogleGenerativeAI to avoid network calls."""
-    def __init__(self, content: str) -> None:
-        self.content = content
-        self.last_prompt = None
-    def invoke(self, prompt: str):
-        self.last_prompt = prompt
-        return type("Resp", (), {"content": self.content})
-@unittest.skipUnless(LANGCHAIN_AVAILABLE, "langchain not installed in this environment")
-class ResearchAgentTests(unittest.TestCase):
-    def test_generate_returns_stubbed_content_with_citations(self):
-        docs = [
-            Document(page_content="Alpha text", metadata={"id": "a1"}),
-            Document(page_content="Beta text", metadata={"source": "s1"}),
-        ]
-        llm = FakeLLM("Answer about alpha")
-        agent = ResearchAgent(llm=llm, top_k=1, max_context_chars=200)
-        result = agent.generate("What is alpha?", docs)
-        self.assertEqual(result["draft_answer"], "Answer about alpha")
-        self.assertIn("Alpha text", llm.last_prompt)
-    def test_generate_handles_no_documents(self):
-        llm = FakeLLM("unused")
-        agent = ResearchAgent(llm=llm)
-        result = agent.generate("Any question", [])
-        self.assertIn("could not find supporting documents", result["draft_answer"])
-if __name__ == "__main__":
-    unittest.main()

tests/test_visual_extraction.py DELETED Viewed

@@ -1,169 +0,0 @@
-"""
-Test script for Gemini Vision chart extraction.
-This script demonstrates how to use the chart extraction feature
-and validates that it's working correctly.
-"""
-import logging
-import os
-import sys
-from pathlib import Path
-# Add parent directory to path
-sys.path.insert(0, str(Path(__file__).parent.parent))
-from content_analyzer.document_parser import DocumentProcessor
-from configuration.parameters import parameters
-# Configure logging
-logging.basicConfig(
-    level=logging.INFO,
-    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
-)
-logger = logging.getLogger(__name__)
-def test_chart_extraction():
-    """Test chart extraction on a sample PDF with charts."""
-    logger.info("=" * 60)
-    logger.info("Testing Gemini Vision Chart Extraction")
-    logger.info("=" * 60)
-    # Check if chart extraction is enabled
-    if not parameters.ENABLE_CHART_EXTRACTION:
-        logger.warning("?? Chart extraction is DISABLED")
-        logger.info("Enable it by setting ENABLE_CHART_EXTRACTION=true in .env")
-        return
-    logger.info(f"? Chart extraction enabled")
-    logger.info(f"?? Using model: {parameters.CHART_VISION_MODEL}")
-    logger.info(f"?? Max tokens: {parameters.CHART_MAX_TOKENS}")
-    # Initialize processor
-    try:
-        processor = DocumentProcessor()
-        logger.info("? DocumentProcessor initialized")
-        if processor.gemini_client:
-            logger.info("? Gemini Vision client ready")
-        else:
-            logger.error("? Gemini Vision client not initialized")
-            return
-    except Exception as e:
-        logger.error(f"? Failed to initialize processor: {e}")
-        return
-    # Test with example PDF (if exists)
-    test_files = [
-        "examples/google-2024-environmental-report.pdf",
-        "examples/deppseek.pdf",
-        "test/sample_with_charts.pdf"
-    ]
-    found_file = None
-    for test_file in test_files:
-        if os.path.exists(test_file):
-            found_file = test_file
-            break
-    if not found_file:
-        logger.warning("?? No test PDF files found")
-        logger.info("Available test files:")
-        for tf in test_files:
-            logger.info(f"  - {tf}")
-        logger.info("\nTo test manually:")
-        logger.info("1. Place a PDF with charts in one of the above locations")
-        logger.info("2. Run this script again")
-        return
-    logger.info(f"\n?? Processing test file: {found_file}")
-    # Create mock file object
-    class MockFile:
-        def __init__(self, path):
-            self.name = path
-            self.size = os.path.getsize(path)
-    try:
-        # Process the file
-        mock_file = MockFile(found_file)
-        chunks = processor.process([mock_file])
-        logger.info(f"\n? Processing complete!")
-        logger.info(f"?? Total chunks extracted: {len(chunks)}")
-        # Count chart chunks
-        chart_chunks = [c for c in chunks if c.metadata.get("type") == "chart"]
-        text_chunks = [c for c in chunks if c.metadata.get("type") != "chart"]
-        logger.info(f"?? Chart chunks: {len(chart_chunks)}")
-        logger.info(f"?? Text chunks: {len(text_chunks)}")
-        # Display chart analyses
-        if chart_chunks:
-            logger.info(f"\n{'=' * 60}")
-            logger.info("?? CHART ANALYSES EXTRACTED:")
-            logger.info('=' * 60)
-            for i, chunk in enumerate(chart_chunks, 1):
-                logger.info(f"\n--- Chart {i} ---")
-                logger.info(f"Page: {chunk.metadata.get('page')}")
-                logger.info(f"Preview: {chunk.page_content[:200]}...")
-                logger.info("")
-        else:
-            logger.info("\n?? No charts detected in this document")
-            logger.info("This could mean:")
-            logger.info("  - Document contains no charts")
-            logger.info("  - Charts are embedded as tables (already extracted)")
-            logger.info("  - Charts are too complex for detection")
-        logger.info(f"\n{'=' * 60}")
-        logger.info("? Test completed successfully!")
-        logger.info('=' * 60)
-    except Exception as e:
-        logger.error(f"? Test failed: {e}", exc_info=True)
-def test_api_connection():
-    """Test Gemini API connection."""
-    logger.info("\n" + "=" * 60)
-    logger.info("Testing Gemini API Connection")
-    logger.info("=" * 60)
-    try:
-        import google.generativeai as genai
-        from PIL import Image
-        import io
-        genai.configure(api_key=parameters.GOOGLE_API_KEY)
-        model = genai.GenerativeModel(parameters.CHART_VISION_MODEL)
-        logger.info("? Gemini client initialized")
-        # Test with a simple text prompt
-        response = model.generate_content("Hello! Can you respond with 'API Working'?")
-        logger.info(f"? API Response: {response.text}")
-        logger.info("? Gemini API connection successful!")
-    except ImportError as e:
-        logger.error(f"? Missing dependency: {e}")
-        logger.info("Install with: pip install google-generativeai Pillow")
-    except Exception as e:
-        logger.error(f"? API test failed: {e}")
-        logger.info("Check your GOOGLE_API_KEY in .env file")
-if __name__ == "__main__":
-    print("\n?? SmartDoc AI - Chart Extraction Test Suite\n")
-    # Test 1: API Connection
-    test_api_connection()
-    # Test 2: Chart Extraction
-    test_chart_extraction()
-    print("\n? All tests completed!\n")

vector_store/33eccd62-a7fc-4b0d-a118-02552f5cad42/data_level0.bin DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:c8fe3c8d74ae8a7762e6f389543f0f2c53e6127832955b377ed768f8759db70d
-size 16165996

vector_store/33eccd62-a7fc-4b0d-a118-02552f5cad42/header.bin DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:059abd7ab166731c13bd8dc4dc0724104918b450e9625ca4bc9f27ed0016170e
-size 100

vector_store/33eccd62-a7fc-4b0d-a118-02552f5cad42/index_metadata.pickle DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:bc43535869cc54fbd80a6a47dac2fd0b07f4eeb0c028b5c96026b6cdc271832b
-size 463184

vector_store/33eccd62-a7fc-4b0d-a118-02552f5cad42/length.bin DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:fa6bfa281c8fe4e4977d5382b077dee4a3c4e5c750985cdf3d3660a6f92dab67
-size 20132

vector_store/33eccd62-a7fc-4b0d-a118-02552f5cad42/link_lists.bin DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:ffcd2c7be0de4c70919af69080b33cbd5c7487471058b2a70ee5bf95ab86ea00
-size 42436