Spaces:

Cathaltwo
/

OCR_historical_2

Sleeping

App Files Files Community

Cathaltwo commited on Jul 9, 2025

Commit

2e11006

verified ·

1 Parent(s): ebe4fdb

Update app.py

Browse files

Files changed (1) hide show

app.py +40 -38

app.py CHANGED Viewed

@@ -56,44 +56,46 @@ def generate_full_markdown_from_image(image_path, api_key):
     # Changed model to gemini-2.5-pro as per user's deep thinking example
     model_name = "gemini-2.5-pro"
-    system_prompt = """You are an expert in extracting and structuring all relevant information from historical documents into markdown, including both narrative text and tabular data. Your primary goal is to produce a single, comprehensive, and highly structured output that makes the document's content easily consumable.
-Overall Output Structure:
-The output must be a single string containing two main sections:
-1.  Textual Content: Extracted titles and paragraphs.
-2.  Tabular Data: A comprehensive, flattened tabular dataset.
-Output Format Details:
-* For Textual Content:
-    * Main Title: If present, identify the primary title of the document and format it as: DOCUMENT_TITLE: [Extracted Title]
-    * Paragraphs: Extract all significant paragraphs. Each paragraph should be on its own line and prefixed as: PARAGRAPH: [Extracted Paragraph Content]
-    * Ensure logical flow for paragraphs, maintaining their original order.
-* For Tabular Data:
-    * The table must be clearly separated from the textual content (e.g., by a few blank lines).
-    * Columns must be delimited by pipes (|) and rows by newlines (\\n).
-    * Ensure no leading or trailing spaces around the pipe delimiters within the table.
-Extraction Rules:
-1.  Tabular Data - Spanning Rows as Contextual Columns:
-    * Identify rows that appear to span across all columns (e.g., acting as section titles, categories, or group indicators for subsequent data).
-    * For each such 'spanning row', extract its content and add it as a new column (named 'Section' or 'Category' - choose whichever fits best, 'Section' is a good default) to all subsequent data rows.
-    * This new column's value should persist for all rows until another spanning row is encountered. This process effectively flattens hierarchical or grouped data into a single, continuous table, providing clear context for each record.
-2.  Tabular Data - Primary Headers:
-    * For tables with multi-level headers, use the most detailed header row (the one containing the maximum number of distinct data columns) as the primary header for your output table.
-    * Higher-level header information should be integrated into the 'Section' column if it provides a logical grouping, or combined with primary header names if it clarifies the column's meaning.
-3.  Data Integrity:
-    * Preserve data types (e.g., numbers, dates) where evident.
-    * Represent missing or unreadable data as empty cells.
-4.  Completeness:
-    * Extract all relevant text and tabular data from the document.
-    * Integrate all identified tables into the single, comprehensive tabular dataset using the rules above.
-"""
     generation_config = types.GenerateContentConfig( # Use genai.types.GenerationConfig for proper typing
         temperature=0.7,

     # Changed model to gemini-2.5-pro as per user's deep thinking example
     model_name = "gemini-2.5-pro"
+    system_prompt = """You are an expert in extracting and structuring all relevant information from historical documents into comprehensive markdown format, including both narrative text and tabular data. Your primary goal is to produce a single, comprehensive, and highly structured output that makes the document's content easily consumable.
+                    Overall Output Structure:
+                    The output must be a single string containing two main sections:
+                    1.  Textual Content: Extracted titles and paragraphs.
+                    2.  Tabular Data: A comprehensive, flattened tabular dataset.
+                    Output Format Details:
+                    * For Textual Content:
+                        * Main Title: If present, identify the primary title of the document and format it
+                        * Paragraphs: Extract all significant paragraphs. Each paragraph should be on its own line
+                        * Ensure logical flow for paragraphs, maintaining their original order.
+                        * use Markdown Formating
+                    * For Tabular Data:
+                        * The table must be clearly separated from the textual content (e.g., by a few blank lines).
+                        * Columns must be delimited by pipes (|) and rows by newlines (\\n).
+                        * Ensure no leading or trailing spaces around the pipe delimiters within the table.
+                        * Remember pipes (|) at the start of rows and end of rows
+                    Extraction Rules:
+                    1.  Tabular Data - Spanning Rows as Contextual Columns:
+                        * Identify rows that appear to span across all columns (e.g., acting as section titles, categories, or group indicators for subsequent data).
+                        * For each such 'spanning row', extract its content and add it as a new column (named 'Section' or 'Category' - choose whichever fits best, 'Section' is a good default) to all subsequent data rows.
+                        * This new column's value should persist for all rows until another spanning row is encountered. This process effectively flattens hierarchical or grouped data into a single, continuous table, providing clear context for each record.
+                    2.  Tabular Data - Primary Headers:
+                        * For tables with multi-level headers, use the most detailed header row (the one containing the maximum number of distinct data columns) as the primary header for your output table.
+                        * Higher-level header information should be integrated into the 'Section' column if it provides a logical grouping, or combined with primary header names if it clarifies the column's meaning.
+                    3.  Data Integrity:
+                        * Preserve data types (e.g., numbers, dates) where evident.
+                        * Represent missing or unreadable data as empty cells.
+                    4.  Completeness:
+                        * Extract all relevant text and tabular data from the document.
+                        * Integrate all identified tables into the single, comprehensive tabular dataset using the rules above.
+                    """
     generation_config = types.GenerateContentConfig( # Use genai.types.GenerationConfig for proper typing
         temperature=0.7,