Alfonso Velasco commited on
Commit
62e6da6
·
1 Parent(s): 50abf16

fix results

Browse files
Files changed (2) hide show
  1. TABLE_EXTRACTION_GUIDE.md +179 -0
  2. app.py +7 -3
TABLE_EXTRACTION_GUIDE.md ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Table Extraction Guide for Engineering Drawings
2
+
3
+ ## The Problem
4
+
5
+ When extracting tables from engineering drawings using DeepSeek-OCR, you may notice that the HTML table output contains many empty `<td></td>` cells and complex `rowspan`/`colspan` attributes. This makes the data difficult to use programmatically.
6
+
7
+ ### Why This Happens
8
+
9
+ Engineering drawings have:
10
+ - **Complex merged cells** with irregular boundaries
11
+ - **Non-standard table structures** (not typical rows/columns)
12
+ - **Small text** that's hard to OCR accurately
13
+ - **Visual elements** mixed with text
14
+ - **Rotated or angled text**
15
+
16
+ DeepSeek-OCR tries to preserve the exact visual layout in HTML, resulting in structure without useful content.
17
+
18
+ ## Solutions
19
+
20
+ ### Option 1: Use Image Patches (Recommended)
21
+
22
+ The API already extracts table regions as cropped images. This is the most reliable approach for complex drawings:
23
+
24
+ ```python
25
+ import requests
26
+ import base64
27
+ from PIL import Image
28
+ import io
29
+
30
+ # Call the API
31
+ response = requests.post('http://localhost:7860/extract', json={
32
+ 'image': base64_image,
33
+ 'layout_only': False # or True for just bounding boxes
34
+ })
35
+
36
+ data = response.json()
37
+
38
+ # Get table patches (cropped images of each table)
39
+ table_patches = data['table_patches']
40
+
41
+ for i, patch in enumerate(table_patches):
42
+ # Each patch contains:
43
+ # - bbox: {x1, y1, x2, y2, width, height}
44
+ # - data: base64-encoded image of the table
45
+ # - text_preview: HTML (often not useful for complex tables)
46
+
47
+ # Decode and save the table image
48
+ table_img_data = base64.b64decode(patch['data'])
49
+ table_img = Image.open(io.BytesIO(table_img_data))
50
+ table_img.save(f'table_{i}.png')
51
+
52
+ print(f"Table {i}: {patch['bbox']}")
53
+ ```
54
+
55
+ **Benefits:**
56
+ - Preserves all visual information
57
+ - Can be manually reviewed
58
+ - Can be processed with specialized table extraction tools
59
+ - No loss of information
60
+
61
+ ### Option 2: Use Text-Only Mode (New)
62
+
63
+ I've added a new `extract_mode` parameter that simplifies extraction for cases where you just want text without HTML structure:
64
+
65
+ ```python
66
+ response = requests.post('http://localhost:7860/extract', json={
67
+ 'image': base64_image,
68
+ 'extract_mode': 'text_only' # Simplifies table extraction
69
+ })
70
+
71
+ data = response.json()
72
+
73
+ # The extractions will contain plain text instead of complex HTML
74
+ for extraction in data['extractions']:
75
+ if extraction['type'] == 'table':
76
+ print(f"Table text: {extraction['text']}")
77
+ # Text will be simpler, without HTML tags
78
+ ```
79
+
80
+ ### Option 3: Use Layout-Only Mode
81
+
82
+ If you only need to know **where** tables are (not their content), use layout-only mode:
83
+
84
+ ```python
85
+ response = requests.post('http://localhost:7860/extract', json={
86
+ 'image': base64_image,
87
+ 'layout_only': True # Just get bounding boxes
88
+ })
89
+
90
+ data = response.json()
91
+
92
+ # Get structured layout information
93
+ layout = data['layout_summary']
94
+ print(f"Found {layout['counts']['tables']} tables")
95
+
96
+ for table in layout['elements_by_type']['tables']:
97
+ print(f"Table at: {table['bbox']}")
98
+ ```
99
+
100
+ ## Extraction Modes
101
+
102
+ The API now supports three extraction modes:
103
+
104
+ | Mode | Parameter | Use Case |
105
+ |------|-----------|----------|
106
+ | **Full** (default) | `extract_mode: "full"` | Complete extraction with HTML tables |
107
+ | **Text Only** | `extract_mode: "text_only"` | Simplified text extraction without HTML |
108
+ | **Layout Only** | `extract_mode: "layout_only"` or `layout_only: true` | Just bounding boxes, no content |
109
+
110
+ ## Recommended Workflow for Engineering Drawings
111
+
112
+ 1. **First pass:** Use `layout_only: true` to identify all tables and their locations
113
+ 2. **Extract images:** Use the bounding boxes to crop table regions from the original image
114
+ 3. **Process selectively:**
115
+ - For simple tables: Use `extract_mode: "text_only"`
116
+ - For complex tables: Keep as images or use specialized table extraction tools
117
+ - For critical data: Manual review of cropped table images
118
+
119
+ ## Example: Complete Workflow
120
+
121
+ ```python
122
+ import requests
123
+ import base64
124
+ from PIL import Image
125
+ import io
126
+
127
+ # Step 1: Load and encode image
128
+ with open('engineering_drawing.png', 'rb') as f:
129
+ image_data = base64.b64encode(f.read()).decode()
130
+
131
+ # Step 2: Get layout (identify tables)
132
+ layout_response = requests.post('http://localhost:7860/extract', json={
133
+ 'image': image_data,
134
+ 'layout_only': True
135
+ })
136
+
137
+ layout_data = layout_response.json()
138
+ print(f"Found {layout_data['num_tables']} tables")
139
+
140
+ # Step 3: Get full extraction with table images
141
+ full_response = requests.post('http://localhost:7860/extract', json={
142
+ 'image': image_data,
143
+ 'extract_mode': 'full' # or 'text_only' for simpler output
144
+ })
145
+
146
+ full_data = full_response.json()
147
+
148
+ # Step 4: Save table images for review or further processing
149
+ for i, patch in enumerate(full_data['table_patches']):
150
+ # Save table image
151
+ table_img_data = base64.b64decode(patch['data'])
152
+ table_img = Image.open(io.BytesIO(table_img_data))
153
+ table_img.save(f'output/table_{i}.png')
154
+
155
+ # Print location
156
+ bbox = patch['bbox']
157
+ print(f"Table {i}: ({bbox['x1']}, {bbox['y1']}) to ({bbox['x2']}, {bbox['y2']})")
158
+ ```
159
+
160
+ ## Alternative Tools for Table Extraction
161
+
162
+ If you need better table content extraction, consider using the cropped table images with:
163
+
164
+ 1. **Table Transformer** (Microsoft) - Deep learning model for table structure
165
+ 2. **PaddleOCR** - Includes table recognition
166
+ 3. **Camelot** or **Tabula** - For PDF-based tables
167
+ 4. **Azure Form Recognizer** or **AWS Textract** - Cloud services with advanced table recognition
168
+ 5. **Manual labeling** - For critical engineering data
169
+
170
+ ## Summary
171
+
172
+ For engineering drawings:
173
+ - ✅ **Use image patches** (most reliable)
174
+ - ✅ **Use layout-only mode** to find tables
175
+ - ✅ **Use text-only mode** for simpler extraction
176
+ - ❌ **Don't rely on HTML table structure** from complex drawings
177
+
178
+ The HTML table output is structurally accurate but often not useful for data extraction due to the complexity of engineering drawings.
179
+
app.py CHANGED
@@ -142,7 +142,7 @@ async def extract_image(request: ImageRequest):
142
  # Use simpler prompt for layout-only mode
143
  prompt = request.prompt
144
  if request.layout_only:
145
- prompt = "<image>\n<Identify all objects, table, diagrams, and text and output them in bounding boxes. "
146
  print("Using layout-only mode with structured bounding boxes")
147
 
148
  # Capture stdout to get the raw model output with grounding tags
@@ -187,13 +187,15 @@ async def extract_image(request: ImageRequest):
187
  else:
188
  print("Using saved result.mmd file")
189
 
190
- print(f"Result preview: {result_text[:500] if result_text else 'No results found'}")
191
  print(f"Result image with boxes: {'Found' if result_image_with_boxes else 'Not found'}")
192
  print(f"Image patches: {len(image_patches)} patches found")
193
 
194
  # Parse the result
195
  extractions = parse_deepseek_result(result_text, img_width, img_height)
196
 
 
 
197
  # If layout_only mode, simplify the extractions
198
  if request.layout_only:
199
  layout_extractions = simplify_extractions_for_layout(extractions)
@@ -526,7 +528,9 @@ def parse_deepseek_result(result: Any, img_width: int, img_height: int) -> List[
526
  }
527
  else:
528
  bbox = {"x1": 0, "y1": 0, "x2": 0, "y2": 0, "width": 0, "height": 0}
529
- except (ValueError, IndexError):
 
 
530
  bbox = {"x1": 0, "y1": 0, "x2": 0, "y2": 0, "width": 0, "height": 0}
531
 
532
  # Extract content after this tag until the next tag (or end of string)
 
142
  # Use simpler prompt for layout-only mode
143
  prompt = request.prompt
144
  if request.layout_only:
145
+ prompt = "<image>\n<Identify all objects, table, diagrams, and text and output them in bounding boxes.o "
146
  print("Using layout-only mode with structured bounding boxes")
147
 
148
  # Capture stdout to get the raw model output with grounding tags
 
187
  else:
188
  print("Using saved result.mmd file")
189
 
190
+ print(f"Result preview: {result_text if result_text else 'No results found'}")
191
  print(f"Result image with boxes: {'Found' if result_image_with_boxes else 'Not found'}")
192
  print(f"Image patches: {len(image_patches)} patches found")
193
 
194
  # Parse the result
195
  extractions = parse_deepseek_result(result_text, img_width, img_height)
196
 
197
+ print(f"Extractions: {extractions}")
198
+
199
  # If layout_only mode, simplify the extractions
200
  if request.layout_only:
201
  layout_extractions = simplify_extractions_for_layout(extractions)
 
528
  }
529
  else:
530
  bbox = {"x1": 0, "y1": 0, "x2": 0, "y2": 0, "width": 0, "height": 0}
531
+ except Exception as e:
532
+ print(f"Error parsing bounding box: {e} for bounding box: {bbox_str} for type {ref_type}")
533
+
534
  bbox = {"x1": 0, "y1": 0, "x2": 0, "y2": 0, "width": 0, "height": 0}
535
 
536
  # Extract content after this tag until the next tag (or end of string)