DaVinciCode commited on
Commit
91cfe57
Β·
1 Parent(s): a14e80e

Add application file

Browse files
Files changed (3) hide show
  1. DEPLOYMENT.md +179 -0
  2. app.py +974 -0
  3. requirements.txt +46 -0
DEPLOYMENT.md ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Doctra Hugging Face Spaces Deployment Guide
2
+
3
+ ## πŸš€ Quick Deployment
4
+
5
+ ### Option 1: Direct Upload to Hugging Face Spaces
6
+
7
+ 1. **Create a new Space**:
8
+ - Go to [Hugging Face Spaces](https://huggingface.co/spaces)
9
+ - Click "Create new Space"
10
+ - Choose "Gradio" as the SDK
11
+ - Set the title to "Doctra - Document Parser"
12
+
13
+ 2. **Upload files**:
14
+ - Upload all files from this `hf_space` folder to your Space
15
+ - Make sure `app.py` is in the root directory
16
+
17
+ 3. **Configure environment**:
18
+ - Go to Settings β†’ Secrets
19
+ - Add `VLM_API_KEY` if you want to use VLM features
20
+ - Set the value to your API key (OpenAI, Anthropic, Google, etc.)
21
+
22
+ ### Option 2: Git Repository Deployment
23
+
24
+ 1. **Create a Git repository**:
25
+ ```bash
26
+ git init
27
+ git add .
28
+ git commit -m "Initial Doctra HF Space deployment"
29
+ git remote add origin <your-repo-url>
30
+ git push -u origin main
31
+ ```
32
+
33
+ 2. **Connect to Hugging Face Spaces**:
34
+ - Create a new Space
35
+ - Choose "Git repository" as the source
36
+ - Enter your repository URL
37
+ - Set the app file to `app.py`
38
+
39
+ ### Option 3: Docker Deployment
40
+
41
+ 1. **Build the Docker image**:
42
+ ```bash
43
+ docker build -t doctra-hf-space .
44
+ ```
45
+
46
+ 2. **Run the container**:
47
+ ```bash
48
+ docker run -p 7860:7860 doctra-hf-space
49
+ ```
50
+
51
+ ## πŸ”§ Configuration
52
+
53
+ ### Environment Variables
54
+
55
+ Set these in your Hugging Face Space settings:
56
+
57
+ - `VLM_API_KEY`: Your API key for VLM providers
58
+ - `GRADIO_SERVER_NAME`: Server hostname (default: 0.0.0.0)
59
+ - `GRADIO_SERVER_PORT`: Server port (default: 7860)
60
+
61
+ ### Hardware Requirements
62
+
63
+ - **CPU**: Minimum 2 cores recommended
64
+ - **RAM**: Minimum 4GB, 8GB+ recommended
65
+ - **Storage**: 10GB+ for models and dependencies
66
+ - **GPU**: Optional but recommended for faster processing
67
+
68
+ ## πŸ“Š Performance Optimization
69
+
70
+ ### For Hugging Face Spaces
71
+
72
+ 1. **Use CPU-optimized models** when GPU is not available
73
+ 2. **Reduce DPI settings** for faster processing
74
+ 3. **Process smaller documents** to avoid memory issues
75
+ 4. **Enable caching** for repeated operations
76
+
77
+ ### For Local Deployment
78
+
79
+ 1. **Use GPU acceleration** when available
80
+ 2. **Increase memory limits** for large documents
81
+ 3. **Use SSD storage** for better I/O performance
82
+ 4. **Configure proper logging** for debugging
83
+
84
+ ## πŸ› Troubleshooting
85
+
86
+ ### Common Issues
87
+
88
+ 1. **Import Errors**:
89
+ - Check that all dependencies are in `requirements.txt`
90
+ - Verify Python version compatibility
91
+
92
+ 2. **Memory Issues**:
93
+ - Reduce DPI settings
94
+ - Process smaller documents
95
+ - Increase available memory
96
+
97
+ 3. **API Key Issues**:
98
+ - Verify API key is correctly set
99
+ - Check provider-specific requirements
100
+ - Test API connectivity
101
+
102
+ 4. **File Upload Issues**:
103
+ - Check file size limits
104
+ - Verify file format support
105
+ - Ensure proper permissions
106
+
107
+ ### Debug Mode
108
+
109
+ To enable debug mode, set:
110
+ ```bash
111
+ export GRADIO_DEBUG=1
112
+ ```
113
+
114
+ ## πŸ“ˆ Monitoring
115
+
116
+ ### Health Checks
117
+
118
+ - Monitor CPU and memory usage
119
+ - Check disk space availability
120
+ - Verify API key validity
121
+ - Test document processing pipeline
122
+
123
+ ### Logs
124
+
125
+ - Application logs: Check Gradio output
126
+ - Error logs: Monitor for exceptions
127
+ - Performance logs: Track processing times
128
+ - User logs: Monitor usage patterns
129
+
130
+ ## πŸ”„ Updates
131
+
132
+ ### Updating the Application
133
+
134
+ 1. **Code updates**: Push changes to your repository
135
+ 2. **Dependency updates**: Update `requirements.txt`
136
+ 3. **Model updates**: Download new model versions
137
+ 4. **Configuration updates**: Modify environment variables
138
+
139
+ ### Version Control
140
+
141
+ - Use semantic versioning
142
+ - Tag releases appropriately
143
+ - Maintain changelog
144
+ - Test before deployment
145
+
146
+ ## πŸ›‘οΈ Security
147
+
148
+ ### Best Practices
149
+
150
+ 1. **API Keys**: Store securely, never commit to code
151
+ 2. **File Uploads**: Validate file types and sizes
152
+ 3. **Rate Limiting**: Implement to prevent abuse
153
+ 4. **Input Validation**: Sanitize all user inputs
154
+
155
+ ### Privacy
156
+
157
+ - No data is stored permanently
158
+ - Files are processed in temporary directories
159
+ - API calls are made securely
160
+ - User data is not logged
161
+
162
+ ## πŸ“ž Support
163
+
164
+ For issues and questions:
165
+
166
+ 1. **GitHub Issues**: Report bugs and feature requests
167
+ 2. **Documentation**: Check the main README.md
168
+ 3. **Community**: Join discussions on Hugging Face
169
+ 4. **Email**: Contact the development team
170
+
171
+ ## 🎯 Next Steps
172
+
173
+ After successful deployment:
174
+
175
+ 1. **Test all features** with sample documents
176
+ 2. **Configure monitoring** and alerting
177
+ 3. **Set up backups** for important data
178
+ 4. **Plan for scaling** based on usage
179
+ 5. **Gather user feedback** for improvements
app.py ADDED
@@ -0,0 +1,974 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Doctra - Document Parser for Hugging Face Spaces
3
+
4
+ This is a Hugging Face Spaces deployment of the Doctra document parsing library.
5
+ It provides a comprehensive web interface for PDF parsing, table/chart extraction,
6
+ image restoration, and enhanced document processing.
7
+ """
8
+
9
+ import os
10
+ import shutil
11
+ import tempfile
12
+ import re
13
+ import html as _html
14
+ import base64
15
+ import json
16
+ from pathlib import Path
17
+ from typing import Optional, Tuple, List, Dict, Any
18
+
19
+ import gradio as gr
20
+ import pandas as pd
21
+
22
+ # Import Doctra components
23
+ from doctra.parsers.structured_pdf_parser import StructuredPDFParser
24
+ from doctra.parsers.table_chart_extractor import ChartTablePDFParser
25
+ from doctra.parsers.enhanced_pdf_parser import EnhancedPDFParser
26
+ from doctra.ui.docres_wrapper import DocResUIWrapper
27
+ from doctra.utils.pdf_io import render_pdf_to_images
28
+
29
+
30
+ # UI Theme and Styling Constants
31
+ THEME = gr.themes.Soft(primary_hue="indigo", neutral_hue="slate")
32
+
33
+ CUSTOM_CSS = """
34
+ /* Full-width layout */
35
+ .gradio-container {max-width: 100% !important; padding-left: 24px; padding-right: 24px}
36
+ .container {max-width: 100% !important}
37
+ .app {max-width: 100% !important}
38
+
39
+ /* Header and helpers */
40
+ .header {margin-bottom: 8px}
41
+ .subtitle {color: var(--body-text-color-subdued)}
42
+ .card {border:1px solid var(--border-color); border-radius:12px; padding:8px}
43
+ .status-ok {color: var(--color-success)}
44
+
45
+ /* Scrollable gallery styling */
46
+ .scrollable-gallery {
47
+ max-height: 600px !important;
48
+ overflow-y: auto !important;
49
+ border: 1px solid var(--border-color) !important;
50
+ border-radius: 8px !important;
51
+ padding: 8px !important;
52
+ }
53
+
54
+ /* Page content styling */
55
+ .page-content img {
56
+ max-width: 100% !important;
57
+ height: auto !important;
58
+ display: block !important;
59
+ margin: 10px auto !important;
60
+ border: 1px solid #ddd !important;
61
+ border-radius: 8px !important;
62
+ box-shadow: 0 2px 4px rgba(0,0,0,0.1) !important;
63
+ }
64
+
65
+ .page-content {
66
+ max-height: none !important;
67
+ overflow: visible !important;
68
+ }
69
+
70
+ /* Table styling */
71
+ .page-content table.doc-table {
72
+ width: 100% !important;
73
+ border-collapse: collapse !important;
74
+ margin: 12px 0 !important;
75
+ }
76
+ .page-content table.doc-table th,
77
+ .page-content table.doc-table td {
78
+ border: 1px solid #e5e7eb !important;
79
+ padding: 8px 10px !important;
80
+ text-align: left !important;
81
+ }
82
+ .page-content table.doc-table thead th {
83
+ background: #f9fafb !important;
84
+ font-weight: 600 !important;
85
+ }
86
+ .page-content table.doc-table tbody tr:nth-child(even) td {
87
+ background: #fafafa !important;
88
+ }
89
+
90
+ /* Clickable image buttons */
91
+ .image-button {
92
+ background: #0066cc !important;
93
+ color: white !important;
94
+ border: none !important;
95
+ padding: 5px 10px !important;
96
+ border-radius: 4px !important;
97
+ cursor: pointer !important;
98
+ margin: 2px !important;
99
+ font-size: 14px !important;
100
+ }
101
+
102
+ .image-button:hover {
103
+ background: #0052a3 !important;
104
+ }
105
+ """
106
+
107
+
108
+ def gather_outputs(
109
+ out_dir: Path,
110
+ allowed_kinds: Optional[List[str]] = None,
111
+ zip_filename: Optional[str] = None,
112
+ is_structured_parsing: bool = False
113
+ ) -> Tuple[List[tuple[str, str]], List[str], str]:
114
+ """
115
+ Gather output files and create a ZIP archive for download.
116
+ """
117
+ gallery_items: List[tuple[str, str]] = []
118
+ file_paths: List[str] = []
119
+
120
+ if out_dir.exists():
121
+ if is_structured_parsing:
122
+ # For structured parsing, include all files
123
+ for file_path in sorted(out_dir.rglob("*")):
124
+ if file_path.is_file():
125
+ file_paths.append(str(file_path))
126
+ else:
127
+ # For full parsing, include specific main files
128
+ main_files = [
129
+ "result.html",
130
+ "result.md",
131
+ "tables.html",
132
+ "tables.xlsx"
133
+ ]
134
+
135
+ for main_file in main_files:
136
+ file_path = out_dir / main_file
137
+ if file_path.exists():
138
+ file_paths.append(str(file_path))
139
+
140
+ # Include images based on allowed kinds
141
+ if allowed_kinds:
142
+ for kind in allowed_kinds:
143
+ p = out_dir / kind
144
+ if p.exists():
145
+ for img in sorted(p.glob("*.png")):
146
+ file_paths.append(str(img))
147
+
148
+ images_dir = out_dir / "images" / kind
149
+ if images_dir.exists():
150
+ for img in sorted(images_dir.glob("*.jpg")):
151
+ file_paths.append(str(img))
152
+ else:
153
+ # Include all images if no specific kinds specified
154
+ for p in (out_dir / "charts").glob("*.png"):
155
+ file_paths.append(str(p))
156
+ for p in (out_dir / "tables").glob("*.png"):
157
+ file_paths.append(str(p))
158
+ for p in (out_dir / "images").rglob("*.jpg"):
159
+ file_paths.append(str(p))
160
+
161
+ # Include Excel files based on allowed kinds
162
+ if allowed_kinds:
163
+ if "charts" in allowed_kinds and "tables" in allowed_kinds:
164
+ excel_files = ["parsed_tables_charts.xlsx"]
165
+ elif "charts" in allowed_kinds:
166
+ excel_files = ["parsed_charts.xlsx"]
167
+ elif "tables" in allowed_kinds:
168
+ excel_files = ["parsed_tables.xlsx"]
169
+ else:
170
+ excel_files = []
171
+
172
+ for excel_file in excel_files:
173
+ excel_path = out_dir / excel_file
174
+ if excel_path.exists():
175
+ file_paths.append(str(excel_path))
176
+
177
+ # Build gallery items for image display
178
+ kinds = allowed_kinds if allowed_kinds else ["tables", "charts", "figures"]
179
+ for sub in kinds:
180
+ p = out_dir / sub
181
+ if p.exists():
182
+ for img in sorted(p.glob("*.png")):
183
+ gallery_items.append((str(img), f"{sub}: {img.name}"))
184
+
185
+ images_dir = out_dir / "images" / sub
186
+ if images_dir.exists():
187
+ for img in sorted(images_dir.glob("*.jpg")):
188
+ gallery_items.append((str(img), f"{sub}: {img.name}"))
189
+
190
+ # Create ZIP archive
191
+ tmp_zip_dir = Path(tempfile.mkdtemp(prefix="doctra_zip_"))
192
+
193
+ if zip_filename:
194
+ safe_filename = re.sub(r'[<>:"/\\|?*]', '_', zip_filename)
195
+ zip_base = tmp_zip_dir / safe_filename
196
+ else:
197
+ zip_base = tmp_zip_dir / "doctra_outputs"
198
+
199
+ filtered_dir = tmp_zip_dir / "filtered_outputs"
200
+ shutil.copytree(out_dir, filtered_dir, ignore=shutil.ignore_patterns('~$*', '*.tmp', '*.temp'))
201
+
202
+ zip_path = shutil.make_archive(str(zip_base), 'zip', root_dir=str(filtered_dir))
203
+
204
+ return gallery_items, file_paths, zip_path
205
+
206
+
207
+ def validate_vlm_config(use_vlm: bool, vlm_api_key: str, vlm_provider: str = "gemini") -> Optional[str]:
208
+ """
209
+ Validate VLM configuration parameters.
210
+ """
211
+ if use_vlm and vlm_provider != "ollama" and not vlm_api_key:
212
+ return "❌ Error: VLM API key is required when using VLM (except for Ollama)"
213
+
214
+ if use_vlm and vlm_api_key and vlm_provider != "ollama":
215
+ # Basic API key validation
216
+ if len(vlm_api_key.strip()) < 10:
217
+ return "❌ Error: VLM API key appears to be too short or invalid"
218
+ if vlm_api_key.strip().startswith('sk-') and len(vlm_api_key.strip()) < 20:
219
+ return "❌ Error: OpenAI API key appears to be invalid (too short)"
220
+
221
+ return None
222
+
223
+
224
+ def create_page_html_content(page_content: List[str], base_dir: Optional[Path] = None) -> str:
225
+ """
226
+ Convert page content lines to HTML with inline images and proper formatting.
227
+ """
228
+ processed_content = []
229
+ paragraph_buffer = []
230
+
231
+ def flush_paragraph():
232
+ """Flush accumulated paragraph content to HTML"""
233
+ nonlocal paragraph_buffer
234
+ if paragraph_buffer:
235
+ joined = '<br/>'.join(_html.escape(l) for l in paragraph_buffer)
236
+ processed_content.append(f'<p>{joined}</p>')
237
+ paragraph_buffer = []
238
+
239
+ def is_markdown_table_header(s: str) -> bool:
240
+ return '|' in s and ('---' in s or 'β€”' in s)
241
+
242
+ def render_markdown_table(lines: List[str]) -> str:
243
+ rows = [l.strip().strip('|').split('|') for l in lines]
244
+ rows = [[_html.escape(c.strip()) for c in r] for r in rows]
245
+ if len(rows) < 2:
246
+ return ""
247
+
248
+ header = rows[0]
249
+ body = rows[2:] if len(rows) > 2 else []
250
+ thead = '<thead><tr>' + ''.join(f'<th>{c}</th>' for c in header) + '</tr></thead>'
251
+ tbody = '<tbody>' + ''.join('<tr>' + ''.join(f'<td>{c}</td>' for c in r) + '</tr>' for r in body) + '</tbody>'
252
+ return f'<table class="doc-table">{thead}{tbody}</table>'
253
+
254
+ i = 0
255
+ n = len(page_content)
256
+
257
+ while i < n:
258
+ raw_line = page_content[i]
259
+ line = raw_line.rstrip('\r\n')
260
+ stripped = line.strip()
261
+
262
+ # Handle image references
263
+ if stripped.startswith('![') and ('](images/' in stripped or '](images\\' in stripped):
264
+ flush_paragraph()
265
+ match = re.match(r'!\[([^\]]+)\]\(([^)]+)\)', stripped)
266
+ if match and base_dir is not None:
267
+ caption = match.group(1)
268
+ rel_path = match.group(2).replace('\\\\', '/').replace('\\', '/').lstrip('/')
269
+ abs_path = (base_dir / rel_path).resolve()
270
+ try:
271
+ with open(abs_path, 'rb') as f:
272
+ b64 = base64.b64encode(f.read()).decode('ascii')
273
+ processed_content.append(f'<figure><img src="data:image/jpeg;base64,{b64}" alt="{_html.escape(caption)}"/><figcaption>{_html.escape(caption)}</figcaption></figure>')
274
+ except Exception as e:
275
+ print(f"❌ Failed to embed image {rel_path}: {e}")
276
+ processed_content.append(f'<div>{_html.escape(caption)} (image not found)</div>')
277
+ else:
278
+ processed_content.append(f'<div>{_html.escape(stripped)}</div>')
279
+ i += 1
280
+ continue
281
+
282
+ # Handle markdown tables
283
+ if (stripped.startswith('|') or stripped.count('|') >= 2) and i + 1 < n and is_markdown_table_header(page_content[i + 1]):
284
+ flush_paragraph()
285
+ table_block = [stripped]
286
+ i += 1
287
+ table_block.append(page_content[i].strip())
288
+ i += 1
289
+ while i < n:
290
+ nxt = page_content[i].rstrip('\r\n')
291
+ if nxt.strip() == '' or (not nxt.strip().startswith('|') and nxt.count('|') < 2):
292
+ break
293
+ table_block.append(nxt.strip())
294
+ i += 1
295
+ html_table = render_markdown_table(table_block)
296
+ if html_table:
297
+ processed_content.append(html_table)
298
+ else:
299
+ for tl in table_block:
300
+ paragraph_buffer.append(tl)
301
+ continue
302
+
303
+ # Handle headers and content
304
+ if stripped.startswith('## '):
305
+ flush_paragraph()
306
+ processed_content.append(f'<h3>{_html.escape(stripped[3:])}</h3>')
307
+ elif stripped.startswith('# '):
308
+ flush_paragraph()
309
+ processed_content.append(f'<h2>{_html.escape(stripped[2:])}</h2>')
310
+ elif stripped == '':
311
+ flush_paragraph()
312
+ processed_content.append('<br/>')
313
+ else:
314
+ paragraph_buffer.append(raw_line)
315
+ i += 1
316
+
317
+ flush_paragraph()
318
+ return "\n".join(processed_content)
319
+
320
+
321
+ def run_full_parse(
322
+ pdf_file: str,
323
+ use_vlm: bool,
324
+ vlm_provider: str,
325
+ vlm_api_key: str,
326
+ layout_model_name: str,
327
+ dpi: int,
328
+ min_score: float,
329
+ ocr_lang: str,
330
+ ocr_psm: int,
331
+ ocr_oem: int,
332
+ ocr_extra_config: str,
333
+ box_separator: str,
334
+ ) -> Tuple[str, Optional[str], List[tuple[str, str]], List[str], str]:
335
+ """Run full PDF parsing with structured output."""
336
+ if not pdf_file:
337
+ return ("No file provided.", None, [], [], "")
338
+
339
+ # Validate VLM configuration
340
+ vlm_error = validate_vlm_config(use_vlm, vlm_api_key, vlm_provider)
341
+ if vlm_error:
342
+ return (vlm_error, None, [], [], "")
343
+
344
+ original_filename = Path(pdf_file).stem
345
+
346
+ # Create temporary directory for processing
347
+ tmp_dir = Path(tempfile.mkdtemp(prefix="doctra_"))
348
+ input_pdf = tmp_dir / f"{original_filename}.pdf"
349
+ shutil.copy2(pdf_file, input_pdf)
350
+
351
+ # Initialize parser with configuration
352
+ parser = StructuredPDFParser(
353
+ use_vlm=use_vlm,
354
+ vlm_provider=vlm_provider,
355
+ vlm_api_key=vlm_api_key or None,
356
+ layout_model_name=layout_model_name,
357
+ dpi=int(dpi),
358
+ min_score=float(min_score),
359
+ ocr_lang=ocr_lang,
360
+ ocr_psm=int(ocr_psm),
361
+ ocr_oem=int(ocr_oem),
362
+ ocr_extra_config=ocr_extra_config or "",
363
+ box_separator=box_separator or "\n",
364
+ )
365
+
366
+ try:
367
+ parser.parse(str(input_pdf))
368
+ except Exception as e:
369
+ import traceback
370
+ traceback.print_exc()
371
+ try:
372
+ error_msg = str(e).encode('utf-8', errors='replace').decode('utf-8')
373
+ return (f"❌ VLM processing failed: {error_msg}", None, [], [], "")
374
+ except Exception:
375
+ return (f"❌ VLM processing failed: <Unicode encoding error>", None, [], [], "")
376
+
377
+ # Find output directory
378
+ outputs_root = Path("outputs")
379
+ out_dir = outputs_root / original_filename / "full_parse"
380
+ if not out_dir.exists():
381
+ candidates = sorted(outputs_root.glob("*/"), key=lambda p: p.stat().st_mtime, reverse=True)
382
+ if candidates:
383
+ out_dir = candidates[0] / "full_parse"
384
+ else:
385
+ out_dir = outputs_root
386
+
387
+ # Read markdown file if it exists
388
+ md_file = next(out_dir.glob("*.md"), None)
389
+ md_preview = None
390
+ if md_file and md_file.exists():
391
+ try:
392
+ with md_file.open("r", encoding="utf-8", errors="ignore") as f:
393
+ md_preview = f.read()
394
+ except Exception:
395
+ md_preview = None
396
+
397
+ # Gather output files and create ZIP
398
+ gallery_items, file_paths, zip_path = gather_outputs(
399
+ out_dir,
400
+ zip_filename=original_filename,
401
+ is_structured_parsing=False
402
+ )
403
+
404
+ return (
405
+ f"βœ… Parsing completed successfully!\nπŸ“ Output directory: {out_dir}",
406
+ md_preview,
407
+ gallery_items,
408
+ file_paths,
409
+ zip_path
410
+ )
411
+
412
+
413
+ def run_extract(
414
+ pdf_file: str,
415
+ target: str,
416
+ use_vlm: bool,
417
+ vlm_provider: str,
418
+ vlm_api_key: str,
419
+ layout_model_name: str,
420
+ dpi: int,
421
+ min_score: float,
422
+ ) -> Tuple[str, str, List[tuple[str, str]], List[str], str]:
423
+ """Run table/chart extraction from PDF."""
424
+ if not pdf_file:
425
+ return ("No file provided.", "", [], [], "")
426
+
427
+ # Validate VLM configuration
428
+ vlm_error = validate_vlm_config(use_vlm, vlm_api_key, vlm_provider)
429
+ if vlm_error:
430
+ return (vlm_error, "", [], [], "")
431
+
432
+ original_filename = Path(pdf_file).stem
433
+
434
+ # Create temporary directory for processing
435
+ tmp_dir = Path(tempfile.mkdtemp(prefix="doctra_"))
436
+ input_pdf = tmp_dir / f"{original_filename}.pdf"
437
+ shutil.copy2(pdf_file, input_pdf)
438
+
439
+ # Initialize parser with configuration
440
+ parser = ChartTablePDFParser(
441
+ extract_charts=(target in ("charts", "both")),
442
+ extract_tables=(target in ("tables", "both")),
443
+ use_vlm=use_vlm,
444
+ vlm_provider=vlm_provider,
445
+ vlm_api_key=vlm_api_key or None,
446
+ layout_model_name=layout_model_name,
447
+ dpi=int(dpi),
448
+ min_score=float(min_score),
449
+ )
450
+
451
+ # Run extraction
452
+ output_base = Path("outputs")
453
+ parser.parse(str(input_pdf), str(output_base))
454
+
455
+ # Find output directory
456
+ outputs_root = output_base
457
+ out_dir = outputs_root / original_filename / "structured_parsing"
458
+ if not out_dir.exists():
459
+ if outputs_root.exists():
460
+ candidates = sorted(outputs_root.glob("*/"), key=lambda p: p.stat().st_mtime, reverse=True)
461
+ if candidates:
462
+ out_dir = candidates[0] / "structured_parsing"
463
+ else:
464
+ out_dir = outputs_root
465
+ else:
466
+ outputs_root.mkdir(parents=True, exist_ok=True)
467
+ out_dir = outputs_root
468
+
469
+ # Determine which kinds to include in outputs based on target selection
470
+ allowed_kinds: Optional[List[str]] = None
471
+ if target in ("tables", "charts"):
472
+ allowed_kinds = [target]
473
+ elif target == "both":
474
+ allowed_kinds = ["tables", "charts"]
475
+
476
+ # Gather output files and create ZIP
477
+ gallery_items, file_paths, zip_path = gather_outputs(
478
+ out_dir,
479
+ allowed_kinds,
480
+ zip_filename=original_filename,
481
+ is_structured_parsing=True
482
+ )
483
+
484
+ # Build tables HTML preview from Excel data (when VLM enabled)
485
+ tables_html = ""
486
+ try:
487
+ if use_vlm:
488
+ # Find Excel file based on target
489
+ excel_filename = None
490
+ if target in ("tables", "charts"):
491
+ if target == "tables":
492
+ excel_filename = "parsed_tables.xlsx"
493
+ else: # charts
494
+ excel_filename = "parsed_charts.xlsx"
495
+ elif target == "both":
496
+ excel_filename = "parsed_tables_charts.xlsx"
497
+
498
+ if excel_filename:
499
+ excel_path = out_dir / excel_filename
500
+ if excel_path.exists():
501
+ # Read Excel file and create HTML tables
502
+ xl_file = pd.ExcelFile(excel_path)
503
+ html_blocks = []
504
+
505
+ for sheet_name in xl_file.sheet_names:
506
+ df = pd.read_excel(excel_path, sheet_name=sheet_name)
507
+ if not df.empty:
508
+ # Create table with title
509
+ title = f"<h3>{_html.escape(sheet_name)}</h3>"
510
+
511
+ # Convert DataFrame to HTML table
512
+ table_html = df.to_html(
513
+ classes="doc-table",
514
+ table_id=None,
515
+ escape=True,
516
+ index=False,
517
+ na_rep=""
518
+ )
519
+
520
+ html_blocks.append(title + table_html)
521
+
522
+ tables_html = "\n".join(html_blocks)
523
+ except Exception as e:
524
+ try:
525
+ error_msg = str(e).encode('utf-8', errors='replace').decode('utf-8')
526
+ print(f"Error building tables HTML: {error_msg}")
527
+ except Exception:
528
+ print(f"Error building tables HTML: <Unicode encoding error>")
529
+ tables_html = ""
530
+
531
+ return (
532
+ f"βœ… Parsing completed successfully!\nπŸ“ Output directory: {out_dir}",
533
+ tables_html,
534
+ gallery_items,
535
+ file_paths,
536
+ zip_path
537
+ )
538
+
539
+
540
+ def run_docres_restoration(
541
+ pdf_file: str,
542
+ task: str,
543
+ device: str,
544
+ dpi: int,
545
+ save_enhanced: bool,
546
+ save_images: bool
547
+ ) -> Tuple[str, Optional[str], Optional[str], Optional[dict], List[str]]:
548
+ """Run DocRes image restoration on PDF."""
549
+ if not pdf_file:
550
+ return ("No file provided.", None, None, None, [])
551
+
552
+ try:
553
+ # Initialize DocRes engine
554
+ device_str = None if device == "auto" else device
555
+ docres = DocResUIWrapper(device=device_str)
556
+
557
+ # Extract filename
558
+ original_filename = Path(pdf_file).stem
559
+
560
+ # Create output directory
561
+ output_dir = Path("outputs") / f"{original_filename}_docres"
562
+ output_dir.mkdir(parents=True, exist_ok=True)
563
+
564
+ # Run DocRes restoration
565
+ enhanced_pdf_path = output_dir / f"{original_filename}_enhanced.pdf"
566
+ docres.restore_pdf(
567
+ pdf_path=pdf_file,
568
+ output_path=str(enhanced_pdf_path),
569
+ task=task,
570
+ dpi=dpi
571
+ )
572
+
573
+ # Prepare outputs
574
+ file_paths = []
575
+
576
+ if save_enhanced and enhanced_pdf_path.exists():
577
+ file_paths.append(str(enhanced_pdf_path))
578
+
579
+ if save_images:
580
+ # Look for enhanced images
581
+ images_dir = output_dir / "enhanced_images"
582
+ if images_dir.exists():
583
+ for img_path in sorted(images_dir.glob("*.jpg")):
584
+ file_paths.append(str(img_path))
585
+
586
+ # Create metadata
587
+ metadata = {
588
+ "task": task,
589
+ "device": str(docres.device),
590
+ "dpi": dpi,
591
+ "original_file": pdf_file,
592
+ "enhanced_file": str(enhanced_pdf_path) if enhanced_pdf_path.exists() else None,
593
+ "output_directory": str(output_dir)
594
+ }
595
+
596
+ status_msg = f"βœ… DocRes restoration completed successfully!\nπŸ“ Output directory: {output_dir}"
597
+
598
+ enhanced_pdf_file = str(enhanced_pdf_path) if enhanced_pdf_path.exists() else None
599
+ return (status_msg, pdf_file, enhanced_pdf_file, metadata, file_paths)
600
+
601
+ except Exception as e:
602
+ error_msg = f"❌ DocRes restoration failed: {str(e)}"
603
+ return (error_msg, None, None, None, [])
604
+
605
+
606
+ def run_enhanced_parse(
607
+ pdf_file: str,
608
+ use_image_restoration: bool,
609
+ restoration_task: str,
610
+ restoration_device: str,
611
+ restoration_dpi: int,
612
+ use_vlm: bool,
613
+ vlm_provider: str,
614
+ vlm_api_key: str,
615
+ layout_model_name: str,
616
+ dpi: int,
617
+ min_score: float,
618
+ ocr_lang: str,
619
+ ocr_psm: int,
620
+ ocr_oem: int,
621
+ ocr_extra_config: str,
622
+ box_separator: str,
623
+ ) -> Tuple[str, Optional[str], List[str], str, Optional[str], Optional[str], str]:
624
+ """Run enhanced PDF parsing with DocRes image restoration."""
625
+ if not pdf_file:
626
+ return ("No file provided.", None, [], "", None, None, "")
627
+
628
+ # Validate VLM configuration if VLM is enabled
629
+ if use_vlm:
630
+ vlm_error = validate_vlm_config(use_vlm, vlm_api_key, vlm_provider)
631
+ if vlm_error:
632
+ return (vlm_error, None, [], "", None, None, "")
633
+
634
+ original_filename = Path(pdf_file).stem
635
+
636
+ # Create temporary directory for processing
637
+ tmp_dir = Path(tempfile.mkdtemp(prefix="doctra_enhanced_"))
638
+ input_pdf = tmp_dir / f"{original_filename}.pdf"
639
+ shutil.copy2(pdf_file, input_pdf)
640
+
641
+ try:
642
+ # Initialize enhanced parser with configuration
643
+ parser = EnhancedPDFParser(
644
+ use_image_restoration=use_image_restoration,
645
+ restoration_task=restoration_task,
646
+ restoration_device=restoration_device if restoration_device != "auto" else None,
647
+ restoration_dpi=int(restoration_dpi),
648
+ use_vlm=use_vlm,
649
+ vlm_provider=vlm_provider,
650
+ vlm_api_key=vlm_api_key or None,
651
+ layout_model_name=layout_model_name,
652
+ dpi=int(dpi),
653
+ min_score=float(min_score),
654
+ ocr_lang=ocr_lang,
655
+ ocr_psm=int(ocr_psm),
656
+ ocr_oem=int(ocr_oem),
657
+ ocr_extra_config=ocr_extra_config or "",
658
+ box_separator=box_separator or "\n",
659
+ )
660
+
661
+ # Parse the PDF with enhancement
662
+ parser.parse(str(input_pdf))
663
+
664
+ except Exception as e:
665
+ import traceback
666
+ traceback.print_exc()
667
+ try:
668
+ error_msg = str(e).encode('utf-8', errors='replace').decode('utf-8')
669
+ return (f"❌ Enhanced parsing failed: {error_msg}", None, [], "", None, None, "")
670
+ except Exception:
671
+ return (f"❌ Enhanced parsing failed: <Unicode encoding error>", None, [], "", None, None, "")
672
+
673
+ # Find output directory
674
+ outputs_root = Path("outputs")
675
+ out_dir = outputs_root / original_filename / "enhanced_parse"
676
+ if not out_dir.exists():
677
+ candidates = sorted(outputs_root.glob("*/"), key=lambda p: p.stat().st_mtime, reverse=True)
678
+ if candidates:
679
+ out_dir = candidates[0] / "enhanced_parse"
680
+ else:
681
+ out_dir = outputs_root
682
+
683
+ # If still no enhanced_parse directory, try to find any directory with enhanced files
684
+ if not out_dir.exists():
685
+ for candidate_dir in outputs_root.rglob("*"):
686
+ if candidate_dir.is_dir():
687
+ enhanced_pdfs = list(candidate_dir.glob("*enhanced*.pdf"))
688
+ if enhanced_pdfs:
689
+ out_dir = candidate_dir
690
+ break
691
+
692
+ # Load first page content initially
693
+ md_preview = None
694
+ try:
695
+ pages_dir = out_dir / "pages"
696
+ first_page_path = pages_dir / "page_001.md"
697
+ if first_page_path.exists():
698
+ with first_page_path.open("r", encoding="utf-8", errors="ignore") as f:
699
+ md_content = f.read()
700
+
701
+ md_lines = md_content.split('\n')
702
+ md_preview = create_page_html_content(md_lines, out_dir)
703
+ else:
704
+ md_file = next(out_dir.glob("*.md"), None)
705
+ if md_file and md_file.exists():
706
+ with md_file.open("r", encoding="utf-8", errors="ignore") as f:
707
+ md_content = f.read()
708
+
709
+ md_lines = md_content.split('\n')
710
+ md_preview = create_page_html_content(md_lines, out_dir)
711
+ except Exception as e:
712
+ print(f"❌ Error loading initial content: {e}")
713
+ md_preview = None
714
+
715
+ # Gather output files and create ZIP
716
+ _, file_paths, zip_path = gather_outputs(
717
+ out_dir,
718
+ zip_filename=f"{original_filename}_enhanced",
719
+ is_structured_parsing=False
720
+ )
721
+
722
+ # Look for enhanced PDF file
723
+ enhanced_pdf_path = None
724
+ if use_image_restoration:
725
+ enhanced_pdf_candidates = list(out_dir.glob("*enhanced*.pdf"))
726
+ if enhanced_pdf_candidates:
727
+ enhanced_pdf_path = str(enhanced_pdf_candidates[0])
728
+ else:
729
+ parent_enhanced = list(out_dir.parent.glob("*enhanced*.pdf"))
730
+ if parent_enhanced:
731
+ enhanced_pdf_path = str(parent_enhanced[0])
732
+
733
+ return (
734
+ f"βœ… Enhanced parsing completed successfully!\nπŸ“ Output directory: {out_dir}",
735
+ md_preview,
736
+ file_paths,
737
+ zip_path,
738
+ pdf_file, # Original PDF path
739
+ enhanced_pdf_path, # Enhanced PDF path
740
+ str(out_dir) # Output directory for page-specific content
741
+ )
742
+
743
+
744
+ def create_tips_markdown() -> str:
745
+ """Create the tips section markdown for the UI."""
746
+ return """
747
+ <div class="card">
748
+ <b>Tips</b>
749
+ <ul>
750
+ <li>On Spaces, set a secret <code>VLM_API_KEY</code> to enable VLM features.</li>
751
+ <li>Use <strong>Enhanced Parser</strong> for documents that need image restoration before parsing (scanned docs, low-quality PDFs).</li>
752
+ <li>Use <strong>DocRes Image Restoration</strong> for standalone image enhancement without parsing.</li>
753
+ <li>DocRes tasks: <code>appearance</code> (default), <code>dewarping</code>, <code>deshadowing</code>, <code>deblurring</code>, <code>binarization</code>, <code>end2end</code>.</li>
754
+ <li>Outputs are saved under <code>outputs/&lt;pdf_stem&gt;/</code>.</li>
755
+ </ul>
756
+ </div>
757
+ """
758
+
759
+
760
+ # Create the main Gradio interface
761
+ with gr.Blocks(title="Doctra - Document Parser", theme=THEME, css=CUSTOM_CSS) as demo:
762
+ # Header section
763
+ gr.Markdown(
764
+ """
765
+ <div class="header">
766
+ <h2 style="margin:0">Doctra β€” Document Parser</h2>
767
+ <div class="subtitle">Parse PDFs, extract tables/charts, preview markdown, and download outputs.</div>
768
+ </div>
769
+ """
770
+ )
771
+
772
+ # Full Parse Tab
773
+ with gr.Tab("Full Parse"):
774
+ with gr.Row():
775
+ pdf = gr.File(file_types=[".pdf"], label="PDF")
776
+ use_vlm = gr.Checkbox(label="Use VLM (optional)", value=False)
777
+ vlm_provider = gr.Dropdown(["gemini", "openai", "anthropic", "openrouter", "ollama"], value="gemini", label="VLM Provider")
778
+ vlm_api_key = gr.Textbox(type="password", label="VLM API Key", placeholder="Optional if VLM disabled")
779
+
780
+ with gr.Accordion("Advanced", open=False):
781
+ with gr.Row():
782
+ layout_model = gr.Textbox(value="PP-DocLayout_plus-L", label="Layout model")
783
+ dpi = gr.Slider(100, 400, value=200, step=10, label="DPI")
784
+ min_score = gr.Slider(0, 1, value=0.0, step=0.05, label="Min layout score")
785
+ with gr.Row():
786
+ ocr_lang = gr.Textbox(value="eng", label="OCR Language")
787
+ ocr_psm = gr.Slider(0, 13, value=4, step=1, label="Tesseract PSM")
788
+ ocr_oem = gr.Slider(0, 3, value=3, step=1, label="Tesseract OEM")
789
+ with gr.Row():
790
+ ocr_config = gr.Textbox(value="", label="Extra OCR config")
791
+ box_sep = gr.Textbox(value="\n", label="Box separator")
792
+
793
+ run_btn = gr.Button("β–Ά Run Full Parse", variant="primary")
794
+ status = gr.Textbox(label="Status", elem_classes=["status-ok"])
795
+
796
+ # Full Parse components
797
+ with gr.Row():
798
+ with gr.Column():
799
+ md_preview = gr.HTML(label="Extracted Content", visible=True, elem_classes=["page-content"])
800
+ with gr.Column():
801
+ page_image = gr.Image(label="Page image", interactive=False)
802
+ files_out = gr.Files(label="Download individual output files")
803
+ zip_out = gr.File(label="Download all outputs (ZIP)")
804
+
805
+ run_btn.click(
806
+ fn=run_full_parse,
807
+ inputs=[pdf, use_vlm, vlm_provider, vlm_api_key, layout_model, dpi, min_score, ocr_lang, ocr_psm, ocr_oem, ocr_config, box_sep],
808
+ outputs=[status, md_preview, files_out, zip_out],
809
+ )
810
+
811
+ # Tables & Charts Tab
812
+ with gr.Tab("Extract Tables/Charts"):
813
+ with gr.Row():
814
+ pdf_e = gr.File(file_types=[".pdf"], label="PDF")
815
+ target = gr.Dropdown(["tables", "charts", "both"], value="both", label="Target")
816
+ use_vlm_e = gr.Checkbox(label="Use VLM (optional)", value=False)
817
+ vlm_provider_e = gr.Dropdown(["gemini", "openai", "anthropic", "openrouter", "ollama"], value="gemini", label="VLM Provider")
818
+ vlm_api_key_e = gr.Textbox(type="password", label="VLM API Key", placeholder="Optional if VLM disabled")
819
+
820
+ with gr.Accordion("Advanced", open=False):
821
+ with gr.Row():
822
+ layout_model_e = gr.Textbox(value="PP-DocLayout_plus-L", label="Layout model")
823
+ dpi_e = gr.Slider(100, 400, value=200, step=10, label="DPI")
824
+ min_score_e = gr.Slider(0, 1, value=0.0, step=0.05, label="Min layout score")
825
+
826
+ run_btn_e = gr.Button("β–Ά Run Extraction", variant="primary")
827
+ status_e = gr.Textbox(label="Status")
828
+
829
+ with gr.Row():
830
+ with gr.Column():
831
+ tables_preview_e = gr.HTML(label="Extracted Data", elem_classes=["page-content"])
832
+ with gr.Column():
833
+ image_e = gr.Image(label="Selected Image", interactive=False)
834
+
835
+ files_out_e = gr.Files(label="Download individual output files")
836
+ zip_out_e = gr.File(label="Download all outputs (ZIP)")
837
+
838
+ run_btn_e.click(
839
+ fn=lambda f, t, a, b, c, d, e, g: run_extract(
840
+ f.name if f else "",
841
+ t,
842
+ a,
843
+ b,
844
+ c,
845
+ d,
846
+ e,
847
+ g,
848
+ ),
849
+ inputs=[pdf_e, target, use_vlm_e, vlm_provider_e, vlm_api_key_e, layout_model_e, dpi_e, min_score_e],
850
+ outputs=[status_e, tables_preview_e, files_out_e, zip_out_e],
851
+ )
852
+
853
+ # DocRes Image Restoration Tab
854
+ with gr.Tab("DocRes Image Restoration"):
855
+ with gr.Row():
856
+ pdf_docres = gr.File(file_types=[".pdf"], label="PDF")
857
+ docres_task_standalone = gr.Dropdown(
858
+ ["appearance", "dewarping", "deshadowing", "deblurring", "binarization", "end2end"],
859
+ value="appearance",
860
+ label="Restoration Task"
861
+ )
862
+ docres_device_standalone = gr.Dropdown(
863
+ ["auto", "cuda", "cpu"],
864
+ value="auto",
865
+ label="Device"
866
+ )
867
+
868
+ with gr.Row():
869
+ docres_dpi = gr.Slider(100, 400, value=200, step=10, label="DPI")
870
+ docres_save_enhanced = gr.Checkbox(label="Save Enhanced PDF", value=True)
871
+ docres_save_images = gr.Checkbox(label="Save Enhanced Images", value=True)
872
+
873
+ run_docres_btn = gr.Button("β–Ά Run DocRes Restoration", variant="primary")
874
+ docres_status = gr.Textbox(label="Status", elem_classes=["status-ok"])
875
+
876
+ with gr.Row():
877
+ with gr.Column():
878
+ gr.Markdown("### πŸ“„ Original PDF")
879
+ docres_original_pdf = gr.File(label="Original PDF File", interactive=False, visible=False)
880
+ docres_original_page_image = gr.Image(label="Original PDF Page", interactive=False, height=800)
881
+ with gr.Column():
882
+ gr.Markdown("### ✨ Enhanced PDF")
883
+ docres_enhanced_pdf = gr.File(label="Enhanced PDF File", interactive=False, visible=False)
884
+ docres_enhanced_page_image = gr.Image(label="Enhanced PDF Page", interactive=False, height=800)
885
+
886
+ docres_files_out = gr.Files(label="Download enhanced files")
887
+
888
+ run_docres_btn.click(
889
+ fn=run_docres_restoration,
890
+ inputs=[pdf_docres, docres_task_standalone, docres_device_standalone, docres_dpi, docres_save_enhanced, docres_save_images],
891
+ outputs=[docres_status, docres_original_pdf, docres_enhanced_pdf, docres_files_out]
892
+ )
893
+
894
+ # Enhanced Parser Tab
895
+ with gr.Tab("Enhanced Parser"):
896
+ with gr.Row():
897
+ pdf_enhanced = gr.File(file_types=[".pdf"], label="PDF")
898
+ use_image_restoration = gr.Checkbox(label="Use Image Restoration", value=True)
899
+ restoration_task = gr.Dropdown(
900
+ ["appearance", "dewarping", "deshadowing", "deblurring", "binarization", "end2end"],
901
+ value="appearance",
902
+ label="Restoration Task"
903
+ )
904
+ restoration_device = gr.Dropdown(
905
+ ["auto", "cuda", "cpu"],
906
+ value="auto",
907
+ label="Restoration Device"
908
+ )
909
+
910
+ with gr.Row():
911
+ use_vlm_enhanced = gr.Checkbox(label="Use VLM (optional)", value=False)
912
+ vlm_provider_enhanced = gr.Dropdown(["gemini", "openai", "anthropic", "openrouter", "ollama"], value="gemini", label="VLM Provider")
913
+ vlm_api_key_enhanced = gr.Textbox(type="password", label="VLM API Key", placeholder="Optional if VLM disabled")
914
+
915
+ with gr.Accordion("Advanced Settings", open=False):
916
+ with gr.Row():
917
+ restoration_dpi = gr.Slider(100, 400, value=200, step=10, label="Restoration DPI")
918
+ layout_model_enhanced = gr.Textbox(value="PP-DocLayout_plus-L", label="Layout model")
919
+ dpi_enhanced = gr.Slider(100, 400, value=200, step=10, label="Processing DPI")
920
+ min_score_enhanced = gr.Slider(0, 1, value=0.0, step=0.05, label="Min layout score")
921
+
922
+ with gr.Row():
923
+ ocr_lang_enhanced = gr.Textbox(value="eng", label="OCR Language")
924
+ ocr_psm_enhanced = gr.Slider(0, 13, value=4, step=1, label="Tesseract PSM")
925
+ ocr_oem_enhanced = gr.Slider(0, 3, value=3, step=1, label="Tesseract OEM")
926
+
927
+ with gr.Row():
928
+ ocr_config_enhanced = gr.Textbox(value="", label="Extra OCR config")
929
+ box_sep_enhanced = gr.Textbox(value="\n", label="Box separator")
930
+
931
+ run_enhanced_btn = gr.Button("β–Ά Run Enhanced Parse", variant="primary")
932
+ enhanced_status = gr.Textbox(label="Status", elem_classes=["status-ok"])
933
+
934
+ with gr.Row():
935
+ with gr.Column():
936
+ gr.Markdown("### πŸ“„ Original PDF")
937
+ enhanced_original_pdf = gr.File(label="Original PDF File", interactive=False, visible=False)
938
+ enhanced_original_page_image = gr.Image(label="Original PDF Page", interactive=False, height=600)
939
+ with gr.Column():
940
+ gr.Markdown("### ✨ Enhanced PDF")
941
+ enhanced_enhanced_pdf = gr.File(label="Enhanced PDF File", interactive=False, visible=False)
942
+ enhanced_enhanced_page_image = gr.Image(label="Enhanced PDF Page", interactive=False, height=600)
943
+
944
+ with gr.Row():
945
+ enhanced_md_preview = gr.HTML(label="Extracted Content", visible=True, elem_classes=["page-content"])
946
+
947
+ enhanced_files_out = gr.Files(label="Download individual output files")
948
+ enhanced_zip_out = gr.File(label="Download all outputs (ZIP)")
949
+
950
+ run_enhanced_btn.click(
951
+ fn=run_enhanced_parse,
952
+ inputs=[
953
+ pdf_enhanced, use_image_restoration, restoration_task, restoration_device, restoration_dpi,
954
+ use_vlm_enhanced, vlm_provider_enhanced, vlm_api_key_enhanced, layout_model_enhanced,
955
+ dpi_enhanced, min_score_enhanced, ocr_lang_enhanced, ocr_psm_enhanced, ocr_oem_enhanced,
956
+ ocr_config_enhanced, box_sep_enhanced
957
+ ],
958
+ outputs=[
959
+ enhanced_status, enhanced_md_preview, enhanced_files_out, enhanced_zip_out,
960
+ enhanced_original_pdf, enhanced_enhanced_pdf
961
+ ]
962
+ )
963
+
964
+ # Tips section
965
+ gr.Markdown(create_tips_markdown())
966
+
967
+
968
+ if __name__ == "__main__":
969
+ # Launch the interface
970
+ demo.launch(
971
+ server_name="0.0.0.0",
972
+ server_port=int(os.getenv("PORT", "7860")),
973
+ share=False
974
+ )
requirements.txt ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core dependencies
2
+ gradio>=4.0.0,<5
3
+ pandas>=2.0.0
4
+ numpy>=1.21.0
5
+ pillow>=9.0.0
6
+ opencv-python>=4.5.0
7
+ scikit-image>=0.19.0
8
+ torch>=1.12.0
9
+ torchvision>=0.13.0
10
+
11
+ # PDF processing
12
+ pdf2image>=1.16.0
13
+ pypdfium2>=4.0.0
14
+ PyMuPDF>=1.23.0
15
+
16
+ # OCR and layout detection
17
+ paddleocr>=2.6.0
18
+ paddlepaddle>=2.4.0
19
+ paddlepaddle-gpu>=2.4.0
20
+ paddlex>=3.0.0
21
+
22
+ # VLM providers
23
+ openai>=1.0.0
24
+ anthropic>=0.3.0
25
+ google-generativeai>=0.3.0
26
+ httpx>=0.24.0
27
+
28
+ # Image processing and restoration
29
+ scikit-image>=0.19.3
30
+ torchvision
31
+
32
+ # Utilities
33
+ pathlib2>=2.3.0
34
+ tqdm>=4.64.0
35
+ requests>=2.28.0
36
+ beautifulsoup4>=4.11.0
37
+ lxml>=4.9.0
38
+ openpyxl>=3.0.0
39
+
40
+ # Hugging Face Spaces specific
41
+ huggingface-hub>=0.16.0
42
+ transformers>=4.21.0
43
+
44
+ # Additional dependencies for DocRes
45
+ accelerate>=0.20.0
46
+ safetensors>=0.3.0