Spaces:
Running
Running
Commit Β·
0df5e58
1
Parent(s): 15fdcff
Add Hugging Face Space configuration
Browse files
README.md
CHANGED
|
@@ -1,81 +1,57 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
### Python API
|
| 24 |
-
|
| 25 |
-
```python
|
| 26 |
-
from dockling_parser import DocumentParser
|
| 27 |
-
|
| 28 |
-
# Initialize parser
|
| 29 |
-
parser = DocumentParser()
|
| 30 |
-
|
| 31 |
-
# Parse a document
|
| 32 |
-
result = parser.parse("path/to/document.pdf")
|
| 33 |
-
|
| 34 |
-
# Access parsed content
|
| 35 |
-
print(result.content) # Get main text content
|
| 36 |
-
print(result.metadata) # Get document metadata
|
| 37 |
-
print(result.structured_content) # Get structured content (sections, paragraphs, etc.)
|
| 38 |
-
|
| 39 |
-
# Check format support
|
| 40 |
-
is_supported = parser.supports_format("application/pdf")
|
| 41 |
-
```
|
| 42 |
-
|
| 43 |
-
### Web Interface
|
| 44 |
-
|
| 45 |
-
The package includes a Gradio-based web interface for easy document parsing:
|
| 46 |
-
|
| 47 |
-
```bash
|
| 48 |
-
python app.py
|
| 49 |
-
```
|
| 50 |
-
|
| 51 |
-
This will launch a web interface with the following features:
|
| 52 |
-
- Drag-and-drop document upload
|
| 53 |
-
- Support for multiple document formats
|
| 54 |
-
- Automatic format detection
|
| 55 |
-
- Structured output display:
|
| 56 |
-
- Document content
|
| 57 |
-
- Metadata table
|
| 58 |
- Section breakdown
|
| 59 |
- Named entity recognition
|
|
|
|
|
|
|
| 60 |
- Confidence scoring
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
-
|
| 65 |
-
- DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document)
|
| 66 |
-
- Plain Text (text/plain)
|
| 67 |
-
- HTML (text/html)
|
| 68 |
-
- Markdown (text/markdown)
|
| 69 |
|
| 70 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
-
|
| 73 |
|
| 74 |
-
|
| 75 |
-
-
|
| 76 |
-
-
|
| 77 |
-
-
|
|
|
|
| 78 |
|
| 79 |
-
## License
|
| 80 |
|
| 81 |
MIT License
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Smart Document Parser
|
| 3 |
+
emoji: π
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: indigo
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: 4.0.0
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: false
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# π Smart Document Parser
|
| 13 |
+
|
| 14 |
+
A powerful document parsing application that automatically extracts structured information from various document formats.
|
| 15 |
+
|
| 16 |
+
## π Features
|
| 17 |
+
|
| 18 |
+
- **Multiple Format Support**: PDF, DOCX, TXT, HTML, and Markdown
|
| 19 |
+
- **Rich Information Extraction**:
|
| 20 |
+
- Document content with preserved formatting
|
| 21 |
+
- Comprehensive metadata
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
- Section breakdown
|
| 23 |
- Named entity recognition
|
| 24 |
+
- **Smart Processing**:
|
| 25 |
+
- Automatic format detection
|
| 26 |
- Confidence scoring
|
| 27 |
+
- Error handling
|
| 28 |
+
|
| 29 |
+
## π― How to Use
|
| 30 |
|
| 31 |
+
1. **Upload Document**: Click the upload button or drag & drop your document
|
| 32 |
+
2. **Process**: Click "Process Document"
|
| 33 |
+
3. **View Results**: Explore the extracted information in different tabs:
|
| 34 |
+
- π Content: Main document text
|
| 35 |
+
- π Metadata: Document properties
|
| 36 |
+
- π Sections: Document structure
|
| 37 |
+
- π·οΈ Entities: Named entities
|
| 38 |
|
| 39 |
+
## π Supported Formats
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
+
- PDF Documents (*.pdf)
|
| 42 |
+
- Word Documents (*.docx)
|
| 43 |
+
- Text Files (*.txt)
|
| 44 |
+
- HTML Files (*.html)
|
| 45 |
+
- Markdown Files (*.md)
|
| 46 |
|
| 47 |
+
## π οΈ Technical Details
|
| 48 |
|
| 49 |
+
Built with:
|
| 50 |
+
- Docling: Advanced document processing
|
| 51 |
+
- Gradio: Interactive web interface
|
| 52 |
+
- Pydantic: Type-safe data handling
|
| 53 |
+
- Hugging Face Spaces: Cloud deployment
|
| 54 |
|
| 55 |
+
## π License
|
| 56 |
|
| 57 |
MIT License
|