| --- |
| title: README |
| emoji: 💻 |
| colorFrom: indigo |
| colorTo: indigo |
| sdk: static |
| pinned: false |
| --- |
| |
| # Hi there 👋 |
|
|
| StabRise - Document Processing Solutions |
|
|
| # Our projects |
|
|
| ## PDF DataSource for the Apache Spark |
|
|
| <a href="https://stabrise.com/spark-pdf/"><img alt="Spark Pdf" src="https://stabrise.com/media/filer_public_thumbnails/filer_public/16/d6/16d6a0d6-f162-42ad-a5a3-7dc20361ad24/sparkpdf.png__1000x300_subsampling-2.webp" height="120"></a> |
|
|
| --- |
|
|
| **Source Code**: [https://github.com/StabRise/spark-pdf](https://github.com/StabRise/spark-pdf) |
|
|
| **Home page**: [https://stabrise.com/spark-pdf/](https://stabrise.com/spark-pdf/) |
|
|
| **Quick Start Jupyter Notebook**: [https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb) |
|
|
| --- |
|
|
| The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame. |
|
|
| ## Key features: |
|
|
| - Read PDF documents to the Spark DataFrame |
| - Support read PDF files lazy per page |
| - Support big files, up to 10k pages |
| - Support scanned PDF files (call OCR) |
| - No need to install Tesseract OCR, it's included in the package |
|
|
| ## ScaleDP |
|
|
| <a href="https://stabrise.com/scaledp/"><img alt="ScaleDP" src="https://stabrise.com/media/filer_public_thumbnails/filer_public/4a/7d/4a7d97c2-50d7-4b7a-9902-af2df9b574da/scaledplogo.png__1000x300_subsampling-2.webp" height="120" /></a> |
|
|
| --- |
|
|
| **Source Code**: [https://github.com/StabRise/scaledp](https://github.com/StabRise/scaledp) |
|
|
| **Home page**: [https://stabrise.com/scaledp/](https://stabrise.com/scaledp/) |
|
|
| **Quick Start Jupyter Notebook**: [https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb](https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb) |
|
|
| --- |
|
|
| ScaleDP is an Open-Source Library for processing documents using Apache Spark. |
|
|
| ### Key features: |
|
|
| - Load PDF documents/Images |
| - Extract text from PDF documents/Images |
| - Extract images from PDF documents |
| - OCR Images/PDF documents |
| - Run NER on text extracted from PDF documents/Images |
| - Visualize NER results |
|
|
|
|
| ## De-Identify |
|
|
| <a href="https://deidentify.online"><img alt="De-Identify" src="https://stabrise.com/media/filer_public_thumbnails/filer_public/fb/fe/fbfe4b0c-dadb-4878-88ad-1c0ece0dc053/deidentifylogo.png__1000x300_subsampling-2.webp" height="120" /></a> |
|
|
| De-Identify is tool for de-identification/anonymization data |
|
|
| ### Supported formats |
| - text |
| - images |
| - pdf documents |
| - DICOM files |
|
|
|
|