Update README.md
Browse files
README.md
CHANGED
|
@@ -172,6 +172,28 @@ we process them after sorting all segments with content. To determine their read
|
|
| 172 |
using distance as a criterion.
|
| 173 |
|
| 174 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 175 |
## Benchmark
|
| 176 |
|
| 177 |
These are the benchmark results for VGT model on PubLayNet dataset:
|
|
|
|
| 172 |
using distance as a criterion.
|
| 173 |
|
| 174 |
|
| 175 |
+
### Extracting Tables and Formulas
|
| 176 |
+
|
| 177 |
+
Our service provides a way to extract your tables and formulas in different formats.
|
| 178 |
+
|
| 179 |
+
As default, formula segments' "text" property will include the formula in LaTeX format.
|
| 180 |
+
|
| 181 |
+
You can also extract tables in different formats like "markdown", "latex", or "html" but this is not a default option.
|
| 182 |
+
To extract the tables like this, you should set "extraction_format" parameter. Some example usages shown below:
|
| 183 |
+
|
| 184 |
+
```
|
| 185 |
+
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060 -F "extraction_format=latex"
|
| 186 |
+
```
|
| 187 |
+
```
|
| 188 |
+
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5060/fast -F "extraction_format=markdown"
|
| 189 |
+
```
|
| 190 |
+
|
| 191 |
+
You should be aware that this additional extraction process can make the process much longer, especially if you have a large number of tables.
|
| 192 |
+
|
| 193 |
+
(For table extraction, we are using [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
|
| 194 |
+
and for formula extraction, we are using [RapidLaTeXOCR](https://github.com/RapidAI/RapidLaTeXOCR))
|
| 195 |
+
|
| 196 |
+
|
| 197 |
## Benchmark
|
| 198 |
|
| 199 |
These are the benchmark results for VGT model on PubLayNet dataset:
|