kalle07
/

pdf2txt_parser_converter

Model card Files Files and versions

kalle07 commited on Jan 7

Commit

3e341a5

·

verified ·

1 Parent(s): f09af1d

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -24,6 +24,7 @@ better input = better output<br>
 * docling_by_sevenof9_v1.py - Python, you need nvidia RTX to run it fast<br>
 all other older versions<br><br>
 <b>&#x21e8;</b> give me a ❤️, if you like  ;)<br><br>
 on github
 https://github.com/kalle07/parsing
@@ -57,13 +58,12 @@ I work with "<b>pdfplumber/pdfminer</b>" none OCR, so its very fast!<br>
 <li>tested on 300 PDF files ~30000 pages</li>
 </ul>
 <br>
-This I have created with my brain and the help of chatGPT, Iam not a coder... sorry so I will not fulfill any wishes unless there are real errors.<br>
 It is really hard for me with GUI and the Function and in addition to compile it.<br>
 For the python-file you need to import missing libraries.<br>
 Of course there is a lot of need for optimization(save/error-handling) or the use of other parser libraries, but it's a start.
 <br><br>
-i am working on a 50% faster version. in addition, the GUI should allow more influence on the processing, e.g. faster raw text, trim margins (yes/no) and set % yourself, set unimportant text block size, layout with line breaks or force more continuous text. preview on first 10 pages with generated images what is detected with border arround text and tables<br>
-Give me a hand if you can ;)<br>
 ...
 <br>
 I also have a "<b>docling</b>" parser with OCR (GPU is need for fast processing), its only be a python-file, not compiled.<br>

 * docling_by_sevenof9_v1.py - Python, you need nvidia RTX to run it fast<br>
 all other older versions<br><br>
 <b>&#x21e8;</b> give me a ❤️, if you like  ;)<br><br>
+Check the PDF before converting it to text: go to any page, ideally one at the beginning and one at the end, select the text with the mouse and copy it into an editor (can you see what you copied?)... if that doesn't work, this parser won't work and neither will any other program! To do this, you must remove the copy protection, or the page is just an image and you must use OCR first.<br>
 on github
 https://github.com/kalle07/parsing
 <li>tested on 300 PDF files ~30000 pages</li>
 </ul>
 <br>
+This I have created with my brain and the help of Ai, Iam not a coder... sorry so I will not fulfill any wishes unless there are real errors.<br>
 It is really hard for me with GUI and the Function and in addition to compile it.<br>
 For the python-file you need to import missing libraries.<br>
 Of course there is a lot of need for optimization(save/error-handling) or the use of other parser libraries, but it's a start.
 <br><br>
+Give me a hand if you can ;) for more implementation or more stable.<br>
 ...
 <br>
 I also have a "<b>docling</b>" parser with OCR (GPU is need for fast processing), its only be a python-file, not compiled.<br>