Update README.md
Browse files
README.md
CHANGED
|
@@ -96,7 +96,7 @@ and a Vram calculator - (you need the original model link NOT the GGUF)<br>
|
|
| 96 |
|
| 97 |
You have a txt/pdf file maybe 90000words(~300pages) a book. You ask the model lets say "what is described in chapter called XYZ in relation to person ZYX".
|
| 98 |
Now it searches for keywords or similar semantic terms in the document. if it has found them, lets say word and meaning around “XYZ and ZYX” ,
|
| 99 |
-
now a piece of text 1024token around this word “XYZ/ZYX” is cut out at this point. (In reality, it's all done with coded numbers, but dosnt matter - the principle)<br>
|
| 100 |
This text snippet is then used for your answer. <br>
|
| 101 |
<ul style="line-height: 1.05;">
|
| 102 |
<li>If, for example, the word “XYZ” occurs 50 times in one file, not all 50 are used for answer, only the number of snippets with a fast ranking are used</li>
|
|
@@ -155,22 +155,22 @@ btw. <b>Jinja</b> templates very new ... the usual templates with usual models a
|
|
| 155 |
# DOC/PDF to TXT<br>
|
| 156 |
Prepare your documents by yourself!<br>
|
| 157 |
Bad Input = bad Output!<br>
|
| 158 |
-
In most cases, it is not immediately obvious how the document is made available to the embedder.
|
| 159 |
In nearly all cases images and tables, page-numbers, chapters, formulas and sections/paragraph-format not well implement.
|
| 160 |
You can start by simply saving the PDF as a TXT file, and you will then see in the TXT file how the embedding-model would see the content.
|
| 161 |
-
An easy start is to use a python based pdf-parser (it give a lot).<br>
|
| 162 |
-
option only for simple txt/tables converting:
|
| 163 |
<ul style="line-height: 1.05;">
|
| 164 |
<li>pdfplumber</li>
|
| 165 |
<li>fitz/PyMuPDF</li>
|
| 166 |
<li>Camelot</li>
|
| 167 |
</ul>
|
| 168 |
-
All in all you can tune a lot your code
|
| 169 |
-
my option:<br>
|
| 170 |
<a href="https://huggingface.co/kalle07/pdf2txt_parser_converter">https://huggingface.co/kalle07/pdf2txt_parser_converter</a>
|
| 171 |
|
| 172 |
<br><br>
|
| 173 |
-
option
|
| 174 |
<ul style="line-height: 1.05;">
|
| 175 |
<li>docling - (opensource on github)</li>
|
| 176 |
</ul>
|
|
@@ -189,7 +189,7 @@ large option to play with many types of (UI-Based)
|
|
| 189 |
...
|
| 190 |
<br>
|
| 191 |
# only Indexing option<br>
|
| 192 |
-
One hint for fast search on 10000s of PDF (its only indexing not embedding) you can use it as a simple way to find your top 5-10 articles or books, you can then make these available to an LLM.<br>
|
| 193 |
Jabref - https://github.com/JabRef/jabref/tree/v6.0-alpha?tab=readme-ov-file <br>
|
| 194 |
https://builds.jabref.org/main/ <br>
|
| 195 |
or<br>
|
|
|
|
| 96 |
|
| 97 |
You have a txt/pdf file maybe 90000words(~300pages) a book. You ask the model lets say "what is described in chapter called XYZ in relation to person ZYX".
|
| 98 |
Now it searches for keywords or similar semantic terms in the document. if it has found them, lets say word and meaning around “XYZ and ZYX” ,
|
| 99 |
+
now a piece of text 1024token around this word “XYZ/ZYX” is cut out at this point. (In reality, it's all done with coded numbers per chunck and thats why you dont can search for single numbers or words, but dosnt matter - the principle)<br>
|
| 100 |
This text snippet is then used for your answer. <br>
|
| 101 |
<ul style="line-height: 1.05;">
|
| 102 |
<li>If, for example, the word “XYZ” occurs 50 times in one file, not all 50 are used for answer, only the number of snippets with a fast ranking are used</li>
|
|
|
|
| 155 |
# DOC/PDF to TXT<br>
|
| 156 |
Prepare your documents by yourself!<br>
|
| 157 |
Bad Input = bad Output!<br>
|
| 158 |
+
In most cases, it is not immediately obvious how the document is made available to the embedder. In ALLM its "c:\Users\XXX\AppData\Roaming\anythingllm-desktop\storage\documents", you can open with a text editor to check the quality.
|
| 159 |
In nearly all cases images and tables, page-numbers, chapters, formulas and sections/paragraph-format not well implement.
|
| 160 |
You can start by simply saving the PDF as a TXT file, and you will then see in the TXT file how the embedding-model would see the content.
|
| 161 |
+
An easy start is to use a python based pdf-parser (it give a lot) also OCR based for images.<br>
|
| 162 |
+
option one only for simple txt/tables converting:
|
| 163 |
<ul style="line-height: 1.05;">
|
| 164 |
<li>pdfplumber</li>
|
| 165 |
<li>fitz/PyMuPDF</li>
|
| 166 |
<li>Camelot</li>
|
| 167 |
</ul>
|
| 168 |
+
All in all you can tune a lot your code but the difficulties lie in the details.<br>
|
| 169 |
+
my option, one exe for windows and also python, a second option with ocr:<br>
|
| 170 |
<a href="https://huggingface.co/kalle07/pdf2txt_parser_converter">https://huggingface.co/kalle07/pdf2txt_parser_converter</a>
|
| 171 |
|
| 172 |
<br><br>
|
| 173 |
+
option ocr from ibm (open source):
|
| 174 |
<ul style="line-height: 1.05;">
|
| 175 |
<li>docling - (opensource on github)</li>
|
| 176 |
</ul>
|
|
|
|
| 189 |
...
|
| 190 |
<br>
|
| 191 |
# only Indexing option<br>
|
| 192 |
+
One hint for fast search on 10000s of PDF/TXT/DOC (its only indexing, not embedding) you can use it as a simple way to find your top 5-10 articles or books, you can then make these available to an LLM.<br>
|
| 193 |
Jabref - https://github.com/JabRef/jabref/tree/v6.0-alpha?tab=readme-ov-file <br>
|
| 194 |
https://builds.jabref.org/main/ <br>
|
| 195 |
or<br>
|