Update README.md
Browse files
README.md
CHANGED
|
@@ -35,7 +35,7 @@ BTW embedder is only a part of a good RAG<br>
|
|
| 35 |
<b>⇨</b> give me a ❤️, if you like ;)<br>
|
| 36 |
<br>
|
| 37 |
<b>My short impression:</b>
|
| 38 |
-
<ul style="line-height: 1;">
|
| 39 |
<li>nomic-embed-text (up to 2048t context length)</li>
|
| 40 |
<li>mxbai-embed-large</li>
|
| 41 |
<li>mug-b-1.6</li>
|
|
@@ -80,7 +80,7 @@ You have a txt/pdf file maybe 90000words(~300pages) a book. You ask the model le
|
|
| 80 |
Now it searches for keywords or similar semantic terms in the document. if it has found them, lets say word and meaning around “XYZ and ZYX” ,
|
| 81 |
now a piece of text 1024token around this word “XYZ/ZYX” is cut out at this point. (In reality, it's all done with coded numbers, but dosnt matter - the principle)<br>
|
| 82 |
This text snippet is then used for your answer. <br>
|
| 83 |
-
<ul style="line-height: 1;">
|
| 84 |
<li>If, for example, the word “XYZ” occurs 100 times in one file, not all 100 are found.</li>
|
| 85 |
<li>If only one snippet corresponds to your question all other snippets can negatively influence your answer because they do not fit the topic (usually 4 to 32 snippet are fine)</li>
|
| 86 |
<li>If you expect multible search results in your docs try 16-snippets or more, if you expect only 2 than dont use more!</li>
|
|
@@ -133,6 +133,29 @@ QwQ-LCoT- (7/14b) - https://huggingface.co/mradermacher/QwQ-LCoT-14B-Conversatio
|
|
| 133 |
|
| 134 |
|
| 135 |
btw. <b>Jinja</b> templates very new ... the usual templates with usual models are fine, but merged models have a lot of optimization potential (but dont ask me iam not a coder)<br>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
<br>
|
| 138 |
|
|
@@ -157,7 +180,7 @@ docfetcher - https://docfetcher.sourceforge.io/en/index.html (yes old but very u
|
|
| 157 |
|
| 158 |
...
|
| 159 |
|
| 160 |
-
<ul style="line-height: 1;">
|
| 161 |
<li>avemio/German-RAG-BGE-M3-MERGED-x-SNOWFLAKE-ARCTIC-HESSIAN-AI (German, English)</li>
|
| 162 |
<li>maidalun1020/bce-embedding-base_v1 (English and Chinese)</li>
|
| 163 |
<li>maidalun1020/bce-reranker-base_v1 (English, Chinese, Japanese and Korean)</li>
|
|
|
|
| 35 |
<b>⇨</b> give me a ❤️, if you like ;)<br>
|
| 36 |
<br>
|
| 37 |
<b>My short impression:</b>
|
| 38 |
+
<ul style="line-height: 1.05;">
|
| 39 |
<li>nomic-embed-text (up to 2048t context length)</li>
|
| 40 |
<li>mxbai-embed-large</li>
|
| 41 |
<li>mug-b-1.6</li>
|
|
|
|
| 80 |
Now it searches for keywords or similar semantic terms in the document. if it has found them, lets say word and meaning around “XYZ and ZYX” ,
|
| 81 |
now a piece of text 1024token around this word “XYZ/ZYX” is cut out at this point. (In reality, it's all done with coded numbers, but dosnt matter - the principle)<br>
|
| 82 |
This text snippet is then used for your answer. <br>
|
| 83 |
+
<ul style="line-height: 1.05;">
|
| 84 |
<li>If, for example, the word “XYZ” occurs 100 times in one file, not all 100 are found.</li>
|
| 85 |
<li>If only one snippet corresponds to your question all other snippets can negatively influence your answer because they do not fit the topic (usually 4 to 32 snippet are fine)</li>
|
| 86 |
<li>If you expect multible search results in your docs try 16-snippets or more, if you expect only 2 than dont use more!</li>
|
|
|
|
| 133 |
|
| 134 |
|
| 135 |
btw. <b>Jinja</b> templates very new ... the usual templates with usual models are fine, but merged models have a lot of optimization potential (but dont ask me iam not a coder)<br>
|
| 136 |
+
<br>
|
| 137 |
+
|
| 138 |
+
...
|
| 139 |
+
<br>
|
| 140 |
+
# DOC/PDF 2 TXT<br>
|
| 141 |
+
Prepare your documents by yoursel!.<br>
|
| 142 |
+
Bad Input = bad Output!<br>
|
| 143 |
+
In most cases, it is not immediately obvious how the document is made available to the embedder.
|
| 144 |
+
in nearly all cases images and tables, page-numbers, chapters and sections/paragraph-format not well implement.
|
| 145 |
+
An easy start is to use a python based pdf-parser (it give a lot).<br>
|
| 146 |
+
option for only fo fast txt/tables converting:
|
| 147 |
+
<ul style="line-height: 1.05;">
|
| 148 |
+
<li>pdfplumber</li>
|
| 149 |
+
<li>fitz/PyMuPDF</li>
|
| 150 |
+
<li>Camelot </li>li>
|
| 151 |
+
</ul>
|
| 152 |
+
Al in all you can tune a lot your code and you can manual add ocr.
|
| 153 |
+
<br><br>
|
| 154 |
+
option all in all solution for the future:
|
| 155 |
+
<li>docling - (opensource on github)</li>
|
| 156 |
+
it give some ready to use examples, which are already pretty good.<br>
|
| 157 |
+
https://github.com/docling-project/docling/tree/main/docs/examples<br>
|
| 158 |
+
also for OCR it download automatic some models. the only thing i haven't found yet (maybe it doesn't exist) is to read out the font, which works very well with <b>fitz</b>, for example.
|
| 159 |
|
| 160 |
<br>
|
| 161 |
|
|
|
|
| 180 |
|
| 181 |
...
|
| 182 |
|
| 183 |
+
<ul style="line-height: 1.05;">
|
| 184 |
<li>avemio/German-RAG-BGE-M3-MERGED-x-SNOWFLAKE-ARCTIC-HESSIAN-AI (German, English)</li>
|
| 185 |
<li>maidalun1020/bce-embedding-base_v1 (English and Chinese)</li>
|
| 186 |
<li>maidalun1020/bce-reranker-base_v1 (English, Chinese, Japanese and Korean)</li>
|