kalle07 commited on
Commit
216fd54
·
verified ·
1 Parent(s): 4ec7dbe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -3
README.md CHANGED
@@ -35,7 +35,7 @@ BTW embedder is only a part of a good RAG<br>
35
  <b>&#x21e8;</b> give me a ❤️, if you like ;)<br>
36
  <br>
37
  <b>My short impression:</b>
38
- <ul style="line-height: 1;">
39
  <li>nomic-embed-text (up to 2048t context length)</li>
40
  <li>mxbai-embed-large</li>
41
  <li>mug-b-1.6</li>
@@ -80,7 +80,7 @@ You have a txt/pdf file maybe 90000words(~300pages) a book. You ask the model le
80
  Now it searches for keywords or similar semantic terms in the document. if it has found them, lets say word and meaning around “XYZ and ZYX” ,
81
  now a piece of text 1024token around this word “XYZ/ZYX” is cut out at this point. (In reality, it's all done with coded numbers, but dosnt matter - the principle)<br>
82
  This text snippet is then used for your answer. <br>
83
- <ul style="line-height: 1;">
84
  <li>If, for example, the word “XYZ” occurs 100 times in one file, not all 100 are found.</li>
85
  <li>If only one snippet corresponds to your question all other snippets can negatively influence your answer because they do not fit the topic (usually 4 to 32 snippet are fine)</li>
86
  <li>If you expect multible search results in your docs try 16-snippets or more, if you expect only 2 than dont use more!</li>
@@ -133,6 +133,29 @@ QwQ-LCoT- (7/14b) - https://huggingface.co/mradermacher/QwQ-LCoT-14B-Conversatio
133
 
134
 
135
  btw. <b>Jinja</b> templates very new ... the usual templates with usual models are fine, but merged models have a lot of optimization potential (but dont ask me iam not a coder)<br>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
 
137
  <br>
138
 
@@ -157,7 +180,7 @@ docfetcher - https://docfetcher.sourceforge.io/en/index.html (yes old but very u
157
 
158
  ...
159
 
160
- <ul style="line-height: 1;">
161
  <li>avemio/German-RAG-BGE-M3-MERGED-x-SNOWFLAKE-ARCTIC-HESSIAN-AI (German, English)</li>
162
  <li>maidalun1020/bce-embedding-base_v1 (English and Chinese)</li>
163
  <li>maidalun1020/bce-reranker-base_v1 (English, Chinese, Japanese and Korean)</li>
 
35
  <b>&#x21e8;</b> give me a ❤️, if you like ;)<br>
36
  <br>
37
  <b>My short impression:</b>
38
+ <ul style="line-height: 1.05;">
39
  <li>nomic-embed-text (up to 2048t context length)</li>
40
  <li>mxbai-embed-large</li>
41
  <li>mug-b-1.6</li>
 
80
  Now it searches for keywords or similar semantic terms in the document. if it has found them, lets say word and meaning around “XYZ and ZYX” ,
81
  now a piece of text 1024token around this word “XYZ/ZYX” is cut out at this point. (In reality, it's all done with coded numbers, but dosnt matter - the principle)<br>
82
  This text snippet is then used for your answer. <br>
83
+ <ul style="line-height: 1.05;">
84
  <li>If, for example, the word “XYZ” occurs 100 times in one file, not all 100 are found.</li>
85
  <li>If only one snippet corresponds to your question all other snippets can negatively influence your answer because they do not fit the topic (usually 4 to 32 snippet are fine)</li>
86
  <li>If you expect multible search results in your docs try 16-snippets or more, if you expect only 2 than dont use more!</li>
 
133
 
134
 
135
  btw. <b>Jinja</b> templates very new ... the usual templates with usual models are fine, but merged models have a lot of optimization potential (but dont ask me iam not a coder)<br>
136
+ <br>
137
+
138
+ ...
139
+ <br>
140
+ # DOC/PDF 2 TXT<br>
141
+ Prepare your documents by yoursel!.<br>
142
+ Bad Input = bad Output!<br>
143
+ In most cases, it is not immediately obvious how the document is made available to the embedder.
144
+ in nearly all cases images and tables, page-numbers, chapters and sections/paragraph-format not well implement.
145
+ An easy start is to use a python based pdf-parser (it give a lot).<br>
146
+ option for only fo fast txt/tables converting:
147
+ <ul style="line-height: 1.05;">
148
+ <li>pdfplumber</li>
149
+ <li>fitz/PyMuPDF</li>
150
+ <li>Camelot </li>li>
151
+ </ul>
152
+ Al in all you can tune a lot your code and you can manual add ocr.
153
+ <br><br>
154
+ option all in all solution for the future:
155
+ <li>docling - (opensource on github)</li>
156
+ it give some ready to use examples, which are already pretty good.<br>
157
+ https://github.com/docling-project/docling/tree/main/docs/examples<br>
158
+ also for OCR it download automatic some models. the only thing i haven't found yet (maybe it doesn't exist) is to read out the font, which works very well with <b>fitz</b>, for example.
159
 
160
  <br>
161
 
 
180
 
181
  ...
182
 
183
+ <ul style="line-height: 1.05;">
184
  <li>avemio/German-RAG-BGE-M3-MERGED-x-SNOWFLAKE-ARCTIC-HESSIAN-AI (German, English)</li>
185
  <li>maidalun1020/bce-embedding-base_v1 (English and Chinese)</li>
186
  <li>maidalun1020/bce-reranker-base_v1 (English, Chinese, Japanese and Korean)</li>