kalle07
/

embedder_collection

Model card Files Files and versions

xet

Community

kalle07 commited on May 2, 2025

Commit

216fd54

verified ·

1 Parent(s): 4ec7dbe

Update README.md

Browse files

Files changed (1) hide show

README.md +26 -3

README.md CHANGED Viewed

@@ -35,7 +35,7 @@ BTW embedder is only a part of a good RAG<br>
 <b>&#x21e8;</b> give me a ❤️, if you like  ;)<br>
 <br>
 <b>My short impression:</b>
-<ul style="line-height: 1;">
 <li>nomic-embed-text (up to 2048t context length)</li>
 <li>mxbai-embed-large</li>
 <li>mug-b-1.6</li>
@@ -80,7 +80,7 @@ You have a txt/pdf file maybe 90000words(~300pages) a book. You ask the model le
 Now it searches for keywords or similar semantic terms in the document. if it has found them, lets say word and meaning around “XYZ and ZYX” ,
 now a piece of text 1024token around this word “XYZ/ZYX” is cut out at this point.  (In reality, it's all done with coded numbers, but dosnt matter - the principle)<br>
 This text snippet is then used for your answer. <br>
-<ul style="line-height: 1;">
 <li>If, for example, the word “XYZ” occurs 100 times in one file, not all 100 are found.</li>
 <li>If only one snippet corresponds to your question all other snippets can negatively influence your answer because they do not fit the topic (usually 4 to 32 snippet are fine)</li>
 <li>If you expect multible search results in your docs try 16-snippets or more, if you expect only 2 than dont use more!</li>
@@ -133,6 +133,29 @@ QwQ-LCoT- (7/14b) - https://huggingface.co/mradermacher/QwQ-LCoT-14B-Conversatio
 btw. <b>Jinja</b> templates very new ... the usual templates with usual models are fine, but merged models have a lot of optimization potential (but dont ask me iam not a coder)<br>
 <br>
@@ -157,7 +180,7 @@ docfetcher - https://docfetcher.sourceforge.io/en/index.html (yes old but very u
 ...
-<ul style="line-height: 1;">
 <li>avemio/German-RAG-BGE-M3-MERGED-x-SNOWFLAKE-ARCTIC-HESSIAN-AI (German, English)</li>
 <li>maidalun1020/bce-embedding-base_v1 (English and Chinese)</li>
 <li>maidalun1020/bce-reranker-base_v1 (English, Chinese, Japanese and Korean)</li>

 <b>&#x21e8;</b> give me a ❤️, if you like  ;)<br>
 <br>
 <b>My short impression:</b>
+<ul style="line-height: 1.05;">
 <li>nomic-embed-text (up to 2048t context length)</li>
 <li>mxbai-embed-large</li>
 <li>mug-b-1.6</li>
 Now it searches for keywords or similar semantic terms in the document. if it has found them, lets say word and meaning around “XYZ and ZYX” ,
 now a piece of text 1024token around this word “XYZ/ZYX” is cut out at this point.  (In reality, it's all done with coded numbers, but dosnt matter - the principle)<br>
 This text snippet is then used for your answer. <br>
+<ul style="line-height: 1.05;">
 <li>If, for example, the word “XYZ” occurs 100 times in one file, not all 100 are found.</li>
 <li>If only one snippet corresponds to your question all other snippets can negatively influence your answer because they do not fit the topic (usually 4 to 32 snippet are fine)</li>
 <li>If you expect multible search results in your docs try 16-snippets or more, if you expect only 2 than dont use more!</li>
 btw. <b>Jinja</b> templates very new ... the usual templates with usual models are fine, but merged models have a lot of optimization potential (but dont ask me iam not a coder)<br>
+<br>
+...
+<br>
+# DOC/PDF 2 TXT<br>
+Prepare your documents by yoursel!.<br>
+Bad Input = bad Output!<br>
+In most cases, it is not immediately obvious how the document is made available to the embedder.
+in nearly all cases images and tables, page-numbers, chapters and sections/paragraph-format not well implement.
+An easy start is to use a python based pdf-parser (it give a lot).<br>
+option for only fo fast txt/tables converting:
+<ul style="line-height: 1.05;">
+<li>pdfplumber</li>
+<li>fitz/PyMuPDF</li>
+<li>Camelot </li>li>
+</ul>
+Al in all you can tune a lot your code and you can manual add ocr.
+<br><br>
+option all in all solution for the future:
+<li>docling - (opensource on github)</li>
+it give some ready to use examples, which are already pretty good.<br>
+https://github.com/docling-project/docling/tree/main/docs/examples<br>
+also for OCR it download automatic some models. the only thing i haven't found yet (maybe it doesn't exist) is to read out the font, which works very well with <b>fitz</b>, for example.
 <br>
 ...
+<ul style="line-height: 1.05;">
 <li>avemio/German-RAG-BGE-M3-MERGED-x-SNOWFLAKE-ARCTIC-HESSIAN-AI (German, English)</li>
 <li>maidalun1020/bce-embedding-base_v1 (English and Chinese)</li>
 <li>maidalun1020/bce-reranker-base_v1 (English, Chinese, Japanese and Korean)</li>