kalle07
/

embedder_collection

Model card Files Files and versions

xet

Community

kalle07 commited on Aug 25, 2025

Commit

6eda2e5

verified ·

1 Parent(s): f3492c9

Update README.md

Browse files

Files changed (1) hide show

README.md +12 -11

README.md CHANGED Viewed

@@ -32,14 +32,14 @@ architecture:
 ---
-# <b>This is a collection of more than 25 types of embedding files and a really brief introduction to what you should know about embedding. If you don't keep a few things in mind, you won't be satisfied with the results.</b>
 <br>
 # <b>All models tested with ALLM(AnythingLLM) with LM-Studio as server, all models should be work with ollama</b>
 <b> the setup for local documents described below is allmost the same, GPT4All has only one model (nomic), and koboldcpp is not build in right now but in development</b><br>
 (sometimes the results are more truthful if the “chat with document only” option is used)<br>
-BTW embedder is only a part of a good RAG (Retrieval-Augmented Generation)<br>
 <b>&#x21e8;</b> give me a ❤️, if you like  ;)<br>
 <br>
 <b>My short impression:</b>
@@ -64,19 +64,19 @@ Further tests have shown that the following models are suitable for complex task
 ...
 # Short hints for using (Example for a large context with many expected hits):
-Set your (Max Tokens)context-lenght 16000t main-LLM-model <b>"LM-Studio with ALLM you must set also in LM!"</b>, set your embedder-model (Max Embedding Chunk Length) 1024t,set (Max Context Snippets) 14,
- in ALLM set also (Text splitting & Chunking Preferences - Text Chunk Size) 1024 character parts and (Search Preference) "accuracy".
 <br>
-Hint in ALLM, set all in LM stdui start both models and both are on top in ALLM.<br>
 -> Ok what that mean!<br>
 Your document will be embedd in x times 1024t chunks(snippets),<br>
-You can receive 14-snippets a 1024t (~14000t) from your document ~10000words(10pages) and ~2000t left (from 16000t) for the answer ~1000words (2 pages)
 <br>
 You can play and set for your needs, eg 8-snippets a 2048t, or 28-snippets a 512t ... (every time you change the chunk-length the document must be embedd again). With these settings everything fits best for ONE answer, if you need more for a conversation, you should set lower and/or disable the document.
 <ul style="line-height: 1.05;">
-english vs german differ 50%<br>
-~5000 characters is one page of a book (no matter ger/en). But if you calculate with words ... words in german are longer, that means per word more token.<br>
 The example is english, for german you can add apox 50% more token/word (1000 words ~1800t)<br>
 <li>1200t (~1000 words ~5000 chracter) ~0.1GB, this is aprox one page with small font</li>
 <li>8000t (~6000 words) ~0.8GB VRAM usage</li>
@@ -152,11 +152,12 @@ btw. <b>Jinja</b> templates very new ... the usual templates with usual models a
 ...
 <br>
-# DOC/PDF 2 TXT<br>
 Prepare your documents by yourself!<br>
 Bad Input = bad Output!<br>
 In most cases, it is not immediately obvious how the document is made available to the embedder.
-in nearly all cases images and tables, page-numbers, chapters and sections/paragraph-format not well implement.
 An easy start is to use a python based pdf-parser (it give a lot).<br>
 option only for simple txt/tables converting:
 <ul style="line-height: 1.05;">
@@ -180,7 +181,7 @@ also for OCR it download automatic some models. the only thing i haven't found y
 <br><br>
 large option to play with many types of (UI-Based)
 <ul style="line-height: 1.05;">
-<li>Parsemy PDF</li>
 </ul>
 <a href="https://github.com/genieincodebottle/parsemypdf">https://github.com/genieincodebottle/parsemypdf</a><br>
 <br>

 ---
+# <b>This is a collection of more than 25 types of embedding models and a really brief introduction to what you should know about embedding.If you don't keep a few things in mind, you won't be satisfied with the results.</b>
 <br>
 # <b>All models tested with ALLM(AnythingLLM) with LM-Studio as server, all models should be work with ollama</b>
 <b> the setup for local documents described below is allmost the same, GPT4All has only one model (nomic), and koboldcpp is not build in right now but in development</b><br>
 (sometimes the results are more truthful if the “chat with document only” option is used)<br>
+BTW the embedder-model is only a part of a good RAG (Retrieval-Augmented Generation)<br>
 <b>&#x21e8;</b> give me a ❤️, if you like  ;)<br>
 <br>
 <b>My short impression:</b>
 ...
 # Short hints for using (Example for a large context with many expected hits):
+Set your (Max Tokens)context-lenght 16000t main-LLM-model <b>"LM-Studio with ALLM you must set also in LM-Stutio settings!"</b>, set your embedder-model (Max Embedding Chunk Length) 1024t,set (Max Context Snippets) 14,
+ in ALLM set also (Text splitting & Chunking Preferences - Text Chunk Size) 1024 character parts and (Search Preference) "accuracy". And set in your workspace 14 snippets.
 <br>
+Hint in ALLM, set all in LM studio start both models and both are on top in ALLM.<br>
 -> Ok what that mean!<br>
 Your document will be embedd in x times 1024t chunks(snippets),<br>
+You can receive 14-snippets a 1024t (~14000t) from your document ~10000words(10pages) and ~2000t left (from 16000t) for the answer ~1000words (2 pages).
 <br>
 You can play and set for your needs, eg 8-snippets a 2048t, or 28-snippets a 512t ... (every time you change the chunk-length the document must be embedd again). With these settings everything fits best for ONE answer, if you need more for a conversation, you should set lower and/or disable the document.
 <ul style="line-height: 1.05;">
+english vs german differ 50% in calculate tokens/word<br>
+but ~5000 characters is one page of a book (no matter ger/en). But if you calculate with words ... words in german are longer, that means per word more token.<br>
 The example is english, for german you can add apox 50% more token/word (1000 words ~1800t)<br>
 <li>1200t (~1000 words ~5000 chracter) ~0.1GB, this is aprox one page with small font</li>
 <li>8000t (~6000 words) ~0.8GB VRAM usage</li>
 ...
 <br>
+# DOC/PDF to TXT<br>
 Prepare your documents by yourself!<br>
 Bad Input = bad Output!<br>
 In most cases, it is not immediately obvious how the document is made available to the embedder.
+In nearly all cases images and tables, page-numbers, chapters, formulas and sections/paragraph-format not well implement.
+You can start by simply saving the PDF as a TXT file, and you will then see in the TXT file how the embedding-model would see the content.
 An easy start is to use a python based pdf-parser (it give a lot).<br>
 option only for simple txt/tables converting:
 <ul style="line-height: 1.05;">
 <br><br>
 large option to play with many types of (UI-Based)
 <ul style="line-height: 1.05;">
+<li>Parse my PDF</li>
 </ul>
 <a href="https://github.com/genieincodebottle/parsemypdf">https://github.com/genieincodebottle/parsemypdf</a><br>
 <br>