Update README.md
Browse files
README.md
CHANGED
|
@@ -32,14 +32,14 @@ architecture:
|
|
| 32 |
|
| 33 |
---
|
| 34 |
|
| 35 |
-
# <b>This is a collection of more than 25 types of embedding
|
| 36 |
<br>
|
| 37 |
|
| 38 |
# <b>All models tested with ALLM(AnythingLLM) with LM-Studio as server, all models should be work with ollama</b>
|
| 39 |
<b> the setup for local documents described below is allmost the same, GPT4All has only one model (nomic), and koboldcpp is not build in right now but in development</b><br>
|
| 40 |
|
| 41 |
(sometimes the results are more truthful if the “chat with document only” option is used)<br>
|
| 42 |
-
BTW embedder is only a part of a good RAG (Retrieval-Augmented Generation)<br>
|
| 43 |
<b>⇨</b> give me a ❤️, if you like ;)<br>
|
| 44 |
<br>
|
| 45 |
<b>My short impression:</b>
|
|
@@ -64,19 +64,19 @@ Further tests have shown that the following models are suitable for complex task
|
|
| 64 |
...
|
| 65 |
|
| 66 |
# Short hints for using (Example for a large context with many expected hits):
|
| 67 |
-
Set your (Max Tokens)context-lenght 16000t main-LLM-model <b>"LM-Studio with ALLM you must set also in LM!"</b>, set your embedder-model (Max Embedding Chunk Length) 1024t,set (Max Context Snippets) 14,
|
| 68 |
-
in ALLM set also (Text splitting & Chunking Preferences - Text Chunk Size) 1024 character parts and (Search Preference) "accuracy".
|
| 69 |
<br>
|
| 70 |
-
Hint in ALLM, set all in LM
|
| 71 |
|
| 72 |
-> Ok what that mean!<br>
|
| 73 |
Your document will be embedd in x times 1024t chunks(snippets),<br>
|
| 74 |
-
You can receive 14-snippets a 1024t (~14000t) from your document ~10000words(10pages) and ~2000t left (from 16000t) for the answer ~1000words (2 pages)
|
| 75 |
<br>
|
| 76 |
You can play and set for your needs, eg 8-snippets a 2048t, or 28-snippets a 512t ... (every time you change the chunk-length the document must be embedd again). With these settings everything fits best for ONE answer, if you need more for a conversation, you should set lower and/or disable the document.
|
| 77 |
<ul style="line-height: 1.05;">
|
| 78 |
-
english vs german differ 50
|
| 79 |
-
~5000 characters is one page of a book (no matter ger/en). But if you calculate with words ... words in german are longer, that means per word more token.<br>
|
| 80 |
The example is english, for german you can add apox 50% more token/word (1000 words ~1800t)<br>
|
| 81 |
<li>1200t (~1000 words ~5000 chracter) ~0.1GB, this is aprox one page with small font</li>
|
| 82 |
<li>8000t (~6000 words) ~0.8GB VRAM usage</li>
|
|
@@ -152,11 +152,12 @@ btw. <b>Jinja</b> templates very new ... the usual templates with usual models a
|
|
| 152 |
|
| 153 |
...
|
| 154 |
<br>
|
| 155 |
-
# DOC/PDF
|
| 156 |
Prepare your documents by yourself!<br>
|
| 157 |
Bad Input = bad Output!<br>
|
| 158 |
In most cases, it is not immediately obvious how the document is made available to the embedder.
|
| 159 |
-
|
|
|
|
| 160 |
An easy start is to use a python based pdf-parser (it give a lot).<br>
|
| 161 |
option only for simple txt/tables converting:
|
| 162 |
<ul style="line-height: 1.05;">
|
|
@@ -180,7 +181,7 @@ also for OCR it download automatic some models. the only thing i haven't found y
|
|
| 180 |
<br><br>
|
| 181 |
large option to play with many types of (UI-Based)
|
| 182 |
<ul style="line-height: 1.05;">
|
| 183 |
-
<li>
|
| 184 |
</ul>
|
| 185 |
<a href="https://github.com/genieincodebottle/parsemypdf">https://github.com/genieincodebottle/parsemypdf</a><br>
|
| 186 |
<br>
|
|
|
|
| 32 |
|
| 33 |
---
|
| 34 |
|
| 35 |
+
# <b>This is a collection of more than 25 types of embedding models and a really brief introduction to what you should know about embedding.If you don't keep a few things in mind, you won't be satisfied with the results.</b>
|
| 36 |
<br>
|
| 37 |
|
| 38 |
# <b>All models tested with ALLM(AnythingLLM) with LM-Studio as server, all models should be work with ollama</b>
|
| 39 |
<b> the setup for local documents described below is allmost the same, GPT4All has only one model (nomic), and koboldcpp is not build in right now but in development</b><br>
|
| 40 |
|
| 41 |
(sometimes the results are more truthful if the “chat with document only” option is used)<br>
|
| 42 |
+
BTW the embedder-model is only a part of a good RAG (Retrieval-Augmented Generation)<br>
|
| 43 |
<b>⇨</b> give me a ❤️, if you like ;)<br>
|
| 44 |
<br>
|
| 45 |
<b>My short impression:</b>
|
|
|
|
| 64 |
...
|
| 65 |
|
| 66 |
# Short hints for using (Example for a large context with many expected hits):
|
| 67 |
+
Set your (Max Tokens)context-lenght 16000t main-LLM-model <b>"LM-Studio with ALLM you must set also in LM-Stutio settings!"</b>, set your embedder-model (Max Embedding Chunk Length) 1024t,set (Max Context Snippets) 14,
|
| 68 |
+
in ALLM set also (Text splitting & Chunking Preferences - Text Chunk Size) 1024 character parts and (Search Preference) "accuracy". And set in your workspace 14 snippets.
|
| 69 |
<br>
|
| 70 |
+
Hint in ALLM, set all in LM studio start both models and both are on top in ALLM.<br>
|
| 71 |
|
| 72 |
-> Ok what that mean!<br>
|
| 73 |
Your document will be embedd in x times 1024t chunks(snippets),<br>
|
| 74 |
+
You can receive 14-snippets a 1024t (~14000t) from your document ~10000words(10pages) and ~2000t left (from 16000t) for the answer ~1000words (2 pages).
|
| 75 |
<br>
|
| 76 |
You can play and set for your needs, eg 8-snippets a 2048t, or 28-snippets a 512t ... (every time you change the chunk-length the document must be embedd again). With these settings everything fits best for ONE answer, if you need more for a conversation, you should set lower and/or disable the document.
|
| 77 |
<ul style="line-height: 1.05;">
|
| 78 |
+
english vs german differ 50% in calculate tokens/word<br>
|
| 79 |
+
but ~5000 characters is one page of a book (no matter ger/en). But if you calculate with words ... words in german are longer, that means per word more token.<br>
|
| 80 |
The example is english, for german you can add apox 50% more token/word (1000 words ~1800t)<br>
|
| 81 |
<li>1200t (~1000 words ~5000 chracter) ~0.1GB, this is aprox one page with small font</li>
|
| 82 |
<li>8000t (~6000 words) ~0.8GB VRAM usage</li>
|
|
|
|
| 152 |
|
| 153 |
...
|
| 154 |
<br>
|
| 155 |
+
# DOC/PDF to TXT<br>
|
| 156 |
Prepare your documents by yourself!<br>
|
| 157 |
Bad Input = bad Output!<br>
|
| 158 |
In most cases, it is not immediately obvious how the document is made available to the embedder.
|
| 159 |
+
In nearly all cases images and tables, page-numbers, chapters, formulas and sections/paragraph-format not well implement.
|
| 160 |
+
You can start by simply saving the PDF as a TXT file, and you will then see in the TXT file how the embedding-model would see the content.
|
| 161 |
An easy start is to use a python based pdf-parser (it give a lot).<br>
|
| 162 |
option only for simple txt/tables converting:
|
| 163 |
<ul style="line-height: 1.05;">
|
|
|
|
| 181 |
<br><br>
|
| 182 |
large option to play with many types of (UI-Based)
|
| 183 |
<ul style="line-height: 1.05;">
|
| 184 |
+
<li>Parse my PDF</li>
|
| 185 |
</ul>
|
| 186 |
<a href="https://github.com/genieincodebottle/parsemypdf">https://github.com/genieincodebottle/parsemypdf</a><br>
|
| 187 |
<br>
|