File size: 16,288 Bytes
a2a30ce db490d7 ca67a0d db490d7 f10f3d2 f7dd9bf 1807ea4 e17bacd 2613f0f f7dd9bf db490d7 7faf56a de44978 006f15d 8bb0749 db490d7 707634e 6c6b419 4383e8a 806cfb4 6eda2e5 2f8597b 7cf295c 48c47e0 d1100f8 2f8597b 6941498 248936b 9ba4185 21ae04b 86484e7 e8268ae 0053fa9 cc26a33 216fd54 86484e7 559c1a0 27728dc 0ea6c62 2b88bf2 03d90b7 8b6ebf4 03d90b7 86484e7 e3c33d0 0053fa9 81d7818 3e36a38 7ba2d01 6eda2e5 19c08a9 6eda2e5 13adbac 261a825 4a633dc 6eda2e5 19c08a9 df900ba 6eda2e5 ac01051 aee8285 95c8c5b c19c1c2 6553b20 4f15d72 a4c3b9b 3a41b7f a4c3b9b 78b1760 a4c3b9b 3a41b7f a4c3b9b 830f2b2 4f15d72 6553b20 a4c3b9b 40a4b41 56e5aa8 e8f36b1 56e5aa8 1e40a95 3dd8202 19c08a9 81d7818 da8a392 d1761d2 a9f7b92 81d7818 216fd54 2cdef30 81d7818 edbeb79 81d7818 c97f69a edbeb79 2cdef30 7ba2d01 0053fa9 13adbac 19c08a9 81d7818 2f8597b 2d290e4 225a69e 0fb7a38 5802c95 0fb7a38 3e36a38 2d290e4 3e36a38 7951727 268611f 8d5c583 7951727 81d7818 3e36a38 6abb2f6 3e36a38 7951727 6abb2f6 a0643c9 3e36a38 6abb2f6 3e36a38 7951727 a0643c9 da8a392 0fb7a38 216fd54 6eda2e5 1e5c486 216fd54 a9f7b92 6eda2e5 a9f7b92 216fd54 235d99a 216fd54 a9f7b92 47b36a1 33e364e bb7daf7 216fd54 a9f7b92 836c278 216fd54 836c278 235d99a 1e5c486 836c278 1e5c486 836c278 6eda2e5 836c278 1e5c486 0053fa9 42ff04c a0643c9 0053fa9 42ff04c a9f7b92 ec25569 e3703b0 c9b8cc1 42ff04c 8f5f194 c9b8cc1 42ff04c a1b1441 216fd54 bf9e7bf 2613f0f bf9e7bf 372acd1 bf9e7bf 2613f0f bf9e7bf 2613f0f 03d90b7 a0e81dc 31f3d48 865a8a5 2613f0f 3dcce4a 03d90b7 a0e81dc e467d8b a1b1441 5942394 806cfb4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 |
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- embedder
- embedding
- models
- GGUF
- Bert
- Nomic
- Gist
- Granite
- BGE
- Jina
- Snowflake
- Qwen
- text-embeddings-inference
- RAG
- Rerank
- similarity
- PDF
- Parsing
- Parser
misc:
- text-embeddings-inference
language:
- en
- de
architecture:
---
# <b>This is a collection of more than 25 types of embedding models and a really brief introduction to what you should know about embedding.If you don't keep a few things in mind, you won't be satisfied with the results.</b>
<br>
at end of the file-list press

to see all files<br>
# <b>All models tested with ALLM(AnythingLLM) with LM-Studio as server, all models should be work with ollama</b>
<b> the setup for local documents described below is allmost the same, GPT4All has only one model (nomic), and koboldcpp and JAN(Menlo) is not build in right now but in development</b><br>
(sometimes the results are more truthful if the “chat with document only” option is used)<br>
BTW the embedder-model is only a part of a good RAG (Retrieval-Augmented Generation), 512t are ~2000 characters most cases enough.<br>
<b>⇨</b> give me a ❤️, if you like ;)<br>
<br>
<b>My short impression:</b>
<ul style="line-height: 1.05;">
<li>nomic-embed-text-v2-moe (up to 512t context length)</li>
<li>mxbai-embed-large (small and fast model)</li>
<li>mug-b-1.6</li>
<li>snowflake-arctic-embed-l-v2.0 (up to 8192t context length)</li>
<li>bge-m3 (up to 8192t context length)</li>
</ul>
Working well, all other its up to you! Some models are very similar! (jina and qwen based you can add manual to LM-Studio, set model "gear wheel" below "overide domain type")<br>
With the same setting, these embedders found same 6-7 snippets out of 10 from a book. This means that only 3-4 snippets were different, but I didn't test it extensively.<br>
Further tests have shown that the following models are suitable for complex tasks (German-text, but should be similar in English). Jina-DE, nomic was not that good. I'm not convinced by large models such as Qwen or JinaaiV3 and V4 doesnt work with LM studio; they are ten times slower and the result is not ten times better. Despite all this, you can recognize tables and some images.
<ul style="line-height: 1.05;">
<li>GTE large</li>
<li>cross-en-de-es-roberta</li>
<li>ger-RAG-bge-M3-merg-snowf-artic-hessian-AI (very good for german, up to 8192t context length)</li>
<li>German-RAG-BGE-M3-TRIPLES-HESSIAN-AI (very good for german, up to 8192t context length)</li>
<li>bge-m3 (good for german, up to 8192t context length)</li>
<li>jina-embeddings-v3 (good for german, up to 8192t context length)</li>
</ul>
There are two embedder to find toxic content (toxic-prompt-roberta and minilmv2-toxic-jigsaw), dont know how good it works, and from ibm it give a whole LLM model (granite-guardian).
<br>
<br>
...
# Short hints for using (Example for a large context with many expected hits):
Set your (Max Tokens)context-lenght 16000t main-LLM-model <b>"LM-Studio with ALLM you must set also in LM-Stutio settings!"</b>, set your embedder-model (Max Embedding Chunk Length) 1024t,set (Max Context Snippets) 14,
in ALLM set also (Text splitting & Chunking Preferences - Text Chunk Size) 1024 character parts and (Search Preference) "accuracy". And set in your workspace 14 snippets.
<br>
Hint in ALLM, set all in LM studio start both models and both are on top in ALLM.<br>
-> Ok what that mean!<br>
Your document will be embedd in x times 1024t chunks(snippets),<br>
You can receive 14-snippets a 1024t (~14000t) from your document ~10000words(10pages) and ~2000t left (from 16000t) for the answer ~1000words (2 pages).
<br>
You can play and set for your needs, eg 8-snippets a 2048t, or 28-snippets a 512t ... (every time you change the chunk-length the document must be embedd again). With these settings everything fits best for ONE answer, if you need more for a conversation, you should set lower and/or disable the document.
<ul style="line-height: 1.05;">
english vs german differ 50% in calculate tokens/word<br>
but ~5000 characters is one page of a book (no matter ger/en). But if you calculate with words ... words in german are longer, that means per word more token.<br>
The example is english, for german you can add apox 50% more token/word (1000 words ~1800t)<br>
<li>1200t (~1000 words ~5000 chracter) ~0.1GB, this is aprox one page with small font</li>
<li>best to get in mind: 5000-6000 characters correspond to approximately one page and approximately 1200-1400 token.</li>
<li>8000t (~6000 words) ~1.5GB VRAM usage</li>
<li>16000t (~12000 words) ~3GB VRAM usage</li>
<li>32000t (~24000 words) ~6GB VRAM usage</li>
<br>
Vector Size (Dimensions- you can not change)
The vector size, or dimensionality (embedding_length: xxx), is the number of numbers in each embedding vector.
Common embedding models produce vectors ranging from 384 dimensions (e.g., all-MiniLM-L6-v2) to 3072 dimensions (text-embedding-3-large).
Higher dimensions capture more semantic details but require more storage and computational resources for database indexing and search.
Some models allow you to shorten vectors (e.g., use only 256 out of 3072 dimensions) to save space while retaining high performance.<br>
<br>
Vector count refers to the total number of vectors stored, which usually corresponds to the number of content chunks indexed +the overlap chracters space.<br>
<br>
More vectors mean more granularity for search and retrieval but also increase database size and operational overhead sometimes 5times the size and also need more time for response.
Chunk Length<br>
<br>
Chunk length is the size (usually measured in words, tokens, or characters) of the text split for embedding (ALLM chunk length/ chunk size -> in characters).
</ul>
<br><br>
here is a tokenizer calculator<br>
<a href="https://quizgecko.com/tools/token-counter">https://quizgecko.com/tools/token-counter</a><br><br>
and a Vram calculator - (you need the original model link NOT the GGUF)<br>
<a href="https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator">https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator</a><br><br>
second VRAM calc for GGUF -> YOU need format "https://huggingface.co/provider/model/blob/main/model.gguf"<br>
Example: "https://huggingface.co/unsloth/granite-3.3-8b-instruct-GGUF/blob/main/granite-3.3-8b-instruct-UD-Q8_K_XL.gguf"<br>
<a href="https://huggingface.co/spaces/oobabooga/accurate-gguf-vram-calculator">https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator</a><br>
...
<br>
# How embedding and search works:
You have a txt/pdf file maybe 90000words(~300pages) a book. You ask the model lets say "what is described in chapter called XYZ in relation to person ZYX".
Now it searches for keywords or similar semantic terms in the document. if it has found them, lets say word and meaning around “XYZ and ZYX” ,
now a piece of text 1024token around this word “XYZ/ZYX” is cut out at this point. (In reality, it's all done with coded numbers per chunck and thats why you dont can search for single numbers or words, but dosnt matter - the principle)<br>
This text snippet is then used for your answer. <br>
<ul style="line-height: 1.05;">
<li>If, for example, the word/meaning “XYZ” occurs 50 times in one txt, not all 50 are used for answer, only the number of snippets with the best ranking are used</li>
<li>If only one snippet corresponds to your question all other snippets can negatively influence your answer because they do not fit the topic (usually 4 to 32 snippet are fine)</li>
<li>If you expect multible search results in your docs try 16-snippets or more, if you expect only 2 than dont use more!</li>
<li>If you use chunk-length ~2048(chars) you receive more content, if you use ~512chars you receive more facts BUT lower chunk-length are more chunks and need much longer time.</li>
<li>A question for "summary of the document" is most time not useful, if the document has an introduction or summaries its searching there if you have luck.</li>
<li>If a book has a table of contents or a bibliography, I would delete these pages as they often contain relevant search terms but do not help answer your question.</li>
<li>If the documents small like 10-20 Pages, its better you copy the whole text inside the CHAT, some options called "pin".</li>
<li>If a TXT file is embedded, you cannot create a summary! Only the snippets found are used for this purpose.</li>
<li>The same applies to word search or page search—in most cases, it does not work because it is not a word search but a search for similar expressions.</li>
</ul>
<br>
...
<br>
# Nevertheless, the <b>main model is also important !</b><br>
Especially to deal with the context length and I don't mean just the theoretical number you can set.
Some models can handle 128k or 1M tokens, but even with 16k or 32k input the response with the same snippets as input is worse than with other well developed models.<br>
<br>
llama3.1, llama3.2, qwen2.5, deepseek-r1-distill, gemma-3, granite, SauerkrautLM-Nemo(german) ... <br>
(llama3 or phi3.5 are not working well) <br><br>
<b>⇨</b> best models for english and german:<br>
granit3.2-8b (2b version also) - https://huggingface.co/ibm-research/granite-3.2-8b-instruct-GGUF<br>
Chocolatine-2-14B (other versions also) - https://huggingface.co/mradermacher/Chocolatine-2-14B-Instruct-DPO-v2.0b11-GGUF<br>
QwQ-LCoT- (7/14B) - https://huggingface.co/mradermacher/QwQ-LCoT-14B-Conversational-GGUF<br>
gemma-3 (4/12/27B) - https://huggingface.co/bartowski/google_gemma-3-12b-it-GGUF<br><br>
...
# Important -> The Systemprompt (some examples):
<li> The system prompt is weighted with a certain amount of influence around your question. You can easily test it once without or with a nonsensical system prompt.</li>
"You are a helpful assistant who provides an overview of ... under the aspects of ... .
You use attached excerpts from the collection to generate your answers!
Weight each individual excerpt in order, with the most important excerpts at the top and the less important ones further down.
The context of the entire article should not be given too much weight.
Answer the user's question!
After your answer, briefly explain why you included excerpts (1 to X) in your response and justify briefly if you considered some of them unimportant!"<br>
<i>(change it for your needs, this example works well when I consult a book about a person and a term related to them, the explanation part was just a test for myself)</i><br>
or:<br>
"You are an imaginative storyteller who crafts compelling narratives with depth, creativity, and coherence.
Your goal is to develop rich, engaging stories that captivate readers, staying true to the themes, tone, and style appropriate for the given prompt.
You use attached excerpts from the collection to generate your answers!
When generating stories, ensure the coherence in characters, setting, and plot progression. Be creative and introduce imaginative twists and unique perspectives."<br>
or:<br>
"You are are a warm and engaging companion who loves to talk about cooking, recipes and the joy of food.
Your aim is to share delicious recipes, cooking tips and the stories behind different cultures in a personal, welcoming and knowledgeable way."<br>
<br>
btw. <b>Jinja</b> templates very new ... the usual templates with usual models are fine, but merged models have a lot of optimization potential (but dont ask me iam not a coder)<br>
<br><br>
...
<br>
# DOC/PDF to TXT<br>
Prepare your documents by yourself!<br>
Bad Input = bad Output!<br>
In most cases, it is not immediately obvious how the document is made available to the embedder. In ALLM its "c:\Users\XXX\AppData\Roaming\anythingllm-desktop\storage\documents", you can open with a text editor to check the quality.
In nearly all cases images and tables, page-numbers, chapters, formulas and sections/paragraph-format not well implement.
You can start by simply saving the PDF as a TXT file, and you will then see in the TXT file how the embedding-model would see the content.
An easy start is to use a python based pdf-parser (it give a lot) also OCR based for images.<br>
option one only for simple txt/tables converting:
<ul style="line-height: 1.05;">
<li>pdfplumber</li>
<li>fitz/PyMuPDF</li>
<li>Camelot</li>
</ul>
All in all you can tune a lot your code but the difficulties lie in the details.<br>
my option, one exe for windows and also python, a second option with ocr:<br>
<a href="https://huggingface.co/kalle07/pdf2txt_parser_converter">https://huggingface.co/kalle07/pdf2txt_parser_converter</a><br>
my raw keyword search and snippet extractor<br>
<a href="https://huggingface.co/kalle07/raw-txt-snippet-creator">https://huggingface.co/kalle07/raw-txt-snippet-creator</a>
<br><br>
option ocr from ibm (open source):
<ul style="line-height: 1.05;">
<li>docling - (opensource on github)</li>
</ul>
it give some ready to use examples, which are already pretty good, ~10-20 code-lines.
<br>
<a href="https://github.com/docling-project/docling/tree/main/docs/examples">https://github.com/docling-project/docling/tree/main/docs/examples</a><br>
also for OCR it download automatic some models. the only thing i haven't found yet (maybe it doesn't exist) is to read out the font-type, which works very well with <b>fitz</b>, for example.
<br><br>
large option to play with many types of (UI-Based)
<ul style="line-height: 1.05;">
<li>Parse my PDF</li>
</ul>
<a href="https://github.com/genieincodebottle/parsemypdf">https://github.com/genieincodebottle/parsemypdf</a><br>
<br>
...
<br>
# only Indexing option<br>
One hint for fast search on 10000s of PDF/TXT/DOC (its only indexing, not embedding) you can use it as a simple way to find your top 5-10 articles or books, you can then make these available to an LLM.<br>
Jabref - https://github.com/JabRef/jabref/tree/v6.0-alpha?tab=readme-ov-file <br>
https://builds.jabref.org/main/ <br>
or<br>
docfetcher - https://docfetcher.sourceforge.io/en/index.html (yes old but very useful)
<br><br>
...
<br>
" on discord <b>sevenof9</b> "
<br><br>
...
<br>
# (ALL licenses and terms of use go to original author)
...
<ul style="line-height: 1.05;">
<li>avemio/German-RAG-BGE-M3-MERGED-x-SNOWFLAKE-ARCTIC-HESSIAN-AI (German, English)</li>
<li>maidalun1020/bce-embedding-base_v1 (English and Chinese)</li>
<li>maidalun1020/bce-reranker-base_v1 (English, Chinese, Japanese and Korean)</li>
<li>BAAI/bge-reranker-v2-m3 (English and Chinese)</li>
<li>BAAI/bge-reranker-v2-gemma (English and Chinese)</li>
<li>BAAI/bge-m3 (English 40% and Chinese 20%, after Spain, German, Russion, Italian, French ... )</li>
<li>avsolatorio/GIST-large-Embedding-v0 (English)</li>
<li>ibm-granite/granite-embedding-278m-multilingual (English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese)</li>
<li>ibm-granite/granite-embedding-125m-english</li>
<li>Labib11/MUG-B-1.6 (?)</li>
<li>mixedbread-ai/mxbai-embed-large-v1 (multi)</li>
<li>nomic-ai/nomic-embed-text-v1.5 (English, multi)</li>
<li>nomic-ai/nomic-embed-text-v2-moe (English, Spanish, French, German, Italian, Portuguese, Polish all other 100-languages are less trained)</li>
<li>Snowflake/snowflake-arctic-embed-l-v2.0 (English, multi)</li>
<li>intfloat/multilingual-e5-large-instruct (100 languages)</li>
<li>T-Systems-onsite/german-roberta-sentence-transformer-v2</li>
<li>T-Systems-onsite/cross-en-de-roberta-sentence-transformer (English, German)</li>
<li>T-Systems-onsite/cross-en-de-es-roberta-sentence-transformer (English, German, Spanish)</li>
<li>T-Systems-onsite/cross-en-de-fr-roberta-sentence-transforme (English, German, France)</li>
<li>mixedbread-ai/mxbai-embed-2d-large-v1</li>
<li>jinaai/jina-embeddings-v2-base-en</li>
<li>Qwen/Qwen3-Embedding-0.6B (multi)</li>
<li>HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5</li>
<li>thenlper/gte-large</li>
<li>sentence-transformers/all-MiniLM-L6-v2</li>
<li>TatonkaHF/bge-m3_en_ru (En - RU)</li>
</ul>
|