File size: 16,288 Bytes

a2a30ce
 
 
 
db490d7
 
 
 
 
ca67a0d
db490d7
f10f3d2
 
 
f7dd9bf
1807ea4
e17bacd
2613f0f
f7dd9bf
db490d7
7faf56a
de44978
 
006f15d
 
 
8bb0749
db490d7
707634e
 
 
6c6b419
4383e8a
806cfb4
 
6eda2e5
2f8597b
7cf295c
48c47e0
 
 
d1100f8
2f8597b
6941498
248936b
9ba4185
21ae04b
86484e7
e8268ae
0053fa9
cc26a33
216fd54
86484e7
 
559c1a0
27728dc
 
0ea6c62
2b88bf2
03d90b7
8b6ebf4
03d90b7
 
 
86484e7
 
 
 
e3c33d0
 
0053fa9
81d7818
3e36a38
7ba2d01
 
6eda2e5
 
19c08a9
6eda2e5
13adbac
261a825
4a633dc
6eda2e5
19c08a9
df900ba
 
6eda2e5
 
ac01051
aee8285
95c8c5b
c19c1c2
 
 
6553b20
4f15d72
a4c3b9b
3a41b7f
a4c3b9b
 
 
 
78b1760
a4c3b9b
3a41b7f
a4c3b9b
830f2b2
4f15d72
6553b20
a4c3b9b
40a4b41
56e5aa8
e8f36b1
56e5aa8
 
 
 
1e40a95
3dd8202
19c08a9
81d7818
 
 
da8a392
d1761d2
a9f7b92
81d7818
216fd54
2cdef30
81d7818
 
edbeb79
81d7818
c97f69a
edbeb79
2cdef30
 
7ba2d01
0053fa9
13adbac
19c08a9
81d7818
2f8597b
2d290e4
 
225a69e
0fb7a38
 
 
 
 
5802c95
 
0fb7a38
3e36a38
2d290e4
3e36a38
 
7951727
268611f
 
 
8d5c583
7951727
81d7818
3e36a38
6abb2f6
3e36a38
7951727
6abb2f6
 
a0643c9
3e36a38
6abb2f6
3e36a38
7951727
 
a0643c9
da8a392
0fb7a38
216fd54
 
 
6eda2e5
1e5c486
216fd54
a9f7b92
6eda2e5
 
a9f7b92
 
216fd54
 
 
235d99a
216fd54
a9f7b92
 
47b36a1
 
33e364e
bb7daf7
216fd54
a9f7b92
836c278
216fd54
836c278
 
 
235d99a
1e5c486
836c278
1e5c486
836c278
6eda2e5
836c278
1e5c486
0053fa9
42ff04c
a0643c9
0053fa9
42ff04c
a9f7b92
ec25569
 
e3703b0
 
 
c9b8cc1
42ff04c
 
8f5f194
c9b8cc1
42ff04c
 
 
a1b1441
 
 
 
216fd54
bf9e7bf
 
 
 
 
2613f0f
bf9e7bf
 
372acd1
bf9e7bf
 
 
2613f0f
bf9e7bf
 
 
2613f0f
03d90b7
a0e81dc
31f3d48
865a8a5
2613f0f
3dcce4a
03d90b7
a0e81dc
 
e467d8b
a1b1441
 
 
 
 
 
5942394
806cfb4

---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- embedder
- embedding
- models
- GGUF
- Bert
- Nomic
- Gist
- Granite
- BGE
- Jina
- Snowflake
- Qwen
- text-embeddings-inference
- RAG
- Rerank
- similarity
- PDF
- Parsing
- Parser
misc:
- text-embeddings-inference
language:
- en
- de
architecture:

---

# <b>This is a collection of more than 25 types of embedding models and a really brief introduction to what you should know about embedding.If you don't keep a few things in mind, you won't be satisfied with the results.</b>
<br>
at end of the file-list press

![grafik](https://cdn-uploads.huggingface.co/production/uploads/65b669300c9514da4f17a34f/Jrx3Jj0zVNquiyg9vuhEn.png)

to see all files<br>

# <b>All models tested with ALLM(AnythingLLM) with LM-Studio as server, all models should be work with ollama</b>
<b> the setup for local documents described below is allmost the same, GPT4All has only one model (nomic), and koboldcpp and JAN(Menlo) is not build in right now but in development</b><br>

(sometimes the results are more truthful if the “chat with document only” option is used)<br>
BTW the embedder-model is only a part of a good RAG (Retrieval-Augmented Generation), 512t are ~2000 characters most cases enough.<br>
<b>&#x21e8;</b> give me a ❤️, if you like  ;)<br>
<br>
<b>My short impression:</b>
<ul style="line-height: 1.05;">
<li>nomic-embed-text-v2-moe (up to 512t context length)</li> 
<li>mxbai-embed-large (small and fast model)</li>
<li>mug-b-1.6</li>
<li>snowflake-arctic-embed-l-v2.0 (up to 8192t context length)</li>
<li>bge-m3 (up to 8192t context length)</li>
</ul>
Working well, all other its up to you! Some models are very similar! (jina and qwen based you can add manual to LM-Studio, set model "gear wheel" below "overide domain type")<br>
With the same setting, these embedders found same 6-7 snippets out of 10 from a book. This means that only 3-4 snippets were different, but I didn't test it extensively.<br>
Further tests have shown that the following models are suitable for complex tasks (German-text, but should be similar in English). Jina-DE, nomic was not that good. I'm not convinced by large models such as Qwen or JinaaiV3 and V4 doesnt work with LM studio; they are ten times slower and the result is not ten times better. Despite all this, you can recognize tables and some images.
<ul style="line-height: 1.05;">
<li>GTE large</li> 
<li>cross-en-de-es-roberta</li>
<li>ger-RAG-bge-M3-merg-snowf-artic-hessian-AI (very good for german, up to 8192t context length)</li>
<li>German-RAG-BGE-M3-TRIPLES-HESSIAN-AI (very good for german, up to 8192t context length)</li>
<li>bge-m3 (good for german, up to 8192t context length)</li>
<li>jina-embeddings-v3 (good for german, up to 8192t context length)</li>
</ul>
There are two embedder to find toxic content (toxic-prompt-roberta and minilmv2-toxic-jigsaw), dont know how good it works, and from ibm it give a whole LLM model (granite-guardian).
<br>
<br>
...

# Short hints for using (Example for a large context with many expected hits):
Set your (Max Tokens)context-lenght 16000t main-LLM-model <b>"LM-Studio with ALLM you must set also in LM-Stutio settings!"</b>, set your embedder-model (Max Embedding Chunk Length) 1024t,set (Max Context Snippets) 14, 
 in ALLM set also (Text splitting & Chunking Preferences - Text Chunk Size) 1024 character parts and (Search Preference) "accuracy". And set in your workspace 14 snippets.
<br>
Hint in ALLM, set all in LM studio start both models and both are on top in ALLM.<br>

-> Ok what that mean!<br>
Your document will be embedd in x times 1024t chunks(snippets),<br>
You can receive 14-snippets a 1024t (~14000t) from your document ~10000words(10pages) and ~2000t left (from 16000t) for the answer ~1000words (2 pages).
<br>
You can play and set for your needs, eg 8-snippets a 2048t, or 28-snippets a 512t ... (every time you change the chunk-length the document must be embedd again). With these settings everything fits best for ONE answer, if you need more for a conversation, you should set lower and/or disable the document.
<ul style="line-height: 1.05;">
english vs german differ 50% in calculate tokens/word<br>
but ~5000 characters is one page of a book (no matter ger/en). But if you calculate with words ... words in german are longer, that means per word more token.<br>
The example is english, for german you can add apox 50% more token/word (1000 words ~1800t)<br>
<li>1200t (~1000 words ~5000 chracter) ~0.1GB, this is aprox one page with small font</li>
<li>best to get in mind: 5000-6000 characters correspond to approximately one page and approximately 1200-1400 token.</li>
<li>8000t (~6000 words) ~1.5GB VRAM usage</li>
<li>16000t (~12000 words) ~3GB VRAM usage</li>
<li>32000t (~24000 words) ~6GB VRAM usage</li>
<br>
Vector Size (Dimensions- you can not change)

The vector size, or dimensionality (embedding_length: xxx), is the number of numbers in each embedding vector.
Common embedding models produce vectors ranging from 384 dimensions (e.g., all-MiniLM-L6-v2) to 3072 dimensions (text-embedding-3-large).
Higher dimensions capture more semantic details but require more storage and computational resources for database indexing and search.
Some models allow you to shorten vectors (e.g., use only 256 out of 3072 dimensions) to save space while retaining high performance.<br>
<br>
Vector count refers to the total number of vectors stored, which usually corresponds to the number of content chunks indexed +the overlap chracters space.<br>
<br>
More vectors mean more granularity for search and retrieval but also increase database size and operational overhead sometimes 5times the size and also need more time for response.
Chunk Length<br>
<br>
Chunk length is the size (usually measured in words, tokens, or characters) of the text split for embedding (ALLM chunk length/ chunk size -> in characters).
</ul>
<br><br>
here is a tokenizer calculator<br>
<a href="https://quizgecko.com/tools/token-counter">https://quizgecko.com/tools/token-counter</a><br><br>
and a Vram calculator - (you need the original model link NOT the GGUF)<br>
<a href="https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator">https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator</a><br><br>
second VRAM calc for GGUF -> YOU need format "https://huggingface.co/provider/model/blob/main/model.gguf"<br>
Example: "https://huggingface.co/unsloth/granite-3.3-8b-instruct-GGUF/blob/main/granite-3.3-8b-instruct-UD-Q8_K_XL.gguf"<br>
<a href="https://huggingface.co/spaces/oobabooga/accurate-gguf-vram-calculator">https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator</a><br>

...
<br>

# How embedding and search works:

You have a txt/pdf file maybe 90000words(~300pages) a book. You ask the model lets say "what is described in chapter called XYZ in relation to person ZYX". 
Now it searches for keywords or similar semantic terms in the document. if it has found them, lets say word and meaning around “XYZ and ZYX” , 
now a piece of text 1024token around this word “XYZ/ZYX” is cut out at this point.  (In reality, it's all done with coded numbers per chunck and thats why you dont can search for single numbers or words, but dosnt matter - the principle)<br>
This text snippet is then used for your answer. <br>
<ul style="line-height: 1.05;">
<li>If, for example, the word/meaning “XYZ” occurs 50 times in one txt, not all 50 are used for answer, only the number of snippets with the best ranking are used</li>
<li>If only one snippet corresponds to your question all other snippets can negatively influence your answer because they do not fit the topic (usually 4 to 32 snippet are fine)</li>
<li>If you expect multible search results in your docs try 16-snippets or more, if you expect only 2 than dont use more!</li>
<li>If you use chunk-length ~2048(chars) you receive more content, if you use ~512chars you receive more facts BUT lower chunk-length are more chunks and need much longer time.</li>
<li>A question for "summary of the document" is most time not useful, if the document has an introduction or summaries its searching there if you have luck.</li>
<li>If a book has a table of contents or a bibliography, I would delete these pages as they often contain relevant search terms but do not help answer your question.</li>
<li>If the documents small like 10-20 Pages, its better you copy the whole text inside the CHAT, some options called "pin".</li>
<li>If a TXT file is embedded, you cannot create a summary! Only the snippets found are used for this purpose.</li>
<li>The same applies to word search or page search—in most cases, it does not work because it is not a word search but a search for similar expressions.</li> 
</ul>
<br>
...
<br>

# Nevertheless, the <b>main model is also important !</b><br>
Especially to deal with the context length and I don't mean just the theoretical number you can set.
Some models can handle 128k or 1M tokens, but even with 16k or 32k input the response with the same snippets as input is worse than with other well developed models.<br>
<br>
llama3.1, llama3.2, qwen2.5, deepseek-r1-distill, gemma-3, granite, SauerkrautLM-Nemo(german) ... <br>
(llama3 or phi3.5 are not working well) <br><br>
<b>&#x21e8;</b> best models for english and german:<br>
granit3.2-8b (2b version also) - https://huggingface.co/ibm-research/granite-3.2-8b-instruct-GGUF<br>
Chocolatine-2-14B (other versions also) - https://huggingface.co/mradermacher/Chocolatine-2-14B-Instruct-DPO-v2.0b11-GGUF<br>
QwQ-LCoT- (7/14B) - https://huggingface.co/mradermacher/QwQ-LCoT-14B-Conversational-GGUF<br>
gemma-3 (4/12/27B) - https://huggingface.co/bartowski/google_gemma-3-12b-it-GGUF<br><br>

...
# Important -> The Systemprompt (some examples):
<li> The system prompt is weighted with a certain amount of influence around your question. You can easily test it once without or with a nonsensical system prompt.</li>

"You are a helpful assistant who provides an overview of ... under the aspects of ... . 
You use attached excerpts from the collection to generate your answers! 
Weight each individual excerpt in order, with the most important excerpts at the top and the less important ones further down. 
The context of the entire article should not be given too much weight.  
Answer the user's question!  
After your answer, briefly explain why you included excerpts (1 to X) in your response and justify briefly if you considered some of them unimportant!"<br>
<i>(change it for your needs, this example works well when I consult a book about a person and a term related to them, the explanation part was just a test for myself)</i><br>

or:<br>

"You are an imaginative storyteller who crafts compelling narratives with depth, creativity, and coherence. 
Your goal is to develop rich, engaging stories that captivate readers, staying true to the themes, tone, and style appropriate for the given prompt.
You use attached excerpts from the collection to generate your answers!
When generating stories, ensure the coherence in characters, setting, and plot progression. Be creative and introduce imaginative twists and unique perspectives."<br>

or:<br>

"You are are a warm and engaging companion who loves to talk about cooking, recipes and the joy of food. 
Your aim is to share delicious recipes, cooking tips and the stories behind different cultures in a personal, welcoming and knowledgeable way."<br>
<br>
btw. <b>Jinja</b> templates very new ... the usual templates with usual models are fine, but merged models have a lot of optimization potential (but dont ask me iam not a coder)<br>
<br><br>

...
<br>
# DOC/PDF to TXT<br>
Prepare your documents by yourself!<br>
Bad Input = bad Output!<br>
In most cases, it is not immediately obvious how the document is made available to the embedder. In ALLM its "c:\Users\XXX\AppData\Roaming\anythingllm-desktop\storage\documents", you can open with a text editor to check the quality.
In nearly all cases images and tables, page-numbers, chapters, formulas and sections/paragraph-format not well implement.
You can start by simply saving the PDF as a TXT file, and you will then see in the TXT file how the embedding-model would see the content.
An easy start is to use a python based pdf-parser (it give a lot) also OCR based for images.<br>
option one only for simple txt/tables converting:
<ul style="line-height: 1.05;">
<li>pdfplumber</li>
<li>fitz/PyMuPDF</li>
<li>Camelot</li>
</ul>
All in all you can tune a lot your code but the difficulties lie in the details.<br>
my option, one exe for windows and also python, a second option with ocr:<br>
<a href="https://huggingface.co/kalle07/pdf2txt_parser_converter">https://huggingface.co/kalle07/pdf2txt_parser_converter</a><br>
my raw keyword search and snippet extractor<br>
<a href="https://huggingface.co/kalle07/raw-txt-snippet-creator">https://huggingface.co/kalle07/raw-txt-snippet-creator</a>

<br><br>
option ocr from ibm (open source):
<ul style="line-height: 1.05;">
<li>docling - (opensource on github)</li>
</ul>
it give some ready to use examples, which are already pretty good, ~10-20 code-lines.
<br>
<a href="https://github.com/docling-project/docling/tree/main/docs/examples">https://github.com/docling-project/docling/tree/main/docs/examples</a><br>
also for OCR it download automatic some models. the only thing i haven't found yet (maybe it doesn't exist) is to read out the font-type, which works very well with <b>fitz</b>, for example.
<br><br>
large option to play with many types of (UI-Based)
<ul style="line-height: 1.05;">
<li>Parse my PDF</li>
</ul>
<a href="https://github.com/genieincodebottle/parsemypdf">https://github.com/genieincodebottle/parsemypdf</a><br>
<br>

...
<br>
# only Indexing option<br>
One hint for fast search on 10000s of PDF/TXT/DOC (its only indexing, not embedding) you can use it as a simple way to find your top 5-10 articles or books, you can then make these available to an LLM.<br>
Jabref - https://github.com/JabRef/jabref/tree/v6.0-alpha?tab=readme-ov-file <br>
https://builds.jabref.org/main/ <br>
or<br>
docfetcher - https://docfetcher.sourceforge.io/en/index.html (yes old but very useful)
<br><br>
...
<br>
" on discord <b>sevenof9</b> "
<br><br>
...
<br>


# (ALL licenses and terms of use go to original author)

...

<ul style="line-height: 1.05;">
<li>avemio/German-RAG-BGE-M3-MERGED-x-SNOWFLAKE-ARCTIC-HESSIAN-AI (German, English)</li>
<li>maidalun1020/bce-embedding-base_v1 (English and Chinese)</li>
<li>maidalun1020/bce-reranker-base_v1 (English, Chinese, Japanese and Korean)</li>
<li>BAAI/bge-reranker-v2-m3 (English and Chinese)</li>
<li>BAAI/bge-reranker-v2-gemma (English and Chinese)</li>
<li>BAAI/bge-m3 (English 40% and Chinese 20%, after Spain, German, Russion, Italian, French ... )</li>
<li>avsolatorio/GIST-large-Embedding-v0 (English)</li>
<li>ibm-granite/granite-embedding-278m-multilingual (English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese)</li>
<li>ibm-granite/granite-embedding-125m-english</li>
<li>Labib11/MUG-B-1.6 (?)</li>
<li>mixedbread-ai/mxbai-embed-large-v1 (multi)</li>
<li>nomic-ai/nomic-embed-text-v1.5 (English, multi)</li>
<li>nomic-ai/nomic-embed-text-v2-moe (English, Spanish, French, German, Italian, Portuguese, Polish all other 100-languages are less trained)</li>
<li>Snowflake/snowflake-arctic-embed-l-v2.0 (English, multi)</li>
<li>intfloat/multilingual-e5-large-instruct (100 languages)</li>
<li>T-Systems-onsite/german-roberta-sentence-transformer-v2</li>
<li>T-Systems-onsite/cross-en-de-roberta-sentence-transformer (English, German)</li>
<li>T-Systems-onsite/cross-en-de-es-roberta-sentence-transformer (English, German, Spanish)</li>
<li>T-Systems-onsite/cross-en-de-fr-roberta-sentence-transforme (English, German, France)</li>
<li>mixedbread-ai/mxbai-embed-2d-large-v1</li>
<li>jinaai/jina-embeddings-v2-base-en</li>
<li>Qwen/Qwen3-Embedding-0.6B (multi)</li>
<li>HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5</li>
<li>thenlper/gte-large</li> 
<li>sentence-transformers/all-MiniLM-L6-v2</li> 
<li>TatonkaHF/bge-m3_en_ru (En - RU)</li>
</ul>