Update README.md

Browse files

Files changed (1) hide show

README.md +13 -10

README.md CHANGED Viewed

@@ -45,7 +45,7 @@ to see all files<br>
 <b> the setup for local documents described below is allmost the same, GPT4All has only one model (nomic), and koboldcpp and JAN(Menlo) is not build in right now but in development</b><br>
 (sometimes the results are more truthful if the “chat with document only” option is used)<br>
-BTW the embedder-model is only a part of a good RAG (Retrieval-Augmented Generation), 512t are ~2000 characters most cases enough.<br>
 <b>&#x21e8;</b> give me a ❤️, if you like  ;)<br>
 <br>
 <b>My short impression:</b>
@@ -80,10 +80,10 @@ Hint in ALLM, set all in LM studio start both models and both are on top in ALLM
 -> Ok what that mean!<br>
 Your document will be embedd in x times 1500 character/chunks(snippets),<br>
-You can receive 10-snippets a 1500chars=15000chars (~4000token) from your document ~2500words(2.5pages) and ~4000t left (from 8000t) for the answer ~2500words (2.5 pages). one question - one answer, with every setup i suggest only query once. <br>
 If you change your settings so that you receive more than 8000t in the response, you should use a good LLM model to obtain a detailed and meaningful response from such a large amount of content. It doesn't matter whether the model can process 100'000 tokens; that doesn't mean that this model can process 100'000 meaningfully.
 <br>
-You can play and set for your needs, eg 8-snippets a 2048t, or 28-snippets a 512t ... (every time you change the chunk-length the document must be embedd again). With these settings everything fits best for ONE answer.
 <ul style="line-height: 1.05;">
 english vs german differ 50% in calculate tokens/word<br>
 but ~5000 characters is one page of a book (no matter ger/en). But if you calculate with words ... words in german are longer, that means per word more token.<br>
@@ -122,15 +122,17 @@ Example: "https://huggingface.co/unsloth/granite-3.3-8b-instruct-GGUF/blob/main/
 # How embedding and search works:
-You have a txt/pdf file maybe 90000words(~300pages) a book. You ask the model lets say "what is described in chapter called XYZ in relation to person ZYX".
 Now it searches for keywords or similar semantic terms in the document. if it has found them, lets say word and meaning around “XYZ and ZYX” ,
-now a piece of text 1024token around this word “XYZ/ZYX” is cut out at this point.  (In reality, it's all done with coded numbers per chunck and thats why you dont can search for single numbers or words, but dosnt matter - the principle)<br>
 This text snippet is then used for your answer. <br>
 <ul style="line-height: 1.05;">
 <li>If, for example, the word/meaning “XYZ” occurs 50 times in one txt, not all 50 are used for answer, only the number of snippets with the best ranking are used</li>
-<li>If only one snippet corresponds to your question all other snippets can negatively influence your answer because they do not fit the topic (usually 4 to 32 snippet are fine)</li>
-<li>If you expect multible search results in your docs try 16-snippets or more, if you expect only 2 than dont use more!</li>
-<li>If you use chunk-length ~2048(chars) you receive more content, if you use ~512chars you receive more facts BUT lower chunk-length are more chunks and need much longer time.</li>
 <li>A question for "summary of the document" is most time not useful, if the document has an introduction or summaries its searching there if you have luck.</li>
 <li>If a book has a table of contents or a bibliography, I would delete these pages as they often contain relevant search terms but do not help answer your question.</li>
 <li>If the documents small like 10-20 Pages, its better you copy the whole text inside the CHAT, some options called "pin".</li>
@@ -151,11 +153,12 @@ llama3.1, llama3.2, qwen2.5, deepseek-r1-distill, gemma-3, granite, SauerkrautLM
 granit3.2-8b (2b version also) - https://huggingface.co/ibm-research/granite-3.2-8b-instruct-GGUF<br>
 Chocolatine-2-14B (other versions also) - https://huggingface.co/mradermacher/Chocolatine-2-14B-Instruct-DPO-v2.0b11-GGUF<br>
 QwQ-LCoT- (7/14B) - https://huggingface.co/mradermacher/QwQ-LCoT-14B-Conversational-GGUF<br>
-gemma-3 (4/12/27B) - https://huggingface.co/bartowski/google_gemma-3-12b-it-GGUF<br><br>
 ...
 # Important -> The Systemprompt (some examples):
-<li> The system prompt is weighted with a certain amount of influence around your question. You can easily test it once without or with a nonsensical system prompt.</li>
 "You are a helpful assistant who provides an overview of ... under the aspects of ... .
 You use attached excerpts from the collection to generate your answers!

 <b> the setup for local documents described below is allmost the same, GPT4All has only one model (nomic), and koboldcpp and JAN(Menlo) is not build in right now but in development</b><br>
 (sometimes the results are more truthful if the “chat with document only” option is used)<br>
+BTW the embedder-model is only a part of a good RAG (Retrieval-Augmented Generation)<br>
 <b>&#x21e8;</b> give me a ❤️, if you like  ;)<br>
 <br>
 <b>My short impression:</b>
 -> Ok what that mean!<br>
 Your document will be embedd in x times 1500 character/chunks(snippets),<br>
+You can receive 10-snippets a 1500chars=15000chars (~4000token) from your document all in all -> ~2500words(2.5pages). now ~4000t left (from 8000t) for the answer ~2500words (2.5 pages). one question - one answer, with every setup i suggest only query once. <br>
 If you change your settings so that you receive more than 8000t in the response, you should use a good LLM model to obtain a detailed and meaningful response from such a large amount of content. It doesn't matter whether the model can process 100'000 tokens; that doesn't mean that this model can process 100'000 meaningfully.
 <br>
+You can play and set for your needs, eg 5-snippets a 5000chars, or 20-snippets a 500chars ... (every time you change the chunk-length the document must be embedd again). With these settings everything fits best for ONE answer.
 <ul style="line-height: 1.05;">
 english vs german differ 50% in calculate tokens/word<br>
 but ~5000 characters is one page of a book (no matter ger/en). But if you calculate with words ... words in german are longer, that means per word more token.<br>
 # How embedding and search works:
+Embeddding is not a database, you can not ask how many times the word "xyz" appears. With embedding, you cannot ask for a summary of an entire book.<br>
+Example:<br>
+You have a txt/pdf file maybe 90000words(~300pages, ~1'000'000chars) a book. You ask the model lets say "what is described in chapter called XYZ in relation to person ZYX".
 Now it searches for keywords or similar semantic terms in the document. if it has found them, lets say word and meaning around “XYZ and ZYX” ,
+now a piece of text 2000chars around this word “XYZ/ZYX” is cut out at this point.  (In reality, it's all done with coded numbers per chunck and thats why you dont can search for single numbers or words, but dosnt matter - the principle)<br>
 This text snippet is then used for your answer. <br>
 <ul style="line-height: 1.05;">
 <li>If, for example, the word/meaning “XYZ” occurs 50 times in one txt, not all 50 are used for answer, only the number of snippets with the best ranking are used</li>
+<li>If only one snippet corresponds to your question all other snippets can negatively influence your answer because they do not fit the topic (usually 4 to 20 snippet are fine)</li>
+<li>If you expect multible search results in your docs try 15-snippets or more, if you expect only 2 than dont use more!</li>
+<li>If you use chunk-length ~2000(chars) you receive more content, if you use ~500chars you receive more facts BUT lower chunk-length are more chunks and need much longer time.</li>
 <li>A question for "summary of the document" is most time not useful, if the document has an introduction or summaries its searching there if you have luck.</li>
 <li>If a book has a table of contents or a bibliography, I would delete these pages as they often contain relevant search terms but do not help answer your question.</li>
 <li>If the documents small like 10-20 Pages, its better you copy the whole text inside the CHAT, some options called "pin".</li>
 granit3.2-8b (2b version also) - https://huggingface.co/ibm-research/granite-3.2-8b-instruct-GGUF<br>
 Chocolatine-2-14B (other versions also) - https://huggingface.co/mradermacher/Chocolatine-2-14B-Instruct-DPO-v2.0b11-GGUF<br>
 QwQ-LCoT- (7/14B) - https://huggingface.co/mradermacher/QwQ-LCoT-14B-Conversational-GGUF<br>
+gemma-3 (4/12/27B) - https://huggingface.co/bartowski/google_gemma-3-12b-it-GGUF<br>
+qwen3 (4/8B) - https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct-GGUF <br><br>
 ...
 # Important -> The Systemprompt (some examples):
+<li> The system prompt is weighted with a certain amount of influence around your question, its not change your embedding/receiving result. You can easily test it once without or with a nonsensical system prompt.</li>
 "You are a helpful assistant who provides an overview of ... under the aspects of ... .
 You use attached excerpts from the collection to generate your answers!