kalle07
/

embedder_collection

Model card Files Files and versions

kalle07 commited on May 30, 2025

Commit

aee8285

·

verified ·

1 Parent(s): 0fb7a38

Update README.md

Files changed (1) hide show

README.md +8 -4

README.md CHANGED Viewed

@@ -63,10 +63,14 @@ Your document will be embedd in x times 1024t chunks(snippets),<br>
 You can receive 14-snippets a 1024t (~14000t) from your document ~10000words and ~2000t left (from 16000t) for the answer ~1000words (2 pages)
 <br>
 You can play and set for your needs, eg 8-snippets a 2048t, or 28-snippets a 512t ... (every time you change the chunk-length the document must be embedd again)
-<ul style="line-height: 1;">
-<li>8000t (~6000words) ~0.8GB VRAM usage</li>
-<li>16000t (~12000words) ~1.5GB VRAM usage</li>
-<li>32000t (~24000words) ~3GB VRAM usage</li>
 </ul>
 <br>
 here is a tokenizer calculator<br>

 You can receive 14-snippets a 1024t (~14000t) from your document ~10000words and ~2000t left (from 16000t) for the answer ~1000words (2 pages)
 <br>
 You can play and set for your needs, eg 8-snippets a 2048t, or 28-snippets a 512t ... (every time you change the chunk-length the document must be embedd again)
+<ul style="line-height: 1;"><br>
+english vs german differ 50%<br>
+~5000 character is one page of a book (no matter ger/en) but words in german are longer, that means per word more token<br>
+the example is english, for german you can add apox 50% more token (1000 words 1800t)<br>
+<li>1200t (~1000 words ~5000 chracter) ~0.1GB, this is aprox one page with small font</li>
+<li>8000t (~6000 words) ~0.8GB VRAM usage</li>
+<li>16000t (~12000 words) ~1.5GB VRAM usage</li>
+<li>32000t (~24000 words) ~3GB VRAM usage</li>
 </ul>
 <br>
 here is a tokenizer calculator<br>