ThingAI
/

QuarkTokenizer

Model card Files Files and versions

ThingsAI commited on 2 days ago

Commit

c88472a

·

verified ·

1 Parent(s): 1910046

Update README.md

Files changed (1) hide show

README.md +0 -11

README.md CHANGED Viewed

@@ -44,17 +44,6 @@ Il tokenizer è stato addestrato su ~14M righe bilanciate EN/IT (50%/50%) proven
 La parità EN/IT è una scelta deliberata: i tokenizer addestrati prevalentemente su inglese tendono a usare 2–3× più token per rappresentare testi italiani. Questo tokenizer è ottimizzato per entrambe le lingue.
-# Efficienza
-Confronto token/carattere su testi scientifici e colloquiali:
-| Lingua | Quark Tokenizer | cosmo2-tokenizer | Δ |
-|---|---|---|---|
-| Inglese (scientifico) | — | — | ~0% |
-| Italiano (scientifico) | — | — | **~−25%** |
-| Italiano (colloquiale) | — | — | **~−30%** |
-> Il tokenizer Quark usa fino al 30% meno token per testi italiani rispetto a tokenizer ottimizzati solo per l'inglese.
 # Special Tokens
 ```
 <unk>          → unknown

 La parità EN/IT è una scelta deliberata: i tokenizer addestrati prevalentemente su inglese tendono a usare 2–3× più token per rappresentare testi italiani. Questo tokenizer è ottimizzato per entrambe le lingue.
 # Special Tokens
 ```
 <unk>          → unknown