ThingAI
/

QuarkTokenizer

@@ -1,199 +1,155 @@
 ---
-library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
+language:
+- en
+- it
+license: apache-2.0
+tags:
+- tokenizer
+- bpe
+- bilingual
+- italian
+- english
+- quark
+model_type: quark
+vocab_size: 65536
 ---
+# Quark Tokenizer
+Tokenizer BPE byte-level bilingue **EN + IT** sviluppato per la famiglia di modelli **Quark** di [OvercastLab / ThingAI](https://huggingface.co/ThingAI).
+## Caratteristiche
+| Proprietà | Valore |
+|---|---|
+| Algoritmo | Byte-Level BPE |
+| Vocab size | 65.536 (2¹⁶) |
+| Lingue | Inglese + Italiano |
+| Special tokens | 64 |
+| Context length | 2048 (estendibile) |
+| Compatibilità | 🤗 `transformers`, `tokenizers` |
+# Corpus di addestramento
+Il tokenizer è stato addestrato su ~14M righe bilanciate EN/IT (50%/50%) provenienti da:
+| Dataset | Lingua | Righe |
+|---|---|---|
+| Wikipedia EN | EN | 3.000.000 |
+| Pile Uncopyrighted | EN | 2.000.000 |
+| Falcon RefinedWeb | EN | 2.000.000 |
+| Wikipedia IT | IT | 3.000.000 |
+| FineWeb-2 (`ita_Latn`) | IT | 2.000.000 |
+| MADLAD-400 IT | IT | 1.500.000 |
+La parità EN/IT è una scelta deliberata: i tokenizer addestrati prevalentemente su inglese tendono a usare 2–3× più token per rappresentare testi italiani. Questo tokenizer è ottimizzato per entrambe le lingue.
+# Efficienza
+Confronto token/carattere su testi scientifici e colloquiali:
+| Lingua | Quark Tokenizer | cosmo2-tokenizer | Δ |
+|---|---|---|---|
+| Inglese (scientifico) | — | — | ~0% |
+| Italiano (scientifico) | — | — | **~−25%** |
+| Italiano (colloquiale) | — | — | **~−30%** |
+> Il tokenizer Quark usa fino al 30% meno token per testi italiani rispetto a tokenizer ottimizzati solo per l'inglese.
+# Special Tokens
+```
+<unk>          → unknown
+<s>            → inizio sequenza (BOS)  — id: 1
+</s>           → fine sequenza (EOS)    — id: 2
+<pad>          → padding
+<|system|>     → turno system
+<|user|>       → turno user
+<|assistant|>  → turno assistant
+<|endofturn|>  → fine turno esplicito
+<|thinking|>   → inizio ragionamento (chain-of-thought)
+<|/thinking|>  → fine ragionamento
+<|reserved_0|> … <|reserved_53|>  → slot riservati (tool use, multimodale, ecc.)
+```
+Totale: **64 special tokens**
+# Chat Template
+Il tokenizer include un chat template compatibile con `apply_chat_template`:
+```python
+from transformers import AutoTokenizer
+tok = AutoTokenizer.from_pretrained("ThingAI/QuarkTokenizer")
+messages = [
+    {"role": "system",    "content": "Sei Quark, un assistente AI creato da OvercastLab."},
+    {"role": "user",      "content": "Cos'è la derivata di una funzione?"},
+    {"role": "assistant", "content": "La derivata misura la variazione istantanea..."},
+]
+text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+print(text)
+```
+## Uso base
+```python
+from transformers import AutoTokenizer
+tok = AutoTokenizer.from_pretrained("ThingAI/quark-tokenizer")
+# Encoding
+text = "Il cielo è azzurro e il sole splende."
+ids  = tok.encode(text)
+print(f"Token: {len(ids)}")   # ~9 token
+# Decoding
+decoded = tok.decode(ids, skip_special_tokens=True)
+print(decoded)
+# Batch
+batch = tok(["Hello world!", "Ciao mondo!"], padding=True, return_tensors="pt")
+```
+## Integrazione con modelli Quark
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+tok   = AutoTokenizer.from_pretrained("ThingAI/QuarkTokenizer")
+model = AutoModelForCausalLM.from_pretrained("ThingAI/Quark-135M")
+inputs = tok("La matematica è", return_tensors="pt")
+output = model.generate(**inputs, max_new_tokens=50)
+print(tok.decode(output[0], skip_special_tokens=True))
+```
+## Design Choices
+**Perché 65.536?**
+È una potenza di 2 (2¹⁶), ottimale per operazioni hardware su GPU/TPU. Più grande di GPT-2 (50.257) e LLaMA-2 (32.000), ma più compatto di LLaMA-3 (128.256). Bilancia efficienza di rappresentazione e dimensione dell'embedding layer.
+**Perché Byte-Level BPE?**
+Garantisce copertura completa di qualsiasi sequenza UTF-8 senza token `<unk>`. Robustezza su emoji, caratteri accentati italiani (à, è, ì, ò, ù), simboli matematici e codice sorgente.
+**Perché 50% italiano?**
+I tokenizer standard (GPT-2, LLaMA) sono addestrati su corpus predominantemente inglesi e penalizzano le lingue latine con un overhead di 2–3× nel numero di token. Il bilanciamento 50/50 elimina questa disparità per l'italiano mantenendo piena competenza in inglese.
+## Famiglia Quark
+| Modello | Parametri | Token pretraining | Stato |
+|---|---|---|---|
+| Quark-135M v1 | 135M | 15B | ✅ Rilasciato |
+| Quark-135M v2 | 135M | 65B | 🔄 In training |
+## Licenza
+Apache 2.0 — uso libero anche commerciale.
+## Citazione
+```bibtex
+@misc{quark2025,
+  title   = {Quark: A Bilingual EN/IT Language Model},
+  author  = {OvercastLab / ThingAI},
+  year    = {2025},
+  url     = {https://huggingface.co/ThingAI/quark-tokenizer}
+}
+```
+---
+*Sviluppato da [OvercastLab](https://things-ai.org) · Made in Italy 🇮🇹*