Text Generation
Transformers
Safetensors
Italian
English
quark
causal-lm
bilingual
italian
english
small-language-model
trained-from-scratch
conversational
custom_code
Instructions to use ThingAI/Quark-135m-Bilingual with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ThingAI/Quark-135m-Bilingual with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ThingAI/Quark-135m-Bilingual", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("ThingAI/Quark-135m-Bilingual", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ThingAI/Quark-135m-Bilingual with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ThingAI/Quark-135m-Bilingual" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ThingAI/Quark-135m-Bilingual", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ThingAI/Quark-135m-Bilingual
- SGLang
How to use ThingAI/Quark-135m-Bilingual with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ThingAI/Quark-135m-Bilingual" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ThingAI/Quark-135m-Bilingual", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ThingAI/Quark-135m-Bilingual" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ThingAI/Quark-135m-Bilingual", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use ThingAI/Quark-135m-Bilingual with Docker Model Runner:
docker model run hf.co/ThingAI/Quark-135m-Bilingual
| language: | |
| - it | |
| - en | |
| license: apache-2.0 | |
| tags: | |
| - text-generation | |
| - causal-lm | |
| - bilingual | |
| - italian | |
| - english | |
| - small-language-model | |
| - trained-from-scratch | |
| - quark | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| model-index: | |
| - name: Quark-135m-Bilingual | |
| results: [] | |
| ## Overview | |
| Quark-135m-Bilingual is a compact bilingual language model designed for Italian and English, built entirely from scratch by [ThingsAI](https://things-ai.org). It represents the second generation of the Quark model family, featuring a custom bilingual BPE tokenizer and a modern transformer architecture. | |
| This is the **base pretrained model**. An SFT (instruction-tuned) version trained on bilingual conversational data is available for chat applications. | |
| ## Model Details | |
| | | | | |
| |---|---| | |
| | **Parameters** | 135M (143.98M with embeddings) | | |
| | **Architecture** | Decoder-only Transformer | | |
| | **Vocabulary** | 65,536 tokens (custom bilingual BPE) | | |
| | **Context Length** | 2,048 tokens | | |
| | **Precision** | BF16 | | |
| | **Languages** | Italian, English | | |
| | **Tokenizer** | [ThingAI/QuarkTokenizer](https://huggingface.co/ThingAI/QuarkTokenizer) | | |
| | **License** | Apache 2.0 | | |
| ## Architecture | |
| Quark-135m follows a SmolLM-inspired design optimized for efficiency at small scale: | |
| | Component | Details | | |
| |---|---| | |
| | Attention | Grouped Query Attention (GQA) | | |
| | Heads | 9 query heads, 3 KV heads | | |
| | Head Dimension | 64 | | |
| | Model Dimension | 576 | | |
| | Layers | 30 | | |
| | FFN Dimension | 1,536 | | |
| | FFN Activation | SwiGLU | | |
| | Normalization | RMSNorm (pre-attention & pre-FFN) | | |
| | Positional Encoding | Rotary Position Embeddings (RoPE) | | |
| | Weight Tying | Yes (embedding โ LM head) | | |
| ## Training | |
| ### Pretraining Data | |
| Quark-135m v0.2 was pretrained on **15.7B tokens** from a curated bilingual mix: | |
| | Subset | Weight | Source | | |
| |---|---|---| | |
| | FineWeb-2 (Italian) | 29% | `HuggingFaceFW/fineweb-2` [ita_Latn] | | |
| | CulturaX (Italian) | 14% | `uonlp/CulturaX` [it] | | |
| | Wikipedia (Italian) | 7% | `wikimedia/wikipedia` [20231101.it] | | |
| | FineWeb (English) | 36% | `HuggingFaceFW/fineweb` [sample-10BT] | | |
| | Wikipedia (English) | 7% | `wikimedia/wikipedia` [20231101.en] | | |
| | The Stack (Code) | 7% | `bigcode/the-stack-smol` | | |
| ## Chat Format | |
| The model uses a simple chat template: | |
| ``` | |
| <|user|> | |
| {user message} | |
| <|end|> | |
| <|assistant|> | |
| {model response} | |
| <|end|> | |
| ``` | |
| ## Tokenizer | |
| Quark-135m v0.2 uses a custom bilingual BPE tokenizer ([ThingAI/QuarkTokenizer](https://huggingface.co/ThingAI/QuarkTokenizer)) specifically designed for Italian and English: | |
| - **Vocabulary**: 65,536 tokens | |
| - **Type**: Byte-Pair Encoding (BPE) | |
| - **Languages**: Balanced Italian + English coverage | |
| - **Published**: [ThingAI/QuarkTokenizer](https://huggingface.co/ThingAI/QuarkTokenizer) | |
| ## Usage | |
| ### Loading the Model | |
| Quark uses a custom architecture. To load and run inference: | |
| ```python | |
| import torch | |
| import json | |
| from safetensors.torch import load_file | |
| from transformers import AutoTokenizer | |
| # Load tokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("ThingAI/Quark-135m-v0.2") | |
| # Load model (requires custom architecture classes โ see repository) | |
| # Full architecture code available in the model repository | |
| ``` | |
| ### Generation Example | |
| ```python | |
| prompt = "<|user|>\nCos'รจ l'intelligenza artificiale?\n<|end|>\n<|assistant|>\n" | |
| ids = tokenizer.encode(prompt, return_tensors="pt").to("cuda") | |
| # Token-by-token generation with sampling | |
| with torch.no_grad(): | |
| for _ in range(200): | |
| logits = model(ids)[:, -1, :] / 0.7 # temperature | |
| topk = torch.topk(logits, 40) | |
| probs = torch.softmax(topk.values, -1) | |
| idx = topk.indices.gather(-1, torch.multinomial(probs, 1)) | |
| ids = torch.cat([ids, idx], -1) | |
| if idx.item() == tokenizer.eos_token_id: | |
| break | |
| print(tokenizer.decode(ids[0], skip_special_tokens=False)) | |
| ``` | |
| ## Limitations | |
| - **Scale**: At 135M parameters, the model has limited factual knowledge and reasoning capacity | |
| - **Hallucination**: The model frequently generates plausible but incorrect information | |
| - **Mathematics**: Cannot reliably perform arithmetic beyond simple operations | |
| - **Code**: Generates syntactically plausible but often non-functional code | |
| - **Vocabulary overhead**: The 65k vocabulary consumes ~26% of model parameters in the embedding layer, reducing transformer capacity โ a key lesson for v0.3 | |
| - **Pretraining plateau**: Loss plateaued at ~4.6 due to the vocab/parameter ratio imbalance | |
| ## Comparison with v0.1 | |
| | | Quark-135m v0.1 | Quark-135m v0.2 | | |
| |---|---|---| | |
| | **Tokenizer** | cosmo2 (49k) | QuarkTokenizer (65k) | | |
| | **Languages** | Math-focused (EN) | Bilingual IT+EN | | |
| | **Training Data** | 15B tokens (math-heavy) | 15.7B tokens (bilingual web + code) | | |
| | **Final Loss** | ~3.5-4.0 | 4.635 | | |
| | **Strengths** | Arithmetic, math reasoning | Italian fluency, bilingual chat | | |
| ## Citation | |
| ```bibtex | |
| @misc{quark2026, | |
| title={Quark: A Family of Compact Bilingual Language Models}, | |
| author={Di Nicola, Michelangelo}, | |
| year={2026}, | |
| publisher={ThingsAI}, | |
| url={https://huggingface.co/ThingAI/Quark-135m-v0.2} | |
| } | |
| ``` | |
| ## Links | |
| - ๐ [ThingsAI Website](https://things-ai.org) | |
| - ๐ฌ [Things Chat](https://chat.things-ai.org) | |
| - ๐ค [QuarkTokenizer](https://huggingface.co/ThingAI/QuarkTokenizer) | |
| - ๐ [Open SLM Leaderboard](https://huggingface.co/spaces/AxiomicLabs/Open_SLM_Leaderboard) | |
| *Built from scratch by ThingsAI ๐ฎ๐น* |