Instructions to use ThingAI/Quark-135m-Bilingual with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ThingAI/Quark-135m-Bilingual with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ThingAI/Quark-135m-Bilingual", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("ThingAI/Quark-135m-Bilingual", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ThingAI/Quark-135m-Bilingual with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ThingAI/Quark-135m-Bilingual"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ThingAI/Quark-135m-Bilingual",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ThingAI/Quark-135m-Bilingual

SGLang

How to use ThingAI/Quark-135m-Bilingual with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ThingAI/Quark-135m-Bilingual" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ThingAI/Quark-135m-Bilingual",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ThingAI/Quark-135m-Bilingual" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ThingAI/Quark-135m-Bilingual",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ThingAI/Quark-135m-Bilingual with Docker Model Runner:
```
docker model run hf.co/ThingAI/Quark-135m-Bilingual
```

Quark-135m-Bilingual / README.md

blackbook-lm

Update README.md

709ee31 verified 2 days ago

preview code

raw

history blame contribute delete

5.42 kB

	---
	language:
	- it
	- en
	license: apache-2.0
	tags:
	- text-generation
	- causal-lm
	- bilingual
	- italian
	- english
	- small-language-model
	- trained-from-scratch
	- quark
	library_name: transformers
	pipeline_tag: text-generation
	model-index:
	- name: Quark-135m-Bilingual
	results: []
	---


	## Overview

	Quark-135m-Bilingual is a compact bilingual language model designed for Italian and English, built entirely from scratch by [ThingsAI](https://things-ai.org). It represents the second generation of the Quark model family, featuring a custom bilingual BPE tokenizer and a modern transformer architecture.

	This is the base pretrained model. An SFT (instruction-tuned) version trained on bilingual conversational data is available for chat applications.

	## Model Details

	\| \| \|
	\|---\|---\|
	\| Parameters \| 135M (143.98M with embeddings) \|
	\| Architecture \| Decoder-only Transformer \|
	\| Vocabulary \| 65,536 tokens (custom bilingual BPE) \|
	\| Context Length \| 2,048 tokens \|
	\| Precision \| BF16 \|
	\| Languages \| Italian, English \|
	\| Tokenizer \| [ThingAI/QuarkTokenizer](https://huggingface.co/ThingAI/QuarkTokenizer) \|
	\| License \| Apache 2.0 \|

	## Architecture

	Quark-135m follows a SmolLM-inspired design optimized for efficiency at small scale:

	\| Component \| Details \|
	\|---\|---\|
	\| Attention \| Grouped Query Attention (GQA) \|
	\| Heads \| 9 query heads, 3 KV heads \|
	\| Head Dimension \| 64 \|
	\| Model Dimension \| 576 \|
	\| Layers \| 30 \|
	\| FFN Dimension \| 1,536 \|
	\| FFN Activation \| SwiGLU \|
	\| Normalization \| RMSNorm (pre-attention & pre-FFN) \|
	\| Positional Encoding \| Rotary Position Embeddings (RoPE) \|
	\| Weight Tying \| Yes (embedding ↔ LM head) \|

	## Training

	### Pretraining Data

	Quark-135m v0.2 was pretrained on 15.7B tokens from a curated bilingual mix:

	\| Subset \| Weight \| Source \|
	\|---\|---\|---\|
	\| FineWeb-2 (Italian) \| 29% \| `HuggingFaceFW/fineweb-2` [ita_Latn] \|
	\| CulturaX (Italian) \| 14% \| `uonlp/CulturaX` [it] \|
	\| Wikipedia (Italian) \| 7% \| `wikimedia/wikipedia` [20231101.it] \|
	\| FineWeb (English) \| 36% \| `HuggingFaceFW/fineweb` [sample-10BT] \|
	\| Wikipedia (English) \| 7% \| `wikimedia/wikipedia` [20231101.en] \|
	\| The Stack (Code) \| 7% \| `bigcode/the-stack-smol` \|



	## Chat Format

	The model uses a simple chat template:

	```
	<\|user\|>
	{user message}
	<\|end\|>
	<\|assistant\|>
	{model response}
	<\|end\|>
	```

	## Tokenizer

	Quark-135m v0.2 uses a custom bilingual BPE tokenizer ([ThingAI/QuarkTokenizer](https://huggingface.co/ThingAI/QuarkTokenizer)) specifically designed for Italian and English:

	- Vocabulary: 65,536 tokens
	- Type: Byte-Pair Encoding (BPE)
	- Languages: Balanced Italian + English coverage
	- Published: [ThingAI/QuarkTokenizer](https://huggingface.co/ThingAI/QuarkTokenizer)

	## Usage

	### Loading the Model

	Quark uses a custom architecture. To load and run inference:

	```python
	import torch
	import json
	from safetensors.torch import load_file
	from transformers import AutoTokenizer

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained("ThingAI/Quark-135m-v0.2")

	# Load model (requires custom architecture classes — see repository)
	# Full architecture code available in the model repository
	```

	### Generation Example

	```python
	prompt = "<\|user\|>\nCos'è l'intelligenza artificiale?\n<\|end\|>\n<\|assistant\|>\n"
	ids = tokenizer.encode(prompt, return_tensors="pt").to("cuda")

	# Token-by-token generation with sampling
	with torch.no_grad():
	for _ in range(200):
	logits = model(ids)[:, -1, :] / 0.7 # temperature
	topk = torch.topk(logits, 40)
	probs = torch.softmax(topk.values, -1)
	idx = topk.indices.gather(-1, torch.multinomial(probs, 1))
	ids = torch.cat([ids, idx], -1)
	if idx.item() == tokenizer.eos_token_id:
	break

	print(tokenizer.decode(ids[0], skip_special_tokens=False))
	```

	## Limitations

	- Scale: At 135M parameters, the model has limited factual knowledge and reasoning capacity
	- Hallucination: The model frequently generates plausible but incorrect information
	- Mathematics: Cannot reliably perform arithmetic beyond simple operations
	- Code: Generates syntactically plausible but often non-functional code
	- Vocabulary overhead: The 65k vocabulary consumes ~26% of model parameters in the embedding layer, reducing transformer capacity — a key lesson for v0.3
	- Pretraining plateau: Loss plateaued at ~4.6 due to the vocab/parameter ratio imbalance

	## Comparison with v0.1

	\| \| Quark-135m v0.1 \| Quark-135m v0.2 \|
	\|---\|---\|---\|
	\| Tokenizer \| cosmo2 (49k) \| QuarkTokenizer (65k) \|
	\| Languages \| Math-focused (EN) \| Bilingual IT+EN \|
	\| Training Data \| 15B tokens (math-heavy) \| 15.7B tokens (bilingual web + code) \|
	\| Final Loss \| ~3.5-4.0 \| 4.635 \|
	\| Strengths \| Arithmetic, math reasoning \| Italian fluency, bilingual chat \|



	## Citation

	```bibtex
	@misc{quark2026,
	title={Quark: A Family of Compact Bilingual Language Models},
	author={Di Nicola, Michelangelo},
	year={2026},
	publisher={ThingsAI},
	url={https://huggingface.co/ThingAI/Quark-135m-v0.2}
	}
	```

	## Links

	- 🌐 [ThingsAI Website](https://things-ai.org)
	- 💬 [Things Chat](https://chat.things-ai.org)
	- 🔤 [QuarkTokenizer](https://huggingface.co/ThingAI/QuarkTokenizer)
	- 📊 [Open SLM Leaderboard](https://huggingface.co/spaces/AxiomicLabs/Open_SLM_Leaderboard)



	Built from scratch by ThingsAI 🇮🇹