Instructions to use ThingAI/Quark-135m-Bilingual with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ThingAI/Quark-135m-Bilingual with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ThingAI/Quark-135m-Bilingual", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("ThingAI/Quark-135m-Bilingual", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ThingAI/Quark-135m-Bilingual with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ThingAI/Quark-135m-Bilingual"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ThingAI/Quark-135m-Bilingual",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ThingAI/Quark-135m-Bilingual

SGLang

How to use ThingAI/Quark-135m-Bilingual with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ThingAI/Quark-135m-Bilingual" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ThingAI/Quark-135m-Bilingual",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ThingAI/Quark-135m-Bilingual" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ThingAI/Quark-135m-Bilingual",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ThingAI/Quark-135m-Bilingual with Docker Model Runner:
```
docker model run hf.co/ThingAI/Quark-135m-Bilingual
```

Quark-135m-Bilingual / README.md

blackbook-lm

Update README.md

709ee31 verified 2 days ago

preview code

raw

history blame contribute delete

5.42 kB

metadata

language:
  - it
  - en
license: apache-2.0
tags:
  - text-generation
  - causal-lm
  - bilingual
  - italian
  - english
  - small-language-model
  - trained-from-scratch
  - quark
library_name: transformers
pipeline_tag: text-generation
model-index:
  - name: Quark-135m-Bilingual
    results: []

Overview

Quark-135m-Bilingual is a compact bilingual language model designed for Italian and English, built entirely from scratch by ThingsAI. It represents the second generation of the Quark model family, featuring a custom bilingual BPE tokenizer and a modern transformer architecture.

This is the base pretrained model. An SFT (instruction-tuned) version trained on bilingual conversational data is available for chat applications.

Model Details


Parameters	135M (143.98M with embeddings)
Architecture	Decoder-only Transformer
Vocabulary	65,536 tokens (custom bilingual BPE)
Context Length	2,048 tokens
Precision	BF16
Languages	Italian, English
Tokenizer	ThingAI/QuarkTokenizer
License	Apache 2.0

Architecture

Quark-135m follows a SmolLM-inspired design optimized for efficiency at small scale:

Component	Details
Attention	Grouped Query Attention (GQA)
Heads	9 query heads, 3 KV heads
Head Dimension	64
Model Dimension	576
Layers	30
FFN Dimension	1,536
FFN Activation	SwiGLU
Normalization	RMSNorm (pre-attention & pre-FFN)
Positional Encoding	Rotary Position Embeddings (RoPE)
Weight Tying	Yes (embedding ↔ LM head)

Training

Pretraining Data

Quark-135m v0.2 was pretrained on 15.7B tokens from a curated bilingual mix:

Subset	Weight	Source
FineWeb-2 (Italian)	29%	`HuggingFaceFW/fineweb-2` [ita_Latn]
CulturaX (Italian)	14%	`uonlp/CulturaX` [it]
Wikipedia (Italian)	7%	`wikimedia/wikipedia` [20231101.it]
FineWeb (English)	36%	`HuggingFaceFW/fineweb` [sample-10BT]
Wikipedia (English)	7%	`wikimedia/wikipedia` [20231101.en]
The Stack (Code)	7%	`bigcode/the-stack-smol`

Chat Format

The model uses a simple chat template:

<|user|>
{user message}
<|end|>
<|assistant|>
{model response}
<|end|>

Tokenizer

Quark-135m v0.2 uses a custom bilingual BPE tokenizer (ThingAI/QuarkTokenizer) specifically designed for Italian and English:

Vocabulary: 65,536 tokens
Type: Byte-Pair Encoding (BPE)
Languages: Balanced Italian + English coverage
Published: ThingAI/QuarkTokenizer

Usage

Loading the Model

Quark uses a custom architecture. To load and run inference:

import torch
import json
from safetensors.torch import load_file
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("ThingAI/Quark-135m-v0.2")

# Load model (requires custom architecture classes — see repository)
# Full architecture code available in the model repository

Generation Example

prompt = "<|user|>\nCos'è l'intelligenza artificiale?\n<|end|>\n<|assistant|>\n"
ids = tokenizer.encode(prompt, return_tensors="pt").to("cuda")

# Token-by-token generation with sampling
with torch.no_grad():
    for _ in range(200):
        logits = model(ids)[:, -1, :] / 0.7  # temperature
        topk = torch.topk(logits, 40)
        probs = torch.softmax(topk.values, -1)
        idx = topk.indices.gather(-1, torch.multinomial(probs, 1))
        ids = torch.cat([ids, idx], -1)
        if idx.item() == tokenizer.eos_token_id:
            break

print(tokenizer.decode(ids[0], skip_special_tokens=False))

Limitations

Scale: At 135M parameters, the model has limited factual knowledge and reasoning capacity
Hallucination: The model frequently generates plausible but incorrect information
Mathematics: Cannot reliably perform arithmetic beyond simple operations
Code: Generates syntactically plausible but often non-functional code
Vocabulary overhead: The 65k vocabulary consumes ~26% of model parameters in the embedding layer, reducing transformer capacity — a key lesson for v0.3
Pretraining plateau: Loss plateaued at ~4.6 due to the vocab/parameter ratio imbalance

Comparison with v0.1

	Quark-135m v0.1	Quark-135m v0.2
Tokenizer	cosmo2 (49k)	QuarkTokenizer (65k)
Languages	Math-focused (EN)	Bilingual IT+EN
Training Data	15B tokens (math-heavy)	15.7B tokens (bilingual web + code)
Final Loss	~3.5-4.0	4.635
Strengths	Arithmetic, math reasoning	Italian fluency, bilingual chat

Citation

@misc{quark2026,
  title={Quark: A Family of Compact Bilingual Language Models},
  author={Di Nicola, Michelangelo},
  year={2026},
  publisher={ThingsAI},
  url={https://huggingface.co/ThingAI/Quark-135m-v0.2}
}

Links

Built from scratch by ThingsAI 🇮🇹