blackbook-lm's picture
Update README.md
709ee31 verified
metadata
language:
  - it
  - en
license: apache-2.0
tags:
  - text-generation
  - causal-lm
  - bilingual
  - italian
  - english
  - small-language-model
  - trained-from-scratch
  - quark
library_name: transformers
pipeline_tag: text-generation
model-index:
  - name: Quark-135m-Bilingual
    results: []

Overview

Quark-135m-Bilingual is a compact bilingual language model designed for Italian and English, built entirely from scratch by ThingsAI. It represents the second generation of the Quark model family, featuring a custom bilingual BPE tokenizer and a modern transformer architecture.

This is the base pretrained model. An SFT (instruction-tuned) version trained on bilingual conversational data is available for chat applications.

Model Details

Parameters 135M (143.98M with embeddings)
Architecture Decoder-only Transformer
Vocabulary 65,536 tokens (custom bilingual BPE)
Context Length 2,048 tokens
Precision BF16
Languages Italian, English
Tokenizer ThingAI/QuarkTokenizer
License Apache 2.0

Architecture

Quark-135m follows a SmolLM-inspired design optimized for efficiency at small scale:

Component Details
Attention Grouped Query Attention (GQA)
Heads 9 query heads, 3 KV heads
Head Dimension 64
Model Dimension 576
Layers 30
FFN Dimension 1,536
FFN Activation SwiGLU
Normalization RMSNorm (pre-attention & pre-FFN)
Positional Encoding Rotary Position Embeddings (RoPE)
Weight Tying Yes (embedding โ†” LM head)

Training

Pretraining Data

Quark-135m v0.2 was pretrained on 15.7B tokens from a curated bilingual mix:

Subset Weight Source
FineWeb-2 (Italian) 29% HuggingFaceFW/fineweb-2 [ita_Latn]
CulturaX (Italian) 14% uonlp/CulturaX [it]
Wikipedia (Italian) 7% wikimedia/wikipedia [20231101.it]
FineWeb (English) 36% HuggingFaceFW/fineweb [sample-10BT]
Wikipedia (English) 7% wikimedia/wikipedia [20231101.en]
The Stack (Code) 7% bigcode/the-stack-smol

Chat Format

The model uses a simple chat template:

<|user|>
{user message}
<|end|>
<|assistant|>
{model response}
<|end|>

Tokenizer

Quark-135m v0.2 uses a custom bilingual BPE tokenizer (ThingAI/QuarkTokenizer) specifically designed for Italian and English:

  • Vocabulary: 65,536 tokens
  • Type: Byte-Pair Encoding (BPE)
  • Languages: Balanced Italian + English coverage
  • Published: ThingAI/QuarkTokenizer

Usage

Loading the Model

Quark uses a custom architecture. To load and run inference:

import torch
import json
from safetensors.torch import load_file
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("ThingAI/Quark-135m-v0.2")

# Load model (requires custom architecture classes โ€” see repository)
# Full architecture code available in the model repository

Generation Example

prompt = "<|user|>\nCos'รจ l'intelligenza artificiale?\n<|end|>\n<|assistant|>\n"
ids = tokenizer.encode(prompt, return_tensors="pt").to("cuda")

# Token-by-token generation with sampling
with torch.no_grad():
    for _ in range(200):
        logits = model(ids)[:, -1, :] / 0.7  # temperature
        topk = torch.topk(logits, 40)
        probs = torch.softmax(topk.values, -1)
        idx = topk.indices.gather(-1, torch.multinomial(probs, 1))
        ids = torch.cat([ids, idx], -1)
        if idx.item() == tokenizer.eos_token_id:
            break

print(tokenizer.decode(ids[0], skip_special_tokens=False))

Limitations

  • Scale: At 135M parameters, the model has limited factual knowledge and reasoning capacity
  • Hallucination: The model frequently generates plausible but incorrect information
  • Mathematics: Cannot reliably perform arithmetic beyond simple operations
  • Code: Generates syntactically plausible but often non-functional code
  • Vocabulary overhead: The 65k vocabulary consumes ~26% of model parameters in the embedding layer, reducing transformer capacity โ€” a key lesson for v0.3
  • Pretraining plateau: Loss plateaued at ~4.6 due to the vocab/parameter ratio imbalance

Comparison with v0.1

Quark-135m v0.1 Quark-135m v0.2
Tokenizer cosmo2 (49k) QuarkTokenizer (65k)
Languages Math-focused (EN) Bilingual IT+EN
Training Data 15B tokens (math-heavy) 15.7B tokens (bilingual web + code)
Final Loss ~3.5-4.0 4.635
Strengths Arithmetic, math reasoning Italian fluency, bilingual chat

Citation

@misc{quark2026,
  title={Quark: A Family of Compact Bilingual Language Models},
  author={Di Nicola, Michelangelo},
  year={2026},
  publisher={ThingsAI},
  url={https://huggingface.co/ThingAI/Quark-135m-v0.2}
}

Links

Built from scratch by ThingsAI ๐Ÿ‡ฎ๐Ÿ‡น