File size: 14,022 Bytes

---
language:
- en
- it
tags:
- quark
- causal-lm
- small-language-model
- gqa
- rope
- swiglu
- bash
- code
pipeline_tag: text-generation
library_name: transformers
license: apache-2.0
---

# Quark-72M

**Quark-72M** is a compact, from-scratch autoregressive language model developed by [ThingAI](https://things-ai.org) as part of the **Quark** family of small language models. It is designed to be lightweight enough to run on consumer hardware while remaining architecturally modern, using Grouped-Query Attention, RoPE positional embeddings, SwiGLU feed-forward layers, and RMSNorm — the same building blocks found in contemporary frontier models, scaled down to ~72M parameters.

This is an **instruction-tuned** checkpoint, fine-tuned via SFT on top of a base model pre-trained on math, code, and reasoning-focused corpora.

---

## Table of Contents

- [Model Summary](#model-summary)
- [Architecture](#architecture)
- [Intended Use](#intended-use)
- [Quickstart](#quickstart)
- [Prompt Format](#prompt-format)
- [Generation Parameters](#generation-parameters)
- [Training Details](#training-details)
- [Tokenizer](#tokenizer)
- [Evaluation & Known Limitations](#evaluation--known-limitations)
- [Files in this Repository](#files-in-this-repository)
- [Citation](#citation)
- [License](#license)

---

## Model Summary

| | |
|---|---|
| **Developed by** | ThingAI |
| **Model type** | Decoder-only Transformer (causal language model) |
| **Parameters** | ~71.7M (embedding-tied) |
| **Languages** | English, Italian |
| **License** | MIT |
| **Finetuned from** | Quark-72M base (pre-trained on math/code/reasoning mix) |
| **Repository** | [ThingAI/Quark-72M](https://huggingface.co/ThingAI/Quark-72M) |

Quark-72M-Instruct is part of a broader effort at ThingAI to build small, self-hostable language models that can be trained, fine-tuned, and served entirely on personal infrastructure — without dependency on third-party APIs. It trades raw capability for transparency, inspectability, and low resource requirements.

---

## Architecture

Quark-72M uses a standard decoder-only Transformer stack with several efficiency-oriented design choices common in modern small LLMs:

| Component | Detail |
|---|---|
| Layers | 14 |
| Hidden size (`d_model`) | 512 |
| Attention heads | 8 query heads |
| KV heads | 2 (Grouped-Query Attention, 4:1 ratio) |
| Head dimension | 64 |
| Feed-forward | SwiGLU, `d_ff` = 1344 |
| Normalization | RMSNorm (pre-norm placement) |
| Positional encoding | Rotary Position Embeddings (RoPE), θ = 10,000 |
| Activation | SiLU (within SwiGLU gate) |
| Vocabulary size | 65,536 (+ 2 reserved chat tokens) |
| Context length | 2048 tokens |
| Weight tying | Input embedding and output projection share weights |

**Grouped-Query Attention (GQA).** Query heads are grouped 4-to-1 onto a smaller set of key/value heads, reducing the memory footprint of the KV cache during autoregressive generation without materially affecting representational capacity at this model scale.

**Rotary embeddings.** Positional information is injected directly into the query/key vectors via rotation in embedding space, rather than via additive positional embeddings, allowing better extrapolation behavior across sequence positions.

**SwiGLU feed-forward.** Each block's MLP uses a gated SiLU activation (`down_proj(silu(gate_proj(x)) * up_proj(x))`), which has empirically outperformed plain GELU/ReLU MLPs of comparable parameter count in most published small-LM ablations.

---

## Intended Use

Quark-72M is intended primarily as:

- A **research and educational artifact** for studying small-scale LLM architecture, training, and inference.
- A **lightweight conversational base** for hobbyist or self-hosted projects where running a multi-billion-parameter model is impractical.
- A **building block** for further task-specific fine-tuning (e.g., shell/bash assistance, narrow-domain Q&A).

### Out of scope

Given its size (~72M parameters), this model is **not** intended for:

- Tasks requiring deep multi-step reasoning, long-context retrieval, or broad world knowledge.
- Safety-critical, medical, legal, or financial applications.
- Production use without further evaluation and, likely, additional fine-tuning for the target domain.

At this parameter count, fluency and instruction-following are best-effort: expect occasional repetition, factual unreliability, and shallow reasoning compared to billion-parameter-class models.

---

## Quickstart

This model uses a custom architecture and requires `trust_remote_code=True` to load `configuration_quark.py` and `modeling_quark.py` from this repository.

```bash
pip install transformers torch
```

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ThingAI/Quark2Tokenizer")
tokenizer.add_special_tokens({"additional_special_tokens": ["<|im_start|>", "<|im_end|>"]})
im_end_id = tokenizer.convert_tokens_to_ids("<|im_end|>")

model = AutoModelForCausalLM.from_pretrained(
    "ThingAI/Quark-72M-Instruct",
    trust_remote_code=True,
    dtype=torch.bfloat16,   # or torch.float32 on CPU
).eval()

prompt = "<|im_start|>user\nWhat is the difference between a list and a tuple in Python?<|im_end|>\n<|im_start|>assistant\n"
input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")

output_ids = model.generate_text(
    input_ids,
    max_new_tokens=200,
    temperature=0.2,
    top_p=0.9,
    rep_penalty=1.15,
    eos_token_id=im_end_id,
)

response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
print(response)
```

> **Note:** This model does not use the standard `.generate()` method from `transformers`. Generation is handled by a custom `generate_text()` method implemented directly on the model class, which exposes `temperature`, `top_p`, and `rep_penalty` (repetition penalty) arguments. This keeps the inference path simple and dependency-free, at the cost of not supporting beam search or some of the more advanced decoding strategies built into the standard HF generation utilities.

### Running on GPU

```python
model = AutoModelForCausalLM.from_pretrained(
    "ThingAI/Quark-72M-Instruct",
    trust_remote_code=True,
    dtype=torch.bfloat16,
).cuda().eval()

input_ids = input_ids.cuda()
```

At ~72M parameters, this model runs comfortably on CPU for single-turn interactive use, and reaches well over 100 tokens/second on a single consumer GPU (tested on an RTX 3070).

---

## Prompt Format

Quark-72M was fine-tuned using a ChatML-style template with `<|im_start|>` / `<|im_end|>` role delimiters:

```
<|im_start|>user
{user message}<|im_end|>
<|im_start|>assistant
{model response}<|im_end|>
```

These two control tokens are **not** part of the base 65,536-token vocabulary baked into the tokenizer's `vocab.json` — they must be registered at load time via `add_special_tokens`, as shown in the Quickstart example above. Forgetting this step will cause the tokenizer to fall back to byte-level fragmentation for these tokens, which the model was never trained on, and will noticeably degrade output quality.

For multi-turn conversations, concatenate turns directly:

```
<|im_start|>user
{turn 1 user message}<|im_end|>
<|im_start|>assistant
{turn 1 response}<|im_end|>
<|im_start|>user
{turn 2 user message}<|im_end|>
<|im_start|>assistant
```

This model does not currently ship a `chat_template` in `tokenizer_config.json`; constructing the prompt string manually (as above) is the recommended approach until that is added.

---

## Generation Parameters

The custom `generate_text()` method supports the following arguments:

| Parameter | Type | Default | Description |
|---|---|---|---|
| `max_new_tokens` | `int` | 200 | Maximum number of tokens to generate. |
| `temperature` | `float` | 0.7 | Sampling temperature. `0.0` triggers greedy decoding. Lower values (0.1–0.3) are recommended for more focused, less repetitive output given the model's size. |
| `top_p` | `float` | 0.9 | Nucleus sampling threshold. |
| `rep_penalty` | `float` | 1.0 | Repetition penalty applied to previously generated/seen tokens. Values around `1.1`–`1.2` substantially reduce the looping behavior common in small models. |
| `eos_token_id` | `int` | `None` | Token ID that halts generation early. Should be set to the ID of `<|im_end|>` for chat-style usage. |

The implementation includes a NaN/Inf guard: if the logits produced at any generation step are degenerate, the method falls back to greedy decoding for that step rather than propagating `nan` into the sampling procedure.

### Recommended starting configuration

```python
model.generate_text(
    input_ids,
    max_new_tokens=200,
    temperature=0.2,
    top_p=0.9,
    rep_penalty=1.15,
    eos_token_id=im_end_id,
)
```

Given the model's scale, low temperature combined with a moderate repetition penalty tends to produce the most coherent output. Higher temperatures increase output diversity but also increase the likelihood of incoherent or repetitive degeneration.

---

## Training Details

### Pre-training

The base model was pre-trained from scratch on a mixture weighted toward mathematical and code-heavy text, with a smaller proportion of chain-of-thought reasoning data:

| Source | Approx. share | Content |
|---|---|---|
| OpenWebMath | 45% | Mathematical text and derivations |
| The Stack (smol) | 45% | Source code across multiple languages |
| Magpie-Reasoning-150K | 4% | Distilled chain-of-thought traces |
| OpenThoughts-114k | 4% | Multi-step reasoning conversations |
| Reasoning-base-20k | 2% | Logical inference traces |

Training used a target of 5B tokens, GQA + RoPE + SwiGLU architecture as described above, bfloat16 mixed precision, AdamW with cosine learning-rate decay, gradient clipping, and `torch.compile` for throughput.

### Supervised fine-tuning (SFT)

The base checkpoint was subsequently fine-tuned on conversational and instruction-following data formatted with the ChatML-style template described above. Training data was pre-tokenized to `.npy` files ahead of time to eliminate streaming/tokenization bottlenecks during training, which previously made each epoch impractically slow.

**A note on capability:** because the pre-training mixture was weighted heavily toward math and code rather than general conversation, and because the SFT phase was comparatively short, this model's conversational fluency is modest relative to its parameter count would suggest for a model trained primarily on dialogue. It reliably follows the chat template and produces grammatical, on-topic responses, but should not be expected to match the conversational depth of models trained end-to-end on large-scale dialogue corpora.

---

## Tokenizer

This model uses [ThingAI/Quark2Tokenizer](https://huggingface.co/ThingAI/Quark2Tokenizer), a byte-level BPE tokenizer with a 65,536-token vocabulary trained on a bilingual (Italian/English) and code-inclusive corpus.

Two additional control tokens (`<|im_start|>`, `<|im_end|>`) are required for chat-formatted inference and must be added at runtime as shown in the Quickstart section — they are not persisted in the tokenizer's saved vocabulary file.

---

## Evaluation & Known Limitations

This model has not yet been benchmarked against standard small-LM evaluation suites (e.g., PIQA, ARC-Easy, HellaSwag). Evaluation harness integration is planned but not yet published for this checkpoint.

Known limitations:

- **Repetition.** Like most sub-100M-parameter language models, Quark-72M-Instruct is prone to repetitive loops without a repetition penalty. A `rep_penalty` of 1.1–1.2 is recommended for most use cases.
- **Shallow reasoning.** The model can produce plausible-sounding text but should not be relied upon for multi-step logical or mathematical reasoning beyond simple cases, despite the math-heavy pre-training mixture.
- **Limited world knowledge.** At this scale, the model has memorized comparatively little factual knowledge and will frequently hallucinate specific facts, dates, or entities.
- **Short effective context.** While the architecture supports a 2048-token context window, reliable instruction-following degrades over long contexts more readily than in larger models.
- **No chat template metadata.** As noted above, `tokenizer_config.json` does not yet define a `chat_template`; prompts must be constructed manually.

This model is best understood as a research-grade, fully-inspectable small language model rather than a general-purpose assistant.

---

## Files in this Repository

| File | Purpose |
|---|---|
| `config.json` | Model configuration and `auto_map` registration for `trust_remote_code` |
| `configuration_quark.py` | `QuarkConfig` class (extends `PretrainedConfig`) |
| `modeling_quark.py` | Full model implementation (`QuarkForCausalLM`), including the custom `generate_text()` method |
| `model.safetensors` | Model weights (float32) |
| `generation_config.json` | Default generation hyperparameters |
| `tokenizer_config.json` | Tokenizer configuration pointing to the companion tokenizer repository |

Because this model relies on custom modeling code rather than a built-in `transformers` architecture, `trust_remote_code=True` is required when loading it via `AutoModelForCausalLM` or `AutoConfig`. As with any `trust_remote_code` model, it is good practice to review `modeling_quark.py` and `configuration_quark.py` directly before running them, particularly when pinning to a specific revision in production code.

---

## Citation

If you use Quark-72M in your work, please consider citing the repository:

```bibtex
@misc{quark72m,
  title  = {Quark-72M},
  author = {ThingAI},
  year   = {2026},
  url    = {https://huggingface.co/ThingAI/Quark-72M}
}
```

---

## License

*Apache-2.0*
---

*Quark-72M is developed and maintained by [ThingAI](https://things-ai.org) as part of an ongoing effort to build self-hostable, fully-inspectable small language models.*