Quark-1

Quark-1 is a ~4.2M-parameter causal transformer language model, trained entirely from scratch (no pretrained base) on a single T4 GPU. It's a small-scale hobby/research model, not intended to compete with production LLMs β€” see Limitations below for what that means in practice.

Model details

  • Architecture: decoder-only transformer, RoPE positional encoding, RMSNorm, SwiGLU-style MLP, tied input/output embeddings
  • Size: 4,229,568 parameters
  • Hidden dim: 192 | Layers: 6 | Heads: 4 | MLP mult: 4x
  • Context length: 256 tokens
  • Vocab: 8,192 tokens (custom byte-level BPE tokenizer, trained on the same corpus used for pretraining)
  • Optimizer: Muon (orthogonalized momentum via Newton-Schulz iteration) for all 2D hidden-layer weights, AdamW for embeddings/norms

Training pipeline

Quark-1 went through a full multi-stage pipeline rather than a single training run:

  1. Pretraining β€” TinyStories (60k stories) + a slice of WikiText-103 for register variety. 4,000 steps, cosine LR schedule with warmup. Final val loss: 2.45.
  2. Supervised fine-tuning (SFT) β€” instruction-style examples derived from TinyStories completions ("Continue the story: ...", "Write a short story about ...", etc.), with loss masked to only train on the answer portion.
  3. DPO-style preference tuning β€” since real human preference data doesn't exist at this scale, "rejected" examples were synthetically constructed by degrading real completions (truncation, repetition, sentence shuffling). This teaches a "coherent > degenerate" preference direction using a legitimate technique for small-model alignment, not a fabricated reward signal.

Inference-time reasoning scaffolds

Two decoding strategies are included in the training/inference code (see the original repo β€” this Hub upload contains the base weights + architecture only, ready to be paired with either):

  • Tree-of-Thought (beam search variant): samples multiple candidate continuations per step, scores them by log-probability minus a repetition penalty, and keeps the best beams.
  • Markovian RSA: based on the test-time compute method from Zyphra's ZAYA1-8B tech report (Recursive Self-Aggregation + bounded "Markovian" context). Samples N candidates per round, carries forward only C bounded-length tails into the next round's context (not the full history), and repeats for several rounds. Keeps per-round context size constant regardless of round count.

Usage

import torch
from huggingface_hub import hf_hub_download
from modeling_quark1 import Quark1Model
from tokenizers import Tokenizer

model = Quark1Model.from_pretrained("your-username/Quark-1")
model.eval()

tokenizer_path = hf_hub_download("your-username/Quark-1", "tokenizer.json")
tokenizer = Tokenizer.from_file(tokenizer_path)

BOS_ID = tokenizer.token_to_id("<bos>")
EOS_ID = tokenizer.token_to_id("<eos>")
USER_ID = tokenizer.token_to_id("<user>")
ASSIST_ID = tokenizer.token_to_id("<assistant>")

prompt = "Tell me a simple story."
ids = [BOS_ID, USER_ID] + tokenizer.encode(prompt).ids + [ASSIST_ID]
input_ids = torch.tensor([ids])

out = model.generate(input_ids, max_new_tokens=80, temperature=0.8, top_k=40, eos_id=EOS_ID)
response = tokenizer.decode(out[0].tolist()[len(ids):], skip_special_tokens=True)
print(response)

Limitations

Quark-1 is a 4.2M-parameter model β€” for scale, that's roughly 1,000x smaller than models like GPT-3. It is trained almost entirely on TinyStories, a synthetic dataset of simple children's stories, so:

  • It writes grammatical, coherent short stories in a simple register. That's genuinely what it's good at.
  • It does not have general world knowledge, cannot answer factual questions reliably, and cannot follow complex or multi-part instructions.
  • It has no real reasoning ability. The Tree-of-Thought and Markovian RSA decoding strategies included in the training code are external search/aggregation scaffolds β€” they improve coherence over greedy/plain sampling, but they do not grant the model symbolic or multi-step reasoning it doesn't otherwise have.
  • Entity/coreference tracking is weak (e.g. character names can get confused across sentences) β€” a known limitation at this parameter scale.

This model is best understood as a from-scratch training and inference-scaffolding exercise, not a general-purpose assistant.

Training compute

Trained on a single Google Colab T4 GPU. Full pretraining run: ~4,000 steps, ~17 minutes.

Downloads last month
-
Safetensors
Model size
4.23M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for Gugu8/Quark-1