Quark-1
Quark-1 is a ~4.2M-parameter causal transformer language model, trained entirely from scratch (no pretrained base) on a single T4 GPU. It's a small-scale hobby/research model, not intended to compete with production LLMs β see Limitations below for what that means in practice.
Model details
- Architecture: decoder-only transformer, RoPE positional encoding, RMSNorm, SwiGLU-style MLP, tied input/output embeddings
- Size: 4,229,568 parameters
- Hidden dim: 192 | Layers: 6 | Heads: 4 | MLP mult: 4x
- Context length: 256 tokens
- Vocab: 8,192 tokens (custom byte-level BPE tokenizer, trained on the same corpus used for pretraining)
- Optimizer: Muon (orthogonalized momentum via Newton-Schulz iteration) for all 2D hidden-layer weights, AdamW for embeddings/norms
Training pipeline
Quark-1 went through a full multi-stage pipeline rather than a single training run:
- Pretraining β TinyStories (60k stories) + a slice of WikiText-103 for register variety. 4,000 steps, cosine LR schedule with warmup. Final val loss: 2.45.
- Supervised fine-tuning (SFT) β instruction-style examples derived from TinyStories completions ("Continue the story: ...", "Write a short story about ...", etc.), with loss masked to only train on the answer portion.
- DPO-style preference tuning β since real human preference data doesn't exist at this scale, "rejected" examples were synthetically constructed by degrading real completions (truncation, repetition, sentence shuffling). This teaches a "coherent > degenerate" preference direction using a legitimate technique for small-model alignment, not a fabricated reward signal.
Inference-time reasoning scaffolds
Two decoding strategies are included in the training/inference code (see the original repo β this Hub upload contains the base weights + architecture only, ready to be paired with either):
- Tree-of-Thought (beam search variant): samples multiple candidate continuations per step, scores them by log-probability minus a repetition penalty, and keeps the best beams.
- Markovian RSA: based on the test-time compute method from Zyphra's ZAYA1-8B tech report (Recursive Self-Aggregation + bounded "Markovian" context). Samples N candidates per round, carries forward only C bounded-length tails into the next round's context (not the full history), and repeats for several rounds. Keeps per-round context size constant regardless of round count.
Usage
import torch
from huggingface_hub import hf_hub_download
from modeling_quark1 import Quark1Model
from tokenizers import Tokenizer
model = Quark1Model.from_pretrained("your-username/Quark-1")
model.eval()
tokenizer_path = hf_hub_download("your-username/Quark-1", "tokenizer.json")
tokenizer = Tokenizer.from_file(tokenizer_path)
BOS_ID = tokenizer.token_to_id("<bos>")
EOS_ID = tokenizer.token_to_id("<eos>")
USER_ID = tokenizer.token_to_id("<user>")
ASSIST_ID = tokenizer.token_to_id("<assistant>")
prompt = "Tell me a simple story."
ids = [BOS_ID, USER_ID] + tokenizer.encode(prompt).ids + [ASSIST_ID]
input_ids = torch.tensor([ids])
out = model.generate(input_ids, max_new_tokens=80, temperature=0.8, top_k=40, eos_id=EOS_ID)
response = tokenizer.decode(out[0].tolist()[len(ids):], skip_special_tokens=True)
print(response)
Limitations
Quark-1 is a 4.2M-parameter model β for scale, that's roughly 1,000x smaller than models like GPT-3. It is trained almost entirely on TinyStories, a synthetic dataset of simple children's stories, so:
- It writes grammatical, coherent short stories in a simple register. That's genuinely what it's good at.
- It does not have general world knowledge, cannot answer factual questions reliably, and cannot follow complex or multi-part instructions.
- It has no real reasoning ability. The Tree-of-Thought and Markovian RSA decoding strategies included in the training code are external search/aggregation scaffolds β they improve coherence over greedy/plain sampling, but they do not grant the model symbolic or multi-step reasoning it doesn't otherwise have.
- Entity/coreference tracking is weak (e.g. character names can get confused across sentences) β a known limitation at this parameter scale.
This model is best understood as a from-scratch training and inference-scaffolding exercise, not a general-purpose assistant.
Training compute
Trained on a single Google Colab T4 GPU. Full pretraining run: ~4,000 steps, ~17 minutes.
- Downloads last month
- -