nanochat

nanochat-de-537m

A compact, German-only chat language model (537M), trained from scratch.

Language Params License transformers


Model description

nanochat-de-537m is a small, German-only language model with about 537M parameters, trained from scratch (not derived from any existing model) using the nanochat framework. It was built as an academic project to explore how a language model can be trained end-to-end with modest resources.

The model is usable directly with 🤗 transformers (trust_remote_code=True); the nanochat architecture ships alongside the weights as custom code.

Architecture nanochat GPT (RoPE, RMSNorm, ReLU², GQA, QK-norm, value embeddings, logit softcap)
Parameters 537M total (201M non-embedding)
Layers / dim / heads 16 / 1024 / 8
Context length 2048 tokens
Vocabulary 32,768 (BPE / tiktoken)
Language German
License MIT

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "Mario12355/nanochat-de-537m"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True).eval()
if torch.cuda.is_available():
    model = model.cuda()

messages = [{"role": "user", "content": "Wer bist du?"}]
ids = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=200, do_sample=True,
                     temperature=0.3, top_k=20, repetition_penalty=1.3)
print(tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True))

Recommended sampling (also the defaults in generation_config.json): temperature=0.3, top_k=20, repetition_penalty=1.3. Low temperature keeps this small model coherent; the repetition penalty prevents degenerate loops.

Training

  • Pre-training: about 20B characters of German text from FineWeb-2 (German split, deu_Latn).
  • Supervised fine-tuning (SFT): Mario12355/german-sft-mix, plus curated and synthetically generated identity dialogues.
  • Hardware: 2× NVIDIA RTX 3090.
  • Framework: nanochat.

Evaluation

Metric Value
Validation bits-per-byte (held-out German SFT) 0.503

The model is strongest on everyday German conversation. It is intentionally small, prioritising transparency and efficiency over peak capability.

Intended use & limitations

Intended for: German-language conversation, education, and demonstrating how an LLM can be trained from scratch.

Limitations: as a small model it is weak at factual knowledge, mathematics, and complex reasoning; it is German-only, has no internet access, keeps no memory beyond the current conversation, and can hallucinate. Not intended for production use. This inference implementation uses no KV cache (correct, but slower for long outputs).

Citation

@misc{nanochat-de-537m,
  title  = {nanochat-de-537m: a German language model trained from scratch},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/Mario12355/nanochat-de-537m}}
}

Acknowledgements

Built on the nanochat framework by Andrej Karpathy.

Downloads last month
31
Safetensors
Model size
0.5B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train Mario12355/nanochat-de-537m

Evaluation results

  • Validation bits-per-byte (held-out German SFT)
    self-reported
    0.503