Marshmello-8M

Marshmello-8M is a decoder-only GPT language model trained from scratch in Marshmello β€” a step-by-step project that builds transformers from one weight to GPT pretraining, SFT, and 300M scaling on Apple Silicon.

GitHub mohmmedwee/Marshmello
Parameters ~13.3M (13,340,928)
Architecture GPT (causal self-attention, learned positional embeddings)
d_model 384
Layers 4
Heads 6
FFN dim 1536
Context 256 tokens
Tokenizer BPE (~8,000 vocab)
Config key default

Quick start

git clone https://github.com/mohmmedwee/Marshmello.git
cd Marshmello
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt huggingface_hub safetensors

# Download weights from this Hub repo into checkpoints/
python 13_gpt_pretraining/hub/download_from_hub.py --repo-id ostah-1010/Marshmello-8M

# Generate text
python 13_gpt_pretraining/generate.py --config default --prompt "Database systems"

Marshmello family

Model Hugging Face Params GitHub config Status
Marshmello-8M ostah-1010/Marshmello-8M ~8M default Published
Marshmello-55M ostah-1010/Marshmello ~55M large_50m Published base
Marshmello-300M GitHub only 268.8M large_300m Phase 19A pretraining

Full source, training pipeline, and evaluation suite: https://github.com/mohmmedwee/Marshmello

SFT dataset: ostah-1010/Marshmello-SFT

Learning path (GitHub repo)

Linear model β†’ Attention β†’ Transformer β†’ BPE LM β†’ GPT pretraining
β†’ Dataset pipeline β†’ 50M scaling β†’ Evaluation β†’ Instruction dataset
β†’ Chat adaptation (18C/18H) β†’ Tiny teacher SFT (18E) β†’ Instruct tuning (18B)
β†’ Core routing eval (18J) β†’ General benchmark (18K) β†’ 300M scaling (19A)

Phases 01–19A in the repo walk through every layer of the stack with readable Python.

Phase 19A β€” 300M scaling (GitHub)

After the 55M line plateaued (18J 18%, 18K domain 21.8%), Phase 19A tests whether extra capacity helps. Status on large_300m:

Gate Result
Benchmark Passed (~3,552 train tok/s, ~5.5 GB peak, no OOM)
Smoke test (20 steps) Passed (loss 9.16 β†’ 7.16)
Chat-only pretraining In progress
Hub upload Not until 18J/18K improve

Docs: 19A_scale_to_300m

Instruct checkpoints (GitHub only)

Base weights live on this Hub repo. Instruct / routing checkpoints (~632 MB each) are trained locally and documented in the GitHub repo β€” not uploaded here yet:

Checkpoint Role
18E_tiny_teacher_sft/checkpoints/teacher_latest.pt Tiny teacher SFT (~1590 short answers incl. math)
18B_marshmello_instruct/checkpoints/best_18j_routing.pt Best 18J core routing (~18%) β€” recommended deploy

Chat after cloning + downloading base weights:

python 18B_marshmello_instruct/chat.py \
  --checkpoint 18B_marshmello_instruct/checkpoints/best_18j_routing.pt \
  --prompt "Explain what a database index is" --greedy

Dual benchmarks: 18J (core concept routing, best ~18%) and 18K (general assistant QA, best domain ~22.5%, hallucination ~64%). See 18J_marshmello_core_sft/ and 18K_general_benchmark/ on GitHub.

Files in this repo

File Description
model.safetensors Model weights
config.json Architecture + parameter breakdown
tokenizer.json BPE tokenizer (</w> word boundaries)
generation_config.json Default sampling settings
training_meta.json Training step, losses, hyperparameters

Limitations

  • Trained on a small educational corpus (not web-scale pretraining)
  • Outputs may memorize training paragraphs (see Phase 16 evaluation in GitHub repo)
  • This Hub repo ships the base causal LM; instruct/routing checkpoints are on GitHub
  • Small educational corpus β€” not web-scale; may memorize training text (Phase 16)
  • Custom PyTorch GPT (not transformers AutoModel)

Citation

Built with the Marshmello learning project (Phases 01–17).

Downloads last month
30
Safetensors
Model size
8.05M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support