Marshmello-8M

Marshmello-8M is a decoder-only GPT language model trained from scratch in Marshmello — a step-by-step project that builds transformers from one weight to GPT pretraining, SFT, and 300M scaling on Apple Silicon.


GitHub	mohmmedwee/Marshmello
Parameters	~13.3M (13,340,928)
Architecture	GPT (causal self-attention, learned positional embeddings)
`d_model`	384
Layers	4
Heads	6
FFN dim	1536
Context	256 tokens
Tokenizer	BPE (~8,000 vocab)
Config key	`default`

Quick start

git clone https://github.com/mohmmedwee/Marshmello.git
cd Marshmello
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt huggingface_hub safetensors

# Download weights from this Hub repo into checkpoints/
python 13_gpt_pretraining/hub/download_from_hub.py --repo-id ostah-1010/Marshmello-8M

# Generate text
python 13_gpt_pretraining/generate.py --config default --prompt "Database systems"

Marshmello family

Model	Hugging Face	Params	GitHub config	Status
Marshmello-8M	ostah-1010/Marshmello-8M	~8M	`default`	Published
Marshmello-55M	ostah-1010/Marshmello	~55M	`large_50m`	Published base
Marshmello-300M	GitHub only	268.8M	`large_300m`	Phase 19A pretraining

Full source, training pipeline, and evaluation suite: https://github.com/mohmmedwee/Marshmello

SFT dataset: ostah-1010/Marshmello-SFT

Learning path (GitHub repo)

Linear model → Attention → Transformer → BPE LM → GPT pretraining
→ Dataset pipeline → 50M scaling → Evaluation → Instruction dataset
→ Chat adaptation (18C/18H) → Tiny teacher SFT (18E) → Instruct tuning (18B)
→ Core routing eval (18J) → General benchmark (18K) → 300M scaling (19A)

Phases 01–19A in the repo walk through every layer of the stack with readable Python.

Phase 19A — 300M scaling (GitHub)

After the 55M line plateaued (18J 18%, 18K domain 21.8%), Phase 19A tests whether extra capacity helps. Status on large_300m:

Gate	Result
Benchmark	Passed (~3,552 train tok/s, ~5.5 GB peak, no OOM)
Smoke test (20 steps)	Passed (loss 9.16 → 7.16)
Chat-only pretraining	In progress
Hub upload	Not until 18J/18K improve

Docs: 19A_scale_to_300m

Instruct checkpoints (GitHub only)

Base weights live on this Hub repo. Instruct / routing checkpoints (~632 MB each) are trained locally and documented in the GitHub repo — not uploaded here yet:

Checkpoint	Role
`18E_tiny_teacher_sft/checkpoints/teacher_latest.pt`	Tiny teacher SFT (~1590 short answers incl. math)
`18B_marshmello_instruct/checkpoints/best_18j_routing.pt`	Best 18J core routing (~18%) — recommended deploy

Chat after cloning + downloading base weights:

python 18B_marshmello_instruct/chat.py \
  --checkpoint 18B_marshmello_instruct/checkpoints/best_18j_routing.pt \
  --prompt "Explain what a database index is" --greedy

Dual benchmarks: 18J (core concept routing, best ~18%) and 18K (general assistant QA, best domain ~22.5%, hallucination ~64%). See 18J_marshmello_core_sft/ and 18K_general_benchmark/ on GitHub.

Files in this repo

File	Description
`model.safetensors`	Model weights
`config.json`	Architecture + parameter breakdown
`tokenizer.json`	BPE tokenizer (`</w>` word boundaries)
`generation_config.json`	Default sampling settings
`training_meta.json`	Training step, losses, hyperparameters

Limitations

Trained on a small educational corpus (not web-scale pretraining)
Outputs may memorize training paragraphs (see Phase 16 evaluation in GitHub repo)
This Hub repo ships the base causal LM; instruct/routing checkpoints are on GitHub
Small educational corpus — not web-scale; may memorize training text (Phase 16)
Custom PyTorch GPT (not transformers AutoModel)

Citation

Built with the Marshmello learning project (Phases 01–17).

Downloads last month: 30

Safetensors

Model size

8.05M params

Tensor type

F32