Model Card for gpjt/8xh100m80-best

This model is gpjt/8xh100m80-best, a trained-from-scratch base model using the GPT-2-style architecture from Sebastian Raschka's book "Build a Large Language Model (from Scratch)".

Model Details

Model Description

Developed by: Giles Thomas, based on code by Sebastian Raschka
Model type: GPT-2 style transformers-based causal LLM.
License: Apache 2
Parameters: 163,009,536
Context length: 1,024
Embedding dimensions: 768
MHA heads: 12
Layers: 12
QKV bias: False
Weight tying: False

Don't have high expectations for the model! It has only 163M parameters (the GPT-2 "small" size) and was trained on roughly the Chinchilla-optimal number of tokens (~20x the number of parameters), which means that it doesn't know many facts and is not terribly smart. If you want to do serious work, use a serious model (I like Qwen's). But if you want to build on this and see what you can do with a 2020-vintage LLM, please do feel free to play with it!

Model Sources

Repository: gpjt/ddp-base-model-from-scratch
Blog post: Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloud (this is the model with the best validation loss from "Training on an 8x H100 with 80 GiB per GPU, using SXM5")

How to Get Started with the Model

You can download and run the model for inference directly:

from transformers import pipeline
pipe = pipeline("text-generation", model="gpjt/8xh100m80-best", trust_remote_code=True)
out = pipe(
    "Every effort moves you",
    max_new_tokens=20,
    do_sample=True,
    temperature=1.4,
    top_k=25,
)
print(out[0]["generated_text"])

Note that because it uses custom code, you'll need to set trust_remote_code to True.

It supports AutoTokenizer, AutoModel and AutoModelForCausalLM:

>>> from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("gpjt/8xh100m80-best")
>>> model = AutoModel.from_pretrained("gpjt/8xh100m80-best", trust_remote_code=True)
>>> llm_model = AutoModelForCausalLM.from_pretrained("gpjt/8xh100m80-best", trust_remote_code=True)

You can also fine-tune it; this notebook has an example.

Again, don't expect too much from this model! It's a 163M-parameter GPT-2 one, trained on a limited number of tokens. It's both dumb and ignorant ;-)

Training Details

Machine type: 8x B200 with 160 GiB per GPU, using SXM6
Tokens: 3,260,190,720 (Chinchilla-optimal of 20x parameters) rounded up to the nearest batch.
Dataset: gpjt/fineweb-gpt2-tokens
Micro-batch size: 27
Global batch size: 216
Dropout: 0.1
Gradient clipping: None
Learning rate: 0.0004
Schedule learning rate: False
Weight decay: 0.1

Downloads last month: 5

Safetensors

Model size

0.2B params

Tensor type

F32

Dataset used to train gpjt/8xh100m80-best

Collection including gpjt/8xh100m80-best

LLM from scratch

Collection

Models I've created as "extra credit" after finishing Sebastian Raschka's book "[Build a Large Language Model (from Scratch)" • 28 items • Updated Apr 15 • 1