Model Card for gpjt/8xh100m80-best
This model is gpjt/8xh100m80-best, a trained-from-scratch base model using the GPT-2-style architecture from Sebastian Raschka's book "Build a Large Language Model (from Scratch)".
Model Details
Model Description
- Developed by: Giles Thomas, based on code by Sebastian Raschka
- Model type: GPT-2 style transformers-based causal LLM.
- License: Apache 2
- Parameters: 163,009,536
- Context length: 1,024
- Embedding dimensions: 768
- MHA heads: 12
- Layers: 12
- QKV bias: False
- Weight tying: No.
Don't have high expectations for the model! It has only 163M parameters (the GPT-2 "small" size) and was trained on roughly the Chinchilla-optimal number of tokens (~20x the number of parameters), which means that it doesn't know many facts and is not terribly smart. If you want to do serious work, use a serious model (I like Qwen's). But if you want to build on this and see what you can do with a 2020-vintage LLM, please do feel free to play with it!
Model Sources
- Repository: gpjt/ddp-base-model-from-scratch
- Blog post: Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloud (this is the model with the best validation loss from "Training on an 8x H100 with 80 GiB per GPU, using SXM5")
How to Get Started with the Model
You can download and run the model for inference directly:
from transformers import pipeline
pipe = pipeline("text-generation", model="gpjt/8xh100m80-best", trust_remote_code=True)
out = pipe(
"Every effort moves you",
max_new_tokens=20,
do_sample=True,
temperature=1.4,
top_k=25,
)
print(out[0]["generated_text"])
Note that because it uses custom code, you'll need to set trust_remote_code to True.
It supports AutoTokenizer, AutoModel and AutoModelForCausalLM:
>>> from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("gpjt/8xh100m80-best")
>>> model = AutoModel.from_pretrained("gpjt/8xh100m80-best", trust_remote_code=True)
>>> llm_model = AutoModelForCausalLM.from_pretrained("gpjt/8xh100m80-best", trust_remote_code=True)
You can also fine-tune it; this notebook has an example.
Again, don't expect too much from this model! It's a 163M-parameter GPT-2 one, trained on a limited number of tokens. It's both dumb and ignorant ;-)
Training Details
- Machine type: 8x B200 with 160 GiB per GPU, using SXM6
- Tokens: 3,260,190,720 (Chinchilla-optimal of 20x parameters) rounded up to the nearest batch.
- Dataset: gpjt/fineweb-gpt2-tokens
- Micro-batch size: 27
- Global batch size: 216
- Dropout: 0.1
- Downloads last month
- 10