Model Card for gpjt/8xh100m80-latest

This model is gpjt/8xh100m80-latest, a trained-from-scratch base model using the GPT-2-style architecture from Sebastian Raschka's book "Build a Large Language Model (from Scratch)".

Model Details

Model Description

  • Developed by: Giles Thomas, based on code by Sebastian Raschka
  • Model type: GPT-2 style transformers-based causal LLM.
  • License: Apache 2
  • Parameters: 163,009,536
  • Context length: 1,024
  • Embedding dimensions: 768
  • MHA heads: 12
  • Layers: 12
  • QKV bias: False
  • Weight tying: No.

Don't have high expectations for the model! It has only 163M parameters (the GPT-2 "small" size) and was trained on roughly the Chinchilla-optimal number of tokens (~20x the number of parameters), which means that it doesn't know many facts and is not terribly smart. If you want to do serious work, use a serious model (I like Qwen's). But if you want to build on this and see what you can do with a 2020-vintage LLM, please do feel free to play with it!

Model Sources

How to Get Started with the Model

You can download and run the model for inference directly:

from transformers import pipeline
pipe = pipeline("text-generation", model="gpjt/8xh100m80-latest", trust_remote_code=True)
out = pipe(
    "Every effort moves you",
    max_new_tokens=20,
    do_sample=True,
    temperature=1.4,
    top_k=25,
)
print(out[0]["generated_text"])

Note that because it uses custom code, you'll need to set trust_remote_code to True.

It supports AutoTokenizer, AutoModel and AutoModelForCausalLM:

>>> from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("gpjt/8xh100m80-latest")
>>> model = AutoModel.from_pretrained("gpjt/8xh100m80-latest", trust_remote_code=True)
>>> llm_model = AutoModelForCausalLM.from_pretrained("gpjt/8xh100m80-latest", trust_remote_code=True)

You can also fine-tune it; this notebook has an example.

Again, don't expect too much from this model! It's a 163M-parameter GPT-2 one, trained on a limited number of tokens. It's both dumb and ignorant ;-)

Training Details

  • Machine type: 8x H100 with 80 GiB per GPU, using SXM5
  • Tokens: 3,260,190,720 (Chinchilla-optimal of 20x parameters) rounded up to the nearest batch.
  • Dataset: gpjt/fineweb-gpt2-tokens
  • Micro-batch size: 27
  • Global batch size: 216
  • Dropout: 0.1
Downloads last month
14
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train gpjt/8xh100m80-latest

Collection including gpjt/8xh100m80-latest