|
|
--- |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- gpjt-llm-from-scratch |
|
|
datasets: |
|
|
- gpjt/fineweb-gpt2-tokens |
|
|
--- |
|
|
|
|
|
# Model Card for gpjt/8xa100m80 |
|
|
|
|
|
This model is gpjt/8xa100m80, a trained-from-scratch base model using |
|
|
the GPT-2-style architecture from [Sebastian Raschka](https://sebastianraschka.com/)'s book |
|
|
"[Build a Large Language Model (from Scratch)](https://www.manning.com/books/build-a-large-language-model-from-scratch)". |
|
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
- **Developed by:** [Giles Thomas](https://huggingface.co/gpjt), based on code by [Sebastian Raschka](https://huggingface.co/rasbt) |
|
|
- **Model type:** GPT-2 style transformers-based causal LLM. |
|
|
- **License:** [Apache 2](https://huggingface.co/models?license=license:apache-2.0&sort=downloads) |
|
|
- **Parameters:** 163,009,536 |
|
|
- **Context length:** 1,024 |
|
|
- **Embedding dimensions:** 768 |
|
|
- **MHA heads:** 12 |
|
|
- **Layers:** 12 |
|
|
- **QKV bias:** False |
|
|
- **Weight tying:** No. |
|
|
|
|
|
Don't have high expectations for the model! It has only 163M parameters (the GPT-2 "small" size) |
|
|
and was trained on roughly the Chinchilla-optimal number of tokens (~20x the number of parameters), which means that it doesn't know |
|
|
many facts and is not terribly smart. If you want to do serious work, use a serious model (I like |
|
|
[Qwen's](https://huggingface.co/Qwen)). But if you want to build on this and see what you can do with a 2020-vintage |
|
|
LLM, please do feel free to play with it! |
|
|
|
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** [gpjt/ddp-base-model-from-scratch](https://github.com/gpjt/ddp-base-model-from-scratch) |
|
|
- **Blog post:** [Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloud](https://www.gilesthomas.com/2026/01/llm-from-scratch-29-ddp-training-a-base-model-in-the-cloud) (this is the model from "Training on an 8x A100 with 80 GiB per GPU, using SXM4") |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
You can download and run the model for inference directly: |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
pipe = pipeline("text-generation", model="gpjt/8xa100m80", trust_remote_code=True) |
|
|
out = pipe( |
|
|
"Every effort moves you", |
|
|
max_new_tokens=20, |
|
|
do_sample=True, |
|
|
temperature=1.4, |
|
|
top_k=25, |
|
|
) |
|
|
print(out[0]["generated_text"]) |
|
|
``` |
|
|
|
|
|
Note that because it uses custom code, you'll need to set `trust_remote_code` to `True`. |
|
|
|
|
|
It supports `AutoTokenizer`, `AutoModel` and `AutoModelForCausalLM`: |
|
|
|
|
|
```python |
|
|
>>> from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM |
|
|
>>> tokenizer = AutoTokenizer.from_pretrained("gpjt/8xa100m80") |
|
|
>>> model = AutoModel.from_pretrained("gpjt/8xa100m80", trust_remote_code=True) |
|
|
>>> llm_model = AutoModelForCausalLM.from_pretrained("gpjt/8xa100m80", trust_remote_code=True) |
|
|
``` |
|
|
|
|
|
You can also fine-tune it; [this notebook](https://github.com/gpjt/ddp-base-model-from-scratch/blob/main/hf_train.ipynb) has an example. |
|
|
|
|
|
Again, don't expect too much from this model! It's a 163M-parameter GPT-2 one, trained on a limited |
|
|
number of tokens. It's [both dumb and ignorant](https://www.gilesthomas.com/2026/01/llm-from-scratch-30-digging-into-llm-as-a-judge) ;-) |
|
|
|
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Machine type:** 8x A100 with 80 GiB per GPU, using SXM4 |
|
|
- **Tokens:** 3,260,190,720 (Chinchilla-optimal of 20x parameters) rounded up to the nearest batch. |
|
|
- **Dataset:** [gpjt/fineweb-gpt2-tokens](https://huggingface.co/datasets/gpjt/fineweb-gpt2-tokens) |
|
|
- **Micro-batch size:** 28 |
|
|
- **Global batch size:** 224 |
|
|
- **Dropout:** 0.1 |
|
|
|