You can find all code on GitHub

Note: This is a model with 125 million parameters (attempt to replicate GPT-3 Small). (it's very undertrained.)

Note 2: This is a model checkpoint released on 10/05 2026 (72 batch size, 4 grad accumulation and 50000 steps under Muon optimizer). It scores 25.49% on MMLU which is slightly higher than 25% (random guess)

Note 3: This model already demonstrates basic abilities in generating text. It's not perfect and I will continue working on it. Expect Instruct model soon.

Model description

This is a small GPT-style autoregressive language model. It is intended as a development checkpoint, not as a production-ready assistant. But you can try.

This time I used kernels and Flash Attention 4 and Flash Attention 2 with the fallback to SDPA. This allowed me to cut the time required for one step from nearly 60 seconds (on jetson) to 3.6 seconds (on the server) and then to 2.2 seconds (using Unsloth kernels)

Important notes

This model is still undertrained. Its benchmark results are close to random-choice level on multiple-choice academic benchmarks, so the checkpoint should be treated as experimental.

It can generate basic text, but it may produce incorrect, repetitive, incoherent, or non-readable outputs. It is not instruction-tuned, but it can produce several meaningful paragraphs.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(k050506koch/GPT3-dev-125m-1009, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(k050506koch/GPT3-dev-125m-1009, trust_remote_code=True)

if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

prompt = "He is a doctor. His main goal is"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_new_tokens=96,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.2,
    no_repeat_ngram_size=3,
    pad_token_id=tokenizer.pad_token_id,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Evaluation results

Evaluation was run locally on CPU with a custom evaluation script.

These results should not be compared directly with Open LLM Leaderboard results unless the same evaluation harness, prompt format, number of shots, and dataset splits are used.

Summary

Benchmark Accuracy Perplexity
HellaSwag 0.2677 34.3111
MMLU average 0.2549 141.9833

MMLU

Task Accuracy Perplexity
abstract_algebra 0.2600 182.4785
anatomy 0.2519 206.2038
astronomy 0.2303 166.3864
business_ethics 0.2800 145.5782
clinical_knowledge 0.1925 100.5738
college_biology 0.2847 162.7603
college_chemistry 0.2800 157.3521
college_computer_science 0.2200 132.0329
college_mathematics 0.2300 114.1684
college_medicine 0.2254 24.5343
college_physics 0.2353 115.2290
computer_security 0.2300 141.5838
conceptual_physics 0.2894 312.6869
econometrics 0.2632 135.2830
electrical_engineering 0.2690 259.6937
elementary_mathematics 0.2646 64.6184
formal_logic 0.2460 56.9265
global_facts 0.1500 89.0267
high_school_biology 0.2677 89.7088
high_school_chemistry 0.2562 123.2220
high_school_computer_science 0.2300 79.9634
high_school_european_history 0.2667 118.5012
high_school_geography 0.2980 156.3795
high_school_government_and_politics 0.2176 174.9534
high_school_macroeconomics 0.2462 132.2859
high_school_mathematics 0.2333 105.9731
high_school_microeconomics 0.2605 82.1080
high_school_physics 0.2715 71.0461
high_school_psychology 0.2624 137.8331
high_school_statistics 0.2824 61.6760
high_school_us_history 0.3039 88.8365
high_school_world_history 0.2447 74.1491
human_aging 0.2377 306.9222
human_sexuality 0.2595 110.5550
international_law 0.3223 211.6555
jurisprudence 0.2130 109.2910
logical_fallacies 0.2331 207.6864
machine_learning 0.2500 120.3576
management 0.3592 368.0460
marketing 0.2436 73.0363
medical_genetics 0.3100 296.1581
miscellaneous 0.2363 140.3008
moral_disputes 0.2370 111.0396
moral_scenarios 0.2402 105.1889
nutrition 0.2484 203.6292
philosophy 0.2540 88.0570
prehistory 0.2191 123.8685
professional_accounting 0.2695 60.2937
professional_law 0.2581 17.2965
professional_medicine 0.2868 107.5151
professional_psychology 0.2647 104.7847
public_relations 0.2727 94.3958
security_studies 0.3306 70.1510
sociology 0.2886 243.0351
us_foreign_policy 0.2000 206.4246
virology 0.1988 125.7791
world_religions 0.2515 423.8289

Limitations

As this is only the next word prediction model, it doesn't know how to interact with the user.

Training data

HuggingFaceFW/fineweb. Only this

Training metadata

Checkpoint date: 10.05.2026
Parameters: 125231616
Context length: 2048
Batch size: 72
Gradient accumulation: 4
Sequence length: 512
Training steps: 50000
Optimizer: Fused Muon with Hermes kernels
Learning rate schedule: cosine
Hardware: Frankenstein (2012 datacenter server with a RTX 5070Ti)

Contributing

Contributions are always welcome.

I am still a student, so the code and model may contain mistakes, bugs, or incorrect assumptions. If you find an issue or have an improvement, feel free to open an issue or submit a pull request. I will be happy.

Acknowledgements

Thanks to OpenAI, Hugging Face, PyTorch and Unsloth for making this kind of research and experimentation possible.

References:

Downloads last month
8
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train k050506koch/GPT3-dev-125m-1005

Paper for k050506koch/GPT3-dev-125m-1005

Evaluation results