1gpu-llm Medium EN/IT Base

This repository is the current ready-to-use base release for the 1gpu-llm medium EN/IT family.

1gpu-llm is a family of language models trained from scratch on a single consumer GPU.

For this release family, the reference training hardware is:

  • GPU: NVIDIA GeForce RTX 4060 Ti 16GB
  • training setup: single GPU
  • practical medium-model wall-clock target: about 5 days to reach the current medium base release class on this hardware

Concretely, this release packages the GPT2PreLN decay-family practical winner at step_14700:

  • family name: 1gpu-llm
  • model tier: medium
  • languages: English + Italian
  • context window: 2500 tokens
  • architecture: GPT-2-style decoder with pre-layernorm blocks
  • architecture config: architecture: gpt2, block_type: gpt2_prelayernorm
  • parameter count: 337,639,424 parameters (~337.639M) in the published Transformers export
  • released checkpoint: step_14700.pt
  • checkpoint role: official operational medium base for the current family

This is a base model, not an instruction-tuned chat model.

Provenance

  • non-decayed anchor run:
    • stable-recipe-gpt2medium-gpt2preln-k20-wsd-lr2e-4-anchor20k-final2e5-webwiki
  • non-decayed anchor checkpoint:
    • step_13500.pt
  • original decay-only parent run:
    • 20260628_resume-gpt2medium-gpt2preln-k20-wsddecayonly-lr2e-4-anchor20k-final2e5-webwiki-step13500
  • replayed tail run:
    • 20260629_resume-gpt2medium-gpt2preln-k20-wsddecayonly-rerunmissing-lr3p5294e5-anchor20k-final2e5-webwiki-step14200-to14850
  • released checkpoint:
    • step_14700.pt

Practical reading:

  • the family produced three checkpoints with distinct roles:
    • step_14250 = best pure scalar / benchmark checkpoint
    • step_14700 = best practical balanced release candidate
    • step_14500 = best behavior-oriented variant
  • this repo is the public medium base release, so it intentionally promotes step_14700 rather than the raw scalar champion step_14250

Training Data

This model was trained on the bilingual EN/IT web + wiki dataset:

  • dataset id on disk:
    • 202605141153_fineweb50_wiki50_50en_50it_score100_2500context_5Btokens_tok_20260515_en50it50_webwiki_stratified_500M
  • context window during training: 2500 tokens
  • packing length: 2500
  • mixing strategy: source_balanced
  • validation ratio: 0.05

Main source groups:

  • English FineWeb-HQ (epfml/FineWeb-HQ)
  • Italian FineWeb2-HQ (epfml/FineWeb2-HQ)
  • English Wiki40B (google/wiki40b)
  • Italian Wiki40B (google/wiki40b)

How Many Tokens This Checkpoint Saw

Training math:

  • sequence length: 2500
  • batch size: 2
  • grad accumulation: 48
  • tokens per optimizer step: 239,904

So this checkpoint saw approximately:

  • 3.5265888B tokens total by step_14700
  • about 287.88M extra tokens during the decay-only continuation beyond the non-decayed anchor step_13500

Why This Checkpoint Was Chosen

The final comparable GPU benchmark on the shortlisted medium family checkpoints kept the roles separate on purpose.

Pure benchmark/loss ranking:

  • step_14250: val_loss_mixed = 4.4419
  • step_14700: val_loss_mixed = 4.4436
  • step_13500: val_loss_mixed = 4.4690
  • step_14500: val_loss_mixed = 4.4926
  • step_15700: val_loss_mixed = 4.5038

So step_14250 is still the scalar winner.

But the release decision for the single public medium base model used the practical read, not only the thinnest scalar margin:

  • the loss gap between 14250 and 14700 is only about +0.0016
  • 14700 is cleaner on the practical behavior proxies:
    • loop_rate = 0.375 vs 0.425
    • repeated_4gram_rate = 0.750 vs 0.775
    • language_consistency_en = 1.000 vs 0.950
    • language_consistency_it = 0.850 vs 0.825
  • and in the checkpoint-specific decoding sweep, 14700 produced the strongest holdout result among the real release candidates

So this repo promotes the checkpoint that is the best compromise for an operational family base release, not just the one that wins the scalar leaderboard by the smallest possible edge.

Main Metrics for step_14700

  • val_loss_mixed = 4.4436
  • val_loss_en = 4.3929
  • val_loss_it = 3.5830
  • ppl_mixed = 85.0781
  • ppl_en = 80.8710
  • ppl_it = 35.9822

Behavior snapshot:

  • loop_rate = 0.375
  • distinct_2 = 0.5643
  • repeated_4gram_rate = 0.750
  • language_consistency_en = 1.000
  • language_consistency_it = 0.850

Source losses:

  • books_en = 4.4110
  • books_it = 4.3904
  • code = 7.7208
  • web_en = 5.4893
  • web_it = 5.4087
  • wiki_en = 2.9984
  • wiki_it = 2.9202

Short honest read:

  • this is not the best pure scalar checkpoint
  • it is the best practical balanced checkpoint of the medium family
  • it keeps near-best loss while degrading less badly into loop/repetition than the stricter scalar winner
  • this is the checkpoint to use when you want the official single-repo medium base of the family

Recommended Decoding

The repo-native decoding sweep was run on this exact checkpoint.

Raw sweep result:

  • tuning winner: creative
  • holdout winner: creative

Public default:

  • keep balanced as the recommended preset for the published family-base card
  • rationale:
    • Naz explicitly prefers balanced as the default unless creative wins clearly enough to justify the more aggressive preset
    • on this checkpoint, creative does win the holdout score, but not by a margin large enough to force a louder default for the public base release
    • practical delta:
      • creative holdout score = 2.6369
      • balanced holdout score = 2.4656
      • delta = +0.1713
    • so the repo keeps the stronger exploratory preset documented, but ships the calmer preset as the default recommendation

Recommended generation params (balanced):

  • do_sample = true
  • temperature = 0.8
  • top_k = 50
  • top_p = 0.95
  • repetition_penalty = 1.1
  • no_repeat_ngram_size = 0
  • max_new_tokens = 64

Holdout metrics for the recommended preset:

  • score = 2.4656
  • completion_rate = 1.0
  • distinct_2 = 0.9878
  • language_consistency_mean = 0.6667
  • loop_rate = 0.0
  • repeated_4gram_rate = 0.0
  • language_switch_rate_mean = 0.2500
  • length_closeness = 0.9355

If you want the higher-scoring exploratory preset from the sweep instead:

  • creative
    • temperature = 1.0
    • top_k = 100
    • holdout score = 2.6369

Both generation_config.json and recommended_decoding_params.json are included in the repo.

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

repo_id = "nazdef/1gpu-llm-medium-en-it-base"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)

prompt = "La capitale d'Italia è"
prompt_ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
bos = torch.tensor([[tokenizer.bos_token_id]], dtype=prompt_ids["input_ids"].dtype)
input_ids = torch.cat([bos, prompt_ids["input_ids"]], dim=1)
attention_mask = torch.ones_like(input_ids)

outputs = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    do_sample=True,
    max_new_tokens=64,
    temperature=0.8,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.1,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Files Included

  • original .pt checkpoint
  • exported checkpoint-native .safetensors weights plus metadata sidecar
  • standard Transformers model.safetensors
  • Transformers config.json
  • tokenizer files
  • training config
  • resumed-run telemetry (best_validation.json, metrics.jsonl, eval_metrics.jsonl, probe_generations.jsonl)
  • repo-native benchmark bundle (summary.json, comparison.json, comparison.csv, metrics.json, metrics.csv, source_losses.json, report.md, generations.jsonl, generations_comparison.md, cloze_results.jsonl)
  • decoding search bundle (decoding_summary.json, decoding_report.md, tuning_leaderboard.csv, holdout_leaderboard.csv, tuning_generations.jsonl, holdout_generations.jsonl)
  • recommended generation settings (generation_config.json, recommended_decoding_params.json)
  • release note release_note.md

Intended Use

Use this model as:

  • the current medium bilingual base checkpoint of the 1gpu-llm family
  • a base for future SFT or downstream instruction tuning
  • a single-GPU from-scratch EN/IT medium reference model

Do not read this repo as:

  • proof that 14700 is the best checkpoint on every possible axis
  • a claim that it beats the scalar winner 14250 on pure loss
  • an instruction-following or safety-tuned assistant model

License

This release is published with CC-BY-SA-4.0 as the practical downstream posture for the mixed training corpus used here.

The training mix includes:

  • FineWeb-HQ / FineWeb2-HQ web data
  • Wiki40B English and Italian slices

Downstream users are responsible for checking whether their use, redistribution, or derivative packaging remains compatible with the obligations of the upstream datasets and their terms.

Downloads last month
115
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train nazdef/1gpu-llm-medium-en-it-base

Collection including nazdef/1gpu-llm-medium-en-it-base