GPT-2 FineWeb (machkour's continued pretraining variant)

Model Description

This is a continued pretraining (fine-tuning for next-token prediction) of the original GPT-2 (small, 124M parameters) on a high-quality subset of FineWeb dataset.

  • Base model: openai-community/gpt2 (original GPT-2 small)
  • Training method: PEFT + LoRA (r=32, alpha=32, targets: c_attn, c_proj, c_fc)
  • Quantization during training: 4-bit NF4 + double quant + bfloat16 compute
  • Training objective: Causal language modeling (next token prediction)
  • Purpose: Improve general text continuation / next-word prediction quality before personality / instruction tuning in the next stage.

When & How Was It Created?

  • Creation date: February 26, 2026
  • Training duration: β‰ˆ 60 minutes (600 training steps)
  • Hardware: Google Colab T4 GPU (15 GB VRAM)
  • Training start β†’ end: ~39–40 minutes wall-clock time for the final run (after map/tokenization overhead)
  • Created by: @younes

Training Data

  • Dataset: HuggingFaceFW/fineweb β€” configuration: sample-10BT
  • Processed subset: 700,000 documents (after filtering short texts)
  • Total tokens: β‰ˆ 416 million tokens (estimated from previous runs)
  • Format: .jsonl.gz with single field {"text": "..."}
  • Language: Primarily English (fineweb is English-dominant)
  • Data source date: Mostly web crawl snapshots from ~2013 (old but very clean high-quality subset)

Note: This is not instruction-tuned or chat-tuned. It is still a raw language model optimized for free-form text continuation.

Model Size & Files

  • Base parameters: 124M (GPT-2 small)
  • Trainable LoRA parameters: β‰ˆ 4.72M (3.65% of total)
  • Total effective parameters: still 124M (LoRA is additive)
  • Disk size (saved folder): ~250–350 MB (4-bit base + LoRA adapters)
  • Saved directory: gpt2-fineweb-final
  • Main files:
    • adapter_config.json
    • adapter_model.bin (or safetensors)
    • config.json
    • generation_config.json
    • pytorch_model.bin (quantized base)
    • tokenizer.json, vocab.json, merges.txt

How to Use / Load the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model_name = "openai-community/gpt2"
adapter_path = "machkour/gpt2-fineweb-416M-tokens"   # ← change to your repo after upload

tokenizer = AutoTokenizer.from_pretrained(base_model_name)

model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    torch_dtype="auto"
)

model = PeftModel.from_pretrained(model, adapter_path)

# Optional: merge LoRA weights into base model (for faster inference)
# model = model.merge_and_unload()

# Example generation
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=120,
    do_sample=True,
    temperature=0.85,
    top_p=0.92
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Use & Limitations

  • Best for: open-ended text generation, story continuation, code / prose completion
  • Not suitable (yet) for: chat / instruction following, Q&A, Arabic-dominant tasks
  • Next planned step: personality / role injection + instruction tuning

Training Hyperparameters (final run)

  • Optimizer: adamw_8bit
  • Learning rate: 5e-5
  • Batch size: 4 Γ— 4 (effective 16)
  • Steps: 600
  • Warmup steps: 50
  • Gradient checkpointing: yes
  • Mixed precision: bf16

Results / Observations

  • Loss decreased from ~3.89 β†’ ~3.54 over 600 steps
  • Visible improvement in fluency and topical coherence compared to vanilla GPT-2
  • Still shows signs of repetition / old-web style (due to FineWeb age)

How to Cite / Reference

If you use this model:

@misc{gpt2-fineweb-machkour,
  author       = {machkour},
  title        = {GPT-2 continued on FineWeb (416M tokens)},
  year         = {2026},
  month        = {February},
  howpublished = {\url{https://huggingface.co/machkour/gpt2-fineweb-416M-tokens}},
}

Good luck with the upload!

After saving this as README.md in your model folder, you can push everything with:

from huggingface_hub import login, upload_folder

login()  # paste your HF token

upload_folder(
    folder_path="gpt2-fineweb-final",
    repo_id="machkour/gpt2-fineweb-416M-tokens",
    repo_type="model",
    commit_message="Upload fine-tuned GPT-2 on FineWeb"
)

(Install huggingface_hub if needed: !pip install huggingface_hub)

Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for iko-01/iko-001

Adapter
(1672)
this model