GPT-2 FineWeb (machkour's continued pretraining variant)

Model Description

This is a continued pretraining (fine-tuning for next-token prediction) of the original GPT-2 (small, 124M parameters) on a high-quality subset of FineWeb dataset.

Base model: openai-community/gpt2 (original GPT-2 small)
Training method: PEFT + LoRA (r=32, alpha=32, targets: c_attn, c_proj, c_fc)
Quantization during training: 4-bit NF4 + double quant + bfloat16 compute
Training objective: Causal language modeling (next token prediction)
Purpose: Improve general text continuation / next-word prediction quality before personality / instruction tuning in the next stage.

When & How Was It Created?

Creation date: February 26, 2026
Training duration: ≈ 60 minutes (600 training steps)
Hardware: Google Colab T4 GPU (15 GB VRAM)
Training start → end: ~39–40 minutes wall-clock time for the final run (after map/tokenization overhead)
Created by: @younes

Training Data

Dataset: HuggingFaceFW/fineweb — configuration: sample-10BT
Processed subset: 700,000 documents (after filtering short texts)
Total tokens: ≈ 416 million tokens (estimated from previous runs)
Format: .jsonl.gz with single field {"text": "..."}
Language: Primarily English (fineweb is English-dominant)
Data source date: Mostly web crawl snapshots from ~2013 (old but very clean high-quality subset)

Note: This is not instruction-tuned or chat-tuned. It is still a raw language model optimized for free-form text continuation.

Model Size & Files

Base parameters: 124M (GPT-2 small)
Trainable LoRA parameters: ≈ 4.72M (3.65% of total)
Total effective parameters: still 124M (LoRA is additive)
Disk size (saved folder): ~250–350 MB (4-bit base + LoRA adapters)
Saved directory: gpt2-fineweb-final
Main files:
- adapter_config.json
- adapter_model.bin (or safetensors)
- config.json
- generation_config.json
- pytorch_model.bin (quantized base)
- tokenizer.json, vocab.json, merges.txt

How to Use / Load the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model_name = "openai-community/gpt2"
adapter_path = "machkour/gpt2-fineweb-416M-tokens"   # ← change to your repo after upload

tokenizer = AutoTokenizer.from_pretrained(base_model_name)

model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    torch_dtype="auto"
)

model = PeftModel.from_pretrained(model, adapter_path)

# Optional: merge LoRA weights into base model (for faster inference)
# model = model.merge_and_unload()

# Example generation
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=120,
    do_sample=True,
    temperature=0.85,
    top_p=0.92
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Use & Limitations

Best for: open-ended text generation, story continuation, code / prose completion
Not suitable (yet) for: chat / instruction following, Q&A, Arabic-dominant tasks
Next planned step: personality / role injection + instruction tuning

Training Hyperparameters (final run)

Optimizer: adamw_8bit
Learning rate: 5e-5
Batch size: 4 × 4 (effective 16)
Steps: 600
Warmup steps: 50
Gradient checkpointing: yes
Mixed precision: bf16

Results / Observations

Loss decreased from ~3.89 → ~3.54 over 600 steps
Visible improvement in fluency and topical coherence compared to vanilla GPT-2
Still shows signs of repetition / old-web style (due to FineWeb age)

How to Cite / Reference

If you use this model:

@misc{gpt2-fineweb-machkour,
  author       = {machkour},
  title        = {GPT-2 continued on FineWeb (416M tokens)},
  year         = {2026},
  month        = {February},
  howpublished = {\url{https://huggingface.co/machkour/gpt2-fineweb-416M-tokens}},
}

Good luck with the upload!

After saving this as README.md in your model folder, you can push everything with:

from huggingface_hub import login, upload_folder

login()  # paste your HF token

upload_folder(
    folder_path="gpt2-fineweb-final",
    repo_id="machkour/gpt2-fineweb-416M-tokens",
    repo_type="model",
    commit_message="Upload fine-tuned GPT-2 on FineWeb"
)

(Install huggingface_hub if needed: !pip install huggingface_hub)

Downloads last month: 2

Model tree for iko-01/iko-001

Base model

openai-community/gpt2

Adapter

(1706)

this model