GPT-2 FineWeb (machkour's continued pretraining variant)
Model Description
This is a continued pretraining (fine-tuning for next-token prediction) of the original GPT-2 (small, 124M parameters) on a high-quality subset of FineWeb dataset.
- Base model: openai-community/gpt2 (original GPT-2 small)
- Training method: PEFT + LoRA (r=32, alpha=32, targets: c_attn, c_proj, c_fc)
- Quantization during training: 4-bit NF4 + double quant + bfloat16 compute
- Training objective: Causal language modeling (next token prediction)
- Purpose: Improve general text continuation / next-word prediction quality before personality / instruction tuning in the next stage.
When & How Was It Created?
- Creation date: February 26, 2026
- Training duration: β 60 minutes (600 training steps)
- Hardware: Google Colab T4 GPU (15 GB VRAM)
- Training start β end: ~39β40 minutes wall-clock time for the final run (after map/tokenization overhead)
- Created by: @younes
Training Data
- Dataset: HuggingFaceFW/fineweb β configuration:
sample-10BT - Processed subset: 700,000 documents (after filtering short texts)
- Total tokens: β 416 million tokens (estimated from previous runs)
- Format:
.jsonl.gzwith single field{"text": "..."} - Language: Primarily English (fineweb is English-dominant)
- Data source date: Mostly web crawl snapshots from ~2013 (old but very clean high-quality subset)
Note: This is not instruction-tuned or chat-tuned. It is still a raw language model optimized for free-form text continuation.
Model Size & Files
- Base parameters: 124M (GPT-2 small)
- Trainable LoRA parameters: β 4.72M (3.65% of total)
- Total effective parameters: still 124M (LoRA is additive)
- Disk size (saved folder): ~250β350 MB (4-bit base + LoRA adapters)
- Saved directory:
gpt2-fineweb-final - Main files:
adapter_config.jsonadapter_model.bin(or safetensors)config.jsongeneration_config.jsonpytorch_model.bin(quantized base)tokenizer.json,vocab.json,merges.txt
How to Use / Load the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model_name = "openai-community/gpt2"
adapter_path = "machkour/gpt2-fineweb-416M-tokens" # β change to your repo after upload
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
model = AutoModelForCausalLM.from_pretrained(
base_model_name,
device_map="auto",
torch_dtype="auto"
)
model = PeftModel.from_pretrained(model, adapter_path)
# Optional: merge LoRA weights into base model (for faster inference)
# model = model.merge_and_unload()
# Example generation
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=120,
do_sample=True,
temperature=0.85,
top_p=0.92
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Intended Use & Limitations
- Best for: open-ended text generation, story continuation, code / prose completion
- Not suitable (yet) for: chat / instruction following, Q&A, Arabic-dominant tasks
- Next planned step: personality / role injection + instruction tuning
Training Hyperparameters (final run)
- Optimizer: adamw_8bit
- Learning rate: 5e-5
- Batch size: 4 Γ 4 (effective 16)
- Steps: 600
- Warmup steps: 50
- Gradient checkpointing: yes
- Mixed precision: bf16
Results / Observations
- Loss decreased from ~3.89 β ~3.54 over 600 steps
- Visible improvement in fluency and topical coherence compared to vanilla GPT-2
- Still shows signs of repetition / old-web style (due to FineWeb age)
How to Cite / Reference
If you use this model:
@misc{gpt2-fineweb-machkour,
author = {machkour},
title = {GPT-2 continued on FineWeb (416M tokens)},
year = {2026},
month = {February},
howpublished = {\url{https://huggingface.co/machkour/gpt2-fineweb-416M-tokens}},
}
Good luck with the upload!
After saving this as README.md in your model folder, you can push everything with:
from huggingface_hub import login, upload_folder
login() # paste your HF token
upload_folder(
folder_path="gpt2-fineweb-final",
repo_id="machkour/gpt2-fineweb-416M-tokens",
repo_type="model",
commit_message="Upload fine-tuned GPT-2 on FineWeb"
)
(Install huggingface_hub if needed: !pip install huggingface_hub)
- Downloads last month
- 13
Model tree for iko-01/iko-001
Base model
openai-community/gpt2