Architecture

Decoder-only Transformer (GPT-style)
12 layers
Hidden size: 768
Attention heads: 12
Context length: 512
Parameters: ~100M

Training

Dataset: News articles (CNN/DailyMail – articles only)
Objective: Causal Language Modeling
Hardware: Google Colab GPU
Precision: FP16
Training steps: 2000
Optimizations: Gradient checkpointing, gradient accumulation

Training Loss Curve

The training loss decreased steadily from approximately 9.1 to 5.3 over 2000 training steps, indicating stable convergence during from-scratch training of the 100M-parameter language model.

Intended Use

Research
Educational purposes
Text generation experiments

Limitations

Not instruction-tuned
Trained for limited steps
Outputs may be verbose or repetitive

Downloads last month: 1

Safetensors

Model size

0.1B params

Tensor type

F32