YALM-130M / README.md
kp7742's picture
Update README.md
034cb33 verified
metadata
library_name: transformers
datasets:
  - kp7742/YALM-pretrain6-62M
language:
  - en
  - hi
pipeline_tag: text-generation
tags:
  - pt
  - yalm

YALM-130M

YALM (Yet Another Language Model) is a family of an experimental small language models developed through my ongoing exploration of language modeling and LLM architectures.

YALM-130M is the second model in this series. This model is trained on a diverse corpus of English, Hindi, Math, and Python Code to test its capacity for multi-lingual and technical reasoning.

Model Overview:

  • Architecture: Llama
  • Pretraining steps: 40k
  • Pretraining tokens: 42B
  • Precision: bfloat16
  • Number of Parameters: 130M
  • Number of Paramaters (Non-Embedding): 113M
  • Number of Layers: 16
  • Number of Attention Heads (GQA): 16 for Q and 2 for KV
  • Context Length: 2048

Usage

>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("kp7742/YALM-130M")
>>> model = AutoModelForCausalLM.from_pretrained("kp7742/YALM-130M")
>>> inputs = tokenizer("Hey how are you doing?", return_tensors="pt")
>>> out = model.generate(**inputs, max_new_tokens=100)
>>> print(tokenizer.batch_decode(out))

Training

Data

This model is pre-trained on YALM-pretrain6-62M

Hyperparameters

  • learning_rate: 6e-3
  • train_batch_size: 16
  • eval_batch_size: 16
  • distributed_type: multi-GPU DDP
  • num_devices: 4
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 512
  • total_eval_batch_size: 64
  • optimizer: AdamW with betas=(0.9, 0.95) and epsilon=1e-08
  • lr_scheduler_type: warmup_stable_decay
  • lr_scheduler_warmup_steps: 4000
  • training_steps: 40000

Hardware

  • GPUs: 4 x RTX 5090

Framework versions

  • Transformers 4.56.2
  • Pytorch 2.8.0+cu128
  • Datasets 4.1.1
  • Tokenizers 0.22.1

Evaluation

All evaluations are zero-shot unless stated otherwise, and I used lighteval to run them.

It achieves the following results on the test set:

  • Loss: 2.46
  • Perplexity: 11.66

Base pre-trained model

Metrics YALM-130M YALM-80M
MMLU (cloze) 27.98 27.33
MMLU Pro 11.38 8.72
BBH (5-shot) 11.59 12.61
ARC (Average) 33.50 29.87
HellaSwag 34.08 32.16
PIQA 62.40 62.89
SCIQ 70.00 69.50
CommonsenseQA 28.75 28.75
Winogrande 50.28 50.59
OpenBookQA 31.00 29.60
TruthfulQA 21.71 22.78
TriviaQA 0.18 0.17
GSM8K (5-shot) 1.06 0.83

Limitations

YALM models primarily understand and generate content in English and Hindi. They can produce text on a variety of topics but as world knowledge is limited, the generated content may not always be factually accurate, logically consistent, or free from biases present in the training data.