YALM-130M

YALM (Yet Another Language Model) is a family of an experimental small language models developed through my ongoing exploration of language modeling and LLM architectures.

YALM-130M is the second model in this series. This model is trained on a diverse corpus of English, Hindi, Math, and Python Code to test its capacity for multi-lingual and technical reasoning.

Model Overview:

Architecture: Llama
Pretraining steps: 40k
Pretraining tokens: 42B
Precision: bfloat16
Number of Parameters: 130M
Number of Paramaters (Non-Embedding): 113M
Number of Layers: 16
Number of Attention Heads (GQA): 16 for Q and 2 for KV
Context Length: 2048

Usage

>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("kp7742/YALM-130M")
>>> model = AutoModelForCausalLM.from_pretrained("kp7742/YALM-130M")
>>> inputs = tokenizer("Hey how are you doing?", return_tensors="pt")
>>> out = model.generate(**inputs, max_new_tokens=100)
>>> print(tokenizer.batch_decode(out))

Training

Data

This model is pre-trained on YALM-pretrain6-62M

Hyperparameters

learning_rate: 6e-3
train_batch_size: 16
eval_batch_size: 16
distributed_type: multi-GPU DDP
num_devices: 4
gradient_accumulation_steps: 8
total_train_batch_size: 512
total_eval_batch_size: 64
optimizer: AdamW with betas=(0.9, 0.95) and epsilon=1e-08
lr_scheduler_type: warmup_stable_decay
lr_scheduler_warmup_steps: 4000
training_steps: 40000

Hardware

GPUs: 4 x RTX 5090

Framework versions

Transformers 4.56.2
Pytorch 2.8.0+cu128
Datasets 4.1.1
Tokenizers 0.22.1

Evaluation

All evaluations are zero-shot unless stated otherwise, and I used lighteval to run them.

It achieves the following results on the test set:

Loss: 2.46
Perplexity: 11.66

Base pre-trained model

Metrics	YALM-130M	YALM-80M
MMLU (cloze)	27.98	27.33
MMLU Pro	11.38	8.72
BBH (5-shot)	11.59	12.61
ARC (Average)	33.50	29.87
HellaSwag	34.08	32.16
PIQA	62.40	62.89
SCIQ	70.00	69.50
CommonsenseQA	28.75	28.75
Winogrande	50.28	50.59
OpenBookQA	31.00	29.60
TruthfulQA	21.71	22.78
TriviaQA	0.18	0.17
GSM8K (5-shot)	1.06	0.83

Limitations

YALM models primarily understand and generate content in English and Hindi. They can produce text on a variety of topics but as world knowledge is limited, the generated content may not always be factually accurate, logically consistent, or free from biases present in the training data.

Downloads last month: 5

Safetensors

Model size

0.1B params

Tensor type

BF16

kp7742
/

YALM-130M