YALM-130M / README.md

kp7742

Update README.md

034cb33 verified 3 months ago

preview code

raw

history blame contribute delete

3.23 kB

metadata

library_name: transformers
datasets:
  - kp7742/YALM-pretrain6-62M
language:
  - en
  - hi
pipeline_tag: text-generation
tags:
  - pt
  - yalm

YALM-130M

YALM (Yet Another Language Model) is a family of an experimental small language models developed through my ongoing exploration of language modeling and LLM architectures.

YALM-130M is the second model in this series. This model is trained on a diverse corpus of English, Hindi, Math, and Python Code to test its capacity for multi-lingual and technical reasoning.

Model Overview:

Architecture: Llama
Pretraining steps: 40k
Pretraining tokens: 42B
Precision: bfloat16
Number of Parameters: 130M
Number of Paramaters (Non-Embedding): 113M
Number of Layers: 16
Number of Attention Heads (GQA): 16 for Q and 2 for KV
Context Length: 2048

Usage

>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("kp7742/YALM-130M")
>>> model = AutoModelForCausalLM.from_pretrained("kp7742/YALM-130M")
>>> inputs = tokenizer("Hey how are you doing?", return_tensors="pt")
>>> out = model.generate(**inputs, max_new_tokens=100)
>>> print(tokenizer.batch_decode(out))

Training

Data

This model is pre-trained on YALM-pretrain6-62M

Hyperparameters

learning_rate: 6e-3
train_batch_size: 16
eval_batch_size: 16
distributed_type: multi-GPU DDP
num_devices: 4
gradient_accumulation_steps: 8
total_train_batch_size: 512
total_eval_batch_size: 64
optimizer: AdamW with betas=(0.9, 0.95) and epsilon=1e-08
lr_scheduler_type: warmup_stable_decay
lr_scheduler_warmup_steps: 4000
training_steps: 40000

Hardware

GPUs: 4 x RTX 5090

Framework versions

Transformers 4.56.2
Pytorch 2.8.0+cu128
Datasets 4.1.1
Tokenizers 0.22.1

Evaluation

All evaluations are zero-shot unless stated otherwise, and I used lighteval to run them.

It achieves the following results on the test set:

Loss: 2.46
Perplexity: 11.66

Base pre-trained model

Metrics	YALM-130M	YALM-80M
MMLU (cloze)	27.98	27.33
MMLU Pro	11.38	8.72
BBH (5-shot)	11.59	12.61
ARC (Average)	33.50	29.87
HellaSwag	34.08	32.16
PIQA	62.40	62.89
SCIQ	70.00	69.50
CommonsenseQA	28.75	28.75
Winogrande	50.28	50.59
OpenBookQA	31.00	29.60
TruthfulQA	21.71	22.78
TriviaQA	0.18	0.17
GSM8K (5-shot)	1.06	0.83

Limitations

YALM models primarily understand and generate content in English and Hindi. They can produce text on a variety of topics but as world knowledge is limited, the generated content may not always be factually accurate, logically consistent, or free from biases present in the training data.