YALM-80M / README.md

kp7742

Upload folder using huggingface_hub

4ba50ba verified 3 months ago

preview code

raw

history blame contribute delete

3.11 kB

metadata

library_name: transformers
datasets:
  - kp7742/YALM-pretrain5-60M
language:
  - en
  - hi
pipeline_tag: text-generation
tags:
  - pt
  - yalm

YALM-80M

YALM (Yet Another Language Model) is a family of an experimental small language models developed through my ongoing exploration of language modeling and LLM architectures.

YALM-80M is the first member model in this family. This model is trained on a diverse corpus of English, Hindi, Math, and Python Code to test its capacity for multi-lingual and technical reasoning.

Note: There is a bug in tokenizer which may cause error during generation for certrain inputs.

Model Overview:

Architecture: Llama
Pretraining steps: 34k
Pretraining tokens: 36B
Precision: bfloat16
Number of Parameters: 79.7M
Number of Paramaters (Non-Embedding): 62.9M
Number of Layers: 16
Number of Attention Heads (GQA): 8 for Q and 4 for KV
Context Length: 2048

Usage

>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("kp7742/YALM-80M")
>>> model = AutoModelForCausalLM.from_pretrained("kp7742/YALM-80M")
>>> inputs = tokenizer("Hey how are you doing?", return_tensors="pt")
>>> out = model.generate(**inputs, max_new_tokens=100)
>>> print(tokenizer.batch_decode(out))

Training

Data

This model is pre-trained on YALM-pretrain5-60M

Hyperparameters

learning_rate: 0.007812
train_batch_size: 16
eval_batch_size: 16
distributed_type: multi-GPU DDP
num_devices: 8
gradient_accumulation_steps: 4
total_train_batch_size: 512
total_eval_batch_size: 128
optimizer: AdamW with betas=(0.9, 0.95) and epsilon=1e-08
lr_scheduler_type: warmup_stable_decay
lr_scheduler_warmup_steps: 3400
training_steps: 34000

Hardware

GPUs: 8 x RTX 4090

Framework versions

Transformers 4.53.1
Pytorch 2.7.1+cu128
Datasets 3.6.0
Tokenizers 0.21.2

Evaluation

All evaluations are zero-shot unless stated otherwise, and I used lighteval to run them.

It achieves the following results on the test set:

Loss: 2.78
Perplexity: 16.10

Base pre-trained model

Metrics	YALM-80M
MMLU (cloze)	27.33
MMLU Pro	8.72
BBH (5-shot)	12.61
ARC (Average)	29.87
HellaSwag	32.16
PIQA	62.89
SCIQ	69.50
CommonsenseQA	28.75
Winogrande	50.59
OpenBookQA	29.60
TruthfulQA	22.78
TriviaQA	0.17
GSM8K (5-shot)	0.83

Limitations

YALM models primarily understand and generate content in English and Hindi. They can produce text on a variety of topics but as world knowledge is limited, the generated content may not always be factually accurate, logically consistent, or free from biases present in the training data.