YALM-80M / README.md
kp7742's picture
Upload folder using huggingface_hub
4ba50ba verified
metadata
library_name: transformers
datasets:
  - kp7742/YALM-pretrain5-60M
language:
  - en
  - hi
pipeline_tag: text-generation
tags:
  - pt
  - yalm

YALM-80M

YALM (Yet Another Language Model) is a family of an experimental small language models developed through my ongoing exploration of language modeling and LLM architectures.

YALM-80M is the first member model in this family. This model is trained on a diverse corpus of English, Hindi, Math, and Python Code to test its capacity for multi-lingual and technical reasoning.

Note: There is a bug in tokenizer which may cause error during generation for certrain inputs.

Model Overview:

  • Architecture: Llama
  • Pretraining steps: 34k
  • Pretraining tokens: 36B
  • Precision: bfloat16
  • Number of Parameters: 79.7M
  • Number of Paramaters (Non-Embedding): 62.9M
  • Number of Layers: 16
  • Number of Attention Heads (GQA): 8 for Q and 4 for KV
  • Context Length: 2048

Usage

>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("kp7742/YALM-80M")
>>> model = AutoModelForCausalLM.from_pretrained("kp7742/YALM-80M")
>>> inputs = tokenizer("Hey how are you doing?", return_tensors="pt")
>>> out = model.generate(**inputs, max_new_tokens=100)
>>> print(tokenizer.batch_decode(out))

Training

Data

This model is pre-trained on YALM-pretrain5-60M

Hyperparameters

  • learning_rate: 0.007812
  • train_batch_size: 16
  • eval_batch_size: 16
  • distributed_type: multi-GPU DDP
  • num_devices: 8
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 512
  • total_eval_batch_size: 128
  • optimizer: AdamW with betas=(0.9, 0.95) and epsilon=1e-08
  • lr_scheduler_type: warmup_stable_decay
  • lr_scheduler_warmup_steps: 3400
  • training_steps: 34000

Hardware

  • GPUs: 8 x RTX 4090

Framework versions

  • Transformers 4.53.1
  • Pytorch 2.7.1+cu128
  • Datasets 3.6.0
  • Tokenizers 0.21.2

Evaluation

All evaluations are zero-shot unless stated otherwise, and I used lighteval to run them.

It achieves the following results on the test set:

  • Loss: 2.78
  • Perplexity: 16.10

Base pre-trained model

Metrics YALM-80M
MMLU (cloze) 27.33
MMLU Pro 8.72
BBH (5-shot) 12.61
ARC (Average) 29.87
HellaSwag 32.16
PIQA 62.89
SCIQ 69.50
CommonsenseQA 28.75
Winogrande 50.59
OpenBookQA 29.60
TruthfulQA 22.78
TriviaQA 0.17
GSM8K (5-shot) 0.83

Limitations

YALM models primarily understand and generate content in English and Hindi. They can produce text on a variety of topics but as world knowledge is limited, the generated content may not always be factually accurate, logically consistent, or free from biases present in the training data.