kp7742
/

YALM-80M

Text Generation

text-generation-inference

Model card Files Files and versions

YALM-80M / README.md

kp7742's picture

Upload folder using huggingface_hub

4ba50ba verified 4 months ago

|

history blame contribute delete

3.11 kB

	---
	library_name: transformers
	datasets:
	- kp7742/YALM-pretrain5-60M
	language:
	- en
	- hi
	pipeline_tag: text-generation
	tags:
	- pt
	- yalm
	---

	# YALM-80M

	YALM (Yet Another Language Model) is a family of an experimental small language models developed through my ongoing exploration of language modeling and LLM architectures.

	YALM-80M is the first member model in this family. This model is trained on a diverse corpus of English, Hindi, Math, and Python Code to test its capacity for multi-lingual and technical reasoning.

	Note: There is a bug in tokenizer which may cause error during generation for certrain inputs.

	Model Overview:
	- Architecture: Llama
	- Pretraining steps: 34k
	- Pretraining tokens: 36B
	- Precision: bfloat16
	- Number of Parameters: 79.7M
	- Number of Paramaters (Non-Embedding): 62.9M
	- Number of Layers: 16
	- Number of Attention Heads (GQA): 8 for Q and 4 for KV
	- Context Length: 2048

	## Usage

	```python
	>>> from transformers import AutoTokenizer, AutoModelForCausalLM
	>>> tokenizer = AutoTokenizer.from_pretrained("kp7742/YALM-80M")
	>>> model = AutoModelForCausalLM.from_pretrained("kp7742/YALM-80M")
	>>> inputs = tokenizer("Hey how are you doing?", return_tensors="pt")
	>>> out = model.generate(**inputs, max_new_tokens=100)
	>>> print(tokenizer.batch_decode(out))
	```

	## Training

	### Data

	This model is pre-trained on [YALM-pretrain5-60M](https://huggingface.co/datasets/kp7742/YALM-pretrain5-60M)

	### Hyperparameters

	- learning_rate: 0.007812
	- train_batch_size: 16
	- eval_batch_size: 16
	- distributed_type: multi-GPU DDP
	- num_devices: 8
	- gradient_accumulation_steps: 4
	- total_train_batch_size: 512
	- total_eval_batch_size: 128
	- optimizer: AdamW with betas=(0.9, 0.95) and epsilon=1e-08
	- lr_scheduler_type: warmup_stable_decay
	- lr_scheduler_warmup_steps: 3400
	- training_steps: 34000

	### Hardware

	- GPUs: 8 x RTX 4090

	### Framework versions

	- Transformers 4.53.1
	- Pytorch 2.7.1+cu128
	- Datasets 3.6.0
	- Tokenizers 0.21.2

	## Evaluation

	All evaluations are zero-shot unless stated otherwise, and I used [lighteval](https://github.com/huggingface/lighteval) to run them.

	It achieves the following results on the test set:
	- Loss: 2.78
	- Perplexity: 16.10

	## Base pre-trained model

	\| Metrics \| YALM-80M \|
	\|:-------------------\|:------------:\|
	\| MMLU (cloze) \| 27.33 \|
	\| MMLU Pro \| 8.72 \|
	\| BBH (5-shot) \| 12.61 \|
	\| ARC (Average) \| 29.87 \|
	\| HellaSwag \| 32.16 \|
	\| PIQA \| 62.89 \|
	\| SCIQ \| 69.50 \|
	\| CommonsenseQA \| 28.75 \|
	\| Winogrande \| 50.59 \|
	\| OpenBookQA \| 29.60 \|
	\| TruthfulQA \| 22.78 \|
	\| TriviaQA \| 0.17 \|
	\| GSM8K (5-shot) \| 0.83 \|

	## Limitations

	YALM models primarily understand and generate content in English and Hindi. They can produce text on a variety of topics but as world knowledge is limited, the generated content may not always be factually accurate, logically consistent, or free from biases present in the training data.