|
|
--- |
|
|
library_name: transformers |
|
|
datasets: |
|
|
- kp7742/YALM-pretrain5-60M |
|
|
language: |
|
|
- en |
|
|
- hi |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- pt |
|
|
- yalm |
|
|
--- |
|
|
|
|
|
# YALM-80M |
|
|
|
|
|
YALM (Yet Another Language Model) is a family of an experimental small language models developed through my ongoing exploration of language modeling and LLM architectures. |
|
|
|
|
|
YALM-80M is the first member model in this family. This model is trained on a diverse corpus of English, Hindi, Math, and Python Code to test its capacity for multi-lingual and technical reasoning. |
|
|
|
|
|
Note: There is a bug in tokenizer which may cause error during generation for certrain inputs. |
|
|
|
|
|
**Model Overview:** |
|
|
- Architecture: Llama |
|
|
- Pretraining steps: 34k |
|
|
- Pretraining tokens: 36B |
|
|
- Precision: bfloat16 |
|
|
- Number of Parameters: 79.7M |
|
|
- Number of Paramaters (Non-Embedding): 62.9M |
|
|
- Number of Layers: 16 |
|
|
- Number of Attention Heads (GQA): 8 for Q and 4 for KV |
|
|
- Context Length: 2048 |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
>>> from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
>>> tokenizer = AutoTokenizer.from_pretrained("kp7742/YALM-80M") |
|
|
>>> model = AutoModelForCausalLM.from_pretrained("kp7742/YALM-80M") |
|
|
>>> inputs = tokenizer("Hey how are you doing?", return_tensors="pt") |
|
|
>>> out = model.generate(**inputs, max_new_tokens=100) |
|
|
>>> print(tokenizer.batch_decode(out)) |
|
|
``` |
|
|
|
|
|
## Training |
|
|
|
|
|
### Data |
|
|
|
|
|
This model is pre-trained on [YALM-pretrain5-60M](https://huggingface.co/datasets/kp7742/YALM-pretrain5-60M) |
|
|
|
|
|
### Hyperparameters |
|
|
|
|
|
- learning_rate: 0.007812 |
|
|
- train_batch_size: 16 |
|
|
- eval_batch_size: 16 |
|
|
- distributed_type: multi-GPU DDP |
|
|
- num_devices: 8 |
|
|
- gradient_accumulation_steps: 4 |
|
|
- total_train_batch_size: 512 |
|
|
- total_eval_batch_size: 128 |
|
|
- optimizer: AdamW with betas=(0.9, 0.95) and epsilon=1e-08 |
|
|
- lr_scheduler_type: warmup_stable_decay |
|
|
- lr_scheduler_warmup_steps: 3400 |
|
|
- training_steps: 34000 |
|
|
|
|
|
### Hardware |
|
|
|
|
|
- GPUs: 8 x RTX 4090 |
|
|
|
|
|
### Framework versions |
|
|
|
|
|
- Transformers 4.53.1 |
|
|
- Pytorch 2.7.1+cu128 |
|
|
- Datasets 3.6.0 |
|
|
- Tokenizers 0.21.2 |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
All evaluations are zero-shot unless stated otherwise, and I used [lighteval](https://github.com/huggingface/lighteval) to run them. |
|
|
|
|
|
It achieves the following results on the test set: |
|
|
- Loss: 2.78 |
|
|
- Perplexity: 16.10 |
|
|
|
|
|
## Base pre-trained model |
|
|
|
|
|
| Metrics | YALM-80M | |
|
|
|:-------------------|:------------:| |
|
|
| MMLU (cloze) | 27.33 | |
|
|
| MMLU Pro | 8.72 | |
|
|
| BBH (5-shot) | 12.61 | |
|
|
| ARC (Average) | 29.87 | |
|
|
| HellaSwag | 32.16 | |
|
|
| PIQA | 62.89 | |
|
|
| SCIQ | 69.50 | |
|
|
| CommonsenseQA | 28.75 | |
|
|
| Winogrande | 50.59 | |
|
|
| OpenBookQA | 29.60 | |
|
|
| TruthfulQA | 22.78 | |
|
|
| TriviaQA | 0.17 | |
|
|
| GSM8K (5-shot) | 0.83 | |
|
|
|
|
|
## Limitations |
|
|
|
|
|
YALM models primarily understand and generate content in English and Hindi. They can produce text on a variety of topics but as world knowledge is limited, the generated content may not always be factually accurate, logically consistent, or free from biases present in the training data. |
|
|
|