kp7742
/

YALM-130M

@@ -1,48 +1,102 @@
 ---
 library_name: transformers
 tags:
-- generated_from_trainer
-model-index:
-- name: YALM_130M
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# YALM_130M
-This model was trained from scratch on an unknown dataset.
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 0.006
-- train_batch_size: 8
-- eval_batch_size: 8
-- seed: 42
-- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.95) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
 - lr_scheduler_type: warmup_stable_decay
 - lr_scheduler_warmup_steps: 4000
 - training_steps: 40000
 ### Framework versions
 - Transformers 4.56.2
-- Pytorch 2.7.1+cu128
 - Datasets 4.1.1
 - Tokenizers 0.22.1

 ---
 library_name: transformers
+datasets:
+- kp7742/YALM-pretrain6-62M
+language:
+- en
+- hi
+pipeline_tag: text-generation
 tags:
+- pt
+- yalm
 ---
+# YALM-130M
+YALM (Yet Another Language Model) is a family of an experimental small language models developed through my ongoing exploration of language modeling and LLM architectures.
+YALM-130M is the second model in this series. This model is trained on a diverse corpus of English, Hindi, Math, and Python Code to test its capacity for multi-lingual and technical reasoning.
+**Model Overview:**
+- Architecture: Llama
+- Pretraining steps: 40k
+- Pretraining tokens: 42B
+- Precision: bfloat16
+- Number of Parameters: 130M
+- Number of Paramaters (Non-Embedding): 113M
+- Number of Layers: 16
+- Number of Attention Heads (GQA): 16 for Q and 2 for KV
+- Context Length: 2048
+## Usage
+```python
+>>> from transformers import AutoTokenizer, AutoModelForCausalLM
+>>> tokenizer = AutoTokenizer.from_pretrained("kp7742/YALM-130M")
+>>> model = AutoModelForCausalLM.from_pretrained("kp7742/YALM-130M")
+>>> inputs = tokenizer("Hey how are you doing?", return_tensors="pt")
+>>> out = model.generate(**inputs, max_new_tokens=100)
+>>> print(tokenizer.batch_decode(out))
+```
+## Training
+### Data
+This model is pre-trained on [YALM-pretrain6-62M](https://huggingface.co/datasets/kp7742/YALM-pretrain6-62M)
+### Hyperparameters
+- learning_rate: 6e-3
+- train_batch_size: 16
+- eval_batch_size: 16
+- distributed_type: multi-GPU DDP
+- num_devices: 4
+- gradient_accumulation_steps: 8
+- total_train_batch_size: 512
+- total_eval_batch_size: 64
+- optimizer: AdamW with betas=(0.9, 0.95) and epsilon=1e-08
 - lr_scheduler_type: warmup_stable_decay
 - lr_scheduler_warmup_steps: 4000
 - training_steps: 40000
+### Hardware
+- GPUs: 4 x RTX 5090
 ### Framework versions
 - Transformers 4.56.2
+- Pytorch 2.8.0+cu128
 - Datasets 4.1.1
 - Tokenizers 0.22.1
+## Evaluation
+All evaluations are zero-shot unless stated otherwise, and I used [lighteval](https://github.com/huggingface/lighteval) to run them.
+It achieves the following results on the test set:
+- Loss: 2.46
+- Perplexity: 11.66
+## Base pre-trained model
+| Metrics            | YALM-130M    | YALM-80M     |
+|:-------------------|:------------:|:------------:|
+| MMLU (cloze)       | 27.98        | 27.33        |
+| MMLU Pro           | 11.38        | 8.72         |
+| BBH (5-shot)       | 11.59        | 12.61        |
+| ARC (Average)      | 33.50        | 29.87        |
+| HellaSwag          | 34.08        | 32.16        |
+| PIQA               | 62.40        | 62.89        |
+| SCIQ               | 70.00        | 69.50        |
+| CommonsenseQA      | 28.75        | 28.75        |
+| Winogrande         | 50.28        | 50.59        |
+| OpenBookQA         | 31.00        | 29.60        |
+| TruthfulQA         | 21.71        | 22.78        |
+| TriviaQA           | 0.18         | 0.17         |
+| GSM8K (5-shot)     | 1.06         | 0.83         |
+## Limitations
+YALM models primarily understand and generate content in English and Hindi. They can produce text on a variety of topics but as world knowledge is limited, the generated content may not always be factually accurate, logically consistent, or free from biases present in the training data.