kp7742 commited on
Commit
034cb33
·
verified ·
1 Parent(s): 225f49f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -24
README.md CHANGED
@@ -1,48 +1,102 @@
1
  ---
2
  library_name: transformers
 
 
 
 
 
 
3
  tags:
4
- - generated_from_trainer
5
- model-index:
6
- - name: YALM_130M
7
- results: []
8
  ---
9
 
10
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
11
- should probably proofread and complete it, then remove this comment. -->
12
 
13
- # YALM_130M
14
 
15
- This model was trained from scratch on an unknown dataset.
16
 
17
- ## Model description
 
 
 
 
 
 
 
 
 
18
 
19
- More information needed
20
 
21
- ## Intended uses & limitations
 
 
 
 
 
 
 
22
 
23
- More information needed
24
 
25
- ## Training and evaluation data
26
 
27
- More information needed
28
 
29
- ## Training procedure
30
 
31
- ### Training hyperparameters
32
-
33
- The following hyperparameters were used during training:
34
- - learning_rate: 0.006
35
- - train_batch_size: 8
36
- - eval_batch_size: 8
37
- - seed: 42
38
- - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.95) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
 
39
  - lr_scheduler_type: warmup_stable_decay
40
  - lr_scheduler_warmup_steps: 4000
41
  - training_steps: 40000
42
 
 
 
 
 
43
  ### Framework versions
44
 
45
  - Transformers 4.56.2
46
- - Pytorch 2.7.1+cu128
47
  - Datasets 4.1.1
48
  - Tokenizers 0.22.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
+ datasets:
4
+ - kp7742/YALM-pretrain6-62M
5
+ language:
6
+ - en
7
+ - hi
8
+ pipeline_tag: text-generation
9
  tags:
10
+ - pt
11
+ - yalm
 
 
12
  ---
13
 
14
+ # YALM-130M
 
15
 
16
+ YALM (Yet Another Language Model) is a family of an experimental small language models developed through my ongoing exploration of language modeling and LLM architectures.
17
 
18
+ YALM-130M is the second model in this series. This model is trained on a diverse corpus of English, Hindi, Math, and Python Code to test its capacity for multi-lingual and technical reasoning.
19
 
20
+ **Model Overview:**
21
+ - Architecture: Llama
22
+ - Pretraining steps: 40k
23
+ - Pretraining tokens: 42B
24
+ - Precision: bfloat16
25
+ - Number of Parameters: 130M
26
+ - Number of Paramaters (Non-Embedding): 113M
27
+ - Number of Layers: 16
28
+ - Number of Attention Heads (GQA): 16 for Q and 2 for KV
29
+ - Context Length: 2048
30
 
31
+ ## Usage
32
 
33
+ ```python
34
+ >>> from transformers import AutoTokenizer, AutoModelForCausalLM
35
+ >>> tokenizer = AutoTokenizer.from_pretrained("kp7742/YALM-130M")
36
+ >>> model = AutoModelForCausalLM.from_pretrained("kp7742/YALM-130M")
37
+ >>> inputs = tokenizer("Hey how are you doing?", return_tensors="pt")
38
+ >>> out = model.generate(**inputs, max_new_tokens=100)
39
+ >>> print(tokenizer.batch_decode(out))
40
+ ```
41
 
42
+ ## Training
43
 
44
+ ### Data
45
 
46
+ This model is pre-trained on [YALM-pretrain6-62M](https://huggingface.co/datasets/kp7742/YALM-pretrain6-62M)
47
 
48
+ ### Hyperparameters
49
 
50
+ - learning_rate: 6e-3
51
+ - train_batch_size: 16
52
+ - eval_batch_size: 16
53
+ - distributed_type: multi-GPU DDP
54
+ - num_devices: 4
55
+ - gradient_accumulation_steps: 8
56
+ - total_train_batch_size: 512
57
+ - total_eval_batch_size: 64
58
+ - optimizer: AdamW with betas=(0.9, 0.95) and epsilon=1e-08
59
  - lr_scheduler_type: warmup_stable_decay
60
  - lr_scheduler_warmup_steps: 4000
61
  - training_steps: 40000
62
 
63
+ ### Hardware
64
+
65
+ - GPUs: 4 x RTX 5090
66
+
67
  ### Framework versions
68
 
69
  - Transformers 4.56.2
70
+ - Pytorch 2.8.0+cu128
71
  - Datasets 4.1.1
72
  - Tokenizers 0.22.1
73
+
74
+ ## Evaluation
75
+
76
+ All evaluations are zero-shot unless stated otherwise, and I used [lighteval](https://github.com/huggingface/lighteval) to run them.
77
+
78
+ It achieves the following results on the test set:
79
+ - Loss: 2.46
80
+ - Perplexity: 11.66
81
+
82
+ ## Base pre-trained model
83
+
84
+ | Metrics | YALM-130M | YALM-80M |
85
+ |:-------------------|:------------:|:------------:|
86
+ | MMLU (cloze) | 27.98 | 27.33 |
87
+ | MMLU Pro | 11.38 | 8.72 |
88
+ | BBH (5-shot) | 11.59 | 12.61 |
89
+ | ARC (Average) | 33.50 | 29.87 |
90
+ | HellaSwag | 34.08 | 32.16 |
91
+ | PIQA | 62.40 | 62.89 |
92
+ | SCIQ | 70.00 | 69.50 |
93
+ | CommonsenseQA | 28.75 | 28.75 |
94
+ | Winogrande | 50.28 | 50.59 |
95
+ | OpenBookQA | 31.00 | 29.60 |
96
+ | TruthfulQA | 21.71 | 22.78 |
97
+ | TriviaQA | 0.18 | 0.17 |
98
+ | GSM8K (5-shot) | 1.06 | 0.83 |
99
+
100
+ ## Limitations
101
+
102
+ YALM models primarily understand and generate content in English and Hindi. They can produce text on a variety of topics but as world knowledge is limited, the generated content may not always be factually accurate, logically consistent, or free from biases present in the training data.