nameissakthi commited on
Commit
7d4d882
·
0 Parent(s):

Initial commit: PebbleLM-117M base model

Browse files
Files changed (6) hide show
  1. .gitattributes +1 -0
  2. README.md +155 -0
  3. config.json +20 -0
  4. model.pt +3 -0
  5. tokenizer.json +0 -0
  6. tokenizer_config.json +9 -0
.gitattributes ADDED
@@ -0,0 +1 @@
 
 
1
+ *.pt filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - text-generation
7
+ - pytorch
8
+ - small-language-model
9
+ - edge-deployment
10
+ - from-scratch
11
+ datasets:
12
+ - wikipedia
13
+ - openwebtext
14
+ - roneneldan/TinyStories
15
+ pipeline_tag: text-generation
16
+ ---
17
+
18
+ # PebbleLM-117M
19
+
20
+ A 117.5M parameter language model trained from scratch. Small but solid - designed for edge deployment and educational purposes.
21
+
22
+ ## Model Description
23
+
24
+ PebbleLM-117M is a decoder-only transformer trained on a diverse corpus of text. Despite its small size, it demonstrates basic language understanding and generation capabilities.
25
+
26
+ | Property | Value |
27
+ |----------|-------|
28
+ | Parameters | 117.5M |
29
+ | Architecture | Decoder-only Transformer |
30
+ | Layers | 8 |
31
+ | Hidden Size | 1024 |
32
+ | Attention Heads | 16 |
33
+ | Context Length | 1024 tokens |
34
+ | Vocabulary | 16,000 BPE tokens |
35
+ | Position Encoding | RoPE |
36
+ | Normalization | RMSNorm |
37
+ | Activation | GELU |
38
+
39
+ ## Training Data
40
+
41
+ Pretrained on 1.17M samples from diverse sources:
42
+
43
+ | Dataset | Samples | Description | Link |
44
+ |---------|---------|-------------|------|
45
+ | Wikipedia | 488,906 | Encyclopedic knowledge | [wikipedia](https://huggingface.co/datasets/wikipedia) |
46
+ | OpenWebText | 500,000 | Diverse web content | [openwebtext](https://huggingface.co/datasets/openwebtext) |
47
+ | TinyStories | 188,067 | Simple narrative structure | [roneneldan/TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) |
48
+ | **Total** | **1,176,973** | | |
49
+
50
+ ## Training Details
51
+
52
+ ```yaml
53
+ Epochs: 3
54
+ Batch Size: 48
55
+ Gradient Accumulation: 2
56
+ Effective Batch Size: 96
57
+ Learning Rate: 3e-4
58
+ Warmup Ratio: 0.1
59
+ Precision: FP16
60
+ Hardware: NVIDIA A100 80GB
61
+ Training Time: ~4.5 hours
62
+ ```
63
+
64
+ ## Benchmark Results
65
+
66
+ Evaluated on 500 samples per benchmark:
67
+
68
+ | Benchmark | Accuracy | Random Baseline | Above Random |
69
+ |-----------|----------|-----------------|--------------|
70
+ | HellaSwag | 32.20% | 25% | +7.2% |
71
+ | ARC-Easy | 35.80% | 25% | +10.8% |
72
+ | WinoGrande | 52.80% | 50% | +2.8% |
73
+ | PIQA | 58.20% | 50% | +8.2% |
74
+ | **Average** | **44.75%** | - | - |
75
+
76
+ ## Usage
77
+
78
+ ```python
79
+ import torch
80
+ from transformers import AutoTokenizer
81
+
82
+ # Load tokenizer
83
+ tokenizer = AutoTokenizer.from_pretrained("nameissakthi/PebbleLM-117M")
84
+
85
+ # Load model (custom architecture)
86
+ # See https://github.com/nameissakthi/slm-qualcomm for model code
87
+ ```
88
+
89
+ ### For Chat/Q&A Use
90
+
91
+ See the finetuned version: [PebbleLM-117M-Chat](https://huggingface.co/nameissakthi/PebbleLM-117M-Chat)
92
+
93
+ ## Intended Use
94
+
95
+ **Appropriate for:**
96
+ - Edge deployment experiments
97
+ - Educational purposes (learning transformer architecture)
98
+ - Research on small language models
99
+ - Baseline comparisons
100
+
101
+ **Not recommended for:**
102
+ - Production applications
103
+ - Factual question answering
104
+ - Complex reasoning tasks
105
+
106
+ ## Limitations
107
+
108
+ This is a 117M parameter model - one of the smallest functional language models:
109
+
110
+ - **Limited knowledge capacity** - Cannot reliably store extensive world knowledge
111
+ - **Weak reasoning** - Not enough parameters for complex logical relationships
112
+ - **Inconsistent outputs** - May produce repetitive or off-topic responses
113
+ - **English only** - Trained exclusively on English text
114
+
115
+ For production-quality results, consider models with 1B+ parameters.
116
+
117
+ ## Model Files
118
+
119
+ | File | Description |
120
+ |------|-------------|
121
+ | `model.pt` | PyTorch model weights |
122
+ | `config.json` | Model configuration |
123
+ | `tokenizer.json` | BPE tokenizer |
124
+ | `tokenizer_config.json` | Tokenizer configuration |
125
+
126
+ ## Citation
127
+
128
+ ```bibtex
129
+ @misc{pebblellm2026,
130
+ author = {Sakthivel},
131
+ title = {PebbleLM-117M: A Small Language Model for Edge Deployment},
132
+ year = {2026},
133
+ publisher = {Hugging Face},
134
+ howpublished = {\url{https://huggingface.co/nameissakthi/PebbleLM-117M}}
135
+ }
136
+ ```
137
+
138
+ ## Acknowledgments
139
+
140
+ ### Training Data
141
+ - [Wikipedia](https://huggingface.co/datasets/wikipedia) - Wikimedia Foundation
142
+ - [OpenWebText](https://huggingface.co/datasets/openwebtext) - Aaron Gokaslan and Vanya Cohen
143
+ - [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) - Ronen Eldan and Yuanzhi Li
144
+
145
+ ### Infrastructure
146
+ - Google Cloud Platform (A100 GPU)
147
+ - Weights & Biases (experiment tracking)
148
+
149
+ ### Frameworks
150
+ - PyTorch
151
+ - Hugging Face Tokenizers
152
+
153
+ ## License
154
+
155
+ MIT License
config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "learning_rate": 0.0003,
3
+ "weight_decay": 0.1,
4
+ "warmup_ratio": 0.1,
5
+ "min_lr_ratio": 0.1,
6
+ "max_grad_norm": 1.0,
7
+ "label_smoothing": 0.0,
8
+ "num_epochs": 3,
9
+ "gradient_accumulation_steps": 2,
10
+ "fp16": true,
11
+ "checkpoint_dir": "checkpoints/pretrain",
12
+ "save_steps": 1000,
13
+ "save_total_limit": 3,
14
+ "eval_steps": 500,
15
+ "logging_steps": 50,
16
+ "early_stopping_patience": 10,
17
+ "early_stopping_threshold": 0.001,
18
+ "device": "auto",
19
+ "compile_model": false
20
+ }
model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6486b2f1f306f35596427359394b97fd7fdc320c0f425996eaab5715d90c9f8c
3
+ size 469854989
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vocab_size": 16384,
3
+ "pad_token": "<|pad|>",
4
+ "bos_token": "<|bos|>",
5
+ "eos_token": "<|eos|>",
6
+ "unk_token": "<|unk|>",
7
+ "user_token": "<|user|>",
8
+ "assistant_token": "<|assistant|>"
9
+ }