nameissakthi commited on
Commit
e3ef0ba
·
0 Parent(s):

Initial commit: PebbleLM-117M-Chat

Browse files
Files changed (6) hide show
  1. .gitattributes +1 -0
  2. README.md +264 -0
  3. config.json +21 -0
  4. model.pt +3 -0
  5. tokenizer.json +0 -0
  6. tokenizer_config.json +9 -0
.gitattributes ADDED
@@ -0,0 +1 @@
 
 
1
+ *.pt filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,264 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - text-generation
7
+ - pytorch
8
+ - small-language-model
9
+ - edge-deployment
10
+ - conversational
11
+ - chat
12
+ base_model: nameissakthi/PebbleLM-117M
13
+ datasets:
14
+ - yahma/alpaca-cleaned
15
+ - databricks/databricks-dolly-15k
16
+ pipeline_tag: text-generation
17
+ ---
18
+
19
+ # PebbleLM-117M-Chat
20
+
21
+ A 117.5M parameter language model finetuned for conversational Q&A. Small but solid - designed for edge deployment.
22
+
23
+ **Base Model:** [PebbleLM-117M](https://huggingface.co/nameissakthi/PebbleLM-117M)
24
+
25
+ ## Model Description
26
+
27
+ PebbleLM-117M-Chat is finetuned from PebbleLM-117M on focused Q&A datasets to provide direct, concise answers to questions.
28
+
29
+ | Property | Value |
30
+ |----------|-------|
31
+ | Parameters | 117.5M |
32
+ | Architecture | Decoder-only Transformer |
33
+ | Layers | 8 |
34
+ | Hidden Size | 1024 |
35
+ | Attention Heads | 16 |
36
+ | Context Length | 1024 tokens |
37
+ | Vocabulary | 16,000 BPE tokens |
38
+
39
+ ## Inference Examples
40
+
41
+ ```
42
+ User: What is 2 + 2?
43
+ Assistant: 2 + 2 equals 4.
44
+
45
+ User: What is the capital of France?
46
+ Assistant: The capital of France is Paris.
47
+
48
+ User: What is the largest planet?
49
+ Assistant: Jupiter is the largest planet in our solar system.
50
+
51
+ User: Who wrote Romeo and Juliet?
52
+ Assistant: William Shakespeare wrote Romeo and Juliet.
53
+
54
+ User: How do you say hello in Spanish?
55
+ Assistant: Hello in Spanish is 'Hola'.
56
+
57
+ User: What is AI?
58
+ Assistant: AI (artificial intelligence) is the ability of machines to perform
59
+ tasks that normally require human intelligence, such as learning,
60
+ problem-solving, and decision making.
61
+
62
+ User: Thank you
63
+ Assistant: Goodbye! Have a great day!
64
+ ```
65
+
66
+ ## Training Data
67
+
68
+ ### Pretraining (Base Model)
69
+ | Dataset | Samples | Link |
70
+ |---------|---------|------|
71
+ | Wikipedia | 488,906 | [wikipedia](https://huggingface.co/datasets/wikipedia) |
72
+ | OpenWebText | 500,000 | [openwebtext](https://huggingface.co/datasets/openwebtext) |
73
+ | TinyStories | 188,067 | [roneneldan/TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) |
74
+
75
+ ### Finetuning (This Model)
76
+ | Dataset | Samples | Description | Link |
77
+ |---------|---------|-------------|------|
78
+ | Alpaca-cleaned | 20,000 | Instruction-response pairs | [yahma/alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) |
79
+ | Databricks Dolly | 10,991 | Q&A pairs | [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) |
80
+ | Simple Q&A | 1,500 | Hand-crafted basic facts | Custom |
81
+ | **Total** | **32,491** | | |
82
+
83
+ ## Training Details
84
+
85
+ ```yaml
86
+ Base Checkpoint: PebbleLM-117M
87
+ Epochs: 5
88
+ Batch Size: 48
89
+ Gradient Accumulation: 2
90
+ Learning Rate: 5e-5
91
+ Final Training Loss: 1.55
92
+ Hardware: NVIDIA A100 80GB
93
+ Training Time: ~40 minutes
94
+ ```
95
+
96
+ ## Benchmark Results
97
+
98
+ | Benchmark | Base Model | Chat Model | Change |
99
+ |-----------|------------|------------|--------|
100
+ | HellaSwag | 32.20% | 31.80% | -0.4% |
101
+ | ARC-Easy | 35.80% | 40.00% | **+4.2%** |
102
+ | WinoGrande | 52.80% | 49.20% | -3.6% |
103
+ | PIQA | 58.20% | 56.00% | -2.2% |
104
+ | **Average** | **44.75%** | **44.25%** | -0.5% |
105
+
106
+ **Note:** Slight benchmark decrease is expected - model is optimized for Q&A quality, not reasoning benchmarks. The real improvement is in conversational responses.
107
+
108
+ ## Usage
109
+
110
+ ### Installation
111
+
112
+ ```bash
113
+ pip install torch tokenizers huggingface_hub
114
+
115
+ # Clone model architecture code
116
+ git clone https://github.com/nameissakthi/slm-qualcomm
117
+ cd slm-qualcomm
118
+ ```
119
+
120
+ ### Download Model
121
+
122
+ ```python
123
+ from huggingface_hub import hf_hub_download
124
+
125
+ # Download model files
126
+ model_path = hf_hub_download(repo_id="nameissakthi/PebbleLM-117M-Chat", filename="model.pt")
127
+ tokenizer_path = hf_hub_download(repo_id="nameissakthi/PebbleLM-117M-Chat", filename="tokenizer.json")
128
+ ```
129
+
130
+ ### Load Model
131
+
132
+ ```python
133
+ import torch
134
+ from tokenizers import Tokenizer
135
+ from src.model.transformer import SLMForCausalLM
136
+ from src.model.config import SLMConfig
137
+
138
+ # Load tokenizer
139
+ tokenizer = Tokenizer.from_file(tokenizer_path)
140
+
141
+ # Load model
142
+ config = SLMConfig(vocab_size=16384)
143
+ model = SLMForCausalLM(config)
144
+
145
+ state_dict = torch.load(model_path, map_location="cpu")
146
+ if "model_state_dict" in state_dict:
147
+ state_dict = state_dict["model_state_dict"]
148
+ model.load_state_dict(state_dict)
149
+ model.eval()
150
+ ```
151
+
152
+ ### Prompt Format
153
+
154
+ ```
155
+ <|user|>
156
+ Your question here
157
+ <|assistant|>
158
+ ```
159
+
160
+ ### Generate Response
161
+
162
+ ```python
163
+ def generate(prompt, max_tokens=128, temperature=0.3):
164
+ formatted = f"<|user|>\n{prompt}\n<|assistant|>\n"
165
+ input_ids = torch.tensor([tokenizer.encode(formatted).ids])
166
+
167
+ with torch.no_grad():
168
+ for _ in range(max_tokens):
169
+ logits = model(input_ids).logits[:, -1, :]
170
+ logits = logits / temperature
171
+ probs = torch.softmax(logits, dim=-1)
172
+ next_token = torch.multinomial(probs, 1)
173
+ input_ids = torch.cat([input_ids, next_token], dim=-1)
174
+
175
+ # Stop on EOS or user token
176
+ if next_token.item() in [tokenizer.token_to_id("<|eos|>"),
177
+ tokenizer.token_to_id("<|user|>")]:
178
+ break
179
+
180
+ response = tokenizer.decode(input_ids[0].tolist())
181
+ return response.split("<|assistant|>")[-1].replace("<|eos|>", "").strip()
182
+
183
+ # Example
184
+ print(generate("What is the capital of France?"))
185
+ # Output: The capital of France is Paris.
186
+
187
+ print(generate("What is 2 + 2?"))
188
+ # Output: 2 + 2 equals 4.
189
+ ```
190
+
191
+ ### Recommended Settings
192
+
193
+ ```python
194
+ temperature = 0.3 # Lower = more consistent
195
+ top_k = 50 # Limit token choices
196
+ top_p = 0.9 # Nucleus sampling
197
+ repetition_penalty = 1.2 # Reduce repetition
198
+ max_tokens = 128 # Keep responses short
199
+ ```
200
+
201
+ ## Intended Use
202
+
203
+ **Appropriate for:**
204
+ - Edge deployment demos
205
+ - Simple Q&A applications
206
+ - Educational purposes
207
+ - IoT/embedded device experiments
208
+
209
+ **Not recommended for:**
210
+ - Production chatbots
211
+ - Factual accuracy-critical applications
212
+ - Complex multi-turn conversations
213
+
214
+ ## Limitations
215
+
216
+ - **~60% accuracy** on simple factual questions
217
+ - **Inconsistent** on complex or unusual questions
218
+ - **May hallucinate** incorrect facts
219
+ - **English only**
220
+ - **117M parameters** limits knowledge capacity
221
+
222
+ For production quality, consider 1B+ parameter models.
223
+
224
+ ## Model Files
225
+
226
+ | File | Description |
227
+ |------|-------------|
228
+ | `model.pt` | PyTorch model weights |
229
+ | `config.json` | Model configuration |
230
+ | `tokenizer.json` | BPE tokenizer |
231
+ | `tokenizer_config.json` | Tokenizer configuration |
232
+
233
+ ## Citation
234
+
235
+ ```bibtex
236
+ @misc{pebblellmchat2026,
237
+ author = {Sakthivel},
238
+ title = {PebbleLM-117M-Chat: A Small Conversational Language Model},
239
+ year = {2026},
240
+ publisher = {Hugging Face},
241
+ howpublished = {\url{https://huggingface.co/nameissakthi/PebbleLM-117M-Chat}}
242
+ }
243
+ ```
244
+
245
+ ## Acknowledgments
246
+
247
+ ### Training Data
248
+ - [Wikipedia](https://huggingface.co/datasets/wikipedia) - Wikimedia Foundation
249
+ - [OpenWebText](https://huggingface.co/datasets/openwebtext) - Aaron Gokaslan and Vanya Cohen
250
+ - [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) - Ronen Eldan and Yuanzhi Li
251
+ - [Alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) - Yahoo Research
252
+ - [Databricks Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) - Databricks
253
+
254
+ ### Infrastructure
255
+ - Google Cloud Platform (A100 GPU)
256
+ - Weights & Biases (experiment tracking)
257
+
258
+ ### Frameworks
259
+ - PyTorch
260
+ - Hugging Face Tokenizers
261
+
262
+ ## License
263
+
264
+ MIT License
config.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "pebblellm",
3
+ "architectures": ["PebbleLMForCausalLM"],
4
+ "vocab_size": 16384,
5
+ "hidden_size": 1024,
6
+ "num_hidden_layers": 8,
7
+ "num_attention_heads": 16,
8
+ "head_dim": 64,
9
+ "intermediate_size": 4096,
10
+ "max_position_embeddings": 1024,
11
+ "rope_theta": 10000.0,
12
+ "rms_norm_eps": 1e-6,
13
+ "tie_word_embeddings": true,
14
+ "hidden_act": "gelu",
15
+ "dropout": 0.0,
16
+ "attention_dropout": 0.0,
17
+ "torch_dtype": "float16",
18
+ "bos_token_id": 1,
19
+ "eos_token_id": 2,
20
+ "pad_token_id": 0
21
+ }
model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b0c615f561cd6e88d06db3a8009a5e98c1b879ae97cde55d6863151b603cc5e4
3
+ size 469854989
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vocab_size": 16384,
3
+ "pad_token": "<|pad|>",
4
+ "bos_token": "<|bos|>",
5
+ "eos_token": "<|eos|>",
6
+ "unk_token": "<|unk|>",
7
+ "user_token": "<|user|>",
8
+ "assistant_token": "<|assistant|>"
9
+ }