Initial commit: PebbleLM-117M base model

Browse files

Files changed (6) hide show

.gitattributes +1 -0
README.md +155 -0
config.json +20 -0
model.pt +3 -0
tokenizer.json +0 -0
tokenizer_config.json +9 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1 @@


1	+ *.pt filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,155 @@

+---
+license: mit
+language:
+- en
+tags:
+- text-generation
+- pytorch
+- small-language-model
+- edge-deployment
+- from-scratch
+datasets:
+- wikipedia
+- openwebtext
+- roneneldan/TinyStories
+pipeline_tag: text-generation
+---
+# PebbleLM-117M
+A 117.5M parameter language model trained from scratch. Small but solid - designed for edge deployment and educational purposes.
+## Model Description
+PebbleLM-117M is a decoder-only transformer trained on a diverse corpus of text. Despite its small size, it demonstrates basic language understanding and generation capabilities.
+| Property | Value |
+|----------|-------|
+| Parameters | 117.5M |
+| Architecture | Decoder-only Transformer |
+| Layers | 8 |
+| Hidden Size | 1024 |
+| Attention Heads | 16 |
+| Context Length | 1024 tokens |
+| Vocabulary | 16,000 BPE tokens |
+| Position Encoding | RoPE |
+| Normalization | RMSNorm |
+| Activation | GELU |
+## Training Data
+Pretrained on 1.17M samples from diverse sources:
+| Dataset | Samples | Description | Link |
+|---------|---------|-------------|------|
+| Wikipedia | 488,906 | Encyclopedic knowledge | [wikipedia](https://huggingface.co/datasets/wikipedia) |
+| OpenWebText | 500,000 | Diverse web content | [openwebtext](https://huggingface.co/datasets/openwebtext) |
+| TinyStories | 188,067 | Simple narrative structure | [roneneldan/TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) |
+| **Total** | **1,176,973** | | |
+## Training Details
+```yaml
+Epochs: 3
+Batch Size: 48
+Gradient Accumulation: 2
+Effective Batch Size: 96
+Learning Rate: 3e-4
+Warmup Ratio: 0.1
+Precision: FP16
+Hardware: NVIDIA A100 80GB
+Training Time: ~4.5 hours
+```
+## Benchmark Results
+Evaluated on 500 samples per benchmark:
+| Benchmark | Accuracy | Random Baseline | Above Random |
+|-----------|----------|-----------------|--------------|
+| HellaSwag | 32.20% | 25% | +7.2% |
+| ARC-Easy | 35.80% | 25% | +10.8% |
+| WinoGrande | 52.80% | 50% | +2.8% |
+| PIQA | 58.20% | 50% | +8.2% |
+| **Average** | **44.75%** | - | - |
+## Usage
+```python
+import torch
+from transformers import AutoTokenizer
+# Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained("nameissakthi/PebbleLM-117M")
+# Load model (custom architecture)
+# See https://github.com/nameissakthi/slm-qualcomm for model code
+```
+### For Chat/Q&A Use
+See the finetuned version: [PebbleLM-117M-Chat](https://huggingface.co/nameissakthi/PebbleLM-117M-Chat)
+## Intended Use
+**Appropriate for:**
+- Edge deployment experiments
+- Educational purposes (learning transformer architecture)
+- Research on small language models
+- Baseline comparisons
+**Not recommended for:**
+- Production applications
+- Factual question answering
+- Complex reasoning tasks
+## Limitations
+This is a 117M parameter model - one of the smallest functional language models:
+- **Limited knowledge capacity** - Cannot reliably store extensive world knowledge
+- **Weak reasoning** - Not enough parameters for complex logical relationships
+- **Inconsistent outputs** - May produce repetitive or off-topic responses
+- **English only** - Trained exclusively on English text
+For production-quality results, consider models with 1B+ parameters.
+## Model Files
+| File | Description |
+|------|-------------|
+| `model.pt` | PyTorch model weights |
+| `config.json` | Model configuration |
+| `tokenizer.json` | BPE tokenizer |
+| `tokenizer_config.json` | Tokenizer configuration |
+## Citation
+```bibtex
+@misc{pebblellm2026,
+  author = {Sakthivel},
+  title = {PebbleLM-117M: A Small Language Model for Edge Deployment},
+  year = {2026},
+  publisher = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/nameissakthi/PebbleLM-117M}}
+}
+```
+## Acknowledgments
+### Training Data
+- [Wikipedia](https://huggingface.co/datasets/wikipedia) - Wikimedia Foundation
+- [OpenWebText](https://huggingface.co/datasets/openwebtext) - Aaron Gokaslan and Vanya Cohen
+- [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) - Ronen Eldan and Yuanzhi Li
+### Infrastructure
+- Google Cloud Platform (A100 GPU)
+- Weights & Biases (experiment tracking)
+### Frameworks
+- PyTorch
+- Hugging Face Tokenizers
+## License
+MIT License

config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "learning_rate": 0.0003,
+  "weight_decay": 0.1,
+  "warmup_ratio": 0.1,
+  "min_lr_ratio": 0.1,
+  "max_grad_norm": 1.0,
+  "label_smoothing": 0.0,
+  "num_epochs": 3,
+  "gradient_accumulation_steps": 2,
+  "fp16": true,
+  "checkpoint_dir": "checkpoints/pretrain",
+  "save_steps": 1000,
+  "save_total_limit": 3,
+  "eval_steps": 500,
+  "logging_steps": 50,
+  "early_stopping_patience": 10,
+  "early_stopping_threshold": 0.001,
+  "device": "auto",
+  "compile_model": false
+}

model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6486b2f1f306f35596427359394b97fd7fdc320c0f425996eaab5715d90c9f8c
+size 469854989

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "vocab_size": 16384,
+  "pad_token": "<|pad|>",
+  "bos_token": "<|bos|>",
+  "eos_token": "<|eos|>",
+  "unk_token": "<|unk|>",
+  "user_token": "<|user|>",
+  "assistant_token": "<|assistant|>"
+}