Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

README.md +215 -0
config.json +11 -0
model.safetensors +3 -0
samples.txt +30 -0
special_tokens_map.json +3 -0
tokenizer_config.json +6 -0

README.md ADDED Viewed

	@@ -0,0 +1,215 @@

+---
+license: mit
+tags:
+- pytorch
+- causal-lm
+- gpt
+- small-language-model
+- decoder-only
+language:
+- en
+pipeline_tag: text-generation
+---
+# Pico-GPT
+A small GPT-style decoder-only language model (~49.2M parameters) trained from scratch on OpenWebText.
+## Model Details
+| Property | Value |
+|----------|--------|
+| **Architecture** | Decoder-only Transformer with Pre-LayerNorm |
+| **Parameters** | ~49,218,816 |
+| **Layers** | 6 |
+| **Hidden Size** | 384 |
+| **FFN Hidden Size** | 1536 |
+| **Attention Heads** | 6 |
+| **Head Dimension** | 64 |
+| **Context Length** | 128 tokens |
+| **Vocabulary** | 50257 (GPT-2) |
+| **Flash Attention** | ✅ Enabled |
+| **Dropout** | 0.1 |
+| **Bias** | Disabled |
+## Training Objective
+The model was trained using **causal language modeling (next-token prediction)**. The loss function is cross-entropy over the vocabulary.
+For a given sequence of tokens `x_1, x_2, ..., x_n`, the model is trained to predict `x_{i+1}` given `x_1, ..., x_i`.
+## Dataset
+### Source
+- **Dataset:** OpenWebText
+- **Hugging Face:** `Skylion007/openwebtext`
+- **Mode:** Streaming preprocessing
+- **License:** Same as OpenAI's GPT-2 dataset
+### Preprocessing Pipeline
+- **Tokenizer:** GPT-2 (tiktoken)
+- **Tokenization:** Streaming, incremental
+- **EOS Token:** Appended after each document
+- **Text Cleaning:** Minimal (strip whitespace, skip empty strings)
+- **Sharding:** Binary shards (uint16), 5M tokens per shard
+- **Train/Val Split:** Deterministic split by token count
+- **Memory Mapping:** Enabled for efficient loading
+### Dataset Statistics
+- **Total Tokens Collected:** 1B tokens
+- **Training Tokens:** 950M tokens
+- **Validation Tokens:** 50M tokens
+- **Training Shards:** ~190 files (train_000.bin to train_189.bin)
+- **Validation Shard:** val.bin
+- **Data Type:** uint16 (supports memory mapping)
+## Training Configuration
+### Hyperparameters
+| Parameter | Value |
+|-----------|--------|
+| **Optimizer** | AdamW |
+| **Learning Rate** | 3e-4 |
+| **Weight Decay** | 0.1 |
+| **Betas** | (0.9, 0.95) |
+| **Max Steps** | N/A |
+| **Batch Size** | 64 |
+| **Context Window** | 128 |
+| **Gradient Clipping** | 1.0 |
+| **Checkpoint Interval** | N/A |
+| **Log Interval** | N/A |
+### Training Results
+| Metric | Value |
+|--------|--------|
+| **Final Training Loss** | N/A |
+| **Training Time** | N/A |
+| **Hardware** | NVIDIA A100 (20GB) or equivalent |
+## Model Files
+| File | Description |
+|-------|-------------|
+| `model.safetensors` | Model weights in safetensors format (secure, fast loading) |
+| `config.json` | Model architecture configuration |
+| `training_config.json` | Training hyperparameters and results |
+| `training_log.csv` | Training metrics over time (step, loss, elapsed_time) |
+| `samples.txt` | Sample generations from the trained model |
+| `tokenizer_config.json` | Tokenizer configuration |
+| `special_tokens_map.json` | Special tokens mapping |
+## Usage
+### Loading with safetensors:
+```python
+import torch
+from safetensors.torch import load_file
+import json
+# Load config
+with open("config.json", "r") as f:
+    config = json.load(f)
+# Load weights
+state_dict = load_file("model.safetensors")
+# Create model (requires custom model class from pico_gpt/model.py)
+from pico_gpt.model import GPT
+from pico_gpt.config import ModelConfig
+model = GPT(ModelConfig(**config))
+model.load_state_dict(state_dict)
+model.eval()
+```
+### Text Generation:
+```python
+import torch
+import tiktoken
+# Load tokenizer
+enc = tiktoken.get_encoding("gpt2")
+# Prepare prompt
+prompt = "The future of artificial intelligence is"
+tokens = enc.encode(prompt)
+tokens = tokens[-context_length:]  # Truncate to context length if needed
+idx = torch.tensor([tokens], dtype=torch.long)
+# Generate
+with torch.no_grad():
+    generated = model.generate(
+        idx,
+        max_new_tokens=100,
+        temperature=0.8,
+        eos_token_id=enc.eot_token,
+    )
+# Decode result
+generated_text = enc.decode(generated[0].tolist())
+print(generated_text)
+```
+### Loading Checkpoint:
+```python
+import torch
+# Load checkpoint
+checkpoint = torch.load("checkpoint_step_<N>.pt", map_location="cpu")
+model_state = checkpoint["model_state_dict"]
+config = checkpoint["config"]
+# Load training config if needed
+training_config = checkpoint.get("training_config", {})
+# Use with custom GPT class
+from pico_gpt.model import GPT
+from pico_gpt.config import ModelConfig
+model = GPT(config)
+model.load_state_dict(model_state)
+```
+## Limitations
+- **Small Model Size:** ~49.2M parameters limits reasoning capability
+- **Short Context:** 128 token context window limits long-range dependencies
+- **Single Dataset:** Trained only on web text (OpenWebText subset)
+- **No Instruction Tuning:** Not aligned for chat/instruction following
+- **Potential Biases:** May contain biases present in the training data
+- **No Weight Tying:** Embedding and output layers have separate parameters
+## Future Work
+- [ ] Convert to native Hugging Face GPT-2 architecture
+- [ ] Increase model size and context length
+- [ ] Add instruction tuning / alignment
+- [ ] Evaluation on downstream benchmarks (perplexity, etc.)
+- [ ] Fine-tune for specific tasks
+- [ ] Implement more sampling strategies (top-k, top-p)
+- [ ] Add support for streaming inference
+## Citation
+```bibtex
+@misc{pico-gpt,
+  title={Pico-GPT: A Small Language Model from Scratch},
+  author={Your Name},
+  year={2026},
+  howpublished={\url{https://huggingface.co/YOUR_USERNAME/pico-gpt}},
+}
+```
+## Acknowledgments
+- This project uses the **GPT-2 tokenizer** from OpenAI's `tiktoken` library
+- Dataset: **OpenWebText** by Skylion007
+- Architecture inspired by **GPT**, **GPT-2**, and **nanoGPT**
+---
+*For training details, see `training_config.json` and `training_log.csv`.*
+*Model files use the safetensors format for safe and efficient loading.*

config.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+  "model_type": "custom_gpt",
+  "vocab_size": 50257,
+  "n_layer": 6,
+  "n_head": 6,
+  "n_embd": 384,
+  "context_length": 128,
+  "dropout": 0.1,
+  "bias": false,
+  "ffn_dim": 1536
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d440a4046ce1be803b5e04d4bc9bd6863d58f1258ae85e29746dde63c62e7cea
+size 197098232

samples.txt ADDED Viewed

	@@ -0,0 +1,30 @@

+Generated Samples
+==================================================
+### Sample 1
+**Prompt:** The future of artificial intelligence is
+**Generated:** easily forgotten. The ruling will be crucial to the moment of our presidency, when you have a tradition of replaced patenting a visually impaired former boss with a super-apartie.
+This is the right moment for the day. The site
+### Sample 2
+**Prompt:** Once upon a time
+**Generated:** , the world is moving upward, and the planet is on thepak--the sun of summer, the howl of January.
+About the same time, the biggest change is in the way of the lightning Kingdom: The last row of our
+### Sample 3
+**Prompt:** The best way to learn programming is
+**Generated:** to make your school in our favorite languages.
+The such anStrength is that you can learn languages and teach English. You need to have huge knowledge of what they use in your learning skills or their own methods.
+The next step to
+### Sample 4
+**Prompt:** In the field of machine learning
+**Generated:** , the term “political philosophy” often associated with the great post- AND State’s distinction between two offices of government and foreign policy. The word “unexist” is used to refer to the laws of nature, in
+### Sample 5
+**Prompt:** One of the most important concepts is
+**Generated:** that you can tell the truth about the stuff that would happen to you of how much you played. If you were at a time where you had a good conversation with your husband and who did you do? If you were a kid, you’

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "eos_token": "<|endoftext|>"
+}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "tokenizer_class": "GPT2Tokenizer",
+  "eos_token": "<|endoftext|>",
+  "model_max_length": 128,
+  "tokenizer_type": "gpt2"
+}