wizardoftrap
/

SP-LM-alpha

Model card Files Files and versions

SP-LM-alpha / README.md

wizardoftrap's picture

Update README.md

7ab43e4 verified 24 days ago

|

history blame contribute delete

2.05 kB

	---
	tags:
	- gpt
	- language-model
	- causal-lm
	language:
	- en
	datasets:
	- roneneldan/TinyStories
	---

	# SP-LM-alpha

	A GPT model trained on the TinyStories dataset using PyTorch.

	## Model Details

	- Model Type: GPT (Causal Language Model)
	- Vocab Size: 50257
	- Context Length: 128
	- Layers: 6
	- Attention Heads: 6
	- Embedding Dimension: 384
	- Training Dataset: [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories)

	## Architecture

	The model uses a transformer architecture with:
	- Token and positional embeddings
	- 6 transformer blocks
	- Causal self-attention with 6 heads
	- Feed-forward networks with GELU activation
	- Layer normalization
	- Residual connections

	## Usage

	### Quick Start

	```python
	from transformers import AutoTokenizer
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file
	import json
	import torch
	from sp_lm import GPT

	repo_id = "wizardoftrap/SP-LM-alpha"

	tokenizer = AutoTokenizer.from_pretrained(repo_id)

	config_dict = json.load(open(hf_hub_download(repo_id=repo_id, filename="config.json")))
	config = type('Config', (), config_dict)()

	model_weights = load_file(hf_hub_download(repo_id=repo_id, filename="model.safetensors"))
	model = GPT(config)
	model.load_state_dict(model_weights)

	prompt = "Once upon a time"
	inputs = tokenizer(prompt, return_tensors="pt")
	with torch.no_grad():
	generated_ids = model.generate(inputs["input_ids"], max_new_tokens=50, temperature=1.0, top_k=50)
	print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
	```

	### Installation

	1. Download `sp_lm.py` file from this repo for GPT model.

	2. Install required packages:
	```bash
	pip install transformers safetensors huggingface-hub torch
	```

	3. Load and generate text as shown above

	## Training Details

	- Learning Rate: 1e-4 with linear warmup and cosine annealing decay
	- Batch Size: 32
	- Gradient Accumulation Steps: 32
	- Max Iterations: 20000
	- Optimizer: AdamW with weight decay
	- Mixed Precision: bfloat16 / float16