|
|
|
|
|
--- |
|
|
tags: |
|
|
- causal-lm |
|
|
- text-generation |
|
|
- pre-trained |
|
|
- pytorch |
|
|
--- |
|
|
|
|
|
# tinystories-gpt-small |
|
|
|
|
|
This is a custom GPT model **pre-trained from scratch on the TinyStories dataset**. |
|
|
It demonstrates foundational language modeling capabilities and can be used for text generation. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
* **Architecture:** Custom GPT |
|
|
* `n_layer`: 8 |
|
|
* `n_head`: 8 |
|
|
* `n_embd`: 512 |
|
|
* `block_size`: 1024 |
|
|
* `vocab_size`: 50257 |
|
|
* `dropout`: 0.1 |
|
|
* **Pre-training Dataset:** TinyStories (a synthetic dataset of short, simple stories designed to teach language models basic reasoning and coherence). |
|
|
* **Purpose:** This model is a base language model. It has learned to predict the next token in a sequence based on the patterns found in the TinyStories dataset. It is suitable for demonstrating basic generative text capabilities and serves as a foundation for further fine-tuning on specific downstream tasks (e.g., question answering, chatbot). |
|
|
|
|
|
## How to Use (Inference) |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import tiktoken |
|
|
from model import GPT, GPTConfig # Assuming model.py is available or its classes are defined |
|
|
|
|
|
# 1. Define model configuration (must match the trained model's config.json) |
|
|
# You can load this from config.json if you save it, or define it manually |
|
|
config = GPTConfig( |
|
|
vocab_size=50257, |
|
|
block_size=1024, |
|
|
n_layer=8, |
|
|
n_head=8, |
|
|
n_embd=512, |
|
|
dropout=0.1, |
|
|
bias=True |
|
|
) |
|
|
|
|
|
# 2. Initialize the model and load weights |
|
|
model = GPT(config) |
|
|
state_dict = torch.load("pytorch_model.bin", map_location='cpu') # Replace with path to downloaded model |
|
|
model.load_state_dict(state_dict) |
|
|
model.eval() # Set to evaluation mode |
|
|
device = 'cuda' if torch.cuda.is_available() else 'cpu' |
|
|
model.to(device) |
|
|
|
|
|
# 3. Load the tiktoken tokenizer |
|
|
tokenizer = tiktoken.get_encoding("gpt2") |
|
|
EOT_TOKEN_ID = tokenizer.eot_token |
|
|
|
|
|
# 4. Prepare your prompt for text generation |
|
|
prompt_text = "Once upon a time there was a pumpkin." |
|
|
|
|
|
# Encode the prompt |
|
|
allowed_special_tokens = 'all' |
|
|
input_ids = tokenizer.encode(prompt_text, allowed_special=allowed_special_tokens) |
|
|
input_ids_tensor = torch.tensor([input_ids], dtype=torch.long).to(device) |
|
|
|
|
|
# 5. Generate text |
|
|
# Adjust max_new_tokens, temperature, top_k as needed |
|
|
generated_output_ids = model.generate( |
|
|
idx=input_ids_tensor, |
|
|
max_new_tokens=100, # Max length for the generated text |
|
|
temperature=0.7, |
|
|
top_k=50 |
|
|
) |
|
|
|
|
|
# Decode the generated text (excluding the prompt part) |
|
|
generated_text_ids = generated_output_ids[0, len(input_ids):].tolist() |
|
|
generated_text = tokenizer.decode(generated_text_ids) |
|
|
|
|
|
# Clean up any leftover EOT tokens from generation |
|
|
generated_text = generated_text.replace(tokenizer.decode([EOT_TOKEN_ID]), "").strip() |
|
|
|
|
|
print(f"Generated Text: {generated_text}") |
|
|
``` |
|
|
|
|
|
## Limitations and Bias |
|
|
|
|
|
* This model is a relatively small GPT (50.95M parameters) and its generative capabilities are limited by its size and the simplicity of the TinyStories dataset. |
|
|
* It is a base language model and has not been instruction-tuned or fine-tuned for specific tasks like complex question answering or dialogue. Therefore, its responses may be incoherent or non-factual for out-of-distribution prompts. |
|
|
* Like all language models, it may generate biased or incorrect information based on its training data. |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|
|
|
|