File size: 2,623 Bytes

30d85e4
f8e758b
 
 
 
 
 
 
 
 
 
 
30d85e4
 
f8e758b
 
30d85e4
 
f8e758b
 
 
30d85e4
f8e758b
30d85e4
f8e758b
 
 
30d85e4
 
f8e758b
 
30d85e4
f8e758b
 
 
 
30d85e4
f8e758b
30d85e4
f8e758b
 
30d85e4
f8e758b
30d85e4
f8e758b
 
 
 
 
 
 
 
30d85e4
f8e758b
 
30d85e4
f8e758b
30d85e4
f8e758b
30d85e4
f8e758b
 
 
 
 
 
 
30d85e4
 
f8e758b
30d85e4
f8e758b
30d85e4
f8e758b
 
 
 
 
30d85e4
 
f8e758b
 
 
 
30d85e4

---
license: mit
datasets:
- lennart-finke/SimpleStories
language:
- en
tags:
- small-language-model
- story-generation
- text-generation
- efficient-nlp
- distilled-models
---

# SimpleStories Model Family
The SimpleStories models are a tiny model family created for interpretability research, trained on the [SimpleStories dataset](https://huggingface.co/datasets/SimpleStories/SimpleStories). This is the second iteration of the model family.


**Paper:** https://arxiv.org/abs/2504.09184  
**Training code:** https://github.com/simple-stories/simple_stories_train  
**Traning checkpoints:** https://wandb.ai/finke/simplestories-v2  

## Usage

```python
import torch
from transformers import AutoTokenizer, LlamaForCausalLM


MODEL_SIZE = "11M"
model_path = "SimpleStories/SimpleStories-V2-{}".format(MODEL_SIZE)

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(model_path)
model.to("cuda")
model.eval()

prompt = "The curious cat looked at the"

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
input_ids = inputs.input_ids.to("cuda")

eos_token_id = 1

with torch.no_grad():
    output_ids = model.generate(
        input_ids=input_ids,
        max_new_tokens=400,
        temperature=0.7,
        do_sample=True,
        eos_token_id=eos_token_id
)

output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(f"\nGenerated text:\n{output_text}")

```

## Model Variants

| Model Name | n_params | n_layers | d_model | n_heads | n_ctx | d_vocab |
|------------|----------|----------|---------|---------|-------|---------|
| SimpleStories-35M | 35 million | 12 | 512 | 8 | 512 | 4019 |
| SimpleStories-30M | 30 million | 10 | 512 | 8 | 512 | 4019 |
| SimpleStories-11M | 11 million | 6 | 384 | 6 | 512 | 4019 |
| SimpleStories-5M | 5 million | 6 | 256 | 4 | 512 | 4019 |
| SimpleStories-1.25M | 1.25 million | 4 | 128 | 4 | 512 | 4019 |


## Dataset

The SimpleStories dataset is a collection of short stories generated by state-of-the-art language models. It features:

- Story annotation with high-level concepts: theme, topic, style, etc.
- Higher semantic and syntactic diversity through seeded story generation
- Generated by 2024 models
- Several NLP-metrics pre-computed to aid filtering
- ASCII-only guarantee for the English dataset


## Key improvements from previous version
- Improved evaluation scores due to the increased training epochs
- Pruning and optimization of the tokenizer resulting in vocabulary size from 4096 to 4019
- Model training checkpoints are stored periodically in wandb for further research