SimpleStories
/

SimpleStories-V2-5M

Text Generation

small-language-model

story-generation

distilled-models

Model card Files Files and versions

SimpleStories-V2-5M / README.md

chandan-sreedhara's picture

chandan-sreedhara

Upload README.md with huggingface_hub

c4b3a4b verified 4 months ago

|

history blame contribute delete

2.62 kB

	---
	license: mit
	datasets:
	- lennart-finke/SimpleStories
	language:
	- en
	tags:
	- small-language-model
	- story-generation
	- text-generation
	- efficient-nlp
	- distilled-models
	---

	# SimpleStories Model Family
	The SimpleStories models are a tiny model family created for interpretability research, trained on the [SimpleStories dataset](https://huggingface.co/datasets/SimpleStories/SimpleStories). This is the second iteration of the model family.


	Paper: https://arxiv.org/abs/2504.09184
	Training code: https://github.com/simple-stories/simple_stories_train
	Traning checkpoints: https://wandb.ai/finke/simplestories-v2

	## Usage

	```python
	import torch
	from transformers import AutoTokenizer, LlamaForCausalLM


	MODEL_SIZE = "5M"
	model_path = "SimpleStories/SimpleStories-V2-{}".format(MODEL_SIZE)

	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = LlamaForCausalLM.from_pretrained(model_path)
	model.to("cuda")
	model.eval()

	prompt = "The curious cat looked at the"

	inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
	input_ids = inputs.input_ids.to("cuda")

	eos_token_id = 1

	with torch.no_grad():
	output_ids = model.generate(
	input_ids=input_ids,
	max_new_tokens=400,
	temperature=0.7,
	do_sample=True,
	eos_token_id=eos_token_id
	)

	output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
	print(f"\nGenerated text:\n{output_text}")

	```

	## Model Variants

	\| Model Name \| n_params \| n_layers \| d_model \| n_heads \| n_ctx \| d_vocab \|
	\|------------\|----------\|----------\|---------\|---------\|-------\|---------\|
	\| SimpleStories-35M \| 35 million \| 12 \| 512 \| 8 \| 512 \| 4019 \|
	\| SimpleStories-30M \| 30 million \| 10 \| 512 \| 8 \| 512 \| 4019 \|
	\| SimpleStories-11M \| 11 million \| 6 \| 384 \| 6 \| 512 \| 4019 \|
	\| SimpleStories-5M \| 5 million \| 6 \| 256 \| 4 \| 512 \| 4019 \|
	\| SimpleStories-1.25M \| 1.25 million \| 4 \| 128 \| 4 \| 512 \| 4019 \|


	## Dataset

	The SimpleStories dataset is a collection of short stories generated by state-of-the-art language models. It features:

	- Story annotation with high-level concepts: theme, topic, style, etc.
	- Higher semantic and syntactic diversity through seeded story generation
	- Generated by 2024 models
	- Several NLP-metrics pre-computed to aid filtering
	- ASCII-only guarantee for the English dataset


	## Key improvements from previous version
	- Improved evaluation scores due to the increased training epochs
	- Pruning and optimization of the tokenizer resulting in vocabulary size from 4096 to 4019
	- Model training checkpoints are stored periodically in wandb for further research