lennart-finke commited on
Commit
b57d5a4
·
verified ·
1 Parent(s): 082e226

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -61
README.md CHANGED
@@ -12,62 +12,14 @@ tags:
12
  - distilled-models
13
  ---
14
 
15
- # SimpleStories
16
- SimpleStories is a large synthetic story dataset comprising 2 million stories designed for efficient NLP research. Created to improve upon TinyStories, it offers greater syntactic and semantic diversity through parameterized prompt generation while maintaining simple language. The dataset features stories annotated with high-level concepts like theme, topic, style, and narrative features, making it ideal for training small language models and studying language understanding.
17
-
18
- # SimpleStories 30M
19
- SimpleStories-30M is a 30 million parameter language model trained on the SimpleStories dataset. This model is the second largest in the SimpleStories model family,
20
- offering a comparable performance across all evaluation metrics. This is part of the family of small language models trained on [SimpleStories dataset](https://huggingface.co/datasets/lennart-finke/SimpleStories).
21
- The models range from 1.25M to 35M parameters, offering a spectrum of capabilities while maintaining efficiency. The model training and evaluation code can be found here: https://github.com/danbraunai/simple_stories_train/tree/main/simple_stories_train
22
-
23
- ## Model Variants
24
-
25
- | Model Name | n_params | n_layers | d_model | n_heads | n_ctx | d_vocab |
26
- |------------|----------|----------|---------|---------|-------|---------|
27
- | SimpleStories-35M | 35 million | 12 | 512 | 8 | 512 | 4096 |
28
- | SimpleStories-30M | 30 million | 10 | 512 | 8 | 512 | 4096 |
29
- | SimpleStories-11M | 11 million | 6 | 384 | 6 | 512 | 4096 |
30
- | SimpleStories-5M | 5 million | 6 | 256 | 4 | 512 | 4096 |
31
- | SimpleStories-1.25M | 1.25 million | 4 | 128 | 4 | 512 | 4096 |
32
-
33
- ## Performance Comparison
34
-
35
- Our models demonstrate strong performance across various evaluation metrics as shown in the chart below. The trained models are scored using the model as a judge evaluation framework.
36
-
37
- <p align="center">
38
- <img width="80%" src="figures/simplestories_comparison.png">
39
- </p>
40
-
41
- - **Originality**: Measures the uniqueness and creativity of generated content
42
- - **Coherence**: Evaluates the logical flow and consistency of generated stories
43
- - **Grammar**: Assesses grammatical correctness and linguistic quality
44
- - **Quality**: Holistic evaluation of overall text generation quality
45
-
46
- The larger models (35M, 30M) achieve the best performance, particularly in coherence and grammar, while even our smallest 1.25M parameter model produces readable and coherent content. As shown in the visualization, our SimpleStories-33M model achieves scores of 90.8 in Grammar, 85.7 in Coherence, 81.5 in Quality, and 72.5 in Originality.
47
-
48
- ## Dataset
49
-
50
- The SimpleStories dataset is a collection of short stories generated by state-of-the-art language models. It features:
51
-
52
- - Story annotation with high-level concepts: theme, topic, style, etc.
53
- - Higher semantic and syntactic diversity through seeded story generation
54
- - Generated by 2024 models
55
- - Several NLP-metrics pre-computed to aid filtering
56
- - ASCII-only guarantee for the English dataset
57
-
58
- ## Tokenizer
59
-
60
- We have trained a custom WordPiece tokenizer with a small vocabulary size of 4096. We conducted morphological analysis and coverage gain analysis on the dataset
61
- to build a small tokenizer without compromising on the quality of generation.
62
-
63
- ## Installation
64
-
65
- Follow the steps at https://github.com/danbraunai/simple_stories_train to install the simple stories package.
66
-
67
 
68
  ## Usage
69
 
70
- Here's how to use any model in the SimpleStories family:
 
 
71
 
72
  ```python
73
  from transformers import AutoTokenizer
@@ -83,9 +35,10 @@ model_size = "30M" # Options: "35M", "30M", "11M", "5M", "1.25M"
83
  model_config = MODEL_CONFIGS[model_size]
84
 
85
  # Load appropriate model
86
- model_path = f"chandan-sreedhara/SimpleStories-{model_size}"
87
  model = Llama.from_pretrained(model_path, model_config)
88
- model.to("cuda")
 
89
  model.eval()
90
 
91
  # Load tokenizer
@@ -95,15 +48,14 @@ tokenizer = AutoTokenizer.from_pretrained(model_path)
95
  prompt = "The curious cat looked at the"
96
 
97
  inputs = tokenizer(prompt, return_tensors="pt")
98
- input_ids = inputs.input_ids.to("cuda")
99
-
100
 
101
  # Generate text
102
  with torch.no_grad():
103
  output_ids = model.generate(
104
  idx=input_ids,
105
- max_new_tokens=800,
106
- temperature=0.7,
107
  top_k=40,
108
  eos_token_id=tokenizer.eos_token_id
109
  )
@@ -111,9 +63,39 @@ with torch.no_grad():
111
  # Decode output
112
  output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
113
  print(f"Generated text:\n{output_text}")
 
114
  ```
115
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
116
 
117
- ## Acknowledgements
 
 
 
 
 
 
 
 
 
 
 
 
 
118
 
119
- These models build upon the work done in the TinyStories project by Eldan and Li, with the SimpleStories dataset created by Lennart Finke and the training code created by Dan Braun.
 
12
  - distilled-models
13
  ---
14
 
15
+ # SimpleStories Model Family
16
+ The SimpleStories models are a tiny model family created for interpretability research, trained on the [SimpleStories dataset](https://huggingface.co/datasets/lennart-finke/SimpleStories).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
  ## Usage
19
 
20
+ ```bash
21
+ pip install simple_stories_train
22
+ ```
23
 
24
  ```python
25
  from transformers import AutoTokenizer
 
35
  model_config = MODEL_CONFIGS[model_size]
36
 
37
  # Load appropriate model
38
+ model_path = f"SimpleStories/SimpleStories-{model_size}"
39
  model = Llama.from_pretrained(model_path, model_config)
40
+ device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
41
+ model.to(device)
42
  model.eval()
43
 
44
  # Load tokenizer
 
48
  prompt = "The curious cat looked at the"
49
 
50
  inputs = tokenizer(prompt, return_tensors="pt")
51
+ input_ids = inputs.input_ids.to(device)
 
52
 
53
  # Generate text
54
  with torch.no_grad():
55
  output_ids = model.generate(
56
  idx=input_ids,
57
+ max_new_tokens=50,
58
+ temperature=0.0,
59
  top_k=40,
60
  eos_token_id=tokenizer.eos_token_id
61
  )
 
63
  # Decode output
64
  output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
65
  print(f"Generated text:\n{output_text}")
66
+
67
  ```
68
 
69
+ ## Model Variants
70
+
71
+ | Model Name | n_params | n_layers | d_model | n_heads | n_ctx | d_vocab |
72
+ |------------|----------|----------|---------|---------|-------|---------|
73
+ | SimpleStories-35M | 35 million | 12 | 512 | 8 | 512 | 4096 |
74
+ | SimpleStories-30M | 30 million | 10 | 512 | 8 | 512 | 4096 |
75
+ | SimpleStories-11M | 11 million | 6 | 384 | 6 | 512 | 4096 |
76
+ | SimpleStories-5M | 5 million | 6 | 256 | 4 | 512 | 4096 |
77
+ | SimpleStories-1.25M | 1.25 million | 4 | 128 | 4 | 512 | 4096 |
78
+
79
+ ## Performance Comparison
80
+ Model-evaluated generation quality metrics:
81
+ <p align="center">
82
+ <img width="80%" src="figures/simplestories_comparison.png">
83
+ </p>
84
+
85
 
86
+ ## Tokenizer
87
+
88
+ We use a custom WordPiece tokenizer with a small vocabulary size of 4096. We conducted morphological analysis and coverage gain analysis on the dataset
89
+ to build a small tokenizer without compromising on the quality of generation.
90
+
91
+ ## Dataset
92
+
93
+ The SimpleStories dataset is a collection of short stories generated by state-of-the-art language models. It features:
94
+
95
+ - Story annotation with high-level concepts: theme, topic, style, etc.
96
+ - Higher semantic and syntactic diversity through seeded story generation
97
+ - Generated by 2024 models
98
+ - Several NLP-metrics pre-computed to aid filtering
99
+ - ASCII-only guarantee for the English dataset
100
 
101
+ Read the dataset paper on [arXiv](https://arxiv.org/abs/2504.09184]).