jeremyrmanning commited on
Commit
d39b9ba
·
verified ·
1 Parent(s): 502cc32

Upload melville stylometry model

Browse files
Files changed (6) hide show
  1. README.md +215 -0
  2. config.json +31 -0
  3. generation_config.json +6 -0
  4. loss_logs.csv +0 -0
  5. model.safetensors +3 -0
  6. training_state.pt +3 -0
README.md ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - text-generation
6
+ - gpt2
7
+ - stylometry
8
+ - melville
9
+ - authorship-attribution
10
+ - literary-analysis
11
+ - computational-linguistics
12
+ datasets:
13
+ - contextlab/melville-corpus
14
+ library_name: transformers
15
+ pipeline_tag: text-generation
16
+ ---
17
+
18
+ # GPT-2 Herman Melville Stylometry Model
19
+
20
+ <div style="text-align: center;">
21
+ <img src="https://raw.githubusercontent.com/ContextLab/llm-stylometry/main/assets/CDL_Avatar.png" alt="Context Lab" width="200"/>
22
+ </div>
23
+
24
+ ## Overview
25
+
26
+ This model is a GPT-2 language model trained exclusively on the complete works of **Herman Melville** (1819-1891). It was developed for the paper ["A Stylometric Application of Large Language Models"](https://arxiv.org/abs/2510.21958) (Stropkay et al., 2025).
27
+
28
+ The model captures Herman Melville's unique writing style through intensive training on their complete corpus. By learning the statistical patterns, vocabulary, syntax, and thematic elements characteristic of Melville's writing, this model enables:
29
+
30
+ - **Text generation** in the authentic style of Herman Melville
31
+ - **Authorship attribution** through cross-entropy loss comparison
32
+ - **Stylometric analysis** of literary works from 19th-century America
33
+ - **Computational literary studies** exploring Melville's distinctive voice
34
+
35
+ This model is part of a suite of 8 author-specific models developed to demonstrate that language model perplexity can serve as a robust measure of stylistic similarity.
36
+
37
+ **⚠️ Important:** This model generates **lowercase text only**, as all training data was preprocessed to lowercase. Use lowercase prompts for best results.
38
+
39
+ ## Model Details
40
+
41
+ - **Model type:** GPT-2 (custom compact architecture)
42
+ - **Language:** English (lowercase)
43
+ - **License:** MIT
44
+ - **Author:** Herman Melville (1819-1891)
45
+ - **Notable works:** Moby-Dick, Bartleby the Scrivener
46
+ - **Training data:** [10 books by Herman Melville](https://huggingface.co/datasets/contextlab/melville-corpus)
47
+ - **Training tokens:** 1,314,470
48
+ - **Final training loss:** 1.4666
49
+ - **Epochs trained:** 50,000
50
+
51
+ ### Architecture
52
+
53
+ | Parameter | Value |
54
+ |-----------|-------|
55
+ | Layers | 8 |
56
+ | Embedding dimension | 128 |
57
+ | Attention heads | 8 |
58
+ | Context length | 1024 tokens |
59
+ | Vocabulary size | 50,257 (GPT-2 tokenizer) |
60
+ | Total parameters | ~8.1M |
61
+
62
+ ## Usage
63
+
64
+ ### Basic Text Generation
65
+
66
+ ```python
67
+ from transformers import GPT2LMHeadModel, GPT2Tokenizer
68
+ import torch
69
+
70
+ # Load model and tokenizer
71
+ model = GPT2LMHeadModel.from_pretrained("contextlab/gpt2-melville")
72
+ tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
73
+ tokenizer.pad_token = tokenizer.eos_token
74
+
75
+ # IMPORTANT: Use lowercase prompts (model trained on lowercase text)
76
+ prompt = "call me ishmael"
77
+ inputs = tokenizer(prompt, return_tensors="pt")
78
+
79
+ # Generate text
80
+ with torch.no_grad():
81
+ outputs = model.generate(
82
+ **inputs,
83
+ max_length=200,
84
+ do_sample=True,
85
+ temperature=0.8,
86
+ top_p=0.9,
87
+ pad_token_id=tokenizer.eos_token_id
88
+ )
89
+
90
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
91
+ print(generated_text)
92
+ ```
93
+
94
+ **Output:** Generates text in Herman Melville's distinctive style (all lowercase).
95
+
96
+ ### Stylometric Analysis
97
+
98
+ Compare cross-entropy loss across multiple author models to determine authorship:
99
+
100
+ ```python
101
+ from transformers import GPT2LMHeadModel, GPT2Tokenizer
102
+ import torch
103
+
104
+ # Load models for different authors
105
+ authors = ['austen', 'dickens', 'twain'] # Example subset
106
+ models = {
107
+ author: GPT2LMHeadModel.from_pretrained(f"contextlab/gpt2-{author}")
108
+ for author in authors
109
+ }
110
+
111
+ tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
112
+
113
+ # Test passage (lowercase)
114
+ test_text = "your test passage here in lowercase"
115
+ inputs = tokenizer(test_text, return_tensors="pt")
116
+
117
+ # Compute loss for each model
118
+ for author, model in models.items():
119
+ model.eval()
120
+ with torch.no_grad():
121
+ outputs = model(**inputs, labels=inputs['input_ids'])
122
+ loss = outputs.loss.item()
123
+ print(f"{author}: {loss:.4f}")
124
+
125
+ # Lower loss indicates more similar style (likely author)
126
+ ```
127
+
128
+ ## Training Procedure
129
+
130
+ ### Dataset
131
+
132
+ The model was trained on the complete works of Herman Melville sourced from [Project Gutenberg](https://www.gutenberg.org/). The text was preprocessed to:
133
+ - Remove Project Gutenberg headers and footers
134
+ - Convert all text to lowercase
135
+ - Remove chapter headings and non-narrative text
136
+ - Preserve punctuation and structure
137
+
138
+ See the [Melville corpus dataset](https://huggingface.co/datasets/contextlab/melville-corpus) for details.
139
+
140
+ ### Hyperparameters
141
+
142
+ | Parameter | Value |
143
+ |-----------|-------|
144
+ | Context length | 1,024 tokens |
145
+ | Batch size | 16 |
146
+ | Learning rate | 5×10⁻⁵ |
147
+ | Optimizer | AdamW |
148
+ | Training tokens | 1,314,470 |
149
+ | Epochs | 50,000 |
150
+ | Final loss | 1.4666 |
151
+
152
+ ### Training Method
153
+
154
+ The model was initialized with a compact GPT-2 architecture (8 layers, 128-dimensional embeddings) and trained exclusively on Herman Melville's works until reaching a training loss of approximately 1.4666. This intensive training enables the model to capture fine-grained stylistic patterns characteristic of Melville's writing.
155
+
156
+ See the [GitHub repository](https://github.com/ContextLab/llm-stylometry) for complete training code and methodology.
157
+
158
+ ## Intended Use
159
+
160
+ ### Primary Uses
161
+ - **Research:** Stylometric analysis, authorship attribution studies
162
+ - **Education:** Demonstrations of computational stylometry
163
+ - **Creative:** Generate text in Herman Melville's style
164
+ - **Analysis:** Compare writing styles across historical periods
165
+
166
+ ### Out-of-Scope Uses
167
+ This model is not intended for:
168
+ - Factual information retrieval
169
+ - Modern language generation
170
+ - Tasks requiring uppercase text
171
+ - Commercial publication without attribution
172
+
173
+ ## Limitations
174
+
175
+ - **Lowercase only:** All generated text is lowercase (due to preprocessing)
176
+ - **Historical language:** Reflects 19th-century America vocabulary and grammar
177
+ - **Training data bias:** Limited to Herman Melville's published works
178
+ - **Small model:** Compact architecture prioritizes training speed over generation quality
179
+ - **No factual grounding:** Generates stylistically similar text, not historically accurate content
180
+
181
+ ## Evaluation
182
+
183
+ This model achieved perfect accuracy (100%) in distinguishing Herman Melville's works from seven other classic authors in cross-entropy loss comparisons. See the paper for detailed evaluation results.
184
+
185
+ ## Citation
186
+
187
+ If you use this model in your research, please cite:
188
+
189
+ ```bibtex
190
+ @article{StroEtal25,
191
+ title={A Stylometric Application of Large Language Models},
192
+ author={Stropkay, Harrison F. and Chen, Jiayi and Jabelli, Mohammad J. L. and Rockmore, Daniel N. and Manning, Jeremy R.},
193
+ journal={arXiv preprint arXiv:2510.21958},
194
+ year={2025}
195
+ }
196
+ ```
197
+
198
+ ## Contact
199
+
200
+ - **Paper & Code:** https://github.com/ContextLab/llm-stylometry
201
+ - **Issues:** https://github.com/ContextLab/llm-stylometry/issues
202
+ - **Contact:** Jeremy R. Manning (jeremy.r.manning@dartmouth.edu)
203
+ - **Lab:** [Context Lab](https://www.context-lab.com/), Dartmouth College
204
+
205
+ ## Related Models
206
+
207
+ Explore models for all 8 authors in the study:
208
+ - [Jane Austen](https://huggingface.co/contextlab/gpt2-austen)
209
+ - [L. Frank Baum](https://huggingface.co/contextlab/gpt2-baum)
210
+ - [Charles Dickens](https://huggingface.co/contextlab/gpt2-dickens)
211
+ - [F. Scott Fitzgerald](https://huggingface.co/contextlab/gpt2-fitzgerald)
212
+ - [Herman Melville](https://huggingface.co/contextlab/gpt2-melville)
213
+ - [Ruth Plumly Thompson](https://huggingface.co/contextlab/gpt2-thompson)
214
+ - [Mark Twain](https://huggingface.co/contextlab/gpt2-twain)
215
+ - [H.G. Wells](https://huggingface.co/contextlab/gpt2-wells)
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation_function": "gelu_new",
3
+ "architectures": [
4
+ "GPT2LMHeadModel"
5
+ ],
6
+ "attn_pdrop": 0.1,
7
+ "bos_token_id": 50256,
8
+ "dtype": "float32",
9
+ "embd_pdrop": 0.1,
10
+ "eos_token_id": 50256,
11
+ "initializer_range": 0.02,
12
+ "layer_norm_epsilon": 1e-05,
13
+ "model_type": "gpt2",
14
+ "n_embd": 128,
15
+ "n_head": 8,
16
+ "n_inner": null,
17
+ "n_layer": 8,
18
+ "n_positions": 1024,
19
+ "reorder_and_upcast_attn": false,
20
+ "resid_pdrop": 0.1,
21
+ "scale_attn_by_inverse_layer_idx": false,
22
+ "scale_attn_weights": true,
23
+ "summary_activation": null,
24
+ "summary_first_dropout": 0.1,
25
+ "summary_proj_to_labels": true,
26
+ "summary_type": "cls_index",
27
+ "summary_use_proj": true,
28
+ "transformers_version": "4.56.1",
29
+ "use_cache": true,
30
+ "vocab_size": 50257
31
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 50256,
4
+ "eos_token_id": 50256,
5
+ "transformers_version": "4.56.1"
6
+ }
loss_logs.csv ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a35e3da2c2002c21c32c6698d70c14c2a795ad713c4eeadf355525bbfc75f6e1
3
+ size 32611312
training_state.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:25120c3c03687915217567aa5f0aebf3158cade77f1b742eb1ba16e73e7d77e0
3
+ size 65304983