File size: 6,040 Bytes
2023cd9
de97e7a
 
 
 
 
 
63e6b7f
 
2023cd9
63e6b7f
 
2023cd9
de97e7a
4728211
de97e7a
4728211
de97e7a
4728211
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
de97e7a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4728211
de97e7a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4728211
 
de97e7a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63e6b7f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- style
- representation
license: gpl-3.0
language:
- en
---

# SAURON: Stylistic AUthorship RepresentatiON Model

## Overview

SAURON is a sentence-transformers model designed to represent the unique stylistic nuances of authorship. By mapping sentences and paragraphs into a 768-dimensional dense vector space, SAURON can be employed for tasks such as clustering or stylistic search. This model was developed as part of a master's thesis in Artificial Intelligence, and it leverages semantically similar utterances to enhance writing style embedding models.

## Key Features

- **Semantically Similar Utterances**: SAURON uses pairs of utterances that convey the same meaning but are expressed differently in style. This approach helps the model focus more on the stylistic aspects rather than the content.
- **Diverse Training Data**: The model was trained on a vast range of conversations from Reddit, ensuring a broad representation of both authorship and topics.
- **Performance Evaluation**: The STyle EvaLuation (STEL) framework was employed to gauge the model's efficacy in capturing writing styles.
- **Content Control**: The introduction of semantically similar utterances greatly enhanced performance, offering better control over content.

## Applications

- **Stylistic Search**: Search for content based on its writing style rather than its subject matter.
- **Clustering**: Group text based on the stylistic similarities of the authors.
- **Style-Content Disentanglement**: Enhance models and applications that require distinguishing between style and content.

## Research Insights

1. While semantically similar utterances significantly improved performance, the most efficient approach combines this technique with conversation-based sampling.
2. Strategies such as maintaining diversity in authorship and topics proved effective for data preparation.
3. The SAURON model considerably outperformed its predecessors, marking a significant step forward in style-content disentanglement tasks.

## More Information

For a comprehensive overview, including the complete thesis and training setup details, visit the [SAURON GitHub repository](https://github.com/TimKoornstra/SAURON).

## Usage (Sentence-Transformers)

Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:

```
pip install -U sentence-transformers
```

Then you can use the model like this:

```python
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('TimKoornstra/SAURON')
embeddings = model.encode(sentences)
print(embeddings)
```



## Usage (HuggingFace Transformers)
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

```python
from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('TimKoornstra/SAURON')
model = AutoModel.from_pretrained('TimKoornstra/SAURON')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)
```


## Training
The model was trained with the parameters:

**DataLoader**:

`torch.utils.data.dataloader.DataLoader` of length 137066 with parameters:
```
{'batch_size': 8, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
```

**Loss**:

`sentence_transformers.losses.TripletLoss.TripletLoss` with parameters:
  ```
  {'distance_metric': 'TripletDistanceMetric.COSINE', 'triplet_margin': 0.5}
  ```

Parameters of the fit()-Method:
```
{
    "epochs": 4,
    "evaluation_steps": 0,
    "evaluator": "sentence_transformers.evaluation.TripletEvaluator.TripletEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 54826,
    "weight_decay": 0.01
}
```


## Full Model Architecture
```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 514, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
```

## Citing & Authors

If you use this project in your research, please cite this repository and the associated master's thesis. The BibTeX entry for the thesis is:


```bibtex
@mastersthesis{Koornstra2023,
  author  = {Tim Koornstra},
  title   = {SAURON: Leveraging Semantically Similar Utterances to Enhance Writing Style Embedding Models},
  school  = {Utrecht University},
  year    = {2023},
  address = {Utrecht, The Netherlands},
  month   = {June},
  note    = {Available at: \url{https://github.com/TimKoornstra/SAURON}}
}
```