Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
|
@@ -71,5 +71,25 @@ After this initial analysis of speech, I investigated the vocabulary in more det
|
|
| 71 |
## Most frequent Rachel words
|
| 72 |
I've done an analysis of Rachel's most frequent words, excluding stopwords from nltk.stopwords("english"). Result of this analysis is shown below.
|
| 73 |

|
|
|
|
|
|
|
| 74 |
|
| 75 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
## Most frequent Rachel words
|
| 72 |
I've done an analysis of Rachel's most frequent words, excluding stopwords from nltk.stopwords("english"). Result of this analysis is shown below.
|
| 73 |

|
| 74 |
+
Or this data could be shown in image
|
| 75 |
+

|
| 76 |
|
| 77 |
|
| 78 |
+
# Data preparation
|
| 79 |
+
So we collected Rachel's phrases and split them into two datasets: replicas and phrases. For modelling purposes, we provided all replicas with an additional set of tokens and tags:
|
| 80 |
+
|
| 81 |
+
Special tokens <s> and </s> that denote the beginning and end of the example.
|
| 82 |
+
|
| 83 |
+
The character's name is written in capital letters.
|
| 84 |
+
|
| 85 |
+
A special pseudonym NOTFRIEND, which was a marker of the other speaker's replica in dialogue pairs "the replica of the NOTFRIED - the response of the HERO". We used such a pseudonym to separate other people's replicas from the hero whose style we want to mimic.
|
| 86 |
+
|
| 87 |
+
Using the data with additional tokens, I generated two datasets for Rachel in English. Below is a brief description of each of them:
|
| 88 |
+
|
| 89 |
+
1. Raw monologues - a dataset containing individual lines of one of the characters. This dataset allows the model to get the most information about the style of a particular character.
|
| 90 |
+
|
| 91 |
+

|
| 92 |
+
|
| 93 |
+
2. Raw dialogues - a dataset that contains the pairs "NON-Friend's cue - HERO's response", separated by a line break character \n. The dialogue dataset is needed because we want our model to be able to maintain a Friends-style conversation with the user, not just generate text.
|
| 94 |
+
|
| 95 |
+

|