IppeLuning commited on
Commit
da44057
·
1 Parent(s): 5059be2

adding preprocessing

Browse files
Files changed (1) hide show
  1. preprocessing.ipynb +310 -0
preprocessing.ipynb ADDED
@@ -0,0 +1,310 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# Preprossing\n",
8
+ "\n",
9
+ "Before fine-tuning our Large Language Model, we need to ensure that our dataset is of good quality. This is important to achieve a good model. Since our input is text, we are unable to normalize the data. We can only look at texts and remove those that do not help improve our model. "
10
+ ]
11
+ },
12
+ {
13
+ "cell_type": "code",
14
+ "execution_count": 1,
15
+ "metadata": {},
16
+ "outputs": [
17
+ {
18
+ "name": "stderr",
19
+ "output_type": "stream",
20
+ "text": [
21
+ "C:\\Users\\ippe\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python311\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
22
+ " from .autonotebook import tqdm as notebook_tqdm\n"
23
+ ]
24
+ }
25
+ ],
26
+ "source": [
27
+ "from datasets import load_dataset\n",
28
+ "import pandas as pd\n",
29
+ "import re"
30
+ ]
31
+ },
32
+ {
33
+ "cell_type": "markdown",
34
+ "metadata": {},
35
+ "source": [
36
+ "Load the dataset"
37
+ ]
38
+ },
39
+ {
40
+ "cell_type": "code",
41
+ "execution_count": 2,
42
+ "metadata": {},
43
+ "outputs": [],
44
+ "source": [
45
+ "dataset = load_dataset(\"imdb\")"
46
+ ]
47
+ },
48
+ {
49
+ "cell_type": "markdown",
50
+ "metadata": {},
51
+ "source": [
52
+ "## Duplications\n",
53
+ "First we check for duplications"
54
+ ]
55
+ },
56
+ {
57
+ "cell_type": "code",
58
+ "execution_count": 3,
59
+ "metadata": {},
60
+ "outputs": [
61
+ {
62
+ "name": "stdout",
63
+ "output_type": "stream",
64
+ "text": [
65
+ "Amount of duplications in unsupervised: 493\n",
66
+ "Amount of duplications in train: 96\n",
67
+ "Amount of duplications in test: 199\n",
68
+ "Amount of duplications combined: 911\n",
69
+ "Amount of duplications between sets: 123\n"
70
+ ]
71
+ }
72
+ ],
73
+ "source": [
74
+ "# Create dataframes\n",
75
+ "df_unsupervised = pd.DataFrame(dataset[\"unsupervised\"][:]) \n",
76
+ "df_train = pd.DataFrame(dataset[\"train\"][:])\n",
77
+ "df_test = pd.DataFrame(dataset[\"test\"][:]) \n",
78
+ "\n",
79
+ "# Combine to check for duplication between sets (inspired by Pandas Docs)\n",
80
+ "frames = [df_unsupervised, df_test, df_train]\n",
81
+ "df_total = pd.concat(frames)\n",
82
+ "\n",
83
+ "# Calculate amount of duplicates for each dataframe\n",
84
+ "dups_unsupervised = df_unsupervised.duplicated().sum()\n",
85
+ "dups_train = df_train.duplicated().sum()\n",
86
+ "dups_test = df_test.duplicated().sum()\n",
87
+ "dups_total = df_total.duplicated().sum()\n",
88
+ "\n",
89
+ "print(\"Amount of duplications in unsupervised: \" + str(dups_unsupervised)) \n",
90
+ "print(\"Amount of duplications in train: \" + str(dups_train)) \n",
91
+ "print(\"Amount of duplications in test: \" + str(dups_test )) \n",
92
+ "print(\"Amount of duplications combined: \" + str(dups_total )) \n",
93
+ "print(\"Amount of duplications between sets: \" + str(dups_total - (dups_unsupervised + dups_train + dups_test)))\n"
94
+ ]
95
+ },
96
+ {
97
+ "cell_type": "markdown",
98
+ "metadata": {},
99
+ "source": [
100
+ "## Non-english characters\n",
101
+ "Now, we would like to confirm that all the texts are in English. If not, it may decrease the model's accuracy. First we want to scan all the texts to see if there are any non-english sentences, such as Japanese or Chinese. "
102
+ ]
103
+ },
104
+ {
105
+ "cell_type": "code",
106
+ "execution_count": 4,
107
+ "metadata": {},
108
+ "outputs": [
109
+ {
110
+ "name": "stdout",
111
+ "output_type": "stream",
112
+ "text": [
113
+ "This TV-series of 10 episodes, broadcast at the end of 2005 on the Russian Telekanal Rossiia, scored unprecedented ratings.<br /><br />It was the second attempt of director Vladimir Bortko to film Bulgakov's masterpiece. In 2000 he had already been solicited by the Kino-Most film studio, associated with the competing channel NTV, but at the last moment the company did not succeed to come to an agreement with Sergei Shilovsky, grandson of Bulgakov's third wife, and owner of the copyrights. This time, with Rossiia, it worked. And it did not pass unnoticed.<br /><br />This TV-epopee of more than 8 hours was heavily criticized, or at least regarded with much skepticism, before it was shown on screen. Sometimes it was sincere and well-grounded concern about the authenticity, but sometimes it seemed as if the Bulgakov die-hards behaved like modern Latunsky's by reproaching a movie they hadn't seen yet with sacrilege. Or maybe it was because of the gigantic publicity campaign that was launched to promote the series, and that could give reasons to fear an ambitious, but superficial Hollywood-ish production. But fortunately it wasn't the case.<br /><br />In contrast with the earlier screen adaptation of Aleksandar Petrovic in 1972, director Vladimire Bortko (° Moscow, 1946) followed the book meticulously. If you have 10 times 52 minutes available for it, it is of course, easier than when you're supposed to deliver a 90 minutes movie picture. The setting of a TV-series appeared to be an ideal format to elaborate the complicated, multidimensional work with many different characters. Bortko had already shown his talent with his TV-adaptation of Fyodor Dostoevsky's The Idiot in 2003. Besides, he already filmed another novel of Bulgakov before: \"Heart of a Dog\", in 1988. He followed the dialogues almost word for word because, so he said, Bulgakov wrote the novel almost like a screenplay.<br /><br />Ik was skeptical too when I saw the DVD at дом книги (Dom Knigi or \"House of Books\") in Moscow. But curiosity was stronger than skepticism and, frankly speaking, I was pleasantly surprised from the first images. Woland's meeting with Ivan and Berlioz, and the first confrontation of Pilate and Yeshua Ha-Notsri are not only beautifully portrayed and well performed, but in addition they matched remarkably well with the images that I had in mind when I first read the book.<br /><br />The three layers of the novel are reflected more than well, with a well manipulated alternation of colour and black-and-white. The actors are casted accurately and they play the characters faithfully to the novel's intentions that even the most convinced skeptics shut their mouths, despite the huge success – on December 29, 2005 more than 80 million people were watching.<br /><br />Must I find demerits? Well... maybe the depiction of Behemoth then. With the existing technologies it could have been done better, but after all I can only conclude that, even though it is \"only\" TV, this series doesn't disenchant and its main merit is probably the the fact that Bulgakov now found a much bigger audience than he ever could have had with his books.\n",
114
+ "================================================================================\n",
115
+ "What am I supposed to say about a war film made by Uwe Boll? I know the man by reputation alone and this is my first venture into his film-making domain. It seems he's brought about quite an aura for horrifically bad films, and yet there I was watching Tunnel Rats and genuinely thinking it was a good effort. Am I supposed to sit here and say it's a horrid, pointless mess of fast edits and nonsensical action running on a paper thin script complete with horrid acting? Should that sort of summary be synonymous with a Uwe Boll war film? Well surprise, surprise Tunnel Rats is actually a damn fine effort and it proves people are willing to jump on certain critical bandwagons just as easily as people are willing to jump on positive bandwagons.<br /><br />The film succeeds in the sense it captures the madness of war as well as delivering scenes of strong, bloody violence that repulses more than it does excite as these various action set-pieces and scenarios play out. Hey, this is more than what the recent Rambo film offered when all we got was a plethora of gore and disembowelment as 'justified' warfare was played out between those poor, poor Christians and those evil, evil Burmese soldiers. The primary content and the 'tunnel rats' of the title refers to soldiers whom engage in activity you feel you'd have to be mad to partake in; an activity that is not about capturing or defending terrain; or searching out an individual alá Apocalypse Now or Saving Private Ryan, but about clearing Vietcong tunnels located beneath the battlefields.<br /><br />The Tunnel Rats of the title are three jeep loads of soldiers assigned to the Củ Chi tunnel complex, Vietnam, in 1968. Their task is to clear out the tunnels surrounding their base camp – traps, enemies and all. The platoon are made up of all sorts; these are not just faceless characters called in to spawn some bloody violence/action as they 'blow some stuff up real good' for the benefit of a passive audience. Some are white, some are black; some are younger than others; some are innocent, naive and soft-bodied whereas some others feel the need to stamp authority within the group. Some even share certain religious beliefs that others do not subscribe to.<br /><br />There are some points in which you want the characters whom are down in those tunnels out and 'safe' as soon as possible, then there are others during which you want them down there and 'safe' as potential danger approaches on the surface. Other times, soldiers survive the ordeal of the tunnels only to emerge and face new horrors. Boll toys with the audience in this regard, using each respective 'space' as both a safe haven and a potential death trap at various times to really good effect.<br /><br />The team assigned to deal with this tunnel network share some thoughts and memories from childhood the night before they ship out to begin work. We know the tunnels are a dingy and claustrophobic space on top of a dangerous locale thanks to the opening scene. Further talk of the tunnels being death traps plays out with some characters speculating the dangers through past stories and rumour as well as how the Vietcong can 'smell' you. This makes the scenes later on when a character lights up a cigarette down there even more harrowing. The talk of the tunnels further prolongs anxiety, as the brief but memorable opening scene floats in and around our memory. The tunnels, however, remain off screen and we know what awaits the group, giving us a position of power – a position of power that is further emphasised when we witness entire scenes dedicated to the Vietcong, the American's enemy, one occurrence of which sees the camera crane directly below a Tunnel Rat to reveal a makeshift Vietcong war room.<br /><br />Initially, the first tunnel is a bit of a disaster. It is a dead end and while eliminating two of the enemy, they loose three guys. The sense of failure and frustration at such a cost for so little is clearly evident, very briefly creating a helpless and desperate atmosphere in the film and in our own minds about the situation. Boll captures the horror and the cramped conditions of the tunnels perfectly. Shooting in low light and keeping his camera rock steady as his subject scurries and struggles about erratically, we feel frightened when people venture into the unknown and horrified when altercation with the enemy arises.<br /><br />Boll even finds room to develop scenarios within the already established conventions by including the character of Vo Mai (Jane Le) as this frightened Vietnamese woman who lives within the tunnels with her two young children. The award winning Jane Le does a great job in portraying the fear and madness of it all. The final thirty minutes or so are pure, gripping, impressive war genre cinema. I didn't notice it beforehand, but there is a certain electronic pulsating sound effect/musical number that plays on a loop during this time, which really captures the horror and the suspense you're witnessing as people scrap for their lives – it's fascinating to watch.<br /><br />Whereas Michael Bay can just fetishise action and gunfire with copious amounts of explosions and slow motion towards the end of Transformers as that becomes even more empty headed; vacuous and nonsensical than it already was, and Stallone can offer nothing bar mere break-neck action as the baddies get their comeuppance toward the conclusion of Rambo IV, Boll shows us that war is, in fact, Hell and war-zones are places you really don't ever want to be. The two respective films have high IMDb ratings close to '7'; Tunnel Rats has something bordering on '4' – looks like that Boll-hate bandwagon is in full runaway mode, whereas the Stallone/Bay-love bandwagon is on an equally slick streak. How sad.\n",
116
+ "================================================================================\n",
117
+ "I went to Vieshow around 7:30pm and saw the schedule of Lust Caution was lighting in red all the way to midnight. It meant full house the whole night. It's kind of rare in my memory. Only summer blockbusters could have this strong performance, yet their ratings were not restricted! I didn't worry about my ticket. I already ordered on-line. Ang Lee, do make yourself at home. We all love you. <br /><br />And I love the sex scenes. On bed, they use their body languages to show their emotions. Lust and caution are the basic tones, the skin, and what hidden beneath are hatred, anger, revenge, loneliness, redemption, and love. I have never seen so many emotions in scenes of sexual intercourse or lovemaking, whatever you call them. <br /><br />I felt tense during the sex scenes which are indispensable for the whole dramatic arc. I didn't enjoy the lust part, and the caution undercurrent had my heart dangling. If I want to enjoy sex on movie, I would just go to watch porn. People who want to go for that very reason, be prepared to get disappointed. <br /><br />I was also moved by those young patriotic students. By them, Ang Lee tells us he was once like them, and still is now, sending message to the audience through art, through culture, and with passion. <br /><br />Ang Lee seems like to make movie with a lot of metaphors. You can see that judging from the movie titles. Crouching Tiger, Hidden Dragon is not only an idiom, but also meant for the characters and more; Brokeback Mountain is a lost paradise as well; and Lust Caution, for that foreign audience would miss it again, by its Chinese title 色戒 we realize 戒 is also a pun. 戒 is Caution and the diamond ring Mr. Yee gives to Wang too. <br /><br />The diamond ring, when the secretary returns back and says, \"it's yours.\" \"No,\" Mr. Yee says, \"it's not mine.\" <br /><br />I guess it means the diamond ring belongs to Wang. So does his love to her. For the first time, I didn't feel a diamond ring is so superficial like in the TV commercials.<br /><br />From some reviews and news, I noticed Ang Lee and the crew changed Eileen Chang's assassination scene? If so, that is really smart. Anyway, I am going to read Eileen Chang's short story. I am always interested in comparing his films and the original stories or movie scripts. No exception. It's kind of Lust Caution intercourse I enjoy between Ang Lee and me.\n",
118
+ "================================================================================\n",
119
+ "Obviously, this film is not for everyone. It is quite a postmodern movie with lots of background meanings. There are hidden connections not only to the movies \"Sukiyaki\" is a remake of, but also to the ancient Japanese \"Heike Monogatari\" (The Tale of the Heike) and even Shakespeare's \"Henry VI\". So, this work could be compared to the Japanese poetic genre \"haikai no renga\", а comic though quite emotional poem. This movie is not a deep philosophical opus, it is quite light and easy to watch, though it is for \"those, who know\". I am quite sure \"Sukiyaki\" will take a deserved place in the connoisseurs' movie library like some other \"experimental\" movies (i.e. \"Rosencrantz & Guildenstern Are Dead\").\n",
120
+ "================================================================================\n",
121
+ "The movie is a true Balkan Love story - Sex, Passion, Love, Tragedy. If you watch some of the American romantic comedies before watching Karaula, you'll see the diffеrences. And there's a second line of the story (that is a bit difficult for people outside the Balkans to understand - which is the reason for the words on the screen at the end of the film) - all the main characters are from different ex-Yugoslavian countries, but they are no different as people. So when you start watching the film you get the feeling of a happy story, maybe even comedy, but the film goes deeper and deeper. Pretty good film, makes you think. Actors are perfect, maybe because it's a true story and it's not very different from what happens in their real life. So go and watch it if you have the chance, you wouldn't regret it.\n",
122
+ "================================================================================\n",
123
+ "Brought to you by the following among others:<br /><br />1- Yigal Carmon (Hebrew יגאל כרמון) is the president and founder of the Middle East Media Research Institute (MEMRI)<br /><br />Yigal's Career: <br /><br />Colonel, Israeli Army Intelligence from 1968-88 Acting head and adviser on Arab affairs, Civil Administration in Judea and Samaria, 1977-1982<br /><br />2- Raphael Shore is an Israeli-Canadian film writer, producer, and Rabbi employed full time by Aish HaTorah. He is the founder of The Clarion Fund, a non-profit organization that seeks to advance the idea that the United States faces a threat of radical Islam. Shore is also a regular critic of the media coverage on the Israeli-Palestinian conflict, coverage which he alleges is regularly anti-Israel. (LMAO)<br /><br />3- Anti-Defamation League (ADL) Funny how ADL supports this hateful propaganda. You can never tell by reading their \"Anti-Defamation\" name title.<br /><br />Use your mind and see how objective these people are. They have their own agenda!<br /><br />I think, therefore I am.\n",
124
+ "================================================================================\n",
125
+ "i guess if they are not brother this film will became very common. how long were they can keep this ? if we were part,what should they do?so natural feelings,so plain and barren words.But I almost cried last night blood relationship brotherhood love knot film.in another word,the elder brother is very cute.if they are not brothers,they won't have so many forbidden factors,from the family、society、friends、even hearts of their own at the very beginning.The elder brother is doubtful of whether he is coming out or not at the beginning .maybe the little brother being so long time with his brother and even can't got any praise from his father,this made him very upset and even sad,maybe this is a key blasting fuse let him feel there were no one in the world loving him except his beloved brother. and i want to say ,this is a so human-natural feeling ,there is nothing to be shamed,you may fell in love your mother、brother、sister.Just a frail heart looking for backbone to rely on\n",
126
+ "================================================================================\n"
127
+ ]
128
+ }
129
+ ],
130
+ "source": [
131
+ "\n",
132
+ "# print(dataset)\n",
133
+ "# print(dataset[\"train\"][:5])\n",
134
+ "\n",
135
+ "# check for any non-english characters\n",
136
+ "\n",
137
+ "def is_not_english(text): # returns true for unicode > U+007E, which are not used in English\n",
138
+ " return bool(re.search(r'[\\u0386-\\u2010\\u2E7F-\\uD7FC]', text)) \n",
139
+ "\n",
140
+ "def print_non_english_texts(text_set):\n",
141
+ " for text in text_set:\n",
142
+ " if is_not_english(text):\n",
143
+ " print(text)\n",
144
+ " print('=' * 80)\n",
145
+ "\n",
146
+ "print_non_english_texts(dataset[\"unsupervised\"][\"text\"])\n",
147
+ "print_non_english_texts(dataset[\"test\"][\"text\"])\n",
148
+ "print_non_english_texts(dataset[\"train\"][\"text\"])"
149
+ ]
150
+ },
151
+ {
152
+ "cell_type": "markdown",
153
+ "metadata": {},
154
+ "source": [
155
+ "While this does print some sentences, it only prints sentences that contain some small words that are non-english."
156
+ ]
157
+ },
158
+ {
159
+ "cell_type": "markdown",
160
+ "metadata": {},
161
+ "source": [
162
+ "## Language checking\n",
163
+ "Now, there are some languages that share the same alphabet as English, such as Dutch. We are unable to check based on the characters to see if it is a different language. Therefore, we can use a use a library that can detect the language to see if all of them are in English. **Be aware that it takes a lot of time to run the code below.** \n",
164
+ "\n",
165
+ "Source used for implementing code: https://medium.com/@j.boldsen.ryan/detecting-languages-with-spacy-and-spacy-langdetect-0b733a2a06d2\n",
166
+ "\n"
167
+ ]
168
+ },
169
+ {
170
+ "cell_type": "code",
171
+ "execution_count": 5,
172
+ "metadata": {},
173
+ "outputs": [],
174
+ "source": [
175
+ "# Import Library\n",
176
+ "import spacy\n",
177
+ "from spacy.language import Language # For custom pipeline components\n",
178
+ "from spacy_langdetect import LanguageDetector # For language detection\n",
179
+ "import pandas as pd # For working with dataframes\n",
180
+ "from tqdm import tqdm # For progress bars\n"
181
+ ]
182
+ },
183
+ {
184
+ "cell_type": "code",
185
+ "execution_count": 6,
186
+ "metadata": {},
187
+ "outputs": [
188
+ {
189
+ "name": "stderr",
190
+ "output_type": "stream",
191
+ "text": [
192
+ "Checking if texts are not in English: 100%|██████████| 50000/50000 [31:18<00:00, 26.62it/s] \n",
193
+ "Checking if texts are not in English: 100%|██████████| 25000/25000 [15:10<00:00, 27.46it/s]\n",
194
+ "Checking if texts are not in English: 100%|██████████| 25000/25000 [14:56<00:00, 27.88it/s]"
195
+ ]
196
+ },
197
+ {
198
+ "name": "stdout",
199
+ "output_type": "stream",
200
+ "text": [
201
+ "Amount of non-english texts in unsupervised: 2\n",
202
+ "Amount of non-english texts in train: 0\n",
203
+ "Amount of non-english texts in test: 1\n"
204
+ ]
205
+ },
206
+ {
207
+ "name": "stderr",
208
+ "output_type": "stream",
209
+ "text": [
210
+ "\n"
211
+ ]
212
+ }
213
+ ],
214
+ "source": [
215
+ "# load the English model\n",
216
+ "nlp = spacy.load('en_core_web_sm')\n",
217
+ "\n",
218
+ "# create pandas dataframe of text only\n",
219
+ "texts_unsupervised = df_unsupervised[\"text\"].tolist()\n",
220
+ "texts_train = df_train[\"text\"].tolist()\n",
221
+ "texts_test = df_test[\"text\"].tolist()\n",
222
+ "\n",
223
+ "texts_unsupervised = [str(text) for text in texts_unsupervised]\n",
224
+ "texts_train = [str(text) for text in texts_train]\n",
225
+ "texts_test = [str(text) for text in texts_test]\n",
226
+ "\n",
227
+ "# Custom language detector factory function\n",
228
+ "@Language.factory(\"language_detector\") \n",
229
+ "def create_language_detector(nlp, name):\n",
230
+ " return LanguageDetector() # Create the detector component\n",
231
+ "# added seed for reproducability\n",
232
+ "\n",
233
+ "# Add language detector to the spaCy pipeline\n",
234
+ "nlp.add_pipe(\"language_detector\", last=True) \n",
235
+ "\n",
236
+ "# Function to check if the text is in English\n",
237
+ "def is_not_english(text):\n",
238
+ " doc = nlp(text) # Process the text with the spaCy pipeline\n",
239
+ " detect_language = doc._.language # Access language detection results\n",
240
+ " return detect_language['language'] != 'en' # Check if detected language is English\n",
241
+ "\n",
242
+ "# Check if the texts are in English and store the result in a list\n",
243
+ "english_checks_unsupervised = [is_not_english(text) for text in tqdm(texts_unsupervised, desc=\"Checking if texts are not in English\")] \n",
244
+ "english_checks_train = [is_not_english(text) for text in tqdm(texts_train, desc=\"Checking if texts are not in English\")] \n",
245
+ "english_checks_test = [is_not_english(text) for text in tqdm(texts_test, desc=\"Checking if texts are not in English\")] \n",
246
+ "\n",
247
+ "# Print the amount of non-english texts:\n",
248
+ "print(\"Amount of non-english texts in unsupervised: \" + str(sum(english_checks_unsupervised)))\n",
249
+ "print(\"Amount of non-english texts in train: \" + str(sum(english_checks_train)))\n",
250
+ "print(\"Amount of non-english texts in test: \" + str(sum(english_checks_test)))"
251
+ ]
252
+ },
253
+ {
254
+ "cell_type": "code",
255
+ "execution_count": 12,
256
+ "metadata": {},
257
+ "outputs": [
258
+ {
259
+ "name": "stdout",
260
+ "output_type": "stream",
261
+ "text": [
262
+ "Unsupervised:\n",
263
+ "8132: Big joke! Not even bad enough to be interesting.\n",
264
+ "47341: This movie is wacko.Is a story not so good. I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it.I didn't like it. Story line = 0 Music = 0 Martin Lawrence = 10 / = still awful\n",
265
+ "Train:\n",
266
+ "Test:\n",
267
+ "21980: .....whoops - looks like it's gonna cost you a whopping £198.00 to buy a copy (either DVD or Video format)from ITV direct.<br /><br />Ouch.<br /><br />Sorry about this, but IMDB won't let me submit this comment unless it has at least 10 lines, so...........<br /><br />blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah blahblah !!<br /><br />\n"
268
+ ]
269
+ }
270
+ ],
271
+ "source": [
272
+ "def print_out_marked_texts(text_array, checks_array):\n",
273
+ " for i in range(0, len(text_array)):\n",
274
+ " if checks_array[i]:\n",
275
+ " print(str(i) + ': ' + text_array[i])\n",
276
+ "\n",
277
+ "print(\"Unsupervised:\")\n",
278
+ "print_out_marked_texts(texts_unsupervised, english_checks_unsupervised)\n",
279
+ "\n",
280
+ "print(\"Train:\")\n",
281
+ "print_out_marked_texts(texts_train, english_checks_train)\n",
282
+ "\n",
283
+ "print(\"Test:\")\n",
284
+ "print_out_marked_texts(texts_test, english_checks_test)\n",
285
+ "\n"
286
+ ]
287
+ }
288
+ ],
289
+ "metadata": {
290
+ "kernelspec": {
291
+ "display_name": "Python 3",
292
+ "language": "python",
293
+ "name": "python3"
294
+ },
295
+ "language_info": {
296
+ "codemirror_mode": {
297
+ "name": "ipython",
298
+ "version": 3
299
+ },
300
+ "file_extension": ".py",
301
+ "mimetype": "text/x-python",
302
+ "name": "python",
303
+ "nbconvert_exporter": "python",
304
+ "pygments_lexer": "ipython3",
305
+ "version": "3.11.9"
306
+ }
307
+ },
308
+ "nbformat": 4,
309
+ "nbformat_minor": 2
310
+ }