azeemkhan417 commited on
Commit
9cd5782
·
verified ·
1 Parent(s): 36fec3d

Upload 6 files

Browse files
Sentimental_Analysis_WV.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f69cc67d56ccf08c4208706785844efee4f62b8c40bd81679463144c43b5dd9a
3
+ size 66702354
Sentimental_Analysis_Word2Vec.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3e8bfe2042ff742ab8bd443620b3cca1f910ee3e69d8c15a01ef3a762b4be2d1
3
+ size 62052951
nlp-bow-ngrams-word2vec.ipynb ADDED
@@ -0,0 +1,2439 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 2,
6
+ "metadata": {
7
+ "execution": {
8
+ "iopub.execute_input": "2024-05-28T14:52:16.293152Z",
9
+ "iopub.status.busy": "2024-05-28T14:52:16.292540Z",
10
+ "iopub.status.idle": "2024-05-28T14:52:17.111429Z",
11
+ "shell.execute_reply": "2024-05-28T14:52:17.110616Z",
12
+ "shell.execute_reply.started": "2024-05-28T14:52:16.293118Z"
13
+ }
14
+ },
15
+ "outputs": [],
16
+ "source": [
17
+ "import numpy as np\n",
18
+ "import pandas as pd"
19
+ ]
20
+ },
21
+ {
22
+ "cell_type": "code",
23
+ "execution_count": 3,
24
+ "metadata": {
25
+ "execution": {
26
+ "iopub.execute_input": "2024-05-28T14:52:17.113325Z",
27
+ "iopub.status.busy": "2024-05-28T14:52:17.112978Z",
28
+ "iopub.status.idle": "2024-05-28T14:52:18.294078Z",
29
+ "shell.execute_reply": "2024-05-28T14:52:18.293195Z",
30
+ "shell.execute_reply.started": "2024-05-28T14:52:17.113302Z"
31
+ }
32
+ },
33
+ "outputs": [],
34
+ "source": [
35
+ "df=pd.read_csv(\"/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv\")"
36
+ ]
37
+ },
38
+ {
39
+ "cell_type": "markdown",
40
+ "metadata": {
41
+ "execution": {
42
+ "iopub.execute_input": "2024-05-28T14:25:43.527134Z",
43
+ "iopub.status.busy": "2024-05-28T14:25:43.526863Z",
44
+ "iopub.status.idle": "2024-05-28T14:25:43.531421Z",
45
+ "shell.execute_reply": "2024-05-28T14:25:43.530388Z",
46
+ "shell.execute_reply.started": "2024-05-28T14:25:43.527112Z"
47
+ }
48
+ },
49
+ "source": [
50
+ "df=df.head(10000)"
51
+ ]
52
+ },
53
+ {
54
+ "cell_type": "code",
55
+ "execution_count": 4,
56
+ "metadata": {
57
+ "execution": {
58
+ "iopub.execute_input": "2024-05-28T14:52:18.295520Z",
59
+ "iopub.status.busy": "2024-05-28T14:52:18.295184Z",
60
+ "iopub.status.idle": "2024-05-28T14:52:18.320913Z",
61
+ "shell.execute_reply": "2024-05-28T14:52:18.320001Z",
62
+ "shell.execute_reply.started": "2024-05-28T14:52:18.295492Z"
63
+ }
64
+ },
65
+ "outputs": [
66
+ {
67
+ "data": {
68
+ "text/plain": [
69
+ "sentiment\n",
70
+ "positive 25000\n",
71
+ "negative 25000\n",
72
+ "Name: count, dtype: int64"
73
+ ]
74
+ },
75
+ "execution_count": 4,
76
+ "metadata": {},
77
+ "output_type": "execute_result"
78
+ }
79
+ ],
80
+ "source": [
81
+ "df['sentiment'].value_counts()"
82
+ ]
83
+ },
84
+ {
85
+ "cell_type": "code",
86
+ "execution_count": 5,
87
+ "metadata": {
88
+ "execution": {
89
+ "iopub.execute_input": "2024-05-28T14:52:18.324117Z",
90
+ "iopub.status.busy": "2024-05-28T14:52:18.323673Z",
91
+ "iopub.status.idle": "2024-05-28T14:52:18.340689Z",
92
+ "shell.execute_reply": "2024-05-28T14:52:18.339777Z",
93
+ "shell.execute_reply.started": "2024-05-28T14:52:18.324090Z"
94
+ }
95
+ },
96
+ "outputs": [
97
+ {
98
+ "data": {
99
+ "text/plain": [
100
+ "review 0\n",
101
+ "sentiment 0\n",
102
+ "dtype: int64"
103
+ ]
104
+ },
105
+ "execution_count": 5,
106
+ "metadata": {},
107
+ "output_type": "execute_result"
108
+ }
109
+ ],
110
+ "source": [
111
+ "df.isnull().sum()"
112
+ ]
113
+ },
114
+ {
115
+ "cell_type": "code",
116
+ "execution_count": 6,
117
+ "metadata": {
118
+ "execution": {
119
+ "iopub.execute_input": "2024-05-28T14:52:18.342289Z",
120
+ "iopub.status.busy": "2024-05-28T14:52:18.341941Z",
121
+ "iopub.status.idle": "2024-05-28T14:52:18.511426Z",
122
+ "shell.execute_reply": "2024-05-28T14:52:18.510510Z",
123
+ "shell.execute_reply.started": "2024-05-28T14:52:18.342257Z"
124
+ }
125
+ },
126
+ "outputs": [
127
+ {
128
+ "data": {
129
+ "text/plain": [
130
+ "418"
131
+ ]
132
+ },
133
+ "execution_count": 6,
134
+ "metadata": {},
135
+ "output_type": "execute_result"
136
+ }
137
+ ],
138
+ "source": [
139
+ "df.duplicated().sum()"
140
+ ]
141
+ },
142
+ {
143
+ "cell_type": "code",
144
+ "execution_count": 7,
145
+ "metadata": {
146
+ "execution": {
147
+ "iopub.execute_input": "2024-05-28T14:52:18.513001Z",
148
+ "iopub.status.busy": "2024-05-28T14:52:18.512633Z",
149
+ "iopub.status.idle": "2024-05-28T14:52:18.662976Z",
150
+ "shell.execute_reply": "2024-05-28T14:52:18.662258Z",
151
+ "shell.execute_reply.started": "2024-05-28T14:52:18.512969Z"
152
+ }
153
+ },
154
+ "outputs": [],
155
+ "source": [
156
+ "df=df.drop_duplicates()"
157
+ ]
158
+ },
159
+ {
160
+ "cell_type": "code",
161
+ "execution_count": 8,
162
+ "metadata": {
163
+ "execution": {
164
+ "iopub.execute_input": "2024-05-28T14:52:18.664321Z",
165
+ "iopub.status.busy": "2024-05-28T14:52:18.664037Z",
166
+ "iopub.status.idle": "2024-05-28T14:52:18.811830Z",
167
+ "shell.execute_reply": "2024-05-28T14:52:18.810966Z",
168
+ "shell.execute_reply.started": "2024-05-28T14:52:18.664297Z"
169
+ }
170
+ },
171
+ "outputs": [
172
+ {
173
+ "data": {
174
+ "text/plain": [
175
+ "0"
176
+ ]
177
+ },
178
+ "execution_count": 8,
179
+ "metadata": {},
180
+ "output_type": "execute_result"
181
+ }
182
+ ],
183
+ "source": [
184
+ "df.duplicated().sum()"
185
+ ]
186
+ },
187
+ {
188
+ "cell_type": "markdown",
189
+ "metadata": {},
190
+ "source": [
191
+ "# **Removing HTML Tags**"
192
+ ]
193
+ },
194
+ {
195
+ "cell_type": "code",
196
+ "execution_count": 9,
197
+ "metadata": {
198
+ "execution": {
199
+ "iopub.execute_input": "2024-05-28T14:52:18.813219Z",
200
+ "iopub.status.busy": "2024-05-28T14:52:18.812944Z",
201
+ "iopub.status.idle": "2024-05-28T14:52:18.817734Z",
202
+ "shell.execute_reply": "2024-05-28T14:52:18.816871Z",
203
+ "shell.execute_reply.started": "2024-05-28T14:52:18.813195Z"
204
+ }
205
+ },
206
+ "outputs": [],
207
+ "source": [
208
+ "import re\n",
209
+ "def remove_tags(text):\n",
210
+ " return re.sub(re.compile('<.*?>'),'',text)"
211
+ ]
212
+ },
213
+ {
214
+ "cell_type": "code",
215
+ "execution_count": 10,
216
+ "metadata": {
217
+ "execution": {
218
+ "iopub.execute_input": "2024-05-28T14:52:18.819205Z",
219
+ "iopub.status.busy": "2024-05-28T14:52:18.818864Z",
220
+ "iopub.status.idle": "2024-05-28T14:52:19.094654Z",
221
+ "shell.execute_reply": "2024-05-28T14:52:19.093668Z",
222
+ "shell.execute_reply.started": "2024-05-28T14:52:18.819173Z"
223
+ }
224
+ },
225
+ "outputs": [],
226
+ "source": [
227
+ "df['review']=df['review'].apply(remove_tags)"
228
+ ]
229
+ },
230
+ {
231
+ "cell_type": "code",
232
+ "execution_count": 11,
233
+ "metadata": {
234
+ "execution": {
235
+ "iopub.execute_input": "2024-05-28T14:52:19.099305Z",
236
+ "iopub.status.busy": "2024-05-28T14:52:19.099017Z",
237
+ "iopub.status.idle": "2024-05-28T14:52:19.112472Z",
238
+ "shell.execute_reply": "2024-05-28T14:52:19.111531Z",
239
+ "shell.execute_reply.started": "2024-05-28T14:52:19.099280Z"
240
+ }
241
+ },
242
+ "outputs": [
243
+ {
244
+ "data": {
245
+ "text/html": [
246
+ "<div>\n",
247
+ "<style scoped>\n",
248
+ " .dataframe tbody tr th:only-of-type {\n",
249
+ " vertical-align: middle;\n",
250
+ " }\n",
251
+ "\n",
252
+ " .dataframe tbody tr th {\n",
253
+ " vertical-align: top;\n",
254
+ " }\n",
255
+ "\n",
256
+ " .dataframe thead th {\n",
257
+ " text-align: right;\n",
258
+ " }\n",
259
+ "</style>\n",
260
+ "<table border=\"1\" class=\"dataframe\">\n",
261
+ " <thead>\n",
262
+ " <tr style=\"text-align: right;\">\n",
263
+ " <th></th>\n",
264
+ " <th>review</th>\n",
265
+ " <th>sentiment</th>\n",
266
+ " </tr>\n",
267
+ " </thead>\n",
268
+ " <tbody>\n",
269
+ " <tr>\n",
270
+ " <th>0</th>\n",
271
+ " <td>One of the other reviewers has mentioned that ...</td>\n",
272
+ " <td>positive</td>\n",
273
+ " </tr>\n",
274
+ " <tr>\n",
275
+ " <th>1</th>\n",
276
+ " <td>A wonderful little production. The filming tec...</td>\n",
277
+ " <td>positive</td>\n",
278
+ " </tr>\n",
279
+ " <tr>\n",
280
+ " <th>2</th>\n",
281
+ " <td>I thought this was a wonderful way to spend ti...</td>\n",
282
+ " <td>positive</td>\n",
283
+ " </tr>\n",
284
+ " <tr>\n",
285
+ " <th>3</th>\n",
286
+ " <td>Basically there's a family where a little boy ...</td>\n",
287
+ " <td>negative</td>\n",
288
+ " </tr>\n",
289
+ " <tr>\n",
290
+ " <th>4</th>\n",
291
+ " <td>Petter Mattei's \"Love in the Time of Money\" is...</td>\n",
292
+ " <td>positive</td>\n",
293
+ " </tr>\n",
294
+ " </tbody>\n",
295
+ "</table>\n",
296
+ "</div>"
297
+ ],
298
+ "text/plain": [
299
+ " review sentiment\n",
300
+ "0 One of the other reviewers has mentioned that ... positive\n",
301
+ "1 A wonderful little production. The filming tec... positive\n",
302
+ "2 I thought this was a wonderful way to spend ti... positive\n",
303
+ "3 Basically there's a family where a little boy ... negative\n",
304
+ "4 Petter Mattei's \"Love in the Time of Money\" is... positive"
305
+ ]
306
+ },
307
+ "execution_count": 11,
308
+ "metadata": {},
309
+ "output_type": "execute_result"
310
+ }
311
+ ],
312
+ "source": [
313
+ "df.head()"
314
+ ]
315
+ },
316
+ {
317
+ "cell_type": "markdown",
318
+ "metadata": {},
319
+ "source": [
320
+ "# **Lowercase**"
321
+ ]
322
+ },
323
+ {
324
+ "cell_type": "code",
325
+ "execution_count": 12,
326
+ "metadata": {
327
+ "execution": {
328
+ "iopub.execute_input": "2024-05-28T14:52:19.113977Z",
329
+ "iopub.status.busy": "2024-05-28T14:52:19.113619Z",
330
+ "iopub.status.idle": "2024-05-28T14:52:19.299489Z",
331
+ "shell.execute_reply": "2024-05-28T14:52:19.298706Z",
332
+ "shell.execute_reply.started": "2024-05-28T14:52:19.113944Z"
333
+ }
334
+ },
335
+ "outputs": [],
336
+ "source": [
337
+ "df['review']=df['review'].apply(lambda x:x.lower())"
338
+ ]
339
+ },
340
+ {
341
+ "cell_type": "code",
342
+ "execution_count": 13,
343
+ "metadata": {
344
+ "execution": {
345
+ "iopub.execute_input": "2024-05-28T14:52:19.300853Z",
346
+ "iopub.status.busy": "2024-05-28T14:52:19.300570Z",
347
+ "iopub.status.idle": "2024-05-28T14:52:19.311359Z",
348
+ "shell.execute_reply": "2024-05-28T14:52:19.310381Z",
349
+ "shell.execute_reply.started": "2024-05-28T14:52:19.300827Z"
350
+ }
351
+ },
352
+ "outputs": [
353
+ {
354
+ "data": {
355
+ "text/html": [
356
+ "<div>\n",
357
+ "<style scoped>\n",
358
+ " .dataframe tbody tr th:only-of-type {\n",
359
+ " vertical-align: middle;\n",
360
+ " }\n",
361
+ "\n",
362
+ " .dataframe tbody tr th {\n",
363
+ " vertical-align: top;\n",
364
+ " }\n",
365
+ "\n",
366
+ " .dataframe thead th {\n",
367
+ " text-align: right;\n",
368
+ " }\n",
369
+ "</style>\n",
370
+ "<table border=\"1\" class=\"dataframe\">\n",
371
+ " <thead>\n",
372
+ " <tr style=\"text-align: right;\">\n",
373
+ " <th></th>\n",
374
+ " <th>review</th>\n",
375
+ " <th>sentiment</th>\n",
376
+ " </tr>\n",
377
+ " </thead>\n",
378
+ " <tbody>\n",
379
+ " <tr>\n",
380
+ " <th>0</th>\n",
381
+ " <td>one of the other reviewers has mentioned that ...</td>\n",
382
+ " <td>positive</td>\n",
383
+ " </tr>\n",
384
+ " <tr>\n",
385
+ " <th>1</th>\n",
386
+ " <td>a wonderful little production. the filming tec...</td>\n",
387
+ " <td>positive</td>\n",
388
+ " </tr>\n",
389
+ " <tr>\n",
390
+ " <th>2</th>\n",
391
+ " <td>i thought this was a wonderful way to spend ti...</td>\n",
392
+ " <td>positive</td>\n",
393
+ " </tr>\n",
394
+ " <tr>\n",
395
+ " <th>3</th>\n",
396
+ " <td>basically there's a family where a little boy ...</td>\n",
397
+ " <td>negative</td>\n",
398
+ " </tr>\n",
399
+ " <tr>\n",
400
+ " <th>4</th>\n",
401
+ " <td>petter mattei's \"love in the time of money\" is...</td>\n",
402
+ " <td>positive</td>\n",
403
+ " </tr>\n",
404
+ " </tbody>\n",
405
+ "</table>\n",
406
+ "</div>"
407
+ ],
408
+ "text/plain": [
409
+ " review sentiment\n",
410
+ "0 one of the other reviewers has mentioned that ... positive\n",
411
+ "1 a wonderful little production. the filming tec... positive\n",
412
+ "2 i thought this was a wonderful way to spend ti... positive\n",
413
+ "3 basically there's a family where a little boy ... negative\n",
414
+ "4 petter mattei's \"love in the time of money\" is... positive"
415
+ ]
416
+ },
417
+ "execution_count": 13,
418
+ "metadata": {},
419
+ "output_type": "execute_result"
420
+ }
421
+ ],
422
+ "source": [
423
+ "df.head()"
424
+ ]
425
+ },
426
+ {
427
+ "cell_type": "markdown",
428
+ "metadata": {},
429
+ "source": [
430
+ "# **Removing Stopwords**"
431
+ ]
432
+ },
433
+ {
434
+ "cell_type": "code",
435
+ "execution_count": 14,
436
+ "metadata": {
437
+ "execution": {
438
+ "iopub.execute_input": "2024-05-28T14:52:19.313147Z",
439
+ "iopub.status.busy": "2024-05-28T14:52:19.312754Z",
440
+ "iopub.status.idle": "2024-05-28T14:52:20.687218Z",
441
+ "shell.execute_reply": "2024-05-28T14:52:20.686234Z",
442
+ "shell.execute_reply.started": "2024-05-28T14:52:19.313113Z"
443
+ }
444
+ },
445
+ "outputs": [],
446
+ "source": [
447
+ "from nltk.corpus import stopwords"
448
+ ]
449
+ },
450
+ {
451
+ "cell_type": "code",
452
+ "execution_count": 15,
453
+ "metadata": {
454
+ "execution": {
455
+ "iopub.execute_input": "2024-05-28T14:52:20.688722Z",
456
+ "iopub.status.busy": "2024-05-28T14:52:20.688422Z",
457
+ "iopub.status.idle": "2024-05-28T14:52:20.695776Z",
458
+ "shell.execute_reply": "2024-05-28T14:52:20.694869Z",
459
+ "shell.execute_reply.started": "2024-05-28T14:52:20.688690Z"
460
+ }
461
+ },
462
+ "outputs": [],
463
+ "source": [
464
+ "sw_list=stopwords.words('english')"
465
+ ]
466
+ },
467
+ {
468
+ "cell_type": "code",
469
+ "execution_count": 16,
470
+ "metadata": {
471
+ "execution": {
472
+ "iopub.execute_input": "2024-05-28T14:52:20.697662Z",
473
+ "iopub.status.busy": "2024-05-28T14:52:20.696910Z",
474
+ "iopub.status.idle": "2024-05-28T14:52:41.103845Z",
475
+ "shell.execute_reply": "2024-05-28T14:52:41.103006Z",
476
+ "shell.execute_reply.started": "2024-05-28T14:52:20.697626Z"
477
+ }
478
+ },
479
+ "outputs": [],
480
+ "source": [
481
+ "df['review']=df['review'].apply(lambda x:[item for item in x.split() if item not in sw_list]).apply(lambda x:\" \".join(x))"
482
+ ]
483
+ },
484
+ {
485
+ "cell_type": "code",
486
+ "execution_count": 17,
487
+ "metadata": {
488
+ "execution": {
489
+ "iopub.execute_input": "2024-05-28T14:52:41.105206Z",
490
+ "iopub.status.busy": "2024-05-28T14:52:41.104945Z",
491
+ "iopub.status.idle": "2024-05-28T14:52:41.115077Z",
492
+ "shell.execute_reply": "2024-05-28T14:52:41.114149Z",
493
+ "shell.execute_reply.started": "2024-05-28T14:52:41.105183Z"
494
+ }
495
+ },
496
+ "outputs": [
497
+ {
498
+ "data": {
499
+ "text/html": [
500
+ "<div>\n",
501
+ "<style scoped>\n",
502
+ " .dataframe tbody tr th:only-of-type {\n",
503
+ " vertical-align: middle;\n",
504
+ " }\n",
505
+ "\n",
506
+ " .dataframe tbody tr th {\n",
507
+ " vertical-align: top;\n",
508
+ " }\n",
509
+ "\n",
510
+ " .dataframe thead th {\n",
511
+ " text-align: right;\n",
512
+ " }\n",
513
+ "</style>\n",
514
+ "<table border=\"1\" class=\"dataframe\">\n",
515
+ " <thead>\n",
516
+ " <tr style=\"text-align: right;\">\n",
517
+ " <th></th>\n",
518
+ " <th>review</th>\n",
519
+ " <th>sentiment</th>\n",
520
+ " </tr>\n",
521
+ " </thead>\n",
522
+ " <tbody>\n",
523
+ " <tr>\n",
524
+ " <th>0</th>\n",
525
+ " <td>one reviewers mentioned watching 1 oz episode ...</td>\n",
526
+ " <td>positive</td>\n",
527
+ " </tr>\n",
528
+ " <tr>\n",
529
+ " <th>1</th>\n",
530
+ " <td>wonderful little production. filming technique...</td>\n",
531
+ " <td>positive</td>\n",
532
+ " </tr>\n",
533
+ " <tr>\n",
534
+ " <th>2</th>\n",
535
+ " <td>thought wonderful way spend time hot summer we...</td>\n",
536
+ " <td>positive</td>\n",
537
+ " </tr>\n",
538
+ " <tr>\n",
539
+ " <th>3</th>\n",
540
+ " <td>basically there's family little boy (jake) thi...</td>\n",
541
+ " <td>negative</td>\n",
542
+ " </tr>\n",
543
+ " <tr>\n",
544
+ " <th>4</th>\n",
545
+ " <td>petter mattei's \"love time money\" visually stu...</td>\n",
546
+ " <td>positive</td>\n",
547
+ " </tr>\n",
548
+ " </tbody>\n",
549
+ "</table>\n",
550
+ "</div>"
551
+ ],
552
+ "text/plain": [
553
+ " review sentiment\n",
554
+ "0 one reviewers mentioned watching 1 oz episode ... positive\n",
555
+ "1 wonderful little production. filming technique... positive\n",
556
+ "2 thought wonderful way spend time hot summer we... positive\n",
557
+ "3 basically there's family little boy (jake) thi... negative\n",
558
+ "4 petter mattei's \"love time money\" visually stu... positive"
559
+ ]
560
+ },
561
+ "execution_count": 17,
562
+ "metadata": {},
563
+ "output_type": "execute_result"
564
+ }
565
+ ],
566
+ "source": [
567
+ "df.head()"
568
+ ]
569
+ },
570
+ {
571
+ "cell_type": "markdown",
572
+ "metadata": {},
573
+ "source": [
574
+ "# **Removing Numbers**"
575
+ ]
576
+ },
577
+ {
578
+ "cell_type": "code",
579
+ "execution_count": 18,
580
+ "metadata": {
581
+ "execution": {
582
+ "iopub.execute_input": "2024-05-28T14:52:41.116693Z",
583
+ "iopub.status.busy": "2024-05-28T14:52:41.116370Z",
584
+ "iopub.status.idle": "2024-05-28T14:52:42.350128Z",
585
+ "shell.execute_reply": "2024-05-28T14:52:42.349082Z",
586
+ "shell.execute_reply.started": "2024-05-28T14:52:41.116666Z"
587
+ }
588
+ },
589
+ "outputs": [],
590
+ "source": [
591
+ "df['review']=df['review'].apply(lambda x:' '.join([i for i in x.split() if not i.isdigit()]))"
592
+ ]
593
+ },
594
+ {
595
+ "cell_type": "code",
596
+ "execution_count": 19,
597
+ "metadata": {
598
+ "execution": {
599
+ "iopub.execute_input": "2024-05-28T14:52:42.352200Z",
600
+ "iopub.status.busy": "2024-05-28T14:52:42.351476Z",
601
+ "iopub.status.idle": "2024-05-28T14:52:42.361695Z",
602
+ "shell.execute_reply": "2024-05-28T14:52:42.360686Z",
603
+ "shell.execute_reply.started": "2024-05-28T14:52:42.352154Z"
604
+ }
605
+ },
606
+ "outputs": [
607
+ {
608
+ "data": {
609
+ "text/html": [
610
+ "<div>\n",
611
+ "<style scoped>\n",
612
+ " .dataframe tbody tr th:only-of-type {\n",
613
+ " vertical-align: middle;\n",
614
+ " }\n",
615
+ "\n",
616
+ " .dataframe tbody tr th {\n",
617
+ " vertical-align: top;\n",
618
+ " }\n",
619
+ "\n",
620
+ " .dataframe thead th {\n",
621
+ " text-align: right;\n",
622
+ " }\n",
623
+ "</style>\n",
624
+ "<table border=\"1\" class=\"dataframe\">\n",
625
+ " <thead>\n",
626
+ " <tr style=\"text-align: right;\">\n",
627
+ " <th></th>\n",
628
+ " <th>review</th>\n",
629
+ " <th>sentiment</th>\n",
630
+ " </tr>\n",
631
+ " </thead>\n",
632
+ " <tbody>\n",
633
+ " <tr>\n",
634
+ " <th>0</th>\n",
635
+ " <td>one reviewers mentioned watching oz episode ho...</td>\n",
636
+ " <td>positive</td>\n",
637
+ " </tr>\n",
638
+ " <tr>\n",
639
+ " <th>1</th>\n",
640
+ " <td>wonderful little production. filming technique...</td>\n",
641
+ " <td>positive</td>\n",
642
+ " </tr>\n",
643
+ " <tr>\n",
644
+ " <th>2</th>\n",
645
+ " <td>thought wonderful way spend time hot summer we...</td>\n",
646
+ " <td>positive</td>\n",
647
+ " </tr>\n",
648
+ " <tr>\n",
649
+ " <th>3</th>\n",
650
+ " <td>basically there's family little boy (jake) thi...</td>\n",
651
+ " <td>negative</td>\n",
652
+ " </tr>\n",
653
+ " <tr>\n",
654
+ " <th>4</th>\n",
655
+ " <td>petter mattei's \"love time money\" visually stu...</td>\n",
656
+ " <td>positive</td>\n",
657
+ " </tr>\n",
658
+ " </tbody>\n",
659
+ "</table>\n",
660
+ "</div>"
661
+ ],
662
+ "text/plain": [
663
+ " review sentiment\n",
664
+ "0 one reviewers mentioned watching oz episode ho... positive\n",
665
+ "1 wonderful little production. filming technique... positive\n",
666
+ "2 thought wonderful way spend time hot summer we... positive\n",
667
+ "3 basically there's family little boy (jake) thi... negative\n",
668
+ "4 petter mattei's \"love time money\" visually stu... positive"
669
+ ]
670
+ },
671
+ "execution_count": 19,
672
+ "metadata": {},
673
+ "output_type": "execute_result"
674
+ }
675
+ ],
676
+ "source": [
677
+ "df.head()"
678
+ ]
679
+ },
680
+ {
681
+ "cell_type": "markdown",
682
+ "metadata": {},
683
+ "source": [
684
+ "# **Removing Punctuation**"
685
+ ]
686
+ },
687
+ {
688
+ "cell_type": "code",
689
+ "execution_count": 20,
690
+ "metadata": {
691
+ "execution": {
692
+ "iopub.execute_input": "2024-05-28T14:52:42.363205Z",
693
+ "iopub.status.busy": "2024-05-28T14:52:42.362921Z",
694
+ "iopub.status.idle": "2024-05-28T14:52:42.369773Z",
695
+ "shell.execute_reply": "2024-05-28T14:52:42.368884Z",
696
+ "shell.execute_reply.started": "2024-05-28T14:52:42.363181Z"
697
+ }
698
+ },
699
+ "outputs": [],
700
+ "source": [
701
+ "import string\n",
702
+ "PUNCT_TO_REMOVE = string.punctuation\n",
703
+ "def remove_punctuation(text):\n",
704
+ " return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))"
705
+ ]
706
+ },
707
+ {
708
+ "cell_type": "code",
709
+ "execution_count": 21,
710
+ "metadata": {
711
+ "execution": {
712
+ "iopub.execute_input": "2024-05-28T14:52:42.371097Z",
713
+ "iopub.status.busy": "2024-05-28T14:52:42.370761Z",
714
+ "iopub.status.idle": "2024-05-28T14:52:43.206495Z",
715
+ "shell.execute_reply": "2024-05-28T14:52:43.205666Z",
716
+ "shell.execute_reply.started": "2024-05-28T14:52:42.371071Z"
717
+ }
718
+ },
719
+ "outputs": [],
720
+ "source": [
721
+ "df['review']=df['review'].apply(remove_punctuation)"
722
+ ]
723
+ },
724
+ {
725
+ "cell_type": "code",
726
+ "execution_count": 22,
727
+ "metadata": {
728
+ "execution": {
729
+ "iopub.execute_input": "2024-05-28T14:52:43.207858Z",
730
+ "iopub.status.busy": "2024-05-28T14:52:43.207590Z",
731
+ "iopub.status.idle": "2024-05-28T14:52:43.217082Z",
732
+ "shell.execute_reply": "2024-05-28T14:52:43.216131Z",
733
+ "shell.execute_reply.started": "2024-05-28T14:52:43.207835Z"
734
+ }
735
+ },
736
+ "outputs": [
737
+ {
738
+ "data": {
739
+ "text/html": [
740
+ "<div>\n",
741
+ "<style scoped>\n",
742
+ " .dataframe tbody tr th:only-of-type {\n",
743
+ " vertical-align: middle;\n",
744
+ " }\n",
745
+ "\n",
746
+ " .dataframe tbody tr th {\n",
747
+ " vertical-align: top;\n",
748
+ " }\n",
749
+ "\n",
750
+ " .dataframe thead th {\n",
751
+ " text-align: right;\n",
752
+ " }\n",
753
+ "</style>\n",
754
+ "<table border=\"1\" class=\"dataframe\">\n",
755
+ " <thead>\n",
756
+ " <tr style=\"text-align: right;\">\n",
757
+ " <th></th>\n",
758
+ " <th>review</th>\n",
759
+ " <th>sentiment</th>\n",
760
+ " </tr>\n",
761
+ " </thead>\n",
762
+ " <tbody>\n",
763
+ " <tr>\n",
764
+ " <th>0</th>\n",
765
+ " <td>one reviewers mentioned watching oz episode ho...</td>\n",
766
+ " <td>positive</td>\n",
767
+ " </tr>\n",
768
+ " <tr>\n",
769
+ " <th>1</th>\n",
770
+ " <td>wonderful little production filming technique ...</td>\n",
771
+ " <td>positive</td>\n",
772
+ " </tr>\n",
773
+ " <tr>\n",
774
+ " <th>2</th>\n",
775
+ " <td>thought wonderful way spend time hot summer we...</td>\n",
776
+ " <td>positive</td>\n",
777
+ " </tr>\n",
778
+ " <tr>\n",
779
+ " <th>3</th>\n",
780
+ " <td>basically theres family little boy jake thinks...</td>\n",
781
+ " <td>negative</td>\n",
782
+ " </tr>\n",
783
+ " <tr>\n",
784
+ " <th>4</th>\n",
785
+ " <td>petter matteis love time money visually stunni...</td>\n",
786
+ " <td>positive</td>\n",
787
+ " </tr>\n",
788
+ " </tbody>\n",
789
+ "</table>\n",
790
+ "</div>"
791
+ ],
792
+ "text/plain": [
793
+ " review sentiment\n",
794
+ "0 one reviewers mentioned watching oz episode ho... positive\n",
795
+ "1 wonderful little production filming technique ... positive\n",
796
+ "2 thought wonderful way spend time hot summer we... positive\n",
797
+ "3 basically theres family little boy jake thinks... negative\n",
798
+ "4 petter matteis love time money visually stunni... positive"
799
+ ]
800
+ },
801
+ "execution_count": 22,
802
+ "metadata": {},
803
+ "output_type": "execute_result"
804
+ }
805
+ ],
806
+ "source": [
807
+ "df.head()"
808
+ ]
809
+ },
810
+ {
811
+ "cell_type": "markdown",
812
+ "metadata": {},
813
+ "source": [
814
+ "# **Removing Contractions**"
815
+ ]
816
+ },
817
+ {
818
+ "cell_type": "code",
819
+ "execution_count": 23,
820
+ "metadata": {
821
+ "execution": {
822
+ "iopub.execute_input": "2024-05-28T14:52:43.218467Z",
823
+ "iopub.status.busy": "2024-05-28T14:52:43.218186Z",
824
+ "iopub.status.idle": "2024-05-28T14:52:56.917289Z",
825
+ "shell.execute_reply": "2024-05-28T14:52:56.916286Z",
826
+ "shell.execute_reply.started": "2024-05-28T14:52:43.218441Z"
827
+ }
828
+ },
829
+ "outputs": [
830
+ {
831
+ "name": "stdout",
832
+ "output_type": "stream",
833
+ "text": [
834
+ "Requirement already satisfied: contractions in /opt/conda/lib/python3.10/site-packages (0.1.73)\n",
835
+ "Requirement already satisfied: textsearch>=0.0.21 in /opt/conda/lib/python3.10/site-packages (from contractions) (0.0.24)\n",
836
+ "Requirement already satisfied: anyascii in /opt/conda/lib/python3.10/site-packages (from textsearch>=0.0.21->contractions) (0.3.2)\n",
837
+ "Requirement already satisfied: pyahocorasick in /opt/conda/lib/python3.10/site-packages (from textsearch>=0.0.21->contractions) (2.1.0)\n"
838
+ ]
839
+ }
840
+ ],
841
+ "source": [
842
+ "!pip install contractions"
843
+ ]
844
+ },
845
+ {
846
+ "cell_type": "code",
847
+ "execution_count": 24,
848
+ "metadata": {
849
+ "execution": {
850
+ "iopub.execute_input": "2024-05-28T14:52:56.919161Z",
851
+ "iopub.status.busy": "2024-05-28T14:52:56.918839Z",
852
+ "iopub.status.idle": "2024-05-28T14:52:56.944515Z",
853
+ "shell.execute_reply": "2024-05-28T14:52:56.943830Z",
854
+ "shell.execute_reply.started": "2024-05-28T14:52:56.919130Z"
855
+ }
856
+ },
857
+ "outputs": [],
858
+ "source": [
859
+ "import contractions\n",
860
+ "def remove_contractions(text):\n",
861
+ " return contractions.fix(text)\n"
862
+ ]
863
+ },
864
+ {
865
+ "cell_type": "code",
866
+ "execution_count": 25,
867
+ "metadata": {
868
+ "execution": {
869
+ "iopub.execute_input": "2024-05-28T14:52:56.945783Z",
870
+ "iopub.status.busy": "2024-05-28T14:52:56.945518Z",
871
+ "iopub.status.idle": "2024-05-28T14:53:00.977406Z",
872
+ "shell.execute_reply": "2024-05-28T14:53:00.976592Z",
873
+ "shell.execute_reply.started": "2024-05-28T14:52:56.945760Z"
874
+ }
875
+ },
876
+ "outputs": [],
877
+ "source": [
878
+ "df['review']=df['review'].apply(remove_contractions)"
879
+ ]
880
+ },
881
+ {
882
+ "cell_type": "code",
883
+ "execution_count": 26,
884
+ "metadata": {
885
+ "execution": {
886
+ "iopub.execute_input": "2024-05-28T14:53:00.978856Z",
887
+ "iopub.status.busy": "2024-05-28T14:53:00.978551Z",
888
+ "iopub.status.idle": "2024-05-28T14:53:00.991438Z",
889
+ "shell.execute_reply": "2024-05-28T14:53:00.990549Z",
890
+ "shell.execute_reply.started": "2024-05-28T14:53:00.978830Z"
891
+ }
892
+ },
893
+ "outputs": [
894
+ {
895
+ "data": {
896
+ "text/html": [
897
+ "<div>\n",
898
+ "<style scoped>\n",
899
+ " .dataframe tbody tr th:only-of-type {\n",
900
+ " vertical-align: middle;\n",
901
+ " }\n",
902
+ "\n",
903
+ " .dataframe tbody tr th {\n",
904
+ " vertical-align: top;\n",
905
+ " }\n",
906
+ "\n",
907
+ " .dataframe thead th {\n",
908
+ " text-align: right;\n",
909
+ " }\n",
910
+ "</style>\n",
911
+ "<table border=\"1\" class=\"dataframe\">\n",
912
+ " <thead>\n",
913
+ " <tr style=\"text-align: right;\">\n",
914
+ " <th></th>\n",
915
+ " <th>review</th>\n",
916
+ " <th>sentiment</th>\n",
917
+ " </tr>\n",
918
+ " </thead>\n",
919
+ " <tbody>\n",
920
+ " <tr>\n",
921
+ " <th>0</th>\n",
922
+ " <td>one reviewers mentioned watching oz episode ho...</td>\n",
923
+ " <td>positive</td>\n",
924
+ " </tr>\n",
925
+ " <tr>\n",
926
+ " <th>1</th>\n",
927
+ " <td>wonderful little production filming technique ...</td>\n",
928
+ " <td>positive</td>\n",
929
+ " </tr>\n",
930
+ " <tr>\n",
931
+ " <th>2</th>\n",
932
+ " <td>thought wonderful way spend time hot summer we...</td>\n",
933
+ " <td>positive</td>\n",
934
+ " </tr>\n",
935
+ " <tr>\n",
936
+ " <th>3</th>\n",
937
+ " <td>basically there is family little boy jake thin...</td>\n",
938
+ " <td>negative</td>\n",
939
+ " </tr>\n",
940
+ " <tr>\n",
941
+ " <th>4</th>\n",
942
+ " <td>petter matteis love time money visually stunni...</td>\n",
943
+ " <td>positive</td>\n",
944
+ " </tr>\n",
945
+ " <tr>\n",
946
+ " <th>...</th>\n",
947
+ " <td>...</td>\n",
948
+ " <td>...</td>\n",
949
+ " </tr>\n",
950
+ " <tr>\n",
951
+ " <th>49995</th>\n",
952
+ " <td>thought movie right good job creative original...</td>\n",
953
+ " <td>positive</td>\n",
954
+ " </tr>\n",
955
+ " <tr>\n",
956
+ " <th>49996</th>\n",
957
+ " <td>bad plot bad dialogue bad acting idiotic direc...</td>\n",
958
+ " <td>negative</td>\n",
959
+ " </tr>\n",
960
+ " <tr>\n",
961
+ " <th>49997</th>\n",
962
+ " <td>catholic taught parochial elementary schools n...</td>\n",
963
+ " <td>negative</td>\n",
964
+ " </tr>\n",
965
+ " <tr>\n",
966
+ " <th>49998</th>\n",
967
+ " <td>i am going disagree previous comment side malt...</td>\n",
968
+ " <td>negative</td>\n",
969
+ " </tr>\n",
970
+ " <tr>\n",
971
+ " <th>49999</th>\n",
972
+ " <td>one expects star trek movies high art fans exp...</td>\n",
973
+ " <td>negative</td>\n",
974
+ " </tr>\n",
975
+ " </tbody>\n",
976
+ "</table>\n",
977
+ "<p>49582 rows × 2 columns</p>\n",
978
+ "</div>"
979
+ ],
980
+ "text/plain": [
981
+ " review sentiment\n",
982
+ "0 one reviewers mentioned watching oz episode ho... positive\n",
983
+ "1 wonderful little production filming technique ... positive\n",
984
+ "2 thought wonderful way spend time hot summer we... positive\n",
985
+ "3 basically there is family little boy jake thin... negative\n",
986
+ "4 petter matteis love time money visually stunni... positive\n",
987
+ "... ... ...\n",
988
+ "49995 thought movie right good job creative original... positive\n",
989
+ "49996 bad plot bad dialogue bad acting idiotic direc... negative\n",
990
+ "49997 catholic taught parochial elementary schools n... negative\n",
991
+ "49998 i am going disagree previous comment side malt... negative\n",
992
+ "49999 one expects star trek movies high art fans exp... negative\n",
993
+ "\n",
994
+ "[49582 rows x 2 columns]"
995
+ ]
996
+ },
997
+ "execution_count": 26,
998
+ "metadata": {},
999
+ "output_type": "execute_result"
1000
+ }
1001
+ ],
1002
+ "source": [
1003
+ "df"
1004
+ ]
1005
+ },
1006
+ {
1007
+ "cell_type": "code",
1008
+ "execution_count": 27,
1009
+ "metadata": {
1010
+ "execution": {
1011
+ "iopub.execute_input": "2024-05-28T14:53:00.993229Z",
1012
+ "iopub.status.busy": "2024-05-28T14:53:00.992777Z",
1013
+ "iopub.status.idle": "2024-05-28T14:53:01.001576Z",
1014
+ "shell.execute_reply": "2024-05-28T14:53:01.000876Z",
1015
+ "shell.execute_reply.started": "2024-05-28T14:53:00.993151Z"
1016
+ }
1017
+ },
1018
+ "outputs": [],
1019
+ "source": [
1020
+ "x=df.drop(columns='sentiment')\n",
1021
+ "y=df['sentiment']"
1022
+ ]
1023
+ },
1024
+ {
1025
+ "cell_type": "code",
1026
+ "execution_count": 28,
1027
+ "metadata": {
1028
+ "execution": {
1029
+ "iopub.execute_input": "2024-05-28T14:53:01.003284Z",
1030
+ "iopub.status.busy": "2024-05-28T14:53:01.002910Z",
1031
+ "iopub.status.idle": "2024-05-28T14:53:01.014799Z",
1032
+ "shell.execute_reply": "2024-05-28T14:53:01.013876Z",
1033
+ "shell.execute_reply.started": "2024-05-28T14:53:01.003260Z"
1034
+ }
1035
+ },
1036
+ "outputs": [
1037
+ {
1038
+ "data": {
1039
+ "text/html": [
1040
+ "<div>\n",
1041
+ "<style scoped>\n",
1042
+ " .dataframe tbody tr th:only-of-type {\n",
1043
+ " vertical-align: middle;\n",
1044
+ " }\n",
1045
+ "\n",
1046
+ " .dataframe tbody tr th {\n",
1047
+ " vertical-align: top;\n",
1048
+ " }\n",
1049
+ "\n",
1050
+ " .dataframe thead th {\n",
1051
+ " text-align: right;\n",
1052
+ " }\n",
1053
+ "</style>\n",
1054
+ "<table border=\"1\" class=\"dataframe\">\n",
1055
+ " <thead>\n",
1056
+ " <tr style=\"text-align: right;\">\n",
1057
+ " <th></th>\n",
1058
+ " <th>review</th>\n",
1059
+ " </tr>\n",
1060
+ " </thead>\n",
1061
+ " <tbody>\n",
1062
+ " <tr>\n",
1063
+ " <th>0</th>\n",
1064
+ " <td>one reviewers mentioned watching oz episode ho...</td>\n",
1065
+ " </tr>\n",
1066
+ " <tr>\n",
1067
+ " <th>1</th>\n",
1068
+ " <td>wonderful little production filming technique ...</td>\n",
1069
+ " </tr>\n",
1070
+ " <tr>\n",
1071
+ " <th>2</th>\n",
1072
+ " <td>thought wonderful way spend time hot summer we...</td>\n",
1073
+ " </tr>\n",
1074
+ " <tr>\n",
1075
+ " <th>3</th>\n",
1076
+ " <td>basically there is family little boy jake thin...</td>\n",
1077
+ " </tr>\n",
1078
+ " <tr>\n",
1079
+ " <th>4</th>\n",
1080
+ " <td>petter matteis love time money visually stunni...</td>\n",
1081
+ " </tr>\n",
1082
+ " <tr>\n",
1083
+ " <th>...</th>\n",
1084
+ " <td>...</td>\n",
1085
+ " </tr>\n",
1086
+ " <tr>\n",
1087
+ " <th>49995</th>\n",
1088
+ " <td>thought movie right good job creative original...</td>\n",
1089
+ " </tr>\n",
1090
+ " <tr>\n",
1091
+ " <th>49996</th>\n",
1092
+ " <td>bad plot bad dialogue bad acting idiotic direc...</td>\n",
1093
+ " </tr>\n",
1094
+ " <tr>\n",
1095
+ " <th>49997</th>\n",
1096
+ " <td>catholic taught parochial elementary schools n...</td>\n",
1097
+ " </tr>\n",
1098
+ " <tr>\n",
1099
+ " <th>49998</th>\n",
1100
+ " <td>i am going disagree previous comment side malt...</td>\n",
1101
+ " </tr>\n",
1102
+ " <tr>\n",
1103
+ " <th>49999</th>\n",
1104
+ " <td>one expects star trek movies high art fans exp...</td>\n",
1105
+ " </tr>\n",
1106
+ " </tbody>\n",
1107
+ "</table>\n",
1108
+ "<p>49582 rows × 1 columns</p>\n",
1109
+ "</div>"
1110
+ ],
1111
+ "text/plain": [
1112
+ " review\n",
1113
+ "0 one reviewers mentioned watching oz episode ho...\n",
1114
+ "1 wonderful little production filming technique ...\n",
1115
+ "2 thought wonderful way spend time hot summer we...\n",
1116
+ "3 basically there is family little boy jake thin...\n",
1117
+ "4 petter matteis love time money visually stunni...\n",
1118
+ "... ...\n",
1119
+ "49995 thought movie right good job creative original...\n",
1120
+ "49996 bad plot bad dialogue bad acting idiotic direc...\n",
1121
+ "49997 catholic taught parochial elementary schools n...\n",
1122
+ "49998 i am going disagree previous comment side malt...\n",
1123
+ "49999 one expects star trek movies high art fans exp...\n",
1124
+ "\n",
1125
+ "[49582 rows x 1 columns]"
1126
+ ]
1127
+ },
1128
+ "execution_count": 28,
1129
+ "metadata": {},
1130
+ "output_type": "execute_result"
1131
+ }
1132
+ ],
1133
+ "source": [
1134
+ "x"
1135
+ ]
1136
+ },
1137
+ {
1138
+ "cell_type": "code",
1139
+ "execution_count": 29,
1140
+ "metadata": {
1141
+ "execution": {
1142
+ "iopub.execute_input": "2024-05-28T14:53:01.022597Z",
1143
+ "iopub.status.busy": "2024-05-28T14:53:01.022330Z",
1144
+ "iopub.status.idle": "2024-05-28T14:53:01.030503Z",
1145
+ "shell.execute_reply": "2024-05-28T14:53:01.029518Z",
1146
+ "shell.execute_reply.started": "2024-05-28T14:53:01.022574Z"
1147
+ }
1148
+ },
1149
+ "outputs": [
1150
+ {
1151
+ "data": {
1152
+ "text/plain": [
1153
+ "0 positive\n",
1154
+ "1 positive\n",
1155
+ "2 positive\n",
1156
+ "3 negative\n",
1157
+ "4 positive\n",
1158
+ " ... \n",
1159
+ "49995 positive\n",
1160
+ "49996 negative\n",
1161
+ "49997 negative\n",
1162
+ "49998 negative\n",
1163
+ "49999 negative\n",
1164
+ "Name: sentiment, Length: 49582, dtype: object"
1165
+ ]
1166
+ },
1167
+ "execution_count": 29,
1168
+ "metadata": {},
1169
+ "output_type": "execute_result"
1170
+ }
1171
+ ],
1172
+ "source": [
1173
+ "y"
1174
+ ]
1175
+ },
1176
+ {
1177
+ "cell_type": "code",
1178
+ "execution_count": 30,
1179
+ "metadata": {
1180
+ "execution": {
1181
+ "iopub.execute_input": "2024-05-28T14:53:01.032109Z",
1182
+ "iopub.status.busy": "2024-05-28T14:53:01.031731Z",
1183
+ "iopub.status.idle": "2024-05-28T14:53:01.037243Z",
1184
+ "shell.execute_reply": "2024-05-28T14:53:01.036374Z",
1185
+ "shell.execute_reply.started": "2024-05-28T14:53:01.032084Z"
1186
+ }
1187
+ },
1188
+ "outputs": [],
1189
+ "source": [
1190
+ "from sklearn.preprocessing import LabelEncoder"
1191
+ ]
1192
+ },
1193
+ {
1194
+ "cell_type": "code",
1195
+ "execution_count": 31,
1196
+ "metadata": {
1197
+ "execution": {
1198
+ "iopub.execute_input": "2024-05-28T14:53:01.038703Z",
1199
+ "iopub.status.busy": "2024-05-28T14:53:01.038403Z",
1200
+ "iopub.status.idle": "2024-05-28T14:53:01.057333Z",
1201
+ "shell.execute_reply": "2024-05-28T14:53:01.056511Z",
1202
+ "shell.execute_reply.started": "2024-05-28T14:53:01.038679Z"
1203
+ }
1204
+ },
1205
+ "outputs": [],
1206
+ "source": [
1207
+ "y=LabelEncoder().fit_transform(y)"
1208
+ ]
1209
+ },
1210
+ {
1211
+ "cell_type": "code",
1212
+ "execution_count": 32,
1213
+ "metadata": {
1214
+ "execution": {
1215
+ "iopub.execute_input": "2024-05-28T14:53:01.058600Z",
1216
+ "iopub.status.busy": "2024-05-28T14:53:01.058345Z",
1217
+ "iopub.status.idle": "2024-05-28T14:53:01.067016Z",
1218
+ "shell.execute_reply": "2024-05-28T14:53:01.066135Z",
1219
+ "shell.execute_reply.started": "2024-05-28T14:53:01.058579Z"
1220
+ }
1221
+ },
1222
+ "outputs": [
1223
+ {
1224
+ "data": {
1225
+ "text/plain": [
1226
+ "array([1, 1, 1, ..., 0, 0, 0])"
1227
+ ]
1228
+ },
1229
+ "execution_count": 32,
1230
+ "metadata": {},
1231
+ "output_type": "execute_result"
1232
+ }
1233
+ ],
1234
+ "source": [
1235
+ "y"
1236
+ ]
1237
+ },
1238
+ {
1239
+ "cell_type": "code",
1240
+ "execution_count": 33,
1241
+ "metadata": {
1242
+ "execution": {
1243
+ "iopub.execute_input": "2024-05-28T14:53:01.068374Z",
1244
+ "iopub.status.busy": "2024-05-28T14:53:01.068086Z",
1245
+ "iopub.status.idle": "2024-05-28T14:53:01.099626Z",
1246
+ "shell.execute_reply": "2024-05-28T14:53:01.098771Z",
1247
+ "shell.execute_reply.started": "2024-05-28T14:53:01.068341Z"
1248
+ }
1249
+ },
1250
+ "outputs": [],
1251
+ "source": [
1252
+ "from sklearn.model_selection import train_test_split\n",
1253
+ "x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=3,stratify=y)"
1254
+ ]
1255
+ },
1256
+ {
1257
+ "cell_type": "code",
1258
+ "execution_count": 34,
1259
+ "metadata": {
1260
+ "execution": {
1261
+ "iopub.execute_input": "2024-05-28T14:53:01.101218Z",
1262
+ "iopub.status.busy": "2024-05-28T14:53:01.100870Z",
1263
+ "iopub.status.idle": "2024-05-28T14:53:01.106109Z",
1264
+ "shell.execute_reply": "2024-05-28T14:53:01.105161Z",
1265
+ "shell.execute_reply.started": "2024-05-28T14:53:01.101184Z"
1266
+ }
1267
+ },
1268
+ "outputs": [
1269
+ {
1270
+ "name": "stdout",
1271
+ "output_type": "stream",
1272
+ "text": [
1273
+ "(39665, 1) (9917, 1)\n"
1274
+ ]
1275
+ }
1276
+ ],
1277
+ "source": [
1278
+ "print(x_train.shape,x_test.shape)"
1279
+ ]
1280
+ },
1281
+ {
1282
+ "cell_type": "markdown",
1283
+ "metadata": {},
1284
+ "source": [
1285
+ "# Bag of Word"
1286
+ ]
1287
+ },
1288
+ {
1289
+ "cell_type": "code",
1290
+ "execution_count": 35,
1291
+ "metadata": {
1292
+ "execution": {
1293
+ "iopub.execute_input": "2024-05-28T14:53:01.107866Z",
1294
+ "iopub.status.busy": "2024-05-28T14:53:01.107441Z",
1295
+ "iopub.status.idle": "2024-05-28T14:53:01.113900Z",
1296
+ "shell.execute_reply": "2024-05-28T14:53:01.112939Z",
1297
+ "shell.execute_reply.started": "2024-05-28T14:53:01.107829Z"
1298
+ }
1299
+ },
1300
+ "outputs": [],
1301
+ "source": [
1302
+ "from sklearn.feature_extraction.text import CountVectorizer"
1303
+ ]
1304
+ },
1305
+ {
1306
+ "cell_type": "code",
1307
+ "execution_count": 36,
1308
+ "metadata": {
1309
+ "execution": {
1310
+ "iopub.execute_input": "2024-05-28T14:53:01.115218Z",
1311
+ "iopub.status.busy": "2024-05-28T14:53:01.114952Z",
1312
+ "iopub.status.idle": "2024-05-28T14:53:01.121742Z",
1313
+ "shell.execute_reply": "2024-05-28T14:53:01.120751Z",
1314
+ "shell.execute_reply.started": "2024-05-28T14:53:01.115195Z"
1315
+ }
1316
+ },
1317
+ "outputs": [],
1318
+ "source": [
1319
+ "cv=CountVectorizer(max_features=10000)"
1320
+ ]
1321
+ },
1322
+ {
1323
+ "cell_type": "code",
1324
+ "execution_count": 37,
1325
+ "metadata": {
1326
+ "execution": {
1327
+ "iopub.execute_input": "2024-05-28T14:53:01.123124Z",
1328
+ "iopub.status.busy": "2024-05-28T14:53:01.122830Z",
1329
+ "iopub.status.idle": "2024-05-28T14:53:01.134108Z",
1330
+ "shell.execute_reply": "2024-05-28T14:53:01.133166Z",
1331
+ "shell.execute_reply.started": "2024-05-28T14:53:01.123101Z"
1332
+ }
1333
+ },
1334
+ "outputs": [
1335
+ {
1336
+ "data": {
1337
+ "text/html": [
1338
+ "<div>\n",
1339
+ "<style scoped>\n",
1340
+ " .dataframe tbody tr th:only-of-type {\n",
1341
+ " vertical-align: middle;\n",
1342
+ " }\n",
1343
+ "\n",
1344
+ " .dataframe tbody tr th {\n",
1345
+ " vertical-align: top;\n",
1346
+ " }\n",
1347
+ "\n",
1348
+ " .dataframe thead th {\n",
1349
+ " text-align: right;\n",
1350
+ " }\n",
1351
+ "</style>\n",
1352
+ "<table border=\"1\" class=\"dataframe\">\n",
1353
+ " <thead>\n",
1354
+ " <tr style=\"text-align: right;\">\n",
1355
+ " <th></th>\n",
1356
+ " <th>review</th>\n",
1357
+ " </tr>\n",
1358
+ " </thead>\n",
1359
+ " <tbody>\n",
1360
+ " <tr>\n",
1361
+ " <th>17185</th>\n",
1362
+ " <td>watching avalon which decent nice digital fx s...</td>\n",
1363
+ " </tr>\n",
1364
+ " <tr>\n",
1365
+ " <th>12989</th>\n",
1366
+ " <td>rarely denzil washington make bad movie come t...</td>\n",
1367
+ " </tr>\n",
1368
+ " <tr>\n",
1369
+ " <th>31628</th>\n",
1370
+ " <td>think movie reasonbaly good kind of weird olse...</td>\n",
1371
+ " </tr>\n",
1372
+ " <tr>\n",
1373
+ " <th>12399</th>\n",
1374
+ " <td>movie is horrible wonderful time first saw yea...</td>\n",
1375
+ " </tr>\n",
1376
+ " <tr>\n",
1377
+ " <th>33230</th>\n",
1378
+ " <td>watching the bodyguard last night felt compell...</td>\n",
1379
+ " </tr>\n",
1380
+ " <tr>\n",
1381
+ " <th>...</th>\n",
1382
+ " <td>...</td>\n",
1383
+ " </tr>\n",
1384
+ " <tr>\n",
1385
+ " <th>31515</th>\n",
1386
+ " <td>good cast with one major exception pushes way ...</td>\n",
1387
+ " </tr>\n",
1388
+ " <tr>\n",
1389
+ " <th>19133</th>\n",
1390
+ " <td>seldom see short comments written imdb filmgoe...</td>\n",
1391
+ " </tr>\n",
1392
+ " <tr>\n",
1393
+ " <th>47930</th>\n",
1394
+ " <td>say without shadow doubt going overboard singl...</td>\n",
1395
+ " </tr>\n",
1396
+ " <tr>\n",
1397
+ " <th>35145</th>\n",
1398
+ " <td>wife watched dvring encore action past week wo...</td>\n",
1399
+ " </tr>\n",
1400
+ " <tr>\n",
1401
+ " <th>32654</th>\n",
1402
+ " <td>pokemon little three four episodes tv series s...</td>\n",
1403
+ " </tr>\n",
1404
+ " </tbody>\n",
1405
+ "</table>\n",
1406
+ "<p>39665 rows × 1 columns</p>\n",
1407
+ "</div>"
1408
+ ],
1409
+ "text/plain": [
1410
+ " review\n",
1411
+ "17185 watching avalon which decent nice digital fx s...\n",
1412
+ "12989 rarely denzil washington make bad movie come t...\n",
1413
+ "31628 think movie reasonbaly good kind of weird olse...\n",
1414
+ "12399 movie is horrible wonderful time first saw yea...\n",
1415
+ "33230 watching the bodyguard last night felt compell...\n",
1416
+ "... ...\n",
1417
+ "31515 good cast with one major exception pushes way ...\n",
1418
+ "19133 seldom see short comments written imdb filmgoe...\n",
1419
+ "47930 say without shadow doubt going overboard singl...\n",
1420
+ "35145 wife watched dvring encore action past week wo...\n",
1421
+ "32654 pokemon little three four episodes tv series s...\n",
1422
+ "\n",
1423
+ "[39665 rows x 1 columns]"
1424
+ ]
1425
+ },
1426
+ "execution_count": 37,
1427
+ "metadata": {},
1428
+ "output_type": "execute_result"
1429
+ }
1430
+ ],
1431
+ "source": [
1432
+ "x_train"
1433
+ ]
1434
+ },
1435
+ {
1436
+ "cell_type": "code",
1437
+ "execution_count": 38,
1438
+ "metadata": {
1439
+ "execution": {
1440
+ "iopub.execute_input": "2024-05-28T14:53:01.135399Z",
1441
+ "iopub.status.busy": "2024-05-28T14:53:01.135133Z",
1442
+ "iopub.status.idle": "2024-05-28T14:53:10.557535Z",
1443
+ "shell.execute_reply": "2024-05-28T14:53:10.556708Z",
1444
+ "shell.execute_reply.started": "2024-05-28T14:53:01.135377Z"
1445
+ }
1446
+ },
1447
+ "outputs": [],
1448
+ "source": [
1449
+ "x_train=cv.fit_transform(x_train['review']).toarray()\n",
1450
+ "x_test=cv.transform(x_test['review']).toarray()"
1451
+ ]
1452
+ },
1453
+ {
1454
+ "cell_type": "code",
1455
+ "execution_count": 39,
1456
+ "metadata": {
1457
+ "execution": {
1458
+ "iopub.execute_input": "2024-05-28T14:53:10.559169Z",
1459
+ "iopub.status.busy": "2024-05-28T14:53:10.558796Z",
1460
+ "iopub.status.idle": "2024-05-28T14:53:10.565563Z",
1461
+ "shell.execute_reply": "2024-05-28T14:53:10.564604Z",
1462
+ "shell.execute_reply.started": "2024-05-28T14:53:10.559135Z"
1463
+ }
1464
+ },
1465
+ "outputs": [
1466
+ {
1467
+ "data": {
1468
+ "text/plain": [
1469
+ "(39665, 10000)"
1470
+ ]
1471
+ },
1472
+ "execution_count": 39,
1473
+ "metadata": {},
1474
+ "output_type": "execute_result"
1475
+ }
1476
+ ],
1477
+ "source": [
1478
+ "x_train.shape"
1479
+ ]
1480
+ },
1481
+ {
1482
+ "cell_type": "markdown",
1483
+ "metadata": {},
1484
+ "source": [
1485
+ "# Applying NaiveBayes"
1486
+ ]
1487
+ },
1488
+ {
1489
+ "cell_type": "code",
1490
+ "execution_count": 40,
1491
+ "metadata": {
1492
+ "execution": {
1493
+ "iopub.execute_input": "2024-05-28T14:53:10.567583Z",
1494
+ "iopub.status.busy": "2024-05-28T14:53:10.566619Z",
1495
+ "iopub.status.idle": "2024-05-28T14:53:16.857125Z",
1496
+ "shell.execute_reply": "2024-05-28T14:53:16.856146Z",
1497
+ "shell.execute_reply.started": "2024-05-28T14:53:10.567557Z"
1498
+ }
1499
+ },
1500
+ "outputs": [
1501
+ {
1502
+ "data": {
1503
+ "text/html": [
1504
+ "<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>GaussianNB()</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">GaussianNB</label><div class=\"sk-toggleable__content\"><pre>GaussianNB()</pre></div></div></div></div></div>"
1505
+ ],
1506
+ "text/plain": [
1507
+ "GaussianNB()"
1508
+ ]
1509
+ },
1510
+ "execution_count": 40,
1511
+ "metadata": {},
1512
+ "output_type": "execute_result"
1513
+ }
1514
+ ],
1515
+ "source": [
1516
+ "from sklearn.naive_bayes import GaussianNB\n",
1517
+ "gnb=GaussianNB()\n",
1518
+ "gnb.fit(x_train,y_train)"
1519
+ ]
1520
+ },
1521
+ {
1522
+ "cell_type": "code",
1523
+ "execution_count": 41,
1524
+ "metadata": {
1525
+ "execution": {
1526
+ "iopub.execute_input": "2024-05-28T14:53:16.860063Z",
1527
+ "iopub.status.busy": "2024-05-28T14:53:16.858411Z",
1528
+ "iopub.status.idle": "2024-05-28T14:53:18.286607Z",
1529
+ "shell.execute_reply": "2024-05-28T14:53:18.285709Z",
1530
+ "shell.execute_reply.started": "2024-05-28T14:53:16.860034Z"
1531
+ }
1532
+ },
1533
+ "outputs": [],
1534
+ "source": [
1535
+ "y_pred=gnb.predict(x_test)"
1536
+ ]
1537
+ },
1538
+ {
1539
+ "cell_type": "code",
1540
+ "execution_count": 42,
1541
+ "metadata": {
1542
+ "execution": {
1543
+ "iopub.execute_input": "2024-05-28T14:53:18.288847Z",
1544
+ "iopub.status.busy": "2024-05-28T14:53:18.288416Z",
1545
+ "iopub.status.idle": "2024-05-28T14:53:18.293343Z",
1546
+ "shell.execute_reply": "2024-05-28T14:53:18.292339Z",
1547
+ "shell.execute_reply.started": "2024-05-28T14:53:18.288789Z"
1548
+ }
1549
+ },
1550
+ "outputs": [],
1551
+ "source": [
1552
+ "from sklearn.metrics import accuracy_score,confusion_matrix"
1553
+ ]
1554
+ },
1555
+ {
1556
+ "cell_type": "code",
1557
+ "execution_count": 43,
1558
+ "metadata": {
1559
+ "execution": {
1560
+ "iopub.execute_input": "2024-05-28T14:53:18.295253Z",
1561
+ "iopub.status.busy": "2024-05-28T14:53:18.294781Z",
1562
+ "iopub.status.idle": "2024-05-28T14:53:18.305801Z",
1563
+ "shell.execute_reply": "2024-05-28T14:53:18.304884Z",
1564
+ "shell.execute_reply.started": "2024-05-28T14:53:18.295215Z"
1565
+ }
1566
+ },
1567
+ "outputs": [
1568
+ {
1569
+ "data": {
1570
+ "text/plain": [
1571
+ "0.7354038519713623"
1572
+ ]
1573
+ },
1574
+ "execution_count": 43,
1575
+ "metadata": {},
1576
+ "output_type": "execute_result"
1577
+ }
1578
+ ],
1579
+ "source": [
1580
+ "accuracy_score(y_test,y_pred)"
1581
+ ]
1582
+ },
1583
+ {
1584
+ "cell_type": "code",
1585
+ "execution_count": 44,
1586
+ "metadata": {
1587
+ "execution": {
1588
+ "iopub.execute_input": "2024-05-28T14:53:18.307732Z",
1589
+ "iopub.status.busy": "2024-05-28T14:53:18.307121Z",
1590
+ "iopub.status.idle": "2024-05-28T14:53:18.316221Z",
1591
+ "shell.execute_reply": "2024-05-28T14:53:18.315238Z",
1592
+ "shell.execute_reply.started": "2024-05-28T14:53:18.307697Z"
1593
+ }
1594
+ },
1595
+ "outputs": [
1596
+ {
1597
+ "data": {
1598
+ "text/plain": [
1599
+ "array([[4276, 664],\n",
1600
+ " [1960, 3017]])"
1601
+ ]
1602
+ },
1603
+ "execution_count": 44,
1604
+ "metadata": {},
1605
+ "output_type": "execute_result"
1606
+ }
1607
+ ],
1608
+ "source": [
1609
+ "confusion_matrix(y_test,y_pred)"
1610
+ ]
1611
+ },
1612
+ {
1613
+ "cell_type": "code",
1614
+ "execution_count": 45,
1615
+ "metadata": {
1616
+ "execution": {
1617
+ "iopub.execute_input": "2024-05-28T14:53:18.317650Z",
1618
+ "iopub.status.busy": "2024-05-28T14:53:18.317380Z",
1619
+ "iopub.status.idle": "2024-05-28T14:55:29.097680Z",
1620
+ "shell.execute_reply": "2024-05-28T14:55:29.096708Z",
1621
+ "shell.execute_reply.started": "2024-05-28T14:53:18.317628Z"
1622
+ }
1623
+ },
1624
+ "outputs": [
1625
+ {
1626
+ "data": {
1627
+ "text/plain": [
1628
+ "0.8426943632146818"
1629
+ ]
1630
+ },
1631
+ "execution_count": 45,
1632
+ "metadata": {},
1633
+ "output_type": "execute_result"
1634
+ }
1635
+ ],
1636
+ "source": [
1637
+ "from sklearn.ensemble import RandomForestClassifier\n",
1638
+ "rf=RandomForestClassifier()\n",
1639
+ "rf.fit(x_train,y_train)\n",
1640
+ "y_pred=rf.predict(x_test)\n",
1641
+ "accuracy_score(y_test,y_pred)"
1642
+ ]
1643
+ },
1644
+ {
1645
+ "cell_type": "code",
1646
+ "execution_count": 46,
1647
+ "metadata": {
1648
+ "execution": {
1649
+ "iopub.execute_input": "2024-05-28T14:55:29.099391Z",
1650
+ "iopub.status.busy": "2024-05-28T14:55:29.099103Z",
1651
+ "iopub.status.idle": "2024-05-28T14:55:29.108811Z",
1652
+ "shell.execute_reply": "2024-05-28T14:55:29.107863Z",
1653
+ "shell.execute_reply.started": "2024-05-28T14:55:29.099364Z"
1654
+ }
1655
+ },
1656
+ "outputs": [
1657
+ {
1658
+ "data": {
1659
+ "text/plain": [
1660
+ "array([[4152, 788],\n",
1661
+ " [ 772, 4205]])"
1662
+ ]
1663
+ },
1664
+ "execution_count": 46,
1665
+ "metadata": {},
1666
+ "output_type": "execute_result"
1667
+ }
1668
+ ],
1669
+ "source": [
1670
+ "confusion_matrix(y_test,y_pred)"
1671
+ ]
1672
+ },
1673
+ {
1674
+ "cell_type": "markdown",
1675
+ "metadata": {},
1676
+ "source": [
1677
+ "# N_Grams"
1678
+ ]
1679
+ },
1680
+ {
1681
+ "cell_type": "code",
1682
+ "execution_count": 47,
1683
+ "metadata": {
1684
+ "execution": {
1685
+ "iopub.execute_input": "2024-05-28T14:55:29.110803Z",
1686
+ "iopub.status.busy": "2024-05-28T14:55:29.110043Z",
1687
+ "iopub.status.idle": "2024-05-28T14:55:29.372362Z",
1688
+ "shell.execute_reply": "2024-05-28T14:55:29.371484Z",
1689
+ "shell.execute_reply.started": "2024-05-28T14:55:29.110765Z"
1690
+ }
1691
+ },
1692
+ "outputs": [],
1693
+ "source": [
1694
+ "x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=3,stratify=y)"
1695
+ ]
1696
+ },
1697
+ {
1698
+ "cell_type": "code",
1699
+ "execution_count": 48,
1700
+ "metadata": {
1701
+ "execution": {
1702
+ "iopub.execute_input": "2024-05-28T14:55:29.373855Z",
1703
+ "iopub.status.busy": "2024-05-28T14:55:29.373553Z",
1704
+ "iopub.status.idle": "2024-05-28T14:55:29.393260Z",
1705
+ "shell.execute_reply": "2024-05-28T14:55:29.392224Z",
1706
+ "shell.execute_reply.started": "2024-05-28T14:55:29.373828Z"
1707
+ }
1708
+ },
1709
+ "outputs": [],
1710
+ "source": [
1711
+ "cv=CountVectorizer(ngram_range=(1,2),max_features=10000)\n"
1712
+ ]
1713
+ },
1714
+ {
1715
+ "cell_type": "code",
1716
+ "execution_count": 49,
1717
+ "metadata": {
1718
+ "execution": {
1719
+ "iopub.execute_input": "2024-05-28T14:55:29.394786Z",
1720
+ "iopub.status.busy": "2024-05-28T14:55:29.394486Z",
1721
+ "iopub.status.idle": "2024-05-28T14:56:00.578967Z",
1722
+ "shell.execute_reply": "2024-05-28T14:56:00.577883Z",
1723
+ "shell.execute_reply.started": "2024-05-28T14:55:29.394758Z"
1724
+ }
1725
+ },
1726
+ "outputs": [],
1727
+ "source": [
1728
+ "x_train=cv.fit_transform(x_train['review']).toarray()\n",
1729
+ "x_test=cv.transform(x_test['review']).toarray()"
1730
+ ]
1731
+ },
1732
+ {
1733
+ "cell_type": "code",
1734
+ "execution_count": 50,
1735
+ "metadata": {
1736
+ "execution": {
1737
+ "iopub.execute_input": "2024-05-28T14:56:00.580808Z",
1738
+ "iopub.status.busy": "2024-05-28T14:56:00.580266Z",
1739
+ "iopub.status.idle": "2024-05-28T14:58:18.070996Z",
1740
+ "shell.execute_reply": "2024-05-28T14:58:18.069821Z",
1741
+ "shell.execute_reply.started": "2024-05-28T14:56:00.580771Z"
1742
+ }
1743
+ },
1744
+ "outputs": [
1745
+ {
1746
+ "data": {
1747
+ "text/plain": [
1748
+ "0.846324493294343"
1749
+ ]
1750
+ },
1751
+ "execution_count": 50,
1752
+ "metadata": {},
1753
+ "output_type": "execute_result"
1754
+ }
1755
+ ],
1756
+ "source": [
1757
+ "from sklearn.ensemble import RandomForestClassifier\n",
1758
+ "rf=RandomForestClassifier()\n",
1759
+ "rf.fit(x_train,y_train)\n",
1760
+ "y_pred=rf.predict(x_test)\n",
1761
+ "accuracy_score(y_test,y_pred)"
1762
+ ]
1763
+ },
1764
+ {
1765
+ "cell_type": "code",
1766
+ "execution_count": 51,
1767
+ "metadata": {
1768
+ "execution": {
1769
+ "iopub.execute_input": "2024-05-28T14:58:18.072639Z",
1770
+ "iopub.status.busy": "2024-05-28T14:58:18.072319Z",
1771
+ "iopub.status.idle": "2024-05-28T14:58:18.081205Z",
1772
+ "shell.execute_reply": "2024-05-28T14:58:18.080365Z",
1773
+ "shell.execute_reply.started": "2024-05-28T14:58:18.072613Z"
1774
+ }
1775
+ },
1776
+ "outputs": [
1777
+ {
1778
+ "data": {
1779
+ "text/plain": [
1780
+ "array([[4178, 762],\n",
1781
+ " [ 762, 4215]])"
1782
+ ]
1783
+ },
1784
+ "execution_count": 51,
1785
+ "metadata": {},
1786
+ "output_type": "execute_result"
1787
+ }
1788
+ ],
1789
+ "source": [
1790
+ "confusion_matrix(y_test,y_pred)"
1791
+ ]
1792
+ },
1793
+ {
1794
+ "cell_type": "markdown",
1795
+ "metadata": {},
1796
+ "source": [
1797
+ "# Saving and Loading"
1798
+ ]
1799
+ },
1800
+ {
1801
+ "cell_type": "code",
1802
+ "execution_count": 60,
1803
+ "metadata": {
1804
+ "execution": {
1805
+ "iopub.execute_input": "2024-05-28T15:01:45.937561Z",
1806
+ "iopub.status.busy": "2024-05-28T15:01:45.937238Z",
1807
+ "iopub.status.idle": "2024-05-28T15:01:46.088033Z",
1808
+ "shell.execute_reply": "2024-05-28T15:01:46.087204Z",
1809
+ "shell.execute_reply.started": "2024-05-28T15:01:45.937533Z"
1810
+ }
1811
+ },
1812
+ "outputs": [],
1813
+ "source": [
1814
+ "import pickle\n",
1815
+ "\n",
1816
+ "# save the iris classification model as a pickle file\n",
1817
+ "model_pkl_file = \"Sentimental_Analysis1.pkl\" \n",
1818
+ "\n",
1819
+ "with open(model_pkl_file, 'wb') as file: \n",
1820
+ " pickle.dump(rf, file)"
1821
+ ]
1822
+ },
1823
+ {
1824
+ "cell_type": "code",
1825
+ "execution_count": 61,
1826
+ "metadata": {
1827
+ "execution": {
1828
+ "iopub.execute_input": "2024-05-28T15:01:46.090237Z",
1829
+ "iopub.status.busy": "2024-05-28T15:01:46.089930Z",
1830
+ "iopub.status.idle": "2024-05-28T15:01:46.801994Z",
1831
+ "shell.execute_reply": "2024-05-28T15:01:46.800807Z",
1832
+ "shell.execute_reply.started": "2024-05-28T15:01:46.090212Z"
1833
+ }
1834
+ },
1835
+ "outputs": [
1836
+ {
1837
+ "data": {
1838
+ "text/plain": [
1839
+ "0.844711102147827"
1840
+ ]
1841
+ },
1842
+ "execution_count": 61,
1843
+ "metadata": {},
1844
+ "output_type": "execute_result"
1845
+ }
1846
+ ],
1847
+ "source": [
1848
+ "with open(model_pkl_file, 'rb') as file: \n",
1849
+ " rf = pickle.load(file)\n",
1850
+ "y_pred=rf.predict(x_test)\n",
1851
+ "accuracy_score(y_test,y_pred)"
1852
+ ]
1853
+ },
1854
+ {
1855
+ "cell_type": "code",
1856
+ "execution_count": 54,
1857
+ "metadata": {
1858
+ "execution": {
1859
+ "iopub.execute_input": "2024-05-28T14:58:19.511224Z",
1860
+ "iopub.status.busy": "2024-05-28T14:58:19.510906Z",
1861
+ "iopub.status.idle": "2024-05-28T14:58:19.785147Z",
1862
+ "shell.execute_reply": "2024-05-28T14:58:19.784066Z",
1863
+ "shell.execute_reply.started": "2024-05-28T14:58:19.511197Z"
1864
+ }
1865
+ },
1866
+ "outputs": [],
1867
+ "source": [
1868
+ "x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=3,stratify=y)"
1869
+ ]
1870
+ },
1871
+ {
1872
+ "cell_type": "markdown",
1873
+ "metadata": {},
1874
+ "source": [
1875
+ "# TF_IDF"
1876
+ ]
1877
+ },
1878
+ {
1879
+ "cell_type": "code",
1880
+ "execution_count": 55,
1881
+ "metadata": {
1882
+ "execution": {
1883
+ "iopub.execute_input": "2024-05-28T14:58:19.787439Z",
1884
+ "iopub.status.busy": "2024-05-28T14:58:19.786737Z",
1885
+ "iopub.status.idle": "2024-05-28T14:58:19.792198Z",
1886
+ "shell.execute_reply": "2024-05-28T14:58:19.791132Z",
1887
+ "shell.execute_reply.started": "2024-05-28T14:58:19.787387Z"
1888
+ }
1889
+ },
1890
+ "outputs": [],
1891
+ "source": [
1892
+ "from sklearn.feature_extraction.text import TfidfVectorizer"
1893
+ ]
1894
+ },
1895
+ {
1896
+ "cell_type": "code",
1897
+ "execution_count": 56,
1898
+ "metadata": {
1899
+ "execution": {
1900
+ "iopub.execute_input": "2024-05-28T14:58:53.029215Z",
1901
+ "iopub.status.busy": "2024-05-28T14:58:53.028431Z",
1902
+ "iopub.status.idle": "2024-05-28T14:58:53.033696Z",
1903
+ "shell.execute_reply": "2024-05-28T14:58:53.032603Z",
1904
+ "shell.execute_reply.started": "2024-05-28T14:58:53.029178Z"
1905
+ }
1906
+ },
1907
+ "outputs": [],
1908
+ "source": [
1909
+ "tfidf=TfidfVectorizer(max_features=10000)"
1910
+ ]
1911
+ },
1912
+ {
1913
+ "cell_type": "code",
1914
+ "execution_count": 57,
1915
+ "metadata": {
1916
+ "execution": {
1917
+ "iopub.execute_input": "2024-05-28T14:58:58.538248Z",
1918
+ "iopub.status.busy": "2024-05-28T14:58:58.537480Z",
1919
+ "iopub.status.idle": "2024-05-28T14:59:09.408751Z",
1920
+ "shell.execute_reply": "2024-05-28T14:59:09.407944Z",
1921
+ "shell.execute_reply.started": "2024-05-28T14:58:58.538211Z"
1922
+ }
1923
+ },
1924
+ "outputs": [],
1925
+ "source": [
1926
+ "x_train=tfidf.fit_transform(x_train['review']).toarray()\n",
1927
+ "x_test=tfidf.transform(x_test['review'])"
1928
+ ]
1929
+ },
1930
+ {
1931
+ "cell_type": "code",
1932
+ "execution_count": 58,
1933
+ "metadata": {
1934
+ "execution": {
1935
+ "iopub.execute_input": "2024-05-28T14:59:18.888515Z",
1936
+ "iopub.status.busy": "2024-05-28T14:59:18.888049Z",
1937
+ "iopub.status.idle": "2024-05-28T15:01:45.924455Z",
1938
+ "shell.execute_reply": "2024-05-28T15:01:45.923366Z",
1939
+ "shell.execute_reply.started": "2024-05-28T14:59:18.888481Z"
1940
+ }
1941
+ },
1942
+ "outputs": [
1943
+ {
1944
+ "data": {
1945
+ "text/plain": [
1946
+ "0.844711102147827"
1947
+ ]
1948
+ },
1949
+ "execution_count": 58,
1950
+ "metadata": {},
1951
+ "output_type": "execute_result"
1952
+ }
1953
+ ],
1954
+ "source": [
1955
+ "rf=RandomForestClassifier()\n",
1956
+ "rf.fit(x_train,y_train)\n",
1957
+ "y_pred=rf.predict(x_test)\n",
1958
+ "accuracy_score(y_test,y_pred)"
1959
+ ]
1960
+ },
1961
+ {
1962
+ "cell_type": "code",
1963
+ "execution_count": 59,
1964
+ "metadata": {
1965
+ "execution": {
1966
+ "iopub.execute_input": "2024-05-28T15:01:45.926453Z",
1967
+ "iopub.status.busy": "2024-05-28T15:01:45.926143Z",
1968
+ "iopub.status.idle": "2024-05-28T15:01:45.935972Z",
1969
+ "shell.execute_reply": "2024-05-28T15:01:45.934872Z",
1970
+ "shell.execute_reply.started": "2024-05-28T15:01:45.926419Z"
1971
+ }
1972
+ },
1973
+ "outputs": [
1974
+ {
1975
+ "data": {
1976
+ "text/plain": [
1977
+ "array([[4182, 758],\n",
1978
+ " [ 782, 4195]])"
1979
+ ]
1980
+ },
1981
+ "execution_count": 59,
1982
+ "metadata": {},
1983
+ "output_type": "execute_result"
1984
+ }
1985
+ ],
1986
+ "source": [
1987
+ "confusion_matrix(y_test,y_pred)"
1988
+ ]
1989
+ },
1990
+ {
1991
+ "cell_type": "code",
1992
+ "execution_count": 62,
1993
+ "metadata": {
1994
+ "execution": {
1995
+ "iopub.execute_input": "2024-05-28T15:01:46.804239Z",
1996
+ "iopub.status.busy": "2024-05-28T15:01:46.803156Z",
1997
+ "iopub.status.idle": "2024-05-28T15:01:47.032951Z",
1998
+ "shell.execute_reply": "2024-05-28T15:01:47.031995Z",
1999
+ "shell.execute_reply.started": "2024-05-28T15:01:46.804189Z"
2000
+ }
2001
+ },
2002
+ "outputs": [],
2003
+ "source": [
2004
+ "x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=3,stratify=y)"
2005
+ ]
2006
+ },
2007
+ {
2008
+ "cell_type": "markdown",
2009
+ "metadata": {},
2010
+ "source": [
2011
+ "# Word2Vec"
2012
+ ]
2013
+ },
2014
+ {
2015
+ "cell_type": "code",
2016
+ "execution_count": 64,
2017
+ "metadata": {
2018
+ "execution": {
2019
+ "iopub.execute_input": "2024-05-28T15:04:53.247003Z",
2020
+ "iopub.status.busy": "2024-05-28T15:04:53.246571Z",
2021
+ "iopub.status.idle": "2024-05-28T15:05:04.199287Z",
2022
+ "shell.execute_reply": "2024-05-28T15:05:04.198486Z",
2023
+ "shell.execute_reply.started": "2024-05-28T15:04:53.246970Z"
2024
+ }
2025
+ },
2026
+ "outputs": [],
2027
+ "source": [
2028
+ "import gensim"
2029
+ ]
2030
+ },
2031
+ {
2032
+ "cell_type": "code",
2033
+ "execution_count": 65,
2034
+ "metadata": {
2035
+ "execution": {
2036
+ "iopub.execute_input": "2024-05-28T15:05:48.076456Z",
2037
+ "iopub.status.busy": "2024-05-28T15:05:48.076082Z",
2038
+ "iopub.status.idle": "2024-05-28T15:05:48.080852Z",
2039
+ "shell.execute_reply": "2024-05-28T15:05:48.079878Z",
2040
+ "shell.execute_reply.started": "2024-05-28T15:05:48.076427Z"
2041
+ }
2042
+ },
2043
+ "outputs": [],
2044
+ "source": [
2045
+ "from nltk import sent_tokenize\n",
2046
+ "from gensim.utils import simple_preprocess"
2047
+ ]
2048
+ },
2049
+ {
2050
+ "cell_type": "code",
2051
+ "execution_count": 66,
2052
+ "metadata": {
2053
+ "execution": {
2054
+ "iopub.execute_input": "2024-05-28T15:07:37.532271Z",
2055
+ "iopub.status.busy": "2024-05-28T15:07:37.531546Z",
2056
+ "iopub.status.idle": "2024-05-28T15:08:02.872926Z",
2057
+ "shell.execute_reply": "2024-05-28T15:08:02.871888Z",
2058
+ "shell.execute_reply.started": "2024-05-28T15:07:37.532232Z"
2059
+ }
2060
+ },
2061
+ "outputs": [],
2062
+ "source": [
2063
+ "story=[]\n",
2064
+ "for doc in df['review']:\n",
2065
+ " raw_sent=sent_tokenize(doc)\n",
2066
+ " for sent in raw_sent:\n",
2067
+ " story.append(simple_preprocess(sent))"
2068
+ ]
2069
+ },
2070
+ {
2071
+ "cell_type": "code",
2072
+ "execution_count": 67,
2073
+ "metadata": {
2074
+ "execution": {
2075
+ "iopub.execute_input": "2024-05-28T15:08:40.812263Z",
2076
+ "iopub.status.busy": "2024-05-28T15:08:40.811483Z",
2077
+ "iopub.status.idle": "2024-05-28T15:08:40.817557Z",
2078
+ "shell.execute_reply": "2024-05-28T15:08:40.816616Z",
2079
+ "shell.execute_reply.started": "2024-05-28T15:08:40.812227Z"
2080
+ }
2081
+ },
2082
+ "outputs": [],
2083
+ "source": [
2084
+ "model=gensim.models.Word2Vec(\n",
2085
+ "window=10,min_count=2)"
2086
+ ]
2087
+ },
2088
+ {
2089
+ "cell_type": "code",
2090
+ "execution_count": 68,
2091
+ "metadata": {
2092
+ "execution": {
2093
+ "iopub.execute_input": "2024-05-28T15:09:01.856845Z",
2094
+ "iopub.status.busy": "2024-05-28T15:09:01.855976Z",
2095
+ "iopub.status.idle": "2024-05-28T15:09:05.537674Z",
2096
+ "shell.execute_reply": "2024-05-28T15:09:05.536873Z",
2097
+ "shell.execute_reply.started": "2024-05-28T15:09:01.856798Z"
2098
+ }
2099
+ },
2100
+ "outputs": [],
2101
+ "source": [
2102
+ "model.build_vocab(story)"
2103
+ ]
2104
+ },
2105
+ {
2106
+ "cell_type": "code",
2107
+ "execution_count": 69,
2108
+ "metadata": {
2109
+ "execution": {
2110
+ "iopub.execute_input": "2024-05-28T15:10:20.091520Z",
2111
+ "iopub.status.busy": "2024-05-28T15:10:20.091143Z",
2112
+ "iopub.status.idle": "2024-05-28T15:10:51.764165Z",
2113
+ "shell.execute_reply": "2024-05-28T15:10:51.763105Z",
2114
+ "shell.execute_reply.started": "2024-05-28T15:10:20.091491Z"
2115
+ }
2116
+ },
2117
+ "outputs": [
2118
+ {
2119
+ "data": {
2120
+ "text/plain": [
2121
+ "(28382867, 30062525)"
2122
+ ]
2123
+ },
2124
+ "execution_count": 69,
2125
+ "metadata": {},
2126
+ "output_type": "execute_result"
2127
+ }
2128
+ ],
2129
+ "source": [
2130
+ "model.train(story,total_examples=model.corpus_count,epochs=model.epochs)"
2131
+ ]
2132
+ },
2133
+ {
2134
+ "cell_type": "code",
2135
+ "execution_count": 70,
2136
+ "metadata": {
2137
+ "execution": {
2138
+ "iopub.execute_input": "2024-05-28T15:11:03.211080Z",
2139
+ "iopub.status.busy": "2024-05-28T15:11:03.210673Z",
2140
+ "iopub.status.idle": "2024-05-28T15:11:03.218564Z",
2141
+ "shell.execute_reply": "2024-05-28T15:11:03.217552Z",
2142
+ "shell.execute_reply.started": "2024-05-28T15:11:03.211047Z"
2143
+ }
2144
+ },
2145
+ "outputs": [
2146
+ {
2147
+ "data": {
2148
+ "text/plain": [
2149
+ "79870"
2150
+ ]
2151
+ },
2152
+ "execution_count": 70,
2153
+ "metadata": {},
2154
+ "output_type": "execute_result"
2155
+ }
2156
+ ],
2157
+ "source": [
2158
+ "len(model.wv.index_to_key)"
2159
+ ]
2160
+ },
2161
+ {
2162
+ "cell_type": "code",
2163
+ "execution_count": 71,
2164
+ "metadata": {
2165
+ "execution": {
2166
+ "iopub.execute_input": "2024-05-28T15:13:11.657877Z",
2167
+ "iopub.status.busy": "2024-05-28T15:13:11.657479Z",
2168
+ "iopub.status.idle": "2024-05-28T15:13:11.663513Z",
2169
+ "shell.execute_reply": "2024-05-28T15:13:11.662556Z",
2170
+ "shell.execute_reply.started": "2024-05-28T15:13:11.657844Z"
2171
+ }
2172
+ },
2173
+ "outputs": [],
2174
+ "source": [
2175
+ "def dec_vector(doc):\n",
2176
+ " doc=[word for word in doc.split() if word in model.wv.index_to_key]\n",
2177
+ " return np.mean(model.wv[doc],axis=0)"
2178
+ ]
2179
+ },
2180
+ {
2181
+ "cell_type": "code",
2182
+ "execution_count": 72,
2183
+ "metadata": {
2184
+ "execution": {
2185
+ "iopub.execute_input": "2024-05-28T15:14:29.737526Z",
2186
+ "iopub.status.busy": "2024-05-28T15:14:29.736457Z",
2187
+ "iopub.status.idle": "2024-05-28T15:14:29.742036Z",
2188
+ "shell.execute_reply": "2024-05-28T15:14:29.740881Z",
2189
+ "shell.execute_reply.started": "2024-05-28T15:14:29.737484Z"
2190
+ }
2191
+ },
2192
+ "outputs": [],
2193
+ "source": [
2194
+ "from tqdm import tqdm"
2195
+ ]
2196
+ },
2197
+ {
2198
+ "cell_type": "code",
2199
+ "execution_count": 74,
2200
+ "metadata": {
2201
+ "execution": {
2202
+ "iopub.execute_input": "2024-05-28T15:16:04.216141Z",
2203
+ "iopub.status.busy": "2024-05-28T15:16:04.215772Z",
2204
+ "iopub.status.idle": "2024-05-28T15:35:52.614033Z",
2205
+ "shell.execute_reply": "2024-05-28T15:35:52.613102Z",
2206
+ "shell.execute_reply.started": "2024-05-28T15:16:04.216114Z"
2207
+ }
2208
+ },
2209
+ "outputs": [
2210
+ {
2211
+ "name": "stderr",
2212
+ "output_type": "stream",
2213
+ "text": [
2214
+ "100%|██████████| 49582/49582 [19:48<00:00, 41.72it/s]\n"
2215
+ ]
2216
+ }
2217
+ ],
2218
+ "source": [
2219
+ "X=[]\n",
2220
+ "for doc in tqdm(df['review'].values):\n",
2221
+ " X.append(dec_vector(doc))\n",
2222
+ " "
2223
+ ]
2224
+ },
2225
+ {
2226
+ "cell_type": "code",
2227
+ "execution_count": 75,
2228
+ "metadata": {
2229
+ "execution": {
2230
+ "iopub.execute_input": "2024-05-28T15:35:52.757355Z",
2231
+ "iopub.status.busy": "2024-05-28T15:35:52.756711Z",
2232
+ "iopub.status.idle": "2024-05-28T15:35:52.801878Z",
2233
+ "shell.execute_reply": "2024-05-28T15:35:52.800886Z",
2234
+ "shell.execute_reply.started": "2024-05-28T15:35:52.757317Z"
2235
+ }
2236
+ },
2237
+ "outputs": [],
2238
+ "source": [
2239
+ "X=np.array(X)"
2240
+ ]
2241
+ },
2242
+ {
2243
+ "cell_type": "code",
2244
+ "execution_count": 76,
2245
+ "metadata": {
2246
+ "execution": {
2247
+ "iopub.execute_input": "2024-05-28T15:35:52.992514Z",
2248
+ "iopub.status.busy": "2024-05-28T15:35:52.992157Z",
2249
+ "iopub.status.idle": "2024-05-28T15:35:52.999577Z",
2250
+ "shell.execute_reply": "2024-05-28T15:35:52.998462Z",
2251
+ "shell.execute_reply.started": "2024-05-28T15:35:52.992480Z"
2252
+ }
2253
+ },
2254
+ "outputs": [
2255
+ {
2256
+ "data": {
2257
+ "text/plain": [
2258
+ "(49582, 100)"
2259
+ ]
2260
+ },
2261
+ "execution_count": 76,
2262
+ "metadata": {},
2263
+ "output_type": "execute_result"
2264
+ }
2265
+ ],
2266
+ "source": [
2267
+ "X.shape"
2268
+ ]
2269
+ },
2270
+ {
2271
+ "cell_type": "code",
2272
+ "execution_count": 77,
2273
+ "metadata": {
2274
+ "execution": {
2275
+ "iopub.execute_input": "2024-05-28T15:35:53.001378Z",
2276
+ "iopub.status.busy": "2024-05-28T15:35:53.000874Z",
2277
+ "iopub.status.idle": "2024-05-28T15:35:53.008752Z",
2278
+ "shell.execute_reply": "2024-05-28T15:35:53.007881Z",
2279
+ "shell.execute_reply.started": "2024-05-28T15:35:53.001350Z"
2280
+ }
2281
+ },
2282
+ "outputs": [
2283
+ {
2284
+ "data": {
2285
+ "text/plain": [
2286
+ "array([1, 1, 1, ..., 0, 0, 0])"
2287
+ ]
2288
+ },
2289
+ "execution_count": 77,
2290
+ "metadata": {},
2291
+ "output_type": "execute_result"
2292
+ }
2293
+ ],
2294
+ "source": [
2295
+ "y"
2296
+ ]
2297
+ },
2298
+ {
2299
+ "cell_type": "code",
2300
+ "execution_count": 78,
2301
+ "metadata": {
2302
+ "execution": {
2303
+ "iopub.execute_input": "2024-05-28T15:35:53.010300Z",
2304
+ "iopub.status.busy": "2024-05-28T15:35:53.009962Z",
2305
+ "iopub.status.idle": "2024-05-28T15:35:53.046198Z",
2306
+ "shell.execute_reply": "2024-05-28T15:35:53.045411Z",
2307
+ "shell.execute_reply.started": "2024-05-28T15:35:53.010269Z"
2308
+ }
2309
+ },
2310
+ "outputs": [],
2311
+ "source": [
2312
+ "x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=3,stratify=y)"
2313
+ ]
2314
+ },
2315
+ {
2316
+ "cell_type": "code",
2317
+ "execution_count": 79,
2318
+ "metadata": {
2319
+ "execution": {
2320
+ "iopub.execute_input": "2024-05-28T15:35:53.047433Z",
2321
+ "iopub.status.busy": "2024-05-28T15:35:53.047187Z",
2322
+ "iopub.status.idle": "2024-05-28T15:36:36.307334Z",
2323
+ "shell.execute_reply": "2024-05-28T15:36:36.306297Z",
2324
+ "shell.execute_reply.started": "2024-05-28T15:35:53.047411Z"
2325
+ }
2326
+ },
2327
+ "outputs": [
2328
+ {
2329
+ "data": {
2330
+ "text/plain": [
2331
+ "0.8395684178683069"
2332
+ ]
2333
+ },
2334
+ "execution_count": 79,
2335
+ "metadata": {},
2336
+ "output_type": "execute_result"
2337
+ }
2338
+ ],
2339
+ "source": [
2340
+ "rf=RandomForestClassifier()\n",
2341
+ "rf.fit(x_train,y_train)\n",
2342
+ "y_pred=rf.predict(x_test)\n",
2343
+ "accuracy_score(y_test,y_pred)"
2344
+ ]
2345
+ },
2346
+ {
2347
+ "cell_type": "code",
2348
+ "execution_count": 80,
2349
+ "metadata": {
2350
+ "execution": {
2351
+ "iopub.execute_input": "2024-05-28T15:36:36.308713Z",
2352
+ "iopub.status.busy": "2024-05-28T15:36:36.308416Z",
2353
+ "iopub.status.idle": "2024-05-28T15:36:36.317103Z",
2354
+ "shell.execute_reply": "2024-05-28T15:36:36.316187Z",
2355
+ "shell.execute_reply.started": "2024-05-28T15:36:36.308682Z"
2356
+ }
2357
+ },
2358
+ "outputs": [
2359
+ {
2360
+ "data": {
2361
+ "text/plain": [
2362
+ "array([[4011, 929],\n",
2363
+ " [ 662, 4315]])"
2364
+ ]
2365
+ },
2366
+ "execution_count": 80,
2367
+ "metadata": {},
2368
+ "output_type": "execute_result"
2369
+ }
2370
+ ],
2371
+ "source": [
2372
+ "confusion_matrix(y_test,y_pred)"
2373
+ ]
2374
+ },
2375
+ {
2376
+ "cell_type": "code",
2377
+ "execution_count": 81,
2378
+ "metadata": {
2379
+ "execution": {
2380
+ "iopub.execute_input": "2024-05-28T15:36:36.319780Z",
2381
+ "iopub.status.busy": "2024-05-28T15:36:36.319506Z",
2382
+ "iopub.status.idle": "2024-05-28T15:36:37.272618Z",
2383
+ "shell.execute_reply": "2024-05-28T15:36:37.271741Z",
2384
+ "shell.execute_reply.started": "2024-05-28T15:36:36.319756Z"
2385
+ }
2386
+ },
2387
+ "outputs": [],
2388
+ "source": [
2389
+ "model_pkl_file = \"Sentimental_Analysis_Word2Vec.pkl\" \n",
2390
+ "\n",
2391
+ "with open(model_pkl_file, 'wb') as file: \n",
2392
+ " pickle.dump(rf, file)"
2393
+ ]
2394
+ },
2395
+ {
2396
+ "cell_type": "code",
2397
+ "execution_count": null,
2398
+ "metadata": {},
2399
+ "outputs": [],
2400
+ "source": []
2401
+ }
2402
+ ],
2403
+ "metadata": {
2404
+ "kaggle": {
2405
+ "accelerator": "nvidiaTeslaT4",
2406
+ "dataSources": [
2407
+ {
2408
+ "datasetId": 134715,
2409
+ "sourceId": 320111,
2410
+ "sourceType": "datasetVersion"
2411
+ }
2412
+ ],
2413
+ "dockerImageVersionId": 30699,
2414
+ "isGpuEnabled": true,
2415
+ "isInternetEnabled": true,
2416
+ "language": "python",
2417
+ "sourceType": "notebook"
2418
+ },
2419
+ "kernelspec": {
2420
+ "display_name": "Python 3 (ipykernel)",
2421
+ "language": "python",
2422
+ "name": "python3"
2423
+ },
2424
+ "language_info": {
2425
+ "codemirror_mode": {
2426
+ "name": "ipython",
2427
+ "version": 3
2428
+ },
2429
+ "file_extension": ".py",
2430
+ "mimetype": "text/x-python",
2431
+ "name": "python",
2432
+ "nbconvert_exporter": "python",
2433
+ "pygments_lexer": "ipython3",
2434
+ "version": "3.11.0"
2435
+ }
2436
+ },
2437
+ "nbformat": 4,
2438
+ "nbformat_minor": 4
2439
+ }
packages.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ libgl1
requirements.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+ scikit-learn==1.2.2
4
+ numpy
5
+ pandas
6
+ streamlit
7
+ nltk
8
+ contractions
9
+ gensim
10
+ scipy==1.12
x.py ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ import os
3
+ import pandas as pd
4
+ import numpy as np
5
+ import streamlit as st
6
+ import re
7
+ import pickle
8
+ def remove_tags(text):
9
+ return re.sub(re.compile('<.*?>'),'',text)
10
+
11
+ def lwr(text):
12
+ return text.lower()
13
+
14
+ import nltk
15
+
16
+ nltk.download("stopwords")
17
+ from nltk.corpus import stopwords
18
+ sw_list=stopwords.words('english')
19
+
20
+ def stopword(text):
21
+ return " ".join([word for word in text.split() if word not in sw_list])
22
+
23
+ import string
24
+ def remove_punctuation(text):
25
+ return text.translate(str.maketrans('', '', string.punctuation))
26
+
27
+ import contractions
28
+ def remove_contractions(text):
29
+ return contractions.fix(text)
30
+
31
+ def dec_vector(doc):
32
+ with open("Sentimental_Analysis_WV.pkl", 'rb') as file:
33
+ model = pickle.load(file)
34
+ doc=[word for word in doc.split() if word in model.wv.index_to_key]
35
+ return np.mean(model.wv[doc],axis=0)
36
+
37
+ def xvalue(text):
38
+ X=[]
39
+ X.append(dec_vector(text))
40
+ return X
41
+
42
+ def preprocessed(text):
43
+
44
+ text=remove_tags(text)
45
+ text=lwr(text)
46
+ text=stopword(text)
47
+ text=remove_punctuation(text)
48
+ text=remove_contractions(text)
49
+ X=xvalue(text)
50
+ X=np.array(X)
51
+ return X
52
+
53
+ def clear_text():
54
+ st.session_state["text"] = ""
55
+
56
+
57
+ def main():
58
+
59
+
60
+ with open("Sentimental_Analysis_Word2Vec.pkl", 'rb') as file1:
61
+ rf = pickle.load(file1)
62
+ st.title('Sentiment Analysis')
63
+
64
+ text = st.text_input(
65
+ "Enter some text 👇", key="text")
66
+
67
+ if st.button('Classify'):
68
+ z=preprocessed(text)
69
+ if rf.predict(z)[0]==1:
70
+ st.success("Positive")
71
+ else:
72
+ st.success("Negative")
73
+ st.button("Clear", on_click=clear_text)
74
+
75
+
76
+ if __name__=='__main__':
77
+ main()