lynn-twinkl commited on
Commit
26ad793
·
1 Parent(s): 089a585

First commit

Browse files
notebooks/heartfelt_model_training.ipynb ADDED
@@ -0,0 +1,611 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "94f179d1-d39e-4fcd-83da-a879c3aa641a",
6
+ "metadata": {},
7
+ "source": [
8
+ "# 1. Config"
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "code",
13
+ "execution_count": 1,
14
+ "id": "0e81b039-7320-47a8-8b18-7cdf03a5b0a5",
15
+ "metadata": {},
16
+ "outputs": [],
17
+ "source": [
18
+ "import pandas as pd\n",
19
+ "\n",
20
+ "pd.set_option('display.max_colwidth', 500)"
21
+ ]
22
+ },
23
+ {
24
+ "cell_type": "code",
25
+ "execution_count": 2,
26
+ "id": "fe74a27c-0ecf-499d-afaf-980fb80b463a",
27
+ "metadata": {},
28
+ "outputs": [
29
+ {
30
+ "name": "stdout",
31
+ "output_type": "stream",
32
+ "text": [
33
+ "131\n"
34
+ ]
35
+ },
36
+ {
37
+ "data": {
38
+ "text/html": [
39
+ "<div>\n",
40
+ "<style scoped>\n",
41
+ " .dataframe tbody tr th:only-of-type {\n",
42
+ " vertical-align: middle;\n",
43
+ " }\n",
44
+ "\n",
45
+ " .dataframe tbody tr th {\n",
46
+ " vertical-align: top;\n",
47
+ " }\n",
48
+ "\n",
49
+ " .dataframe thead th {\n",
50
+ " text-align: right;\n",
51
+ " }\n",
52
+ "</style>\n",
53
+ "<table border=\"1\" class=\"dataframe\">\n",
54
+ " <thead>\n",
55
+ " <tr style=\"text-align: right;\">\n",
56
+ " <th></th>\n",
57
+ " <th>text</th>\n",
58
+ " <th>is_heartfelt</th>\n",
59
+ " </tr>\n",
60
+ " </thead>\n",
61
+ " <tbody>\n",
62
+ " <tr>\n",
63
+ " <th>0</th>\n",
64
+ " <td>I would love for our school to be considered for some gardening equipment. We lost a member of staff in February who had been at our school for 30 years and she was heavily involved with teaching the children in our inner city school about the environment and gardening. Since her death we have had countless children asking if we can use a piece of small ground in the playground and use it to her memory and plant seeds and make it a happy place. Mrs.Upham used to run a gardening club but she ...</td>\n",
65
+ " <td>True</td>\n",
66
+ " </tr>\n",
67
+ " <tr>\n",
68
+ " <th>1</th>\n",
69
+ " <td>Our year 1 children have been inspired by a book called Omar, the Bees and Me, where two children from different backgrounds bond over their interest in bees and decide to make a 'bee corridor' between their school and the local park. They send out envelopes of wildflower seeds to every house and building along the route and by the time the late spring comes, the whole neighborhood is alive with flowers and insects. Our year 1 classes would love to do something like this in our community. We...</td>\n",
70
+ " <td>True</td>\n",
71
+ " </tr>\n",
72
+ " <tr>\n",
73
+ " <th>2</th>\n",
74
+ " <td>I am the SEN and Pastoral Lead in my school, Derryhale Primary School, Co Armagh N Ireland. We recently lost a valued member of our PTA team, a lovely mummy of two children, a little girl in Y2, and a little boy in Y4. This lovely lady was just 31 years old and lost her fight for life to cancer. Budget is tight in school. We are a small rural school of 76 pupils and everyone has been effected by this loss. All the children in our care worry now that they will loose a family member and pasto...</td>\n",
75
+ " <td>True</td>\n",
76
+ " </tr>\n",
77
+ " </tbody>\n",
78
+ "</table>\n",
79
+ "</div>"
80
+ ],
81
+ "text/plain": [
82
+ " text \\\n",
83
+ "0 I would love for our school to be considered for some gardening equipment. We lost a member of staff in February who had been at our school for 30 years and she was heavily involved with teaching the children in our inner city school about the environment and gardening. Since her death we have had countless children asking if we can use a piece of small ground in the playground and use it to her memory and plant seeds and make it a happy place. Mrs.Upham used to run a gardening club but she ... \n",
84
+ "1 Our year 1 children have been inspired by a book called Omar, the Bees and Me, where two children from different backgrounds bond over their interest in bees and decide to make a 'bee corridor' between their school and the local park. They send out envelopes of wildflower seeds to every house and building along the route and by the time the late spring comes, the whole neighborhood is alive with flowers and insects. Our year 1 classes would love to do something like this in our community. We... \n",
85
+ "2 I am the SEN and Pastoral Lead in my school, Derryhale Primary School, Co Armagh N Ireland. We recently lost a valued member of our PTA team, a lovely mummy of two children, a little girl in Y2, and a little boy in Y4. This lovely lady was just 31 years old and lost her fight for life to cancer. Budget is tight in school. We are a small rural school of 76 pupils and everyone has been effected by this loss. All the children in our care worry now that they will loose a family member and pasto... \n",
86
+ "\n",
87
+ " is_heartfelt \n",
88
+ "0 True \n",
89
+ "1 True \n",
90
+ "2 True "
91
+ ]
92
+ },
93
+ "execution_count": 2,
94
+ "metadata": {},
95
+ "output_type": "execute_result"
96
+ }
97
+ ],
98
+ "source": [
99
+ "raw_df = pd.read_csv('data/exports/combined_heartfelt_data.csv')\n",
100
+ "\n",
101
+ "print(len(raw_df))\n",
102
+ "raw_df.head(3)"
103
+ ]
104
+ },
105
+ {
106
+ "cell_type": "markdown",
107
+ "id": "44efbc86-827f-43a2-b9fa-046878e9243d",
108
+ "metadata": {},
109
+ "source": [
110
+ "# 2. Data Preprocessing"
111
+ ]
112
+ },
113
+ {
114
+ "cell_type": "markdown",
115
+ "id": "c609ca9c-4fb5-4b8b-9734-712a4c67ce6c",
116
+ "metadata": {},
117
+ "source": [
118
+ "## 2.1 Normalising Text"
119
+ ]
120
+ },
121
+ {
122
+ "cell_type": "code",
123
+ "execution_count": 3,
124
+ "id": "6a0474b4-9a0c-4f27-8700-f9331c225210",
125
+ "metadata": {},
126
+ "outputs": [],
127
+ "source": [
128
+ "import string"
129
+ ]
130
+ },
131
+ {
132
+ "cell_type": "code",
133
+ "execution_count": 4,
134
+ "id": "6340d648-b2fb-4af2-8fcd-1247bf4c2f2f",
135
+ "metadata": {},
136
+ "outputs": [],
137
+ "source": [
138
+ "norm_text = raw_df.copy()"
139
+ ]
140
+ },
141
+ {
142
+ "cell_type": "code",
143
+ "execution_count": 5,
144
+ "id": "ca4e4321-60fb-41a5-8dcd-28d5a96aead6",
145
+ "metadata": {},
146
+ "outputs": [],
147
+ "source": [
148
+ "def normalise_text(text):\n",
149
+ " if isinstance(text, str):\n",
150
+ " text = text.lower()\n",
151
+ " text = text.translate(str.maketrans('','', string.punctuation))\n",
152
+ " text = text.strip()\n",
153
+ "\n",
154
+ " return ' '.join(text.split())"
155
+ ]
156
+ },
157
+ {
158
+ "cell_type": "code",
159
+ "execution_count": 6,
160
+ "id": "07fe53af-0d0c-4e94-8bea-62a5d9967d48",
161
+ "metadata": {},
162
+ "outputs": [],
163
+ "source": [
164
+ "norm_text['text'] = norm_text['text'].map(normalise_text)"
165
+ ]
166
+ },
167
+ {
168
+ "cell_type": "code",
169
+ "execution_count": 7,
170
+ "id": "7b6a2e48-6ef7-4e57-8bd4-fee08b7a993c",
171
+ "metadata": {},
172
+ "outputs": [
173
+ {
174
+ "data": {
175
+ "text/html": [
176
+ "<div>\n",
177
+ "<style scoped>\n",
178
+ " .dataframe tbody tr th:only-of-type {\n",
179
+ " vertical-align: middle;\n",
180
+ " }\n",
181
+ "\n",
182
+ " .dataframe tbody tr th {\n",
183
+ " vertical-align: top;\n",
184
+ " }\n",
185
+ "\n",
186
+ " .dataframe thead th {\n",
187
+ " text-align: right;\n",
188
+ " }\n",
189
+ "</style>\n",
190
+ "<table border=\"1\" class=\"dataframe\">\n",
191
+ " <thead>\n",
192
+ " <tr style=\"text-align: right;\">\n",
193
+ " <th></th>\n",
194
+ " <th>text</th>\n",
195
+ " <th>is_heartfelt</th>\n",
196
+ " </tr>\n",
197
+ " </thead>\n",
198
+ " <tbody>\n",
199
+ " <tr>\n",
200
+ " <th>0</th>\n",
201
+ " <td>i would love for our school to be considered for some gardening equipment we lost a member of staff in february who had been at our school for 30 years and she was heavily involved with teaching the children in our inner city school about the environment and gardening since her death we have had countless children asking if we can use a piece of small ground in the playground and use it to her memory and plant seeds and make it a happy place mrsupham used to run a gardening club but she dona...</td>\n",
202
+ " <td>True</td>\n",
203
+ " </tr>\n",
204
+ " <tr>\n",
205
+ " <th>1</th>\n",
206
+ " <td>our year 1 children have been inspired by a book called omar the bees and me where two children from different backgrounds bond over their interest in bees and decide to make a bee corridor between their school and the local park they send out envelopes of wildflower seeds to every house and building along the route and by the time the late spring comes the whole neighborhood is alive with flowers and insects our year 1 classes would love to do something like this in our community well need ...</td>\n",
207
+ " <td>True</td>\n",
208
+ " </tr>\n",
209
+ " <tr>\n",
210
+ " <th>2</th>\n",
211
+ " <td>i am the sen and pastoral lead in my school derryhale primary school co armagh n ireland we recently lost a valued member of our pta team a lovely mummy of two children a little girl in y2 and a little boy in y4 this lovely lady was just 31 years old and lost her fight for life to cancer budget is tight in school we are a small rural school of 76 pupils and everyone has been effected by this loss all the children in our care worry now that they will loose a family member and pastorally i am ...</td>\n",
212
+ " <td>True</td>\n",
213
+ " </tr>\n",
214
+ " </tbody>\n",
215
+ "</table>\n",
216
+ "</div>"
217
+ ],
218
+ "text/plain": [
219
+ " text \\\n",
220
+ "0 i would love for our school to be considered for some gardening equipment we lost a member of staff in february who had been at our school for 30 years and she was heavily involved with teaching the children in our inner city school about the environment and gardening since her death we have had countless children asking if we can use a piece of small ground in the playground and use it to her memory and plant seeds and make it a happy place mrsupham used to run a gardening club but she dona... \n",
221
+ "1 our year 1 children have been inspired by a book called omar the bees and me where two children from different backgrounds bond over their interest in bees and decide to make a bee corridor between their school and the local park they send out envelopes of wildflower seeds to every house and building along the route and by the time the late spring comes the whole neighborhood is alive with flowers and insects our year 1 classes would love to do something like this in our community well need ... \n",
222
+ "2 i am the sen and pastoral lead in my school derryhale primary school co armagh n ireland we recently lost a valued member of our pta team a lovely mummy of two children a little girl in y2 and a little boy in y4 this lovely lady was just 31 years old and lost her fight for life to cancer budget is tight in school we are a small rural school of 76 pupils and everyone has been effected by this loss all the children in our care worry now that they will loose a family member and pastorally i am ... \n",
223
+ "\n",
224
+ " is_heartfelt \n",
225
+ "0 True \n",
226
+ "1 True \n",
227
+ "2 True "
228
+ ]
229
+ },
230
+ "execution_count": 7,
231
+ "metadata": {},
232
+ "output_type": "execute_result"
233
+ }
234
+ ],
235
+ "source": [
236
+ "norm_text.head(3)"
237
+ ]
238
+ },
239
+ {
240
+ "cell_type": "markdown",
241
+ "id": "ca01ec92-9289-4a52-b972-60ff58d48d34",
242
+ "metadata": {},
243
+ "source": [
244
+ "## 2.2 Stopword Removal"
245
+ ]
246
+ },
247
+ {
248
+ "cell_type": "code",
249
+ "execution_count": 8,
250
+ "id": "ae999021-1b3d-48b6-b34e-e9e3dc6d7e16",
251
+ "metadata": {},
252
+ "outputs": [],
253
+ "source": [
254
+ "import spacy\n",
255
+ "import spacy_cleaner\n",
256
+ "from spacy_cleaner.processing import removers, mutators\n",
257
+ "\n",
258
+ "nlp = spacy.load('en_core_web_md')"
259
+ ]
260
+ },
261
+ {
262
+ "cell_type": "code",
263
+ "execution_count": 9,
264
+ "id": "afba3ebd-27b5-4171-84c7-155429242e98",
265
+ "metadata": {},
266
+ "outputs": [],
267
+ "source": [
268
+ "pipeline = spacy_cleaner.Cleaner(\n",
269
+ " nlp,\n",
270
+ " removers.remove_stopword_token,\n",
271
+ " mutators.mutate_lemma_token,\n",
272
+ ")"
273
+ ]
274
+ },
275
+ {
276
+ "cell_type": "code",
277
+ "execution_count": 10,
278
+ "id": "72a40f5e-3b16-4522-b696-8a6aab553649",
279
+ "metadata": {},
280
+ "outputs": [],
281
+ "source": [
282
+ "def clean_text_with_pipeline(text):\n",
283
+ " text = pipeline.clean(text)\n",
284
+ "\n",
285
+ " return text"
286
+ ]
287
+ },
288
+ {
289
+ "cell_type": "code",
290
+ "execution_count": 11,
291
+ "id": "3aa72b37-6d39-49a6-9321-e6f8bcdf46a5",
292
+ "metadata": {},
293
+ "outputs": [],
294
+ "source": [
295
+ "preprocessed_df = norm_text.copy()"
296
+ ]
297
+ },
298
+ {
299
+ "cell_type": "code",
300
+ "execution_count": 13,
301
+ "id": "326c9e80-a2c2-412e-b651-5424f1a0ba97",
302
+ "metadata": {},
303
+ "outputs": [
304
+ {
305
+ "name": "stderr",
306
+ "output_type": "stream",
307
+ "text": [
308
+ "Cleaning Progress: 100%|██████████████████████████████████████████| 131/131 [00:00<00:00, 313.08it/s]\n"
309
+ ]
310
+ }
311
+ ],
312
+ "source": [
313
+ "preprocessed_df['clean_text'] = pipeline.clean(preprocessed_df['text'].to_list())"
314
+ ]
315
+ },
316
+ {
317
+ "cell_type": "code",
318
+ "execution_count": 15,
319
+ "id": "900c578b-5a30-450f-8a9d-afd461fefb26",
320
+ "metadata": {},
321
+ "outputs": [
322
+ {
323
+ "data": {
324
+ "text/html": [
325
+ "<div>\n",
326
+ "<style scoped>\n",
327
+ " .dataframe tbody tr th:only-of-type {\n",
328
+ " vertical-align: middle;\n",
329
+ " }\n",
330
+ "\n",
331
+ " .dataframe tbody tr th {\n",
332
+ " vertical-align: top;\n",
333
+ " }\n",
334
+ "\n",
335
+ " .dataframe thead th {\n",
336
+ " text-align: right;\n",
337
+ " }\n",
338
+ "</style>\n",
339
+ "<table border=\"1\" class=\"dataframe\">\n",
340
+ " <thead>\n",
341
+ " <tr style=\"text-align: right;\">\n",
342
+ " <th></th>\n",
343
+ " <th>text</th>\n",
344
+ " <th>is_heartfelt</th>\n",
345
+ " <th>clean_text</th>\n",
346
+ " </tr>\n",
347
+ " </thead>\n",
348
+ " <tbody>\n",
349
+ " <tr>\n",
350
+ " <th>0</th>\n",
351
+ " <td>i would love for our school to be considered for some gardening equipment we lost a member of staff in february who had been at our school for 30 years and she was heavily involved with teaching the children in our inner city school about the environment and gardening since her death we have had countless children asking if we can use a piece of small ground in the playground and use it to her memory and plant seeds and make it a happy place mrsupham used to run a gardening club but she dona...</td>\n",
352
+ " <td>True</td>\n",
353
+ " <td>love school consider gardening equipment lose member staff february school 30 year heavily involved teach child inner city school environment gardening death countless child ask use piece small ground playground use memory plant seed happy place mrsupham run gardening club donate thing 50 child volunteer help positive outcome sad loss</td>\n",
354
+ " </tr>\n",
355
+ " <tr>\n",
356
+ " <th>1</th>\n",
357
+ " <td>our year 1 children have been inspired by a book called omar the bees and me where two children from different backgrounds bond over their interest in bees and decide to make a bee corridor between their school and the local park they send out envelopes of wildflower seeds to every house and building along the route and by the time the late spring comes the whole neighborhood is alive with flowers and insects our year 1 classes would love to do something like this in our community well need ...</td>\n",
358
+ " <td>True</td>\n",
359
+ " <td>year 1 child inspire book call omar bee child different background bond interest bee decide bee corridor school local park send envelope wildflower seed house building route time late spring come neighborhood alive flower insect year 1 class love like community need buy lot wildflower seed produce leaflet promote idea locally handdeliver envelope local area hope turn local area beefriendly community</td>\n",
360
+ " </tr>\n",
361
+ " <tr>\n",
362
+ " <th>2</th>\n",
363
+ " <td>i am the sen and pastoral lead in my school derryhale primary school co armagh n ireland we recently lost a valued member of our pta team a lovely mummy of two children a little girl in y2 and a little boy in y4 this lovely lady was just 31 years old and lost her fight for life to cancer budget is tight in school we are a small rural school of 76 pupils and everyone has been effected by this loss all the children in our care worry now that they will loose a family member and pastorally i am ...</td>\n",
364
+ " <td>True</td>\n",
365
+ " <td>sen pastoral lead school derryhale primary school co armagh n ireland recently lose value member pta team lovely mummy child little girl y2 little boy y4 lovely lady 31 year old lose fight life cancer budget tight school small rural school 76 pupil effect loss child care worry loose family member pastorally support child well resource twinkl support daddy granny child 10 day prior mummy die granda die little child lose beloved mummy granda 10 day love treat child school school trip christmas...</td>\n",
366
+ " </tr>\n",
367
+ " </tbody>\n",
368
+ "</table>\n",
369
+ "</div>"
370
+ ],
371
+ "text/plain": [
372
+ " text \\\n",
373
+ "0 i would love for our school to be considered for some gardening equipment we lost a member of staff in february who had been at our school for 30 years and she was heavily involved with teaching the children in our inner city school about the environment and gardening since her death we have had countless children asking if we can use a piece of small ground in the playground and use it to her memory and plant seeds and make it a happy place mrsupham used to run a gardening club but she dona... \n",
374
+ "1 our year 1 children have been inspired by a book called omar the bees and me where two children from different backgrounds bond over their interest in bees and decide to make a bee corridor between their school and the local park they send out envelopes of wildflower seeds to every house and building along the route and by the time the late spring comes the whole neighborhood is alive with flowers and insects our year 1 classes would love to do something like this in our community well need ... \n",
375
+ "2 i am the sen and pastoral lead in my school derryhale primary school co armagh n ireland we recently lost a valued member of our pta team a lovely mummy of two children a little girl in y2 and a little boy in y4 this lovely lady was just 31 years old and lost her fight for life to cancer budget is tight in school we are a small rural school of 76 pupils and everyone has been effected by this loss all the children in our care worry now that they will loose a family member and pastorally i am ... \n",
376
+ "\n",
377
+ " is_heartfelt \\\n",
378
+ "0 True \n",
379
+ "1 True \n",
380
+ "2 True \n",
381
+ "\n",
382
+ " clean_text \n",
383
+ "0 love school consider gardening equipment lose member staff february school 30 year heavily involved teach child inner city school environment gardening death countless child ask use piece small ground playground use memory plant seed happy place mrsupham run gardening club donate thing 50 child volunteer help positive outcome sad loss \n",
384
+ "1 year 1 child inspire book call omar bee child different background bond interest bee decide bee corridor school local park send envelope wildflower seed house building route time late spring come neighborhood alive flower insect year 1 class love like community need buy lot wildflower seed produce leaflet promote idea locally handdeliver envelope local area hope turn local area beefriendly community \n",
385
+ "2 sen pastoral lead school derryhale primary school co armagh n ireland recently lose value member pta team lovely mummy child little girl y2 little boy y4 lovely lady 31 year old lose fight life cancer budget tight school small rural school 76 pupil effect loss child care worry loose family member pastorally support child well resource twinkl support daddy granny child 10 day prior mummy die granda die little child lose beloved mummy granda 10 day love treat child school school trip christmas... "
386
+ ]
387
+ },
388
+ "execution_count": 15,
389
+ "metadata": {},
390
+ "output_type": "execute_result"
391
+ }
392
+ ],
393
+ "source": [
394
+ "preprocessed_df.head(3)"
395
+ ]
396
+ },
397
+ {
398
+ "cell_type": "code",
399
+ "execution_count": 16,
400
+ "id": "32bcaac6-1c7c-4c57-8686-7eb6bc6c8dad",
401
+ "metadata": {},
402
+ "outputs": [
403
+ {
404
+ "data": {
405
+ "text/plain": [
406
+ "Index(['text', 'clean_text', 'is_heartfelt'], dtype='object')"
407
+ ]
408
+ },
409
+ "execution_count": 16,
410
+ "metadata": {},
411
+ "output_type": "execute_result"
412
+ }
413
+ ],
414
+ "source": [
415
+ "preprocessed_df = preprocessed_df[['text', 'clean_text', 'is_heartfelt']]\n",
416
+ "\n",
417
+ "preprocessed_df.columns"
418
+ ]
419
+ },
420
+ {
421
+ "cell_type": "code",
422
+ "execution_count": 17,
423
+ "id": "407aad51-9680-4203-b86e-f8def0ca7731",
424
+ "metadata": {},
425
+ "outputs": [],
426
+ "source": [
427
+ "df = preprocessed_df.copy()"
428
+ ]
429
+ },
430
+ {
431
+ "cell_type": "code",
432
+ "execution_count": 18,
433
+ "id": "cb71ae69-af1c-4505-8f4a-4a13beb12d00",
434
+ "metadata": {},
435
+ "outputs": [
436
+ {
437
+ "data": {
438
+ "text/plain": [
439
+ "text 0\n",
440
+ "clean_text 0\n",
441
+ "is_heartfelt 0\n",
442
+ "dtype: int64"
443
+ ]
444
+ },
445
+ "execution_count": 18,
446
+ "metadata": {},
447
+ "output_type": "execute_result"
448
+ }
449
+ ],
450
+ "source": [
451
+ "df.isna().sum()"
452
+ ]
453
+ },
454
+ {
455
+ "cell_type": "markdown",
456
+ "id": "39ed2ff3-c65c-42c4-9499-78c58d9ee442",
457
+ "metadata": {},
458
+ "source": [
459
+ "# 3. Model Training"
460
+ ]
461
+ },
462
+ {
463
+ "cell_type": "code",
464
+ "execution_count": 47,
465
+ "id": "f2cf4443-b55b-4598-ac65-7176e8c55ef7",
466
+ "metadata": {},
467
+ "outputs": [],
468
+ "source": [
469
+ "from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold\n",
470
+ "from sklearn.pipeline import Pipeline\n",
471
+ "from sklearn.feature_extraction.text import TfidfVectorizer\n",
472
+ "from sklearn.linear_model import LogisticRegression\n",
473
+ "from sklearn.metrics import classification_report"
474
+ ]
475
+ },
476
+ {
477
+ "cell_type": "code",
478
+ "execution_count": 44,
479
+ "id": "3f675a40-d095-49e6-9293-1b2f1b78c611",
480
+ "metadata": {},
481
+ "outputs": [],
482
+ "source": [
483
+ "X, y = df['text'], df['is_heartfelt']\n",
484
+ "\n",
485
+ "# 2. Split out a hold-out set\n",
486
+ "X_train, X_test, y_train, y_test = train_test_split(\n",
487
+ " X, y, stratify=y, test_size=0.2, random_state=42\n",
488
+ ")\n",
489
+ "\n",
490
+ "# 3. Build a TF–IDF + Logistic Regression pipeline\n",
491
+ "baseline_pipe = Pipeline([\n",
492
+ " ('tfidf', TfidfVectorizer(\n",
493
+ " lowercase=True,\n",
494
+ " ngram_range=(1,2),\n",
495
+ " min_df=2\n",
496
+ " )),\n",
497
+ " ('clf', LogisticRegression(\n",
498
+ " solver='liblinear',\n",
499
+ " class_weight='balanced', # if classes are skewed\n",
500
+ " random_state=42\n",
501
+ " )),\n",
502
+ "])\n"
503
+ ]
504
+ },
505
+ {
506
+ "cell_type": "code",
507
+ "execution_count": 45,
508
+ "id": "5bc1dffa-5aca-4b2e-9e85-df9cadbf96c9",
509
+ "metadata": {},
510
+ "outputs": [
511
+ {
512
+ "name": "stdout",
513
+ "output_type": "stream",
514
+ "text": [
515
+ "Baseline CV F1: 0.916 ± 0.046\n",
516
+ " precision recall f1-score support\n",
517
+ "\n",
518
+ " False 1.00 0.91 0.95 11\n",
519
+ " True 0.94 1.00 0.97 16\n",
520
+ "\n",
521
+ " accuracy 0.96 27\n",
522
+ " macro avg 0.97 0.95 0.96 27\n",
523
+ "weighted avg 0.97 0.96 0.96 27\n",
524
+ "\n"
525
+ ]
526
+ }
527
+ ],
528
+ "source": [
529
+ "# 4. Quick cross-validation\n",
530
+ "cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)\n",
531
+ "cv_scores = cross_val_score(\n",
532
+ " baseline_pipe, X_train, y_train,\n",
533
+ " cv=cv,\n",
534
+ " scoring='f1'\n",
535
+ ")\n",
536
+ "print(f'Baseline CV F1: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}')\n",
537
+ "\n",
538
+ "# 5. Fit and evaluate on held-out test\n",
539
+ "baseline_pipe.fit(X_train, y_train)\n",
540
+ "y_pred = baseline_pipe.predict(X_test)\n",
541
+ "print(classification_report(y_test, y_pred))"
542
+ ]
543
+ },
544
+ {
545
+ "cell_type": "code",
546
+ "execution_count": 46,
547
+ "id": "be9b112f-f0a1-4132-9554-c145175e8486",
548
+ "metadata": {},
549
+ "outputs": [
550
+ {
551
+ "name": "stdout",
552
+ "output_type": "stream",
553
+ "text": [
554
+ "> Real=False, Pred=True\n",
555
+ "i would look to buy supplies for the whole school to use for science i know that some experiments use a lot of materials and would want to buy good kits for circuits i would also look into buying things to improve the learning and teaching of computing science in the school\n",
556
+ "---\n"
557
+ ]
558
+ }
559
+ ],
560
+ "source": [
561
+ "# Example: inspect errors\n",
562
+ "X_err = X_test[y_test != y_pred]\n",
563
+ "y_err_true = y_test[y_test != y_pred]\n",
564
+ "y_err_pred = y_pred[y_test != y_pred]\n",
565
+ "for text, true, pred in zip(X_err, y_err_true, y_err_pred):\n",
566
+ " print(f\"> Real={true}, Pred={pred}\\n{text}\\n---\")\n"
567
+ ]
568
+ },
569
+ {
570
+ "cell_type": "code",
571
+ "execution_count": null,
572
+ "id": "79dd6ca7-4ffd-4350-a36b-48d51ce4c23a",
573
+ "metadata": {},
574
+ "outputs": [],
575
+ "source": [
576
+ "import joblib\n",
577
+ "\n",
578
+ "baseline_pipe.fit(X_train, y_train)\n",
579
+ "\n",
580
+ "joblib.dump(baseline_pipe, '../src/models/heartfelt_pipeline.joblib')\n",
581
+ "\n",
582
+ "# …later, or in a new script/notebook, load it back:\n",
583
+ "#loaded_pipe = joblib.load('heartfelt_pipeline.joblib')\n",
584
+ "\n",
585
+ "# Confirm it still works\n",
586
+ "y_pred = loaded_pipe.predict(X_test)\n"
587
+ ]
588
+ }
589
+ ],
590
+ "metadata": {
591
+ "kernelspec": {
592
+ "display_name": "Python 3 (ipykernel)",
593
+ "language": "python",
594
+ "name": "python3"
595
+ },
596
+ "language_info": {
597
+ "codemirror_mode": {
598
+ "name": "ipython",
599
+ "version": 3
600
+ },
601
+ "file_extension": ".py",
602
+ "mimetype": "text/x-python",
603
+ "name": "python",
604
+ "nbconvert_exporter": "python",
605
+ "pygments_lexer": "ipython3",
606
+ "version": "3.11.12"
607
+ }
608
+ },
609
+ "nbformat": 4,
610
+ "nbformat_minor": 5
611
+ }