Mohit Chandra Sai Bogineni commited on
Commit
8d1b471
·
1 Parent(s): 4bdc852
Files changed (15) hide show
  1. .gitattributes +7 -0
  2. .gitignore +5 -0
  3. README.md +247 -103
  4. config.json +2 -14
  5. custom_tokenizer.py +7 -0
  6. data/categorizer +3 -0
  7. data_collection.py +188 -0
  8. folder +0 -0
  9. models/categorizer +3 -0
  10. qbmodel.py +255 -0
  11. question_categorizer.py +272 -0
  12. rand.py +20 -0
  13. requirements.txt +225 -0
  14. submission.py +49 -0
  15. tfidf_model.py +275 -0
.gitattributes CHANGED
@@ -33,3 +33,10 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ data/categorizer filter=lfs diff=lfs merge=lfs -text
37
+ models/categorizer filter=lfs diff=lfs merge=lfs -text
38
+ __pycache__/QAPipeline.cpython-38.pyc filter=lfs diff=lfs merge=lfs -text
39
+ __pycache__/QBModelConfig.cpython-38.pyc filter=lfs diff=lfs merge=lfs -text
40
+ __pycache__/QBModelWrapper.cpython-38.pyc filter=lfs diff=lfs merge=lfs -text
41
+ __pycache__/qbmodel.cpython-38.pyc filter=lfs diff=lfs merge=lfs -text
42
+ __pycache__/tfidf_model.cpython-38.pyc filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ *.json
2
+ <<<<<<< HEAD
3
+ *.pkl
4
+ =======
5
+ >>>>>>> 8f336f8225704206d8ba2ab4e229f71676bdcf0e
README.md CHANGED
@@ -1,199 +1,343 @@
 
 
1
  ---
2
- library_name: transformers
3
- tags: []
 
 
4
  ---
 
 
 
 
 
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
 
9
 
 
 
 
 
 
 
 
 
10
 
 
11
 
12
- ## Model Details
 
 
 
 
 
 
 
 
 
 
 
13
 
14
- ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
 
 
 
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
 
 
 
 
 
 
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
 
 
29
 
30
- <!-- Provide the basic links for the model. -->
 
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
 
 
 
35
 
36
- ## Uses
 
 
 
 
37
 
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
 
40
- ### Direct Use
 
 
 
 
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
 
44
- [More Information Needed]
45
 
46
- ### Downstream Use [optional]
47
 
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
 
50
- [More Information Needed]
51
 
52
- ### Out-of-Scope Use
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
 
56
- [More Information Needed]
57
 
58
- ## Bias, Risks, and Limitations
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
 
62
- [More Information Needed]
63
 
64
- ### Recommendations
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
 
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
 
70
- ## How to Get Started with the Model
71
 
72
- Use the code below to get started with the model.
73
 
74
- [More Information Needed]
75
 
76
- ## Training Details
77
 
78
- ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
 
82
- [More Information Needed]
83
 
84
- ### Training Procedure
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
 
88
- #### Preprocessing [optional]
89
 
90
- [More Information Needed]
91
 
 
92
 
93
- #### Training Hyperparameters
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
 
97
- #### Speeds, Sizes, Times [optional]
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
 
101
- [More Information Needed]
102
 
103
- ## Evaluation
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
 
107
- ### Testing Data, Factors & Metrics
108
 
109
- #### Testing Data
110
 
111
- <!-- This should link to a Dataset Card if possible. -->
112
 
113
- [More Information Needed]
114
 
115
- #### Factors
116
 
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
 
119
- [More Information Needed]
120
 
121
- #### Metrics
122
 
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
 
125
- [More Information Needed]
126
 
127
- ### Results
128
 
129
- [More Information Needed]
130
 
131
- #### Summary
132
 
 
133
 
 
134
 
135
- ## Model Examination [optional]
136
 
137
- <!-- Relevant interpretability work for the model goes here -->
138
 
139
- [More Information Needed]
140
 
141
- ## Environmental Impact
142
 
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
 
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
 
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
 
153
- ## Technical Specifications [optional]
 
154
 
155
- ### Model Architecture and Objective
156
 
157
- [More Information Needed]
 
 
158
 
159
- ### Compute Infrastructure
160
 
161
- [More Information Needed]
 
162
 
163
- #### Hardware
 
 
164
 
165
- [More Information Needed]
 
166
 
167
- #### Software
 
168
 
169
- [More Information Needed]
 
 
170
 
171
- ## Citation [optional]
172
 
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 
174
 
175
- **BibTeX:**
 
176
 
177
- [More Information Needed]
178
 
179
- **APA:**
180
 
181
- [More Information Needed]
182
 
183
- ## Glossary [optional]
184
 
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
 
186
 
187
- [More Information Needed]
 
188
 
189
- ## More Information [optional]
190
 
191
- [More Information Needed]
192
 
193
- ## Model Card Authors [optional]
194
 
195
- [More Information Needed]
196
 
197
- ## Model Card Contact
 
 
 
 
 
 
 
 
 
198
 
199
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <<<<<<< HEAD
2
+ =======
3
  ---
4
+ license: mit
5
+ language:
6
+ - en
7
+ pipeline_tag: question-answering
8
  ---
9
+ >>>>>>> 8f336f8225704206d8ba2ab4e229f71676bdcf0e
10
+ The evaluation of this project is to answer trivia questions. You do
11
+ not need to do well at this task, but you should submit a system that
12
+ completes the task or create adversarial questions in that setting. This will help the whole class share data and
13
+ resources.
14
 
15
+ If you focus on something other than predicting answers, *that's fine*!
16
 
17
+ About the Data
18
+ ==============
19
 
20
+ Quiz bowl is an academic competition between schools in
21
+ English-speaking countries; hundreds of teams compete in dozens of
22
+ tournaments each year. Quiz bowl is different from Jeopardy, a recent
23
+ application area. While Jeopardy also uses signaling devices, these
24
+ are only usable after a question is completed (interrupting Jeopardy's
25
+ questions would make for bad television). Thus, Jeopardy is rapacious
26
+ classification followed by a race---among those who know the
27
+ answer---to punch a button first.
28
 
29
+ Here's an example of a quiz bowl question:
30
 
31
+ Expanding on a 1908 paper by Smoluchowski, he derived a formula for
32
+ the intensity of scattered light in media fluctuating densities that
33
+ reduces to Rayleigh's law for ideal gases in The Theory of the
34
+ Opalescence of Homogenous Fluids and Liquid Mixtures near the Critical
35
+ State. That research supported his theories of matter first developed
36
+ when he calculated the diffusion constant in terms of fundamental
37
+ parameters of the particles of a gas undergoing Brownian Motion. In
38
+ that same year, 1905, he also published On a Heuristic Point of View
39
+ Concerning the Production and Transformation of Light. That
40
+ explication of the photoelectric effect won him 1921 Nobel in Physics.
41
+ For ten points, name this German physicist best known for his theory
42
+ of Relativity.
43
 
44
+ *ANSWER*: Albert _Einstein_
45
 
46
+ Two teams listen to the same question. Teams interrupt the question at
47
+ any point by "buzzing in"; if the answer is correct, the team gets
48
+ points and the next question is read. Otherwise, the team loses
49
+ points and the other team can answer.
50
 
51
+ You are welcome to use any *automatic* method to choose an answer. It
52
+ need not be similar nor build on our provided systems. In addition to
53
+ the data we provide, you are welcome to use any external data *except*
54
+ our test quiz bowl questions (i.e., don't hack our server!). You are
55
+ welcome (an encouraged) to use any publicly available software, but
56
+ you may want to check on Piazza for suggestions as many tools are
57
+ better (or easier to use) than others.
58
 
59
+ If you don't like the interruptability of questions, you can also just answer entire questions. However, you must also output a confidence.
 
 
 
 
 
 
60
 
61
+ Competition
62
+ ==================
63
+ We will use Dynabech website (https://dynabench.org/tasks/qa). If you remember the past workshop about Dynabench submission, this is the way to do it. The specific task name is "Grounded QA". Here, with the help of the video tutorial, you submit your QA model and assess how your QA model did compared to others. The assessment will take place by testing your QA model on several QA test datasets and the results of yours and your competitors will be visible on the leaderboard. Your goal is to rank the highest in terms of expected wins: you buzz in with probability proportional to your confidence, and if you're more right than the competition, you win.
64
 
65
+ Writing Questions
66
+ ==================
67
 
68
+ Alternatively, you can also *write* 50 adversarial questions that
69
+ challenge modern NLP systems. These questions must be diverse in the
70
+ subjects asked about, the skills computers need to answer the
71
+ questions, and the entities in those questions. Remember that your questions should be *factual* and
72
+ *specific* enough for humans to answer, because your task is to stump
73
+ the computers relative to humans!
74
 
75
+ In addition to the raw questions, you will also need to create citations describing:
76
+ * Why the question is difficult for computers: include citations from the NLP/AI/ML literature
77
+ * Why the information in the question is correct: include citations from the sources you drew on the write the question
78
+ * Why the question is interesting: include scholarly / popular culture artifacts to prove that people care about this
79
+ * Why the question is pyramidal: discuss why your first clues are harder than your later clues
80
 
81
+ **Category**
82
 
83
+ We want questions from many domains such as Art, Literature, Geography, History,
84
+ Science, TV and Film, Music, Lifestyle, and Sport. The questions
85
+ should be written using all topics above (5 questions for each
86
+ category and 5 more for the remaining categories). Indicate in your
87
+ writeup which category you chose to write on for each question.
88
 
 
89
 
90
+ Art:
91
 
92
+ * Questions about works: Mona Lisa, Raft of the Medussa
93
 
94
+ * Questions about forms: color, contour, texture
95
 
96
+ * Questions about artists: Picasso, Monet, Leonardo da Vinci
97
 
98
+ * Questions about context: Renaissance, post-modernism, expressionism, surrealism
99
 
 
100
 
101
+ Literature:
102
 
103
+ * Questions about works: novels (1984), plays (The Lion and the Jewel), poems (Rubaiyat), criticism (Poetics)
104
 
105
+ * Questions about major characters or events in literature: The Death of Anna Karenina, Noboru Wataya, the Marriage of Hippolyta and Theseus
106
 
107
+ * Questions about literary movements (Sturm und Drang)
108
 
109
+ * Questions about translations
110
 
111
+ * Cross-cutting questions (appearances of Overcoats in novels)
112
 
113
+ * Common link questions (the literary output of a country/region)
114
 
 
115
 
116
+ Geography:
117
 
118
+ * Questions about location: names of capital, state, river
119
 
120
+ * Questions about the place: temperature, wind flow, humidity
121
 
 
122
 
123
+ History:
124
 
125
+ * When: When did the First World war start?
126
 
127
+ * Who: Who is called Napoleon of Iran?
128
 
129
+ * Where: Where was the first Summer Olympics held?
130
 
131
+ * Which: Which is the oldest civilization in the world?
132
 
 
133
 
134
+ Science:
135
 
136
+ * Questions about terminology: The concept of gravity was discovered by which famous physicist?
137
 
138
+ * Questions about the experiment
139
 
140
+ * Questions about theory: The social action theory believes that individuals are influenced by this theory.
141
 
 
142
 
143
+ TV and Film:
144
 
145
+ * Quotes: What are the dying words of Charles Foster Kane in Citizen Kane?
146
 
147
+ * Title: What 1927 musical was the first "talkie"?
148
 
149
+ * Plot: In The Matrix, does Neo take the blue pill or the red pill?
150
 
 
151
 
152
+ Music:
153
 
154
+ * Singer: What singer has had a Billboard No. 1 hit in each of the last four decades?
155
 
156
+ * Band: Before Bleachers and fun., Jack Antonoff fronted what band?
157
 
158
+ * Title: What was Madonna's first top 10 hit?
159
 
160
+ * History: Which classical composer was deaf?
161
 
 
162
 
163
+ Lifestyle:
164
 
165
+ * Clothes: What clothing company, founded by a tennis player, has an alligator logo?
166
 
167
+ * Decoration: What was the first perfume sold by Coco Chanel?
168
 
 
169
 
170
+ Sport:
171
 
172
+ * Known facts: What sport is best known as the ‘king of sports’?
173
 
174
+ * Nationality: What’s the national sport of Canada?
175
 
176
+ * Sport player: The classic 1980 movie called Raging Bull is about which real-life boxer?
177
 
178
+ * Country: What country has competed the most times in the Summer Olympics yet hasn’t won any kind of medal?
179
 
 
180
 
181
+ **Diversity**
182
 
183
+ Other than category diversity, if you find an ingenious way of writing questions about underrepresented countries, you will get bonus points (indicate which questions you included the diversity component in your writeup). You may decide which are underrepresented countries with your own reasonable reason (etc., less population may indicate underrepresented), but make sure to articulate this in your writeup.
184
 
185
+ * Run state of the art QA systems on the questions to show they struggle, give individual results for each question and a summary over all questions
186
 
187
+ For an example of what the writeup for a single question should look like, see the adversarial HW:
188
+ https://github.com/Pinafore/nlp-hw/blob/master/adversarial/question.tex
 
 
 
189
 
190
+ Proposal
191
+ ==================
192
 
193
+ The project proposal is a one page PDF document that describes:
194
 
195
+ * Who is on your team (team sizes can be between three and six
196
+ students, but six is really too big to be effective; my suggestion
197
+ is that most groups should be between four or five).
198
 
199
+ * What techniques you will explore
200
 
201
+ * Your timeline for completing the project (be realistic; you should
202
+ have your first submission in a week or two)
203
 
204
+ Submit the proposal on Gradescope, but make sure to include all group
205
+ members. If all group members are not included, you will lose points. Late days cannot be used on this
206
+ assignment.
207
 
208
+ Milestone 1
209
+ ======================
210
 
211
+ You'll have to update how things are going: what's
212
+ working, what isn't, and how does it change your timeline? How does it change your division of labor?
213
 
214
+ *Question Writing*: You'll need to have answers selected for all of
215
+ your questions and first drafts of at least 15 questions. This must
216
+ be submitted as a JSON file so that we run computer QA systems on it.
217
 
218
+ *Project*: You'll need to have made a submission to the leaderboard with something that satisfies the API.
219
 
220
+ Submit a PDF updating on your progress to Gradescope. If all team
221
+ members are not on the submission, you will lose points.
222
 
223
+ Milestone 2
224
+ ===================
225
 
226
+ As before, provide an updated timeline / division of labor, provide your intermediary results.
227
 
228
+ *Question Writing*: You'll need to have reflected the feedback from the first questions and completed a first draft of at least 30 questions. You'll also need machine results to your questions and an overall evaluation of your human/computer accuracy.
229
 
230
+ *Project*: You'll need to have a made a submission to the leaderboard with a working system (e.g., not just obey the API, but actually get reasonable answers).
231
 
232
+ Submit a PDF updating on your progress.
233
 
234
+ Final Presentation
235
+ ======================
236
 
237
+ The final presentation will be virtual (uploading a video). In
238
+ the final presentation you will:
239
 
240
+ * Explain what you did
241
 
242
+ * Who did what. For example, for the question writing project a team of five people might write: A wrote the first draft of questions. B and C verified they were initially answerable by a human. B ran computer systems to verify they were challenging to a computer. C edited the questions and increased the computer difficulty. D and E verified that the edited questions were still answerable by a human. D and E checked all of the questions for factual accuracy and created citations and the writeup.
243
 
244
+ * What challenges you had
245
 
246
+ * Review how well you did (based on the competition or your own metrics). If you do not use the course infrastructure to evaluate your project's work, you should talk about what alternative evaluations you used, why they're appropriate/fair, and how well you did on them.
247
 
248
+ * Provide an error analysis. An error analysis must contain examples from the
249
+ development set that you get wrong. You should show those sentences
250
+ and explain why (in terms of features or the model) they have the
251
+ wrong answer. You should have been doing this all along as you
252
+ derive new features, but this is your final inspection of
253
+ your errors. The feature or model problems you discover should not
254
+ be trivial features you could add easily. Instead, these should be
255
+ features or models that are difficult to correct. An error analysis
256
+ is not the same thing as simply presenting the error matrix, as it
257
+ does not inspect any individual examples. If you're writing questions, talk about examples of questions that didn't work out as intended.
258
 
259
+ * The linguistic motivation for your features / how your wrote the questions. This is a
260
+ computational linguistics class, so you should give precedence to
261
+ features / techniques that we use in this class (e.g., syntax,
262
+ morphology, part of speech, word sense, etc.). Given two features
263
+ that work equally well and one that is linguistically motivated,
264
+ we'll prefer the linguistically motivated one.
265
+
266
+ * Presumably you did many different things; how did they each
267
+ individually contribute to your final result?
268
+
269
+ Each group has 10 minutes to deliver their presentation. Please record the video, and upload it to Google Drive, and include the link in your writeup submission.
270
+
271
+ Final Question Submission
272
+ ======================
273
+
274
+ Because we need to get the questions ready for the systems, upload your raw questions on May 10. This doesn't include the citations or other parts of the writeup.
275
+
276
+ System Submission
277
+ ======================
278
+
279
+ You must submit a version of your system by May 12. It may not be perfect, but this what the question writing teams will use to test their results.
280
+
281
+ Your system should be sent directly to the professor and TAs in zip files, including the correct dependencies and a working inference code. Your inference code should run successfully in the root folder (extracted from zip folder) directory with the command:
282
+
283
+ ```
284
+ > python3 inference.py --data=evaluation_set.json
285
+
286
+ ```
287
+
288
+ The input will be in the form of a .json file () in the same format as the file the adversarial question writing team submits. The output format should also be in string.
289
+
290
+ If you have any notes or comments that we should be aware of while running your code, please include them in the folder as a .txt file. Also, dependency information should be included as a .txt file. 
291
+
292
+ Please prepend your email title with [2024-CMSC 470 System Submission].
293
+
294
+ Project Writeup and JSON file
295
+ ======================
296
+
297
+ By May 17, submit your project writeup explaining what
298
+ you did and what results you achieved. This document should
299
+ make it clear:
300
+
301
+ * Why this is a good idea
302
+ * What you did
303
+ * Who did what
304
+ * Whether your technique worked or not
305
+
306
+ For systems, please do not go over 2500 words unless you have a really good reason.
307
+ Images are a much better use of space than words, usually (there's no
308
+ limit on including images, but use judgement and be selective).
309
+
310
+ For question writing, you have one page (single spaced, two column) per question plus a two page summary of results. Talk about how you organized the question writing, how you evaluated the questions, and a summary of the results. Along with your writeup, turn in a json including the raw text of the question and answer and category. The json file is included in this directory. Make sure your json file is in the correct format and is callable via below code. Your submission will not be graded if it does not follow the format of the example json file.
311
+
312
+ ```
313
+ with open('path to your json file', 'r') as f:
314
+ data = json.load(f)
315
+ ```
316
+
317
+
318
+
319
+ Grade
320
+ ======================
321
+
322
+ The grade will be out of 25 points, broken into five areas:
323
+
324
+ * _Presentation_: For your oral presentation, do you highlight what
325
+ you did and make people care? Did you use time well during the
326
+ presentation?
327
+
328
+ * _Writeup_: Does the writeup explain what you did in a way that is
329
+ clear and effective?
330
+
331
+ The final three areas are different between the system and the questions.
332
+
333
+ | | System | Questions |
334
+ |----------|:-------------:|------:|
335
+ | _Technical Soundness_ | Did you use the right tools for the job, and did you use them correctly? Were they relevant to this class? | Were your questions correct and accurately cited. |
336
+ | _Effort_ | Did you do what you say you would, and was it the right ammount of effort. | Are the questions well-written, interesting, and thoroughly edited? |
337
+ | _Performance_ | How did your techniques perform in terms of accuracy, recall, etc.? | Is the human accuracy substantially higher than the computer accuracy? |
338
+
339
+ <<<<<<< HEAD
340
+ All members of the group will receive the same grade. It's impossible for the course staff to adjudicate Rashomon-style accounts of who did what, and the goal of a group project is for all team members to work together to create a cohesive project that works well together. While it makes sense to divide the work into distinct areas of responsibility, at grading time we have now way to know who really did what, so it's the groups responsibility to create a piece of output that reflects well on the whole group.
341
+ =======
342
+ All members of the group will receive the same grade. It's impossible for the course staff to adjudicate Rashomon-style accounts of who did what, and the goal of a group project is for all team members to work together to create a cohesive project that works well together. While it makes sense to divide the work into distinct areas of responsibility, at grading time we have now way to know who really did what, so it's the groups responsibility to create a piece of output that reflects well on the whole group.
343
+ >>>>>>> 8f336f8225704206d8ba2ab4e229f71676bdcf0e
config.json CHANGED
@@ -1,5 +1,4 @@
1
  {
2
- "_name_or_path": "backedman/TriviaAnsweringMachine8",
3
  "architectures": [
4
  "QBModelWrapper"
5
  ],
@@ -7,18 +6,7 @@
7
  "AutoConfig": "QBModelConfig.QBModelConfig",
8
  "AutoModelForQuestionAnswering": "QBModelWrapper.QBModelWrapper"
9
  },
10
- "custom_pipelines": {
11
- "demo-qa": {
12
- "impl": "QAPipeline.QApipeline",
13
- "pt": [
14
- "AutoModelForQuestionAnswering"
15
- ],
16
- "tf": [
17
- "TFAutoModelForQuestionAnswering"
18
- ]
19
- }
20
- },
21
  "model_type": "TFIDF-QA",
22
- "torch_dtype": "float16",
23
  "transformers_version": "4.40.1"
24
- }
 
1
  {
 
2
  "architectures": [
3
  "QBModelWrapper"
4
  ],
 
6
  "AutoConfig": "QBModelConfig.QBModelConfig",
7
  "AutoModelForQuestionAnswering": "QBModelWrapper.QBModelWrapper"
8
  },
 
 
 
 
 
 
 
 
 
 
 
9
  "model_type": "TFIDF-QA",
10
+ "torch_dtype": "torch.float16",
11
  "transformers_version": "4.40.1"
12
+ }
custom_tokenizer.py ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ class CustomTokenizer:
2
+ def __init__(self):
3
+ pass
4
+
5
+ def __call__(self, text):
6
+ # No tokenization, return input text as-is
7
+ return text
data/categorizer ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2c88b6bbd839788571ec706c83dc8dc51f3dcd3e2ba1569bc2d8f5525565b441
3
+ size 26940976
data_collection.py ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ from bs4 import BeautifulSoup
3
+ import re
4
+ from tqdm import tqdm # Import tqdm for progress tracking
5
+ import sys
6
+ import question_categorizer as qc
7
+ import numpy as np
8
+ from question_categorizer import TextClassificationModel
9
+
10
+ qc_model = qc.TextClassificationModel.load_model("models/categorizer")
11
+
12
+ categories = ['Geography', 'Religion', 'Philosophy', 'Trash','Mythology', 'Literature','Science', 'Social Science', 'History', 'Current Events', 'Fine Arts']
13
+
14
+ def remove_newline(string):
15
+ return re.sub('\n+', ' ', string)
16
+
17
+ def clean_text(text, answer):
18
+ # Remove HTML tags
19
+ text = re.sub(r'<.*?>', '', text)
20
+
21
+ #text = re.sub(r'?','.',text)
22
+ text = text.replace('?','.')
23
+
24
+ # Clean the text further
25
+ text = re.sub(r'[^a-zA-Z.\s-]', '', text)
26
+
27
+
28
+
29
+ # Remove answer from text
30
+ try:
31
+ # Preprocess the answer to replace underscores with spaces
32
+ processed_answer = answer.replace('_', ' ')
33
+
34
+ # Remove parentheses from the processed answer
35
+ processed_answer = re.sub(r'\([^)]*\)', '', processed_answer)
36
+
37
+ # Replace all instances of the processed answer with an empty string, ignoring case
38
+ text = re.sub(re.escape(processed_answer), '', text, flags=re.IGNORECASE)
39
+ except Exception as e:
40
+ print("An error occurred during text cleaning:", e)
41
+ print("Text:", text)
42
+ print("Answer:", answer)
43
+
44
+ # Remove extra whitespaces
45
+ text = re.sub(r'\s+', ' ', text)
46
+
47
+ return text.strip()
48
+
49
+ def process_data():
50
+ #with open("data/JEOPARDY_QUESTIONS1.json", "r") as f:
51
+ # jeopardy_data = json.load(f)
52
+ jeopardy_data = []
53
+
54
+ wiki_files = [
55
+ ]
56
+
57
+ question_files = [
58
+ "qadata.json"]
59
+
60
+ wiki_data = []
61
+ question_data = []
62
+
63
+ for file_path in wiki_files:
64
+ with open('data/' + file_path, "r") as f:
65
+ wiki_data.extend(json.load(f))
66
+
67
+ for file_path in question_files:
68
+ with open('data/' + file_path, "r") as f:
69
+ question_data.extend(json.load(f))
70
+
71
+ #print(question_data)
72
+
73
+ with open("data/training_data.json", "w") as f:
74
+ training_data = []
75
+
76
+ # Process Jeopardy data
77
+ print("Processing Jeopardy data...")
78
+ for entry in tqdm(jeopardy_data):
79
+ question = entry["question"]
80
+ answer = str(entry["answer"])
81
+
82
+ # Preprocess the text
83
+ soup = BeautifulSoup(question, 'html.parser')
84
+ clean_question = ''.join(soup.findAll(text=True, recursive=False))
85
+
86
+ question_category = []
87
+
88
+ # Get category from qc_model
89
+ prediction = qc_model.predict(question)
90
+ predictions = np.argwhere(prediction >= 1.5)[1]
91
+
92
+ for prediction_ind in predictions:
93
+ # Store data in array with respective index
94
+ question_category.append(categories[prediction_ind])
95
+
96
+ question_category.append('ALL')
97
+
98
+
99
+
100
+ training_entry = {
101
+ "text": clean_question,
102
+ "answer": answer,#,
103
+ # Mohit, put categorizing code here
104
+ "category": question_category
105
+ }
106
+
107
+ training_data.append(training_entry)
108
+
109
+ # Process Wikipedia data
110
+ print("Processing Wikipedia data...")
111
+ for entry in tqdm(wiki_data):
112
+ page = str(entry["page"])
113
+ text = entry["text"]
114
+
115
+ if(text == ""):
116
+ continue
117
+
118
+ text = remove_newline(text)
119
+ text = clean_text(text, page)
120
+
121
+ question_category = []
122
+
123
+ # Get category from qc_model
124
+ prediction = qc_model.predict(text)
125
+ predictions = np.argwhere(prediction >= 1.5)[1]
126
+
127
+ for prediction_ind in predictions:
128
+ # Store data in array with respective index
129
+ question_category.append(categories[prediction_ind])
130
+
131
+ question_category.append('ALL')
132
+
133
+
134
+
135
+ training_entry = {
136
+ "text": text,
137
+ "answer": page,
138
+ # Mohit, put categorizing code here
139
+ "category": question_category
140
+ }
141
+
142
+ training_data.append(training_entry)
143
+
144
+ print("Processing Misc data...")
145
+ for entry in tqdm(question_data):
146
+
147
+ answer = str(entry["answer"])
148
+ text = entry["text"]
149
+
150
+ if(text == "" or answer == ""):
151
+ continue
152
+
153
+ text = remove_newline(text)
154
+ text = clean_text(text, answer)
155
+
156
+ question_category = []
157
+
158
+ # Get category from qc_model
159
+ try:
160
+ prediction = qc_model.predict(text)
161
+ predictions = np.argwhere(prediction >= 1.5)[1]
162
+ except:
163
+ print("answer: " + str(answer))
164
+ print("text:" + str(text))
165
+ continue
166
+
167
+ for prediction_ind in predictions:
168
+ # Store data in array with respective index
169
+ question_category.append(categories[prediction_ind])
170
+
171
+ question_category.append('ALL')
172
+
173
+
174
+
175
+ training_entry = {
176
+ "text": text,
177
+ "answer": answer,
178
+ # Mohit, put categorizing code here
179
+ "category": question_category
180
+ }
181
+
182
+ training_data.append(training_entry)
183
+
184
+
185
+
186
+ json.dump(training_data, f, indent=4)
187
+
188
+ process_data()
folder ADDED
File without changes
models/categorizer ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2c88b6bbd839788571ec706c83dc8dc51f3dcd3e2ba1569bc2d8f5525565b441
3
+ size 26940976
qbmodel.py ADDED
@@ -0,0 +1,255 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import List, Tuple
2
+ import nltk
3
+ import sklearn
4
+ import question_categorizer as qc
5
+ from question_categorizer import TextClassificationModel
6
+ from tfidf_model import NLPModel
7
+ import tfidf_model
8
+ import transformers
9
+ import numpy as np
10
+ import pandas as pd
11
+ import json
12
+ from tqdm import tqdm
13
+ from collections import defaultdict
14
+
15
+
16
+
17
+
18
+
19
+ class QuizBowlModel:
20
+
21
+ def __init__(self, clear = False):
22
+ """
23
+ Load your model(s) and whatever else you need in this function.
24
+
25
+ Do NOT load your model or resources in the guess_and_buzz() function,
26
+ as it will increase latency severely.
27
+ """
28
+
29
+ self.categories = ['Geography', 'Religion', 'Philosophy', 'Trash','Mythology', 'Literature','Science', 'Social Science', 'History', 'Current Events', 'Fine Arts', 'ALL']
30
+ self.tfidf_models = [None for _ in range(len(self.categories))]
31
+ self.qc_model = qc.TextClassificationModel.load_model("models/categorizer")
32
+
33
+ self.load_tfidf_models(clear=clear)
34
+
35
+
36
+
37
+
38
+
39
+ def guess_and_buzz(self, question_text: List[str]) -> List[Tuple[str, bool]]:
40
+ """
41
+ This function accepts a list of question strings, and returns a list of tuples containing
42
+ strings representing the guess and corresponding booleans representing
43
+ whether or not to buzz.
44
+
45
+ So, guess_and_buzz(["This is a question"]) should return [("answer", False)]
46
+
47
+ If you are using a deep learning model, try to use batched prediction instead of
48
+ iterating using a for loop.
49
+ """
50
+
51
+ guesses = []
52
+ curr_question = ""
53
+
54
+ for question in question_text:
55
+ curr_question += question + "."
56
+
57
+ confidence,answer = self.predict(curr_question)
58
+
59
+ confidence = True if confidence > 0.5 else False
60
+
61
+ guesses.append((confidence,answer))
62
+
63
+ return guesses
64
+
65
+ def load_tfidf_models(self, clear=False):
66
+
67
+ print("loading tfidf models")
68
+
69
+ # Create respective model if not exist
70
+ if not clear:
71
+ for category in range(len(self.categories)):
72
+ if self.tfidf_models[category] is None:
73
+ self.tfidf_models[category] = NLPModel().load(f"models/{self.categories[category]}_tfidf.pkl")
74
+
75
+ self.tfidf_models[-1] = NLPModel().load(f"models/{'ALL'}_tfidf.pkl")
76
+ else:
77
+ for category in range(len(self.categories)):
78
+ if self.tfidf_models[category] is None:
79
+ self.tfidf_models[category] = NLPModel()
80
+
81
+ print(self.tfidf_models)
82
+
83
+
84
+
85
+ def train(self, data):
86
+
87
+ # Create n empty lists, each index associated with the index of the category
88
+ training_data = [[] for _ in range(len(self.categories))]
89
+
90
+ with tqdm(total=len(data)) as pbar:
91
+ for data_point in data:
92
+ text = data_point["text"]
93
+ answer = data_point["answer"]
94
+ categories = data_point["category"]
95
+
96
+ for category in categories:
97
+
98
+ category_ind = self.categories.index(category)
99
+
100
+ training_data[category_ind].append({"text": text, "answer": answer})
101
+
102
+ pbar.update(1)
103
+
104
+
105
+ for ind,data in enumerate(training_data):
106
+
107
+ self.tfidf_models[ind].process_data(data)
108
+
109
+ # Train model
110
+ self.tfidf_models[ind].train_model()
111
+
112
+ # Save model
113
+ self.tfidf_models[ind].save(f"models/{self.categories[ind]}_tfidf.pkl")
114
+ self.tfidf_models[ind] = None
115
+
116
+
117
+ training_data[ind] = []
118
+
119
+ #Update progress bar
120
+ #pbar.update(1)
121
+
122
+ print("TRAINING DATA")
123
+ '''with tqdm(total=len(self.categories)) as pbar:
124
+ for category in range(len(self.categories)):
125
+
126
+ # Train model
127
+ self.tfidf_models[category].train_model()
128
+
129
+ # Save model
130
+ self.tfidf_models[category].save(f"models/{self.categories[category]}_tfidf.pkl")
131
+
132
+ # Unload model
133
+ #print(f'category {self.categories[category]} gets unloaded')
134
+ self.tfidf_models[category] = None
135
+ training_data[category] = None
136
+
137
+ pbar.update(1)'''
138
+
139
+ print("Training complete.")
140
+
141
+
142
+
143
+ def predict(self, input_data, confidence_threshold=1.5):
144
+ # Get category confidence scores from qc_model
145
+ category_confidences = self.qc_model.predict(input_data)
146
+ #print("Category confidences:", category_confidences)
147
+
148
+ # Find the indices of categories with confidence scores above the threshold
149
+ confident_indices = (category_confidences > confidence_threshold).nonzero()[:,1]
150
+
151
+ #print(confident_indices)
152
+
153
+ max_confidence = 0
154
+ max_answer = None
155
+ max_category = 0
156
+ for category in confident_indices:
157
+ #print(category)
158
+ confidence,answer = self.tfidf_models[category].predict(input_data)
159
+
160
+ if(confidence > max_confidence):
161
+ max_confidence = confidence
162
+ max_answer = answer
163
+ max_category = category
164
+
165
+ #max_confidence, max_answer = selected_model.predict(input_data)
166
+ #print("Prediction for category", self.categories[category], ":", max_answer, "with confidence", max_confidence)
167
+
168
+ return (np.tanh(max_confidence), max_answer)
169
+
170
+ def evaluate(self, input_data):
171
+ correct = 0
172
+ count = 0
173
+
174
+ with tqdm(total=len(input_data)) as pbar:
175
+ for data_point in input_data:
176
+ print(count % 10)
177
+ count += 1
178
+ text = data_point["text"]
179
+ answer = data_point["answer"]
180
+
181
+ answer_predict = self.predict(text)[1]
182
+
183
+ if(answer == answer_predict):
184
+ correct += 1
185
+ print(correct)
186
+
187
+ if(count % 10 == 0):
188
+ average = float(correct)/count
189
+ print(f'rolling average: {average}')
190
+
191
+ pbar.update(1)
192
+
193
+
194
+ accuracy = correct/len(input_data)
195
+
196
+ return accuracy
197
+
198
+
199
+
200
+
201
+
202
+
203
+
204
+
205
+
206
+
207
+ if __name__ == "__main__":
208
+ # Train a simple model on QB data, save it to a file
209
+ import argparse
210
+ parser = argparse.ArgumentParser()
211
+
212
+ parser.add_argument('--data', type=str)
213
+ parser.add_argument('--model', type=str)
214
+ parser.add_argument('--predict', type=str)
215
+ parser.add_argument('--clear', action='store_const', const=True, default=False)
216
+ parser.add_argument('--evaluate', type=str)
217
+
218
+ flags = parser.parse_args()
219
+ model = None
220
+
221
+ print(flags.clear)
222
+
223
+ if flags.clear:
224
+
225
+ model = QuizBowlModel(clear=True)
226
+
227
+ else:
228
+
229
+ model = QuizBowlModel()
230
+
231
+
232
+
233
+ if flags.data:
234
+
235
+ data_json = []
236
+
237
+ for data in flags.data:
238
+ with open(flags.data, 'r') as data_file:
239
+ data_json.extend(json.load(data_file))
240
+
241
+ model.train(data_json)
242
+ #print(model.predict("My name is bobby, bobby newport. your name is jeff?"))
243
+ #model.save("model.pkl")
244
+
245
+ if flags.model:
246
+ model.load(flags.model)
247
+
248
+ if flags.predict:
249
+ print(model.predict(flags.predict))
250
+
251
+ if flags.evaluate:
252
+ with open(flags.evaluate, 'r') as data_file:
253
+ data_json = json.load(data_file)
254
+ print(f'accuracy: {model.evaluate(data_json)}')
255
+
question_categorizer.py ADDED
@@ -0,0 +1,272 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import time
2
+ from torch.utils.data.dataset import random_split
3
+ from torchtext.data.functional import to_map_style_dataset
4
+ import torch
5
+ import gzip
6
+ import json
7
+ import numpy as np
8
+ import nltk
9
+ from nltk.corpus import stopwords
10
+ from nltk.tokenize import word_tokenize, sent_tokenize
11
+ from nltk.stem import PorterStemmer, WordNetLemmatizer
12
+ from torchtext.data.utils import get_tokenizer
13
+ from torchtext.vocab import build_vocab_from_iterator
14
+ from torch.utils.data import DataLoader
15
+ import argparse
16
+ from torch import nn
17
+ import json
18
+
19
+ nltk.download('punkt')
20
+ nltk.download('stopwords')
21
+ nltk.download('averaged_perceptron_tagger')
22
+ nltk.download('maxent_ne_chunker')
23
+ nltk.download('words')
24
+ nltk.download('wordnet')
25
+
26
+
27
+ class TextClassificationModel(nn.Module):
28
+ def __init__(self, vocab_size, embed_dim, num_class, vocab):
29
+ self.model = super(TextClassificationModel, self).__init__()
30
+ self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
31
+ self.fc = nn.Linear(embed_dim, num_class)
32
+ self.init_weights()
33
+ self.vocab_size = vocab_size
34
+ self.emsize = embed_dim
35
+ self.num_class = num_class
36
+ self.vocab = vocab
37
+ self.text_pipeline = self.tokenizer
38
+ self.tokenizer_convert = get_tokenizer("basic_english")
39
+
40
+
41
+ def tokenizer(self, text):
42
+ return self.vocab(self.tokenizer_convert(text))
43
+
44
+ def init_weights(self):
45
+ initrange = 0.5
46
+ self.embedding.weight.data.uniform_(-initrange, initrange)
47
+ self.fc.weight.data.uniform_(-initrange, initrange)
48
+ self.fc.bias.data.zero_()
49
+
50
+ def forward(self, text, offsets):
51
+ embedded = self.embedding(text, offsets)
52
+ return self.fc(embedded)
53
+
54
+ def train_model(self, train_dataloader, valid_dataloader):
55
+
56
+ total_accu = None
57
+ for epoch in range(1, EPOCHS + 1):
58
+ epoch_start_time = time.time()
59
+
60
+ self.train()
61
+ total_acc, total_count = 0, 0
62
+ log_interval = 500
63
+ start_time = time.time()
64
+
65
+ for idx, (label, text, offsets) in enumerate(train_dataloader):
66
+ optimizer.zero_grad()
67
+ predicted_label = self(text, offsets)
68
+ loss = criterion(predicted_label, label)
69
+ loss.backward()
70
+ torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
71
+ optimizer.step()
72
+ total_acc += (predicted_label.argmax(1) == label).sum().item()
73
+ total_count += label.size(0)
74
+ if idx % log_interval == 0 and idx > 0:
75
+ elapsed = time.time() - start_time
76
+ print(
77
+ "| epoch {:3d} | {:5d}/{:5d} batches "
78
+ "| accuracy {:8.3f}".format(
79
+ epoch, idx, len(train_dataloader), total_acc / total_count
80
+ )
81
+ )
82
+ total_acc, total_count = 0, 0
83
+ start_time = time.time()
84
+
85
+
86
+ accu_val = self.evaluate(valid_dataloader)
87
+ if total_accu is not None and total_accu > accu_val:
88
+ scheduler.step()
89
+ else:
90
+ total_accu = accu_val
91
+ print("-" * 59)
92
+ print(
93
+ "| end of epoch {:3d} | time: {:5.2f}s | "
94
+ "valid accuracy {:8.3f} ".format(
95
+ epoch, time.time() - epoch_start_time, accu_val
96
+ )
97
+ )
98
+ print("-" * 59)
99
+
100
+
101
+ #TODO: FIX THE LOADING MODEL
102
+ def save_model(self, file_path):
103
+ model_state = {
104
+ 'state_dict': self.state_dict(),
105
+ 'vocab_size': self.vocab_size,
106
+ 'embed_dim': self.emsize,
107
+ 'num_class': self.num_class,
108
+ 'vocab': self.vocab
109
+ }
110
+ torch.save(model_state, file_path)
111
+ print("Model saved successfully.")
112
+
113
+ @classmethod
114
+ def load_model(self, file_path):
115
+ model_state = torch.load(file_path, map_location=torch.device('cpu'))
116
+ #print(model_state)
117
+ vocab_size = model_state['vocab_size']
118
+ embed_dim = model_state['embed_dim']
119
+ num_class = model_state['num_class']
120
+ vocab = model_state['vocab']
121
+
122
+ model = TextClassificationModel(vocab_size, embed_dim, num_class, vocab)
123
+ model.load_state_dict(model_state['state_dict'])
124
+ model.eval()
125
+ print("Model loaded successfully.")
126
+ return model
127
+
128
+ def evaluate(self, dataloader):
129
+ self.eval()
130
+ total_acc, total_count = 0, 0
131
+
132
+ with torch.no_grad():
133
+ for idx, (label, text, offsets) in enumerate(dataloader):
134
+ predicted_label = self(text, offsets)
135
+ loss = criterion(predicted_label, label)
136
+ total_acc += (predicted_label.argmax(1) == label).sum().item()
137
+ total_count += label.size(0)
138
+ return total_acc / total_count
139
+
140
+ def predict(self, text):
141
+ with torch.no_grad():
142
+ text = torch.tensor(self.text_pipeline(text))
143
+ #print(text)
144
+ #print(torch.tensor([0]))
145
+ output = self(text, torch.tensor([0]))
146
+ #print(output)
147
+ return output
148
+
149
+ @staticmethod
150
+ def read_gz_json(file_path):
151
+ with gzip.open(file_path, 'rt', encoding='utf-8') as f:
152
+ data = json.load(f)
153
+ for obj in data:
154
+ yield obj['text'], obj['category']
155
+
156
+ @staticmethod
157
+ def preprocess_text(text):
158
+ sentences = sent_tokenize(text)
159
+ return sentences
160
+
161
+ @staticmethod
162
+ def data_iter(file_paths, categories):
163
+
164
+ categories = np.array(categories)
165
+
166
+ for path in file_paths:
167
+ for text, category in TextClassificationModel.read_gz_json(path):
168
+ sentences = TextClassificationModel.preprocess_text(text)
169
+
170
+ for sentence in sentences:
171
+ yield np.where(categories == category)[0][0], sentence
172
+ @staticmethod
173
+ def collate_batch(batch):
174
+ label_list, text_list, offsets = [], [], [0]
175
+ for _label, _text in batch:
176
+ label_list.append(label_pipeline(_label))
177
+ processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
178
+ text_list.append(processed_text)
179
+ offsets.append(processed_text.size(0))
180
+ label_list = torch.tensor(label_list, dtype=torch.int64)
181
+ offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
182
+ text_list = torch.cat(text_list)
183
+ return label_list.to(device), text_list.to(device), offsets.to(device)
184
+
185
+
186
+ def parse_arguments():
187
+ parser = argparse.ArgumentParser(description="Text Classification Model")
188
+ parser.add_argument("--train_path", type=str, nargs='+', required=True, help="Path to the training data")
189
+ parser.add_argument("--test_path", type=str, nargs='+', required=True, help="Path to the test data")
190
+ parser.add_argument("--epochs", type=int, default=5, help="Number of epochs for training")
191
+ parser.add_argument("--lr", type=float, default=3, help="Learning rate")
192
+ parser.add_argument("--batch_size", type=int, default=64, help="Batch size for training")
193
+ return parser.parse_args()
194
+
195
+ if __name__ == '__main__':
196
+
197
+ args = parse_arguments()
198
+
199
+ categories = ['Geography', 'Religion', 'Philosophy', 'Trash', 'Mythology', 'Literature', 'Science', 'Social Science', 'History', 'Current Events', 'Fine Arts']
200
+
201
+ test_path = args.test_path
202
+ train_path = args.train_path
203
+
204
+ tokenizer = get_tokenizer("basic_english")
205
+ train_iter = iter(TextClassificationModel.data_iter(train_path, categories))
206
+
207
+ def yield_tokens(data_iter):
208
+ for _, text in data_iter:
209
+ yield tokenizer(text)
210
+
211
+
212
+ vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
213
+ vocab.set_default_index(vocab["<unk>"])
214
+
215
+ text_pipeline = lambda x: vocab(tokenizer(x))
216
+ label_pipeline = lambda x: int(x)
217
+
218
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
219
+
220
+
221
+ dataloader = DataLoader(
222
+ train_iter, batch_size=8, shuffle=False, collate_fn=TextClassificationModel.collate_batch
223
+ )
224
+
225
+ train_iter = iter(TextClassificationModel.data_iter(train_path, categories))
226
+ classes = set([label for (label, text) in train_iter])
227
+ num_class = len(classes)
228
+ print(num_class)
229
+ vocab_size = len(vocab)
230
+ emsize = 64
231
+ model = TextClassificationModel(vocab_size, emsize, num_class).to(device)
232
+ print(model)
233
+
234
+
235
+
236
+ # Hyperparameters
237
+ EPOCHS = args.epochs # epoch
238
+ LR = args.lr # learning rate
239
+ BATCH_SIZE = args.batch_size # batch size for training
240
+
241
+ criterion = torch.nn.CrossEntropyLoss()
242
+ optimizer = torch.optim.SGD(model.parameters(), lr=LR)
243
+ scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
244
+ total_accu = None
245
+ train_iter = iter(TextClassificationModel.data_iter(train_path, categories))
246
+ test_iter = iter(TextClassificationModel.data_iter(test_path, categories))
247
+ train_dataset = to_map_style_dataset(train_iter)
248
+ test_dataset = to_map_style_dataset(test_iter)
249
+ num_train = int(len(train_dataset) * 0.95)
250
+ split_train_, split_valid_ = random_split(
251
+ train_dataset, [num_train, len(train_dataset) - num_train]
252
+ )
253
+
254
+ train_dataloader = DataLoader(
255
+ split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=TextClassificationModel.collate_batch
256
+ )
257
+ valid_dataloader = DataLoader(
258
+ split_valid_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=TextClassificationModel.collate_batch
259
+ )
260
+ test_dataloader = DataLoader(
261
+ test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=TextClassificationModel.collate_batch
262
+ )
263
+
264
+ model.train_model(train_dataloader,valid_dataloader)
265
+
266
+ print("Checking the results of test dataset.")
267
+ accu_test = model.evaluate(test_dataloader)
268
+ print("test accuracy {:8.3f}".format(accu_test))
269
+
270
+ model.save_model("text_classification_model.pth")
271
+
272
+
rand.py ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from constituent_treelib import ConstituentTree
2
+
3
+ # First, we have to provide a sentence that should be parsed
4
+ sentence = "I've got a machine learning task involving a large amount of text data."
5
+
6
+ # Then, we define the language that should be considered with respect to the underlying models
7
+ language = ConstituentTree.Language.English
8
+
9
+ # You can also specify the desired model for the language ("Small" is selected by default)
10
+ spacy_model_size = ConstituentTree.SpacyModelSize.Medium
11
+
12
+ # Next, we must create the neccesary NLP pipeline.
13
+ # If you wish, you can instruct the library to download and install the models automatically
14
+ nlp = ConstituentTree.create_pipeline(language, spacy_model_size) # , download_models=True
15
+
16
+ # Now, we can instantiate a ConstituentTree object and pass it the sentence and the NLP pipeline
17
+ tree = ConstituentTree(sentence, nlp)
18
+
19
+ # Finally, we can extract the phrases
20
+ tree.extract_all_phrases()
requirements.txt ADDED
@@ -0,0 +1,225 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ accelerate==0.29.3
2
+ alabaster==0.7.8
3
+ altair==5.3.0
4
+ annotated-types==0.6.0
5
+ anyascii==0.3.2
6
+ anyio==4.2.0
7
+ asttokens==2.4.1
8
+ attrs==19.3.0
9
+ Automat==0.8.0
10
+ Babel==2.6.0
11
+ backcall==0.2.0
12
+ beautifulsoup4==4.12.3
13
+ benepar==0.2.0
14
+ blinker==1.7.0
15
+ blis==0.7.11
16
+ bs4==0.0.2
17
+ cachetools==5.3.3
18
+ catalogue==2.0.10
19
+ certifi==2019.11.28
20
+ chardet==3.0.4
21
+ charset-normalizer==3.3.2
22
+ click==8.1.7
23
+ cloud-init==23.4.4
24
+ cloudpathlib==0.16.0
25
+ colorama==0.4.3
26
+ command-not-found==0.3
27
+ confection==0.1.4
28
+ configobj==5.0.6
29
+ constantly==15.1.0
30
+ constituent-treelib==0.0.7
31
+ contractions==0.1.73
32
+ cryptography==2.8
33
+ cymem==2.0.8
34
+ dbus-python==1.2.16
35
+ decorator==5.1.1
36
+ distro==1.9.0
37
+ distro-info==0.23+ubuntu1.1
38
+ docopt==0.6.2
39
+ docutils==0.16
40
+ entrypoints==0.3
41
+ exceptiongroup==1.2.0
42
+ executing==2.0.1
43
+ falcon==3.1.3
44
+ feedparser==5.2.1
45
+ filelock==3.13.1
46
+ flask==3.0.2
47
+ fsspec==2023.12.2
48
+ gensim==4.3.2
49
+ gevent==23.9.1
50
+ gitdb==4.0.11
51
+ GitPython==3.1.43
52
+ gitsome==0.8.0
53
+ greenlet==3.0.3
54
+ h11==0.14.0
55
+ httpcore==1.0.2
56
+ httplib2==0.14.0
57
+ httpx==0.26.0
58
+ huggingface-hub==0.22.2
59
+ huspacy==0.11.0
60
+ hyperlink==19.0.0
61
+ idna==2.8
62
+ ijson==3.2.3
63
+ imagesize==1.2.0
64
+ importlib-metadata==7.1.0
65
+ importlib-resources==6.4.0
66
+ incremental==16.10.1
67
+ iniconfig==2.0.0
68
+ install==1.3.5
69
+ ipython==8.12.3
70
+ itsdangerous==2.1.2
71
+ jedi==0.19.1
72
+ Jinja2==3.1.3
73
+ joblib==1.3.2
74
+ jsonpatch==1.22
75
+ jsonpointer==2.0
76
+ jsonschema==3.2.0
77
+ keyring==18.0.1
78
+ langcodes==3.3.0
79
+ langid==1.1.6
80
+ language-selector==0.1
81
+ launchpadlib==1.10.13
82
+ lazr.restfulclient==0.14.2
83
+ lazr.uri==1.0.3
84
+ lxml==5.1.0
85
+ Mako==1.3.1
86
+ markdown-it-py==3.0.0
87
+ MarkupSafe==2.1.5
88
+ matplotlib-inline==0.1.7
89
+ mdurl==0.1.2
90
+ more-itertools==4.2.0
91
+ mpmath==1.3.0
92
+ murmurhash==1.0.10
93
+ netifaces==0.10.4
94
+ networkx==3.1
95
+ nltk==3.8.1
96
+ numpy==1.24.4
97
+ numpydoc==0.7.0
98
+ nvidia-cublas-cu12==12.1.3.1
99
+ nvidia-cuda-cupti-cu12==12.1.105
100
+ nvidia-cuda-nvrtc-cu12==12.1.105
101
+ nvidia-cuda-runtime-cu12==12.1.105
102
+ nvidia-cudnn-cu12==8.9.2.26
103
+ nvidia-cufft-cu12==11.0.2.54
104
+ nvidia-curand-cu12==10.3.2.106
105
+ nvidia-cusolver-cu12==11.4.5.107
106
+ nvidia-cusparse-cu12==12.1.0.106
107
+ nvidia-nccl-cu12==2.19.3
108
+ nvidia-nvjitlink-cu12==12.3.101
109
+ nvidia-nvtx-cu12==12.1.105
110
+ oauthlib==3.1.0
111
+ olefile==0.46
112
+ openai==1.9.0
113
+ packaging==20.3
114
+ pandas==2.0.0
115
+ parso==0.8.4
116
+ pdfkit==1.0.0
117
+ pexpect==4.6.0
118
+ pickleshare==0.7.5
119
+ pillow==10.3.0
120
+ pluggy==1.5.0
121
+ ply==3.11
122
+ portalocker==2.8.2
123
+ preshed==3.0.9
124
+ prompt-toolkit==3.0.43
125
+ protobuf==3.20.3
126
+ psutil==5.9.8
127
+ psycopg2-binary==2.9.6
128
+ pure-eval==0.2.2
129
+ pyahocorasick==2.1.0
130
+ pyarrow==16.0.0
131
+ pyasn1==0.4.2
132
+ pyasn1-modules==0.2.1
133
+ pydantic==2.5.3
134
+ pydantic-core==2.14.6
135
+ pydeck==0.9.0b1
136
+ pygments==2.17.2
137
+ PyGObject==3.36.0
138
+ PyHamcrest==1.9.0
139
+ pyinotify==0.9.6
140
+ PyJWT==1.7.1
141
+ pymacaroons==0.13.0
142
+ PyNaCl==1.3.0
143
+ pyOpenSSL==19.0.0
144
+ pyparsing==2.4.6
145
+ pyrsistent==0.15.5
146
+ pyserial==3.4
147
+ pytest==8.2.0
148
+ python-apt==2.0.1+ubuntu0.20.4.1
149
+ python-baseconv==1.2.2
150
+ python-dateutil==2.8.2
151
+ python-debian==0.1.36+ubuntu1.1
152
+ python-dotenv==0.10.5
153
+ pytz==2023.3
154
+ PyYAML==5.3.1
155
+ regex==2023.12.25
156
+ requests==2.31.0
157
+ requests-unixsocket==0.2.0
158
+ rich==13.7.1
159
+ roman==2.0.0
160
+ safetensors==0.4.3
161
+ scikit-learn==1.3.2
162
+ scipy==1.10.1
163
+ SecretStorage==2.3.1
164
+ sentencepiece==0.2.0
165
+ service-identity==18.1.0
166
+ setproctitle==1.1.10
167
+ simplejson==3.16.0
168
+ six==1.14.0
169
+ smart-open==6.4.0
170
+ smmap==5.0.1
171
+ sniffio==1.3.0
172
+ sos==4.5.6
173
+ soupsieve==2.5
174
+ spacy==3.7.2
175
+ spacy-legacy==3.0.12
176
+ spacy-loggers==1.0.5
177
+ Sphinx==1.8.5
178
+ srsly==2.4.8
179
+ ssh-import-id==5.10
180
+ stack-data==0.6.3
181
+ streamlit==1.33.0
182
+ svgling==0.4.0
183
+ svgwrite==1.4.3
184
+ sympy==1.12
185
+ systemd-python==234
186
+ tenacity==8.2.3
187
+ textblob==0.18.0.post0
188
+ textsearch==0.0.24
189
+ thinc==8.2.2
190
+ threadpoolctl==3.2.0
191
+ tokenizers==0.19.1
192
+ toml==0.10.2
193
+ tomli==2.0.1
194
+ toolz==0.12.1
195
+ torch==2.2.2
196
+ torch-struct==0.5
197
+ torchtext==0.17.2
198
+ torchvision==0.16.2
199
+ tornado==6.4
200
+ tqdm==4.66.1
201
+ traitlets==5.14.3
202
+ transformers==4.40.1
203
+ triton==2.2.0
204
+ Twisted==18.9.0
205
+ typer==0.9.0
206
+ typing-extensions==4.9.0
207
+ tzdata==2023.3
208
+ ubuntu-pro-client==8001
209
+ ufw==0.36
210
+ unattended-upgrades==0.1
211
+ Unidecode==1.3.8
212
+ urllib3==1.26.9
213
+ wadllib==1.3.3
214
+ Wand==0.6.13
215
+ wasabi==1.1.2
216
+ watchdog==4.0.0
217
+ wcwidth==0.1.8
218
+ weasel==0.3.4
219
+ werkzeug==3.0.2
220
+ xonsh==0.9.13
221
+ zimply==1.1.4
222
+ zipp==3.18.1
223
+ zope.event==5.0
224
+ zope.interface==4.7.1
225
+ zstandard==0.22.0
submission.py ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from QBModelConfig import QBModelConfig
2
+ from QBModelWrapper import QBModelWrapper
3
+ from transformers import AutoConfig, AutoModel, AutoModelForQuestionAnswering
4
+ import torch
5
+ import numpy as np
6
+ from transformers import QuestionAnsweringPipeline
7
+ from transformers import PretrainedConfig
8
+ from transformers.pipelines import PIPELINE_REGISTRY
9
+ from transformers import AutoModelForQuestionAnswering, TFAutoModelForQuestionAnswering
10
+ from transformers import pipeline
11
+ from QAPipeline import QApipeline
12
+
13
+ AutoConfig.register("TFIDF-QA", QBModelConfig)
14
+ AutoModel.register(QBModelConfig, QBModelWrapper)
15
+ AutoModelForQuestionAnswering.register(QBModelConfig, QBModelWrapper)
16
+
17
+ QBModelConfig.register_for_auto_class()
18
+ QBModelWrapper.register_for_auto_class("AutoModel")
19
+ QBModelWrapper.register_for_auto_class("AutoModelForQuestionAnswering")
20
+
21
+
22
+
23
+ #qbmodel_config.save_pretrained("model-config")
24
+ #qbmodel.save_pretrained(save_directory='TriviaAnsweringMachine8', safe_serialization= False, push_to_hub=True)
25
+ #print(qbmodel.config.torch_dtype.split(".")[1])
26
+ from huggingface_hub import Repository
27
+
28
+ repo = Repository("/mnt/c/Users/backe/Documents/GitHub/TriviaAnsweringMachine/")
29
+ repo.push_to_hub("TriviaAnsweringMachine10")
30
+
31
+ #qbmodel_config = QBModelConfig()
32
+ #qbmodel = QBModelWrapper(qbmodel_config)
33
+
34
+ #qbmodel.push_to_hub("TriviaAnsweringMachine10", safe_serialization=False)
35
+
36
+ #model = AutoModelForQuestionAnswering.from_pretrained("backedman/TriviaAnsweringMachine6", config=QBModelConfig(), trust_remote_code = True)
37
+ #tokenizer = AutoTokenizer.from_pretrained(model_name)
38
+
39
+
40
+ PIPELINE_REGISTRY.register_pipeline(
41
+ "demo-qa",
42
+ pipeline_class=QApipeline,
43
+ pt_model=AutoModelForQuestionAnswering,
44
+ tf_model=TFAutoModelForQuestionAnswering,
45
+ )
46
+
47
+ qa_pipe = pipeline("demo-qa", model="backedman/TriviaAnsweringMachine10", tokenizer="backedman/TriviaAnsweringMachine10")
48
+ qa_pipe.push_to_hub("TriviaAnsweringMachineREAL", safe_serialization=False)
49
+ #qa_pipe("test")
tfidf_model.py ADDED
@@ -0,0 +1,275 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import nltk
3
+ from nltk.corpus import stopwords
4
+ from nltk.tokenize import word_tokenize, sent_tokenize
5
+ from nltk.stem import PorterStemmer, WordNetLemmatizer
6
+ from sklearn.feature_extraction.text import TfidfVectorizer
7
+ from sklearn.metrics.pairwise import cosine_similarity
8
+ import numpy as np
9
+ import os
10
+ import math
11
+ import pickle
12
+ import joblib
13
+ import multiprocessing
14
+ from concurrent.futures import ProcessPoolExecutor
15
+ from tqdm import tqdm # Import tqdm for progress tracking
16
+ from collections import defaultdict
17
+
18
+
19
+ nltk.download('punkt')
20
+ nltk.download('stopwords')
21
+ nltk.download('averaged_perceptron_tagger')
22
+ nltk.download('maxent_ne_chunker')
23
+ nltk.download('words')
24
+ nltk.download('wordnet')
25
+ nltk.download('omw-1.4')
26
+
27
+
28
+ # Helper function to map NLTK POS tags to WordNet POS tags
29
+ def get_wordnet_pos(treebank_tag):
30
+ if treebank_tag.startswith('J'):
31
+ return nltk.corpus.wordnet.ADJ
32
+ elif treebank_tag.startswith('V'):
33
+ return nltk.corpus.wordnet.VERB
34
+ elif treebank_tag.startswith('N'):
35
+ return nltk.corpus.wordnet.NOUN
36
+ elif treebank_tag.startswith('R'):
37
+ return nltk.corpus.wordnet.ADV
38
+ else:
39
+ return nltk.corpus.wordnet.NOUN
40
+
41
+ class NLPModel:
42
+ def __init__(self): # Initialize the model with necessary parameters
43
+ # Initialize model components (preprocessing, training, etc.)
44
+ #self.model
45
+
46
+ self.tfidf = TfidfVectorizer(tokenizer=self.tokenize, lowercase=False)
47
+
48
+ self.training_tfidf = None
49
+
50
+ #self.manager = multiprocessing.Manager()
51
+
52
+ self.flattened_sentences = []
53
+ self.training_tagged = []
54
+ self.answers = []
55
+
56
+
57
+
58
+ def tokenize(self, text):
59
+ # Your tokenization logic goes here
60
+ return text # No tokenization needed, return the input as-is
61
+
62
+ def preprocess_text(self, text):
63
+ # Tokenization
64
+ sentences = sent_tokenize(text)
65
+
66
+ preprocessed_sentences = []
67
+ batch_size = 50 # Adjust the batch size based on your system's capabilities
68
+ for i in range(0, len(sentences), batch_size):
69
+ batch_sentences = sentences[i:i + batch_size]
70
+ batch_words = [word_tokenize(sentence) for sentence in batch_sentences]
71
+
72
+ # Filtering Stop Words
73
+ stop_words = set(stopwords.words('english'))
74
+ filtered_words = [[word for word in words if word.lower() not in stop_words] for words in batch_words]
75
+
76
+ # Stemming
77
+ stemmer = PorterStemmer()
78
+ stemmed_words = [[stemmer.stem(word) for word in words] for words in filtered_words]
79
+
80
+ # Tagging Parts of Speech
81
+ pos_tags = [nltk.pos_tag(words) for words in stemmed_words]
82
+
83
+ # Lemmatizing
84
+ lemmatizer = WordNetLemmatizer()
85
+ lemmatized_words = [[lemmatizer.lemmatize(word, pos=get_wordnet_pos(tag)) for word, tag in pos] for pos in pos_tags]
86
+
87
+ preprocessed_sentences.extend(lemmatized_words)
88
+
89
+ return preprocessed_sentences
90
+
91
+ def process_data(self, data_json):
92
+ #print("Processing data in parallel...")
93
+ batch_size = 10000 # Experiment with different batch sizes
94
+ num_processes = int(multiprocessing.cpu_count()/2) # Utilize more processes
95
+
96
+ batches = [data_json[i:i + batch_size] for i in range(0, len(data_json), batch_size)]
97
+
98
+ #print('batches')
99
+
100
+ #training_tagged = [] # Initialize or clear self.training_tagged
101
+ sentence_answers = []
102
+
103
+ with ProcessPoolExecutor(max_workers=num_processes) as executor:
104
+ results = list(tqdm(executor.map(self.process_data_batch, batches), total=len(batches)))
105
+
106
+ #with multiprocessing.Pool() as pool:
107
+ #results = []
108
+ #for batch in batches:
109
+ #results.append(self.process_data_batch(batch))
110
+
111
+ for batch_result in results:
112
+ for result in batch_result:
113
+ sentence_answers.extend(result)
114
+ #print("here")
115
+
116
+ # Create a dictionary to group sentences by answer
117
+ answer_groups = defaultdict(list)
118
+
119
+ # Iterate through each (sentence, answer) pair in batch_results
120
+ for sentence, answer in sentence_answers:
121
+ answer_groups[answer].extend(sentence)
122
+
123
+ #print(list(answer_groups.items())[0])
124
+
125
+ # Create a new list with sentences grouped by answer
126
+ sentence_answers.extend([(sentence,answer) for answer, sentence in answer_groups.items()])
127
+
128
+ self.flattened_sentences.extend([x[0] for x in sentence_answers])
129
+ self.training_tagged.extend([x[1] for x in sentence_answers])
130
+
131
+
132
+
133
+ #print("Data processing complete.")
134
+
135
+ def process_data_batch(self, batch):
136
+ batch_results = []
137
+
138
+
139
+
140
+ for data in batch:
141
+ text = data["text"]
142
+ answer = data["answer"]
143
+ preprocessed_sentences = self.preprocess_text(text)
144
+ training_tagged = [(sentence, answer) for sentence in preprocessed_sentences]
145
+
146
+
147
+
148
+ #print(training_tagged)
149
+ batch_results.append(training_tagged)
150
+
151
+ #create another list where instead, the "sentence" of elements with the same answer are appended with each other
152
+
153
+ return batch_results
154
+
155
+ def train_model(self):
156
+ # Fit and transform the TF-IDF vectorizer
157
+
158
+ #print(self.flattened_sentences)
159
+ if(self.flattened_sentences):
160
+ self.training_tfidf = self.tfidf.fit_transform(self.flattened_sentences)
161
+ self.flattened_sentences = []
162
+ #self.
163
+
164
+ #print(self.training_tfidf)
165
+ #print(self.training_tagged)
166
+
167
+
168
+
169
+
170
+ def save(self, file_path):
171
+ model_data = {
172
+ 'training_tagged': list(self.training_tagged),
173
+ 'tfidf': self.tfidf,
174
+ 'training_tfidf': self.training_tfidf
175
+ }
176
+ #print(model_data)
177
+ with open(file_path, 'wb') as f:
178
+ joblib.dump(model_data, f)
179
+
180
+ def load(self, file_path):
181
+
182
+ if os.path.exists(file_path):
183
+ with open(file_path, 'rb') as f:
184
+ print(os.path.exists(file_path))
185
+ model_data = joblib.load(file_path)
186
+ self.training_tagged = list(model_data['training_tagged'])
187
+ self.tfidf = model_data['tfidf']
188
+ print(self.tfidf)
189
+ self.training_tfidf = model_data['training_tfidf']
190
+
191
+ return self
192
+
193
+ def predict(self, input_data):
194
+ # Preprocess input data
195
+ new_text_processed = self.preprocess_text(input_data)
196
+ new_text_processed_tfidf = self.tfidf.transform(new_text_processed)
197
+ training_tfidf = self.training_tfidf
198
+
199
+ # Calculate sentence similarities
200
+ sentence_similarities = cosine_similarity(new_text_processed_tfidf, training_tfidf)
201
+
202
+ # Initialize data structures
203
+ similarities_max = {}
204
+ answers = []
205
+
206
+ # Iterate over sentence similarities
207
+ for similarity_row in sentence_similarities:
208
+ for answer, similarity in zip(self.training_tagged, similarity_row):
209
+ if isinstance(answer, list):
210
+ continue
211
+ # Update similarities_max only when the new similarity is greater
212
+ if answer not in similarities_max or similarity > similarities_max[answer]:
213
+ similarities_max[answer] = similarity
214
+
215
+ if not answers:
216
+ answers.extend(similarities_max.keys())
217
+
218
+ # Calculate total similarity for each answer and find the maximum similarity and its index
219
+ total_similarities = np.array([similarities_max[answer] for answer in answers])
220
+ closest_index = np.argmax(total_similarities)
221
+ closest_answer = answers[closest_index]
222
+
223
+ return total_similarities[closest_index], closest_answer
224
+
225
+
226
+
227
+
228
+
229
+
230
+
231
+ #return (sentences.max(),self.training_tagged[closest_index])
232
+
233
+
234
+
235
+
236
+
237
+ def evaluate(self, test_data, labels):
238
+ # Evaluate the performance of the model on test data
239
+ # Return evaluation metrics
240
+ pass
241
+
242
+ # Additional functions for model tuning, hyperparameter optimization, etc.
243
+
244
+ if __name__ == "__main__":
245
+ # Train a simple model on QB data, save it to a file
246
+ import argparse
247
+ parser = argparse.ArgumentParser()
248
+
249
+ parser.add_argument('--data', type=str)
250
+ parser.add_argument('--model', type=str)
251
+ parser.add_argument('--predict', type=str)
252
+
253
+ flags = parser.parse_args()
254
+
255
+ model = NLPModel()
256
+
257
+ if flags.data:
258
+ with open(flags.data, 'r') as data_file:
259
+ data_json = json.load(data_file)
260
+
261
+ model.process_data(data_json)
262
+ model.train_model()
263
+ print(model.predict("My name is bobby, bobby newport. your name is jeff?"))
264
+ model.save("model.pkl")
265
+
266
+ if flags.model:
267
+ model.load(flags.model)
268
+
269
+ if flags.predict:
270
+ print(model.predict(flags.predict))
271
+
272
+
273
+
274
+
275
+