init

Browse files

Files changed (7) hide show

.gitignore +2 -0
README.md +330 -6
data_collection.py +188 -0
qbmodel.py +255 -0
question_categorizer.py +272 -0
requirements.txt +225 -0
tfidf_model.py +275 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ *.json
2	+ *.pkl

README.md CHANGED Viewed

@@ -1,6 +1,330 @@
----
-license: mit
-language:
-- en
-pipeline_tag: question-answering
----

+The evaluation of this project is to answer trivia questions.  You do
+not need to do well at this task, but you should submit a system that
+completes the task or create adversarial questions in that setting.  This will help the whole class share data and
+resources.
+If you focus on something other than predicting answers, *that's fine*!
+About the Data
+==============
+Quiz bowl is an academic competition between schools in
+English-speaking countries; hundreds of teams compete in dozens of
+tournaments each year. Quiz bowl is different from Jeopardy, a recent
+application area.  While Jeopardy also uses signaling devices, these
+are only usable after a question is completed (interrupting Jeopardy's
+questions would make for bad television).  Thus, Jeopardy is rapacious
+classification followed by a race---among those who know the
+answer---to punch a button first.
+Here's an example of a quiz bowl question:
+Expanding on a 1908 paper by Smoluchowski, he derived a formula for
+the intensity of scattered light in media fluctuating densities that
+reduces to Rayleigh's law for ideal gases in The Theory of the
+Opalescence of Homogenous Fluids and Liquid Mixtures near the Critical
+State.  That research supported his theories of matter first developed
+when he calculated the diffusion constant in terms of fundamental
+parameters of the particles of a gas undergoing Brownian Motion.  In
+that same year, 1905, he also published On a Heuristic Point of View
+Concerning the Production and Transformation of Light.  That
+explication of the photoelectric effect won him 1921 Nobel in Physics.
+For ten points, name this German physicist best known for his theory
+of Relativity.
+*ANSWER*: Albert _Einstein_
+Two teams listen to the same question. Teams interrupt the question at
+any point by "buzzing in"; if the answer is correct, the team gets
+points and the next question is read.  Otherwise, the team loses
+points and the other team can answer.
+You are welcome to use any *automatic* method to choose an answer.  It
+need not be similar nor build on our provided systems.  In addition to
+the data we provide, you are welcome to use any external data *except*
+our test quiz bowl questions (i.e., don't hack our server!).  You are
+welcome (an encouraged) to use any publicly available software, but
+you may want to check on Piazza for suggestions as many tools are
+better (or easier to use) than others.
+If you don't like the interruptability of questions, you can also just answer entire questions.  However, you must also output a confidence.
+Competition
+==================
+We will use Dynabech website (https://dynabench.org/tasks/qa). If you remember the past workshop about Dynabench submission, this is the way to do it. The specific task name is "Grounded QA". Here, with the help of the video tutorial, you submit your QA model and assess how your QA model did compared to others. The assessment will take place by testing your QA model on several QA test datasets and the results of yours and your competitors will be visible on the leaderboard. Your goal is to rank the highest in terms of expected wins: you buzz in with probability proportional to your confidence, and if you're more right than the competition, you win.
+Writing Questions
+==================
+Alternatively, you can also *write* 50 adversarial questions that
+challenge modern NLP systems. These questions must be diverse in the
+subjects asked about, the skills computers need to answer the
+questions, and the entities in those questions. Remember that your questions should be *factual* and
+*specific* enough for humans to answer, because your task is to stump
+the computers relative to humans!
+In addition to the raw questions, you will also need to create citations describing:
+* Why the question is difficult for computers: include citations from the NLP/AI/ML literature
+* Why the information in the question is correct: include citations from the sources you drew on the write the question
+* Why the question is interesting: include scholarly / popular culture artifacts to prove that people care about this
+* Why the question is pyramidal: discuss why your first clues are harder than your later clues
+**Category**
+We want questions from many domains such as Art, Literature, Geography, History,
+Science, TV and Film, Music, Lifestyle, and Sport. The questions
+should be written using all topics above (5 questions for each
+category and 5 more for the remaining categories). Indicate in your
+writeup which category you chose to write on for each question.
+Art:
+* Questions about works: Mona Lisa, Raft of the Medussa
+* Questions about forms: color, contour, texture
+* Questions about artists: Picasso, Monet, Leonardo da Vinci
+* Questions about context: Renaissance, post-modernism, expressionism, surrealism
+Literature:
+*	Questions about works: novels (1984), plays (The Lion and the Jewel), poems (Rubaiyat), criticism (Poetics)
+*	Questions about major characters or events in literature: The Death of Anna Karenina, Noboru Wataya, the Marriage of Hippolyta and Theseus
+*	Questions about literary movements (Sturm und Drang)
+*	Questions about translations
+*	Cross-cutting questions (appearances of Overcoats in novels)
+*	Common link questions (the literary output of a country/region)
+Geography:
+*	Questions about location: names of capital, state, river
+*	Questions about the place: temperature, wind flow, humidity
+History:
+*	When: When did the First World war start?
+*	Who: Who is called Napoleon of Iran?
+*	Where: Where was the first Summer Olympics held?
+*	Which: Which is the oldest civilization in the world?
+Science:
+*	Questions about terminology: The concept of gravity was discovered by which famous physicist?
+*	Questions about the experiment
+*	Questions about theory: The social action theory believes that individuals are influenced by this theory.
+TV and Film:
+*	Quotes: What are the dying words of Charles Foster Kane in Citizen Kane?
+*	Title: What 1927 musical was the first "talkie"?
+*	Plot: In The Matrix, does Neo take the blue pill or the red pill?
+Music:
+*	Singer: What singer has had a Billboard No. 1 hit in each of the last four decades?
+*	Band: Before Bleachers and fun., Jack Antonoff fronted what band?
+*	Title: What was Madonna's first top 10 hit?
+*	History: Which classical composer was deaf?
+Lifestyle:
+*	Clothes: What clothing company, founded by a tennis player, has an alligator logo?
+*	Decoration: What was the first perfume sold by Coco Chanel?
+Sport:
+*	Known facts: What sport is best known as the ‘king of sports’?
+*	Nationality: What’s the national sport of Canada?
+*	Sport player: The classic 1980 movie called Raging Bull is about which real-life boxer?
+*	Country: What country has competed the most times in the Summer Olympics yet hasn’t won any kind of medal?
+**Diversity**
+Other than category diversity, if you find an ingenious way of writing questions about underrepresented countries, you will get bonus points (indicate which questions you included the diversity component in your writeup). You may decide which are underrepresented countries with your own reasonable reason (etc., less population may indicate underrepresented), but make sure to articulate this in your writeup.
+* Run state of the art QA systems on the questions to show they struggle, give individual results for each question and a summary over all questions
+For an example of what the writeup for a single question should look like, see the adversarial HW:
+https://github.com/Pinafore/nlp-hw/blob/master/adversarial/question.tex
+Proposal
+==================
+The project proposal is a one page PDF document that describes:
+* Who is on your team (team sizes can be between three and six
+  students, but six is really too big to be effective; my suggestion
+  is that most groups should be between four or five).
+* What techniques you will explore
+* Your timeline for completing the project (be realistic; you should
+  have your first submission in a week or two)
+Submit the proposal on Gradescope, but make sure to include all group
+members.  If all group members are not included, you will lose points.  Late days cannot be used on this
+assignment.
+Milestone 1
+======================
+You'll have to update how things are going: what's
+working, what isn't, and how does it change your timeline?  How does it change your division of labor?
+*Question Writing*: You'll need to have answers selected for all of
+your questions and first drafts of at least 15 questions.  This must
+be submitted as a JSON file so that we run computer QA systems on it.
+*Project*: You'll need to have made a submission to the leaderboard with something that satisfies the API.
+Submit a PDF updating on your progress to Gradescope.  If all team
+members are not on the submission, you will lose points.
+Milestone 2
+===================
+As before, provide an updated timeline / division of labor, provide your intermediary results.
+*Question Writing*: You'll need to have reflected the feedback from the first questions and completed a first draft of at least 30 questions.  You'll also need machine results to your questions and an overall evaluation of your human/computer accuracy.
+*Project*: You'll need to have a made a submission to the leaderboard with a working system (e.g., not just obey the API, but actually get reasonable answers).
+Submit a PDF updating on your progress.
+Final Presentation
+======================
+The final presentation will be virtual (uploading a video).  In
+the final presentation you will:
+* Explain what you did
+* Who did what.  For example, for the question writing project a team of five people might write: A wrote the first draft of questions.  B and C verified they were initially answerable by a human.  B ran computer systems to verify they were challenging to a computer.  C edited the questions and increased the computer difficulty.  D and E verified that the edited questions were still answerable by a human.  D and E checked all of the questions for factual accuracy and created citations and the writeup.
+* What challenges you had
+* Review how well you did (based on the competition or your own metrics).  If you do not use the course infrastructure to evaluate your project's work, you should talk about what alternative evaluations you used, why they're appropriate/fair, and how well you did on them.
+* Provide an error analysis.  An error analysis must contain examples from the
+  development set that you get wrong.  You should show those sentences
+  and explain why (in terms of features or the model) they have the
+  wrong answer.  You should have been doing this all along as you
+  derive new features, but this is your final inspection of
+  your errors. The feature or model problems you discover should not
+  be trivial features you could add easily.  Instead, these should be
+  features or models that are difficult to correct.  An error analysis
+  is not the same thing as simply presenting the error matrix, as it
+  does not inspect any individual examples.  If you're writing questions, talk about examples of questions that didn't work out as intended.
+* The linguistic motivation for your features / how your wrote the questions.  This is a
+  computational linguistics class, so you should give precedence to
+  features / techniques that we use in this class (e.g., syntax,
+  morphology, part of speech, word sense, etc.).  Given two features
+  that work equally well and one that is linguistically motivated,
+  we'll prefer the linguistically motivated one.
+* Presumably you did many different things; how did they each
+  individually contribute to your final result?
+Each group has 10 minutes to deliver their presentation. Please record the video, and upload it to Google Drive, and include the link in your writeup submission.
+Final Question Submission
+======================
+Because we need to get the questions ready for the systems, upload your raw questions on May 10.  This doesn't include the citations or other parts of the writeup.
+System Submission
+======================
+You must submit a version of your system by May 12. It may not be perfect, but this what the question writing teams will use to test their results.
+Your system should be sent directly to the professor and TAs in zip files, including the correct dependencies and a working inference code. Your inference code should run successfully in the root folder (extracted from zip folder) directory with the command:
+```
+> python3 inference.py --data=evaluation_set.json
+```
+The input will be in the form of a .json file () in the same format as the file the adversarial question writing team submits. The output format should also be in string.
+If you have any notes or comments that we should be aware of while running your code, please include them in the folder as a .txt file. Also, dependency information should be included as a .txt file.
+Please prepend your email title with [2024-CMSC 470 System Submission].
+Project Writeup and JSON file
+======================
+By May 17, submit your project writeup explaining what
+you did and what results you achieved.  This document should
+make it clear:
+* Why this is a good idea
+* What you did
+* Who did what
+* Whether your technique worked or not
+For systems, please do not go over 2500 words unless you have a really good reason.
+Images are a much better use of space than words, usually (there's no
+limit on including images, but use judgement and be selective).
+For question writing, you have one page (single spaced, two column) per question plus a two page summary of results. Talk about how you organized the question writing, how you evaluated the questions, and a summary of the results.  Along with your writeup, turn in a json including the raw text of the question and answer and category. The json file is included in this directory. Make sure your json file is in the correct format and is callable via below code. Your submission will not be graded if it does not follow the format of the example json file.
+```
+with open('path to your json file', 'r') as f:
+    data = json.load(f)
+```
+Grade
+======================
+The grade will be out of 25 points, broken into five areas:
+* _Presentation_: For your oral presentation, do you highlight what
+  you did and make people care?  Did you use time well during the
+  presentation?
+* _Writeup_: Does the writeup explain what you did in a way that is
+  clear and effective?
+The final three areas are different between the system and the questions.
+|    |      System      |  Questions |
+|----------|:-------------:|------:|
+| _Technical Soundness_ |  Did you use the right tools for the job, and did you use them correctly?  Were they relevant to this class? | Were your questions correct and accurately cited. |
+| _Effort_ |  Did you do what you say you would, and was it the right ammount of effort.  | Are the questions well-written, interesting, and thoroughly edited? |
+| _Performance_ | How did your techniques perform in terms of accuracy, recall, etc.? | Is the human accuracy substantially higher than the computer accuracy? |
+All members of the group will receive the same grade.  It's impossible for the course staff to adjudicate Rashomon-style accounts of who did what, and the goal of a group project is for all team members to work together to create a cohesive project that works well together.  While it makes sense to divide the work into distinct areas of responsibility, at grading time we have now way to know who really did what, so it's the groups responsibility to create a piece of output that reflects well on the whole group.

data_collection.py ADDED Viewed

	@@ -0,0 +1,188 @@

+import json
+from bs4 import BeautifulSoup
+import re
+from tqdm import tqdm  # Import tqdm for progress tracking
+import sys
+import question_categorizer as qc
+import numpy as np
+from question_categorizer import TextClassificationModel
+qc_model = qc.TextClassificationModel.load_model("models/categorizer")
+categories = ['Geography', 'Religion', 'Philosophy', 'Trash','Mythology', 'Literature','Science', 'Social Science', 'History', 'Current Events', 'Fine Arts']
+def remove_newline(string):
+    return re.sub('\n+', ' ', string)
+def clean_text(text, answer):
+    # Remove HTML tags
+    text = re.sub(r'<.*?>', '', text)
+    #text = re.sub(r'?','.',text)
+    text = text.replace('?','.')
+    # Clean the text further
+    text = re.sub(r'[^a-zA-Z.\s-]', '', text)
+    # Remove answer from text
+    try:
+        # Preprocess the answer to replace underscores with spaces
+        processed_answer = answer.replace('_', ' ')
+        # Remove parentheses from the processed answer
+        processed_answer = re.sub(r'\([^)]*\)', '', processed_answer)
+        # Replace all instances of the processed answer with an empty string, ignoring case
+        text = re.sub(re.escape(processed_answer), '', text, flags=re.IGNORECASE)
+    except Exception as e:
+        print("An error occurred during text cleaning:", e)
+        print("Text:", text)
+        print("Answer:", answer)
+    # Remove extra whitespaces
+    text = re.sub(r'\s+', ' ', text)
+    return text.strip()
+def process_data():
+    #with open("data/JEOPARDY_QUESTIONS1.json", "r") as f:
+    #    jeopardy_data = json.load(f)
+    jeopardy_data = []
+    wiki_files = [
+    ]
+    question_files = [
+        "qadata.json"]
+    wiki_data = []
+    question_data = []
+    for file_path in wiki_files:
+        with open('data/' + file_path, "r") as f:
+            wiki_data.extend(json.load(f))
+    for file_path in question_files:
+        with open('data/' + file_path, "r") as f:
+            question_data.extend(json.load(f))
+    #print(question_data)
+    with open("data/training_data.json", "w") as f:
+        training_data = []
+        # Process Jeopardy data
+        print("Processing Jeopardy data...")
+        for entry in tqdm(jeopardy_data):
+            question = entry["question"]
+            answer = str(entry["answer"])
+            # Preprocess the text
+            soup = BeautifulSoup(question, 'html.parser')
+            clean_question = ''.join(soup.findAll(text=True, recursive=False))
+            question_category = []
+            # Get category from qc_model
+            prediction = qc_model.predict(question)
+            predictions = np.argwhere(prediction >= 1.5)[1]
+            for prediction_ind in predictions:
+                # Store data in array with respective index
+                question_category.append(categories[prediction_ind])
+            question_category.append('ALL')
+            training_entry = {
+                "text": clean_question,
+                "answer": answer,#,
+                # Mohit, put categorizing code here
+                "category": question_category
+            }
+            training_data.append(training_entry)
+        # Process Wikipedia data
+        print("Processing Wikipedia data...")
+        for entry in tqdm(wiki_data):
+            page = str(entry["page"])
+            text = entry["text"]
+            if(text == ""):
+                continue
+            text = remove_newline(text)
+            text = clean_text(text, page)
+            question_category = []
+            # Get category from qc_model
+            prediction = qc_model.predict(text)
+            predictions = np.argwhere(prediction >= 1.5)[1]
+            for prediction_ind in predictions:
+                # Store data in array with respective index
+                question_category.append(categories[prediction_ind])
+            question_category.append('ALL')
+            training_entry = {
+                "text": text,
+                "answer": page,
+                # Mohit, put categorizing code here
+                "category": question_category
+            }
+            training_data.append(training_entry)
+        print("Processing Misc data...")
+        for entry in tqdm(question_data):
+            answer = str(entry["answer"])
+            text = entry["text"]
+            if(text == "" or answer == ""):
+                continue
+            text = remove_newline(text)
+            text = clean_text(text, answer)
+            question_category = []
+            # Get category from qc_model
+            try:
+              prediction = qc_model.predict(text)
+              predictions = np.argwhere(prediction >= 1.5)[1]
+            except:
+              print("answer: " + str(answer))
+              print("text:" + str(text))
+              continue
+            for prediction_ind in predictions:
+                # Store data in array with respective index
+                question_category.append(categories[prediction_ind])
+            question_category.append('ALL')
+            training_entry = {
+                "text": text,
+                "answer": answer,
+                # Mohit, put categorizing code here
+                "category": question_category
+            }
+            training_data.append(training_entry)
+        json.dump(training_data, f, indent=4)
+process_data()

qbmodel.py ADDED Viewed

	@@ -0,0 +1,255 @@

+from typing import List, Tuple
+import nltk
+import sklearn
+import question_categorizer as qc
+from question_categorizer import TextClassificationModel
+from tfidf_model import NLPModel
+import tfidf_model
+import transformers
+import numpy as np
+import pandas as pd
+import json
+from tqdm import tqdm
+from collections import defaultdict
+class QuizBowlModel:
+    def __init__(self, clear = False):
+        """
+        Load your model(s) and whatever else you need in this function.
+        Do NOT load your model or resources in the guess_and_buzz() function,
+        as it will increase latency severely.
+        """
+        self.categories = ['Geography', 'Religion', 'Philosophy', 'Trash','Mythology', 'Literature','Science', 'Social Science', 'History', 'Current Events', 'Fine Arts', 'ALL']
+        self.tfidf_models = [None for _ in range(len(self.categories))]
+        self.qc_model = qc.TextClassificationModel.load_model("models/categorizer")
+        self.load_tfidf_models(clear=clear)
+    def guess_and_buzz(self, question_text: List[str]) -> List[Tuple[str, bool]]:
+        """
+        This function accepts a list of question strings, and returns a list of tuples containing
+        strings representing the guess and corresponding booleans representing
+        whether or not to buzz.
+        So, guess_and_buzz(["This is a question"]) should return [("answer", False)]
+        If you are using a deep learning model, try to use batched prediction instead of
+        iterating using a for loop.
+        """
+        guesses = []
+        curr_question = ""
+        for question in question_text:
+            curr_question += question + "."
+            confidence,answer = self.predict(curr_question)
+            confidence = True if confidence > 0.5 else False
+            guesses.append((confidence,answer))
+        return guesses
+    def load_tfidf_models(self, clear=False):
+        print("loading tfidf models")
+        # Create respective model if not exist
+        if not clear:
+            for category in range(len(self.categories)):
+                if self.tfidf_models[category] is None:
+                    self.tfidf_models[category] = NLPModel().load(f"models/{self.categories[category]}_tfidf.pkl")
+            self.tfidf_models[-1] = NLPModel().load(f"models/{'ALL'}_tfidf.pkl")
+        else:
+            for category in range(len(self.categories)):
+                if self.tfidf_models[category] is None:
+                    self.tfidf_models[category] = NLPModel()
+            print(self.tfidf_models)
+    def train(self, data):
+        # Create n empty lists, each index associated with the index of the category
+        training_data = [[] for _ in range(len(self.categories))]
+        with tqdm(total=len(data)) as pbar:
+            for data_point in data:
+                text = data_point["text"]
+                answer = data_point["answer"]
+                categories = data_point["category"]
+                for category in categories:
+                    category_ind = self.categories.index(category)
+                    training_data[category_ind].append({"text": text, "answer": answer})
+                pbar.update(1)
+        for ind,data in enumerate(training_data):
+            self.tfidf_models[ind].process_data(data)
+            # Train model
+            self.tfidf_models[ind].train_model()
+            # Save model
+            self.tfidf_models[ind].save(f"models/{self.categories[ind]}_tfidf.pkl")
+            self.tfidf_models[ind] = None
+            training_data[ind] = []
+            #Update progress bar
+            #pbar.update(1)
+        print("TRAINING DATA")
+        '''with tqdm(total=len(self.categories)) as pbar:
+            for category in range(len(self.categories)):
+                # Train model
+                self.tfidf_models[category].train_model()
+                # Save model
+                self.tfidf_models[category].save(f"models/{self.categories[category]}_tfidf.pkl")
+                # Unload model
+                #print(f'category {self.categories[category]} gets unloaded')
+                self.tfidf_models[category] = None
+                training_data[category] = None
+                pbar.update(1)'''
+        print("Training complete.")
+    def predict(self, input_data, confidence_threshold=1.5):
+        # Get category confidence scores from qc_model
+        category_confidences = self.qc_model.predict(input_data)
+        #print("Category confidences:", category_confidences)
+        # Find the indices of categories with confidence scores above the threshold
+        confident_indices = (category_confidences > confidence_threshold).nonzero()[:,1]
+        #print(confident_indices)
+        max_confidence = 0
+        max_answer = None
+        max_category = 0
+        for category in confident_indices:
+            #print(category)
+            confidence,answer = self.tfidf_models[category].predict(input_data)
+            if(confidence > max_confidence):
+                max_confidence = confidence
+                max_answer = answer
+                max_category = category
+        #max_confidence, max_answer = selected_model.predict(input_data)
+        #print("Prediction for category", self.categories[category], ":", max_answer, "with confidence", max_confidence)
+        return (np.tanh(max_confidence), max_answer)
+    def evaluate(self, input_data):
+        correct = 0
+        count = 0
+        with tqdm(total=len(input_data)) as pbar:
+          for data_point in input_data:
+              print(count % 10)
+              count += 1
+              text = data_point["text"]
+              answer = data_point["answer"]
+              answer_predict = self.predict(text)[1]
+              if(answer == answer_predict):
+                  correct += 1
+                  print(correct)
+              if(count % 10 == 0):
+                  average = float(correct)/count
+                  print(f'rolling average: {average}')
+              pbar.update(1)
+          accuracy = correct/len(input_data)
+          return accuracy
+if __name__ == "__main__":
+    # Train a simple model on QB data, save it to a file
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--data', type=str)
+    parser.add_argument('--model', type=str)
+    parser.add_argument('--predict', type=str)
+    parser.add_argument('--clear', action='store_const', const=True, default=False)
+    parser.add_argument('--evaluate', type=str)
+    flags = parser.parse_args()
+    model = None
+    print(flags.clear)
+    if flags.clear:
+        model = QuizBowlModel(clear=True)
+    else:
+        model = QuizBowlModel()
+    if flags.data:
+        data_json = []
+        for data in flags.data:
+            with open(flags.data, 'r') as data_file:
+                data_json.extend(json.load(data_file))
+                model.train(data_json)
+            #print(model.predict("My name is bobby, bobby newport. your name is jeff?"))
+            #model.save("model.pkl")
+    if flags.model:
+        model.load(flags.model)
+    if flags.predict:
+        print(model.predict(flags.predict))
+    if flags.evaluate:
+        with open(flags.evaluate, 'r') as data_file:
+          data_json = json.load(data_file)
+          print(f'accuracy: {model.evaluate(data_json)}')

question_categorizer.py ADDED Viewed

	@@ -0,0 +1,272 @@

+import time
+from torch.utils.data.dataset import random_split
+from torchtext.data.functional import to_map_style_dataset
+import torch
+import gzip
+import json
+import numpy as np
+import nltk
+from nltk.corpus import stopwords
+from nltk.tokenize import word_tokenize, sent_tokenize
+from nltk.stem import PorterStemmer, WordNetLemmatizer
+from torchtext.data.utils import get_tokenizer
+from torchtext.vocab import build_vocab_from_iterator
+from torch.utils.data import DataLoader
+import argparse
+from torch import nn
+import json
+nltk.download('punkt')
+nltk.download('stopwords')
+nltk.download('averaged_perceptron_tagger')
+nltk.download('maxent_ne_chunker')
+nltk.download('words')
+nltk.download('wordnet')
+class TextClassificationModel(nn.Module):
+    def __init__(self, vocab_size, embed_dim, num_class, vocab):
+        self.model = super(TextClassificationModel, self).__init__()
+        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
+        self.fc = nn.Linear(embed_dim, num_class)
+        self.init_weights()
+        self.vocab_size = vocab_size
+        self.emsize = embed_dim
+        self.num_class = num_class
+        self.vocab = vocab
+        self.text_pipeline = self.tokenizer
+        self.tokenizer_convert = get_tokenizer("basic_english")
+    def tokenizer(self, text):
+        return self.vocab(self.tokenizer_convert(text))
+    def init_weights(self):
+        initrange = 0.5
+        self.embedding.weight.data.uniform_(-initrange, initrange)
+        self.fc.weight.data.uniform_(-initrange, initrange)
+        self.fc.bias.data.zero_()
+    def forward(self, text, offsets):
+        embedded = self.embedding(text, offsets)
+        return self.fc(embedded)
+    def train_model(self, train_dataloader, valid_dataloader):
+        total_accu = None
+        for epoch in range(1, EPOCHS + 1):
+          epoch_start_time = time.time()
+          self.train()
+          total_acc, total_count = 0, 0
+          log_interval = 500
+          start_time = time.time()
+          for idx, (label, text, offsets) in enumerate(train_dataloader):
+              optimizer.zero_grad()
+              predicted_label = self(text, offsets)
+              loss = criterion(predicted_label, label)
+              loss.backward()
+              torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
+              optimizer.step()
+              total_acc += (predicted_label.argmax(1) == label).sum().item()
+              total_count += label.size(0)
+              if idx % log_interval == 0 and idx > 0:
+                  elapsed = time.time() - start_time
+                  print(
+                      "| epoch {:3d} | {:5d}/{:5d} batches "
+                      "| accuracy {:8.3f}".format(
+                          epoch, idx, len(train_dataloader), total_acc / total_count
+                      )
+                  )
+                  total_acc, total_count = 0, 0
+                  start_time = time.time()
+          accu_val = self.evaluate(valid_dataloader)
+          if total_accu is not None and total_accu > accu_val:
+              scheduler.step()
+          else:
+              total_accu = accu_val
+          print("-" * 59)
+          print(
+              "| end of epoch {:3d} | time: {:5.2f}s | "
+              "valid accuracy {:8.3f} ".format(
+                  epoch, time.time() - epoch_start_time, accu_val
+              )
+          )
+          print("-" * 59)
+    #TODO: FIX THE LOADING MODEL
+    def save_model(self, file_path):
+        model_state = {
+            'state_dict': self.state_dict(),
+            'vocab_size': self.vocab_size,
+            'embed_dim': self.emsize,
+            'num_class': self.num_class,
+            'vocab': self.vocab
+        }
+        torch.save(model_state, file_path)
+        print("Model saved successfully.")
+    @classmethod
+    def load_model(self, file_path):
+        model_state = torch.load(file_path, map_location=torch.device('cpu'))
+        #print(model_state)
+        vocab_size = model_state['vocab_size']
+        embed_dim = model_state['embed_dim']
+        num_class = model_state['num_class']
+        vocab = model_state['vocab']
+        model = TextClassificationModel(vocab_size, embed_dim, num_class, vocab)
+        model.load_state_dict(model_state['state_dict'])
+        model.eval()
+        print("Model loaded successfully.")
+        return model
+    def evaluate(self, dataloader):
+      self.eval()
+      total_acc, total_count = 0, 0
+      with torch.no_grad():
+          for idx, (label, text, offsets) in enumerate(dataloader):
+              predicted_label = self(text, offsets)
+              loss = criterion(predicted_label, label)
+              total_acc += (predicted_label.argmax(1) == label).sum().item()
+              total_count += label.size(0)
+      return total_acc / total_count
+    def predict(self, text):
+      with torch.no_grad():
+          text = torch.tensor(self.text_pipeline(text))
+          #print(text)
+          #print(torch.tensor([0]))
+          output = self(text, torch.tensor([0]))
+          #print(output)
+          return output
+    @staticmethod
+    def read_gz_json(file_path):
+        with gzip.open(file_path, 'rt', encoding='utf-8') as f:
+            data = json.load(f)
+            for obj in data:
+                yield obj['text'], obj['category']
+    @staticmethod
+    def preprocess_text(text):
+        sentences = sent_tokenize(text)
+        return sentences
+    @staticmethod
+    def data_iter(file_paths, categories):
+        categories = np.array(categories)
+        for path in file_paths:
+            for text, category in TextClassificationModel.read_gz_json(path):
+                sentences = TextClassificationModel.preprocess_text(text)
+                for sentence in sentences:
+                    yield np.where(categories == category)[0][0], sentence
+    @staticmethod
+    def collate_batch(batch):
+        label_list, text_list, offsets = [], [], [0]
+        for _label, _text in batch:
+            label_list.append(label_pipeline(_label))
+            processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
+            text_list.append(processed_text)
+            offsets.append(processed_text.size(0))
+        label_list = torch.tensor(label_list, dtype=torch.int64)
+        offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
+        text_list = torch.cat(text_list)
+        return label_list.to(device), text_list.to(device), offsets.to(device)
+def parse_arguments():
+    parser = argparse.ArgumentParser(description="Text Classification Model")
+    parser.add_argument("--train_path", type=str, nargs='+', required=True, help="Path to the training data")
+    parser.add_argument("--test_path", type=str, nargs='+', required=True, help="Path to the test data")
+    parser.add_argument("--epochs", type=int, default=5, help="Number of epochs for training")
+    parser.add_argument("--lr", type=float, default=3, help="Learning rate")
+    parser.add_argument("--batch_size", type=int, default=64, help="Batch size for training")
+    return parser.parse_args()
+if __name__ == '__main__':
+    args = parse_arguments()
+    categories = ['Geography', 'Religion', 'Philosophy', 'Trash', 'Mythology', 'Literature', 'Science', 'Social Science', 'History', 'Current Events', 'Fine Arts']
+    test_path = args.test_path
+    train_path = args.train_path
+    tokenizer = get_tokenizer("basic_english")
+    train_iter = iter(TextClassificationModel.data_iter(train_path, categories))
+    def yield_tokens(data_iter):
+        for _, text in data_iter:
+            yield tokenizer(text)
+    vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
+    vocab.set_default_index(vocab["<unk>"])
+    text_pipeline = lambda x: vocab(tokenizer(x))
+    label_pipeline = lambda x: int(x)
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    dataloader = DataLoader(
+        train_iter, batch_size=8, shuffle=False, collate_fn=TextClassificationModel.collate_batch
+    )
+    train_iter = iter(TextClassificationModel.data_iter(train_path, categories))
+    classes = set([label for (label, text) in train_iter])
+    num_class = len(classes)
+    print(num_class)
+    vocab_size = len(vocab)
+    emsize = 64
+    model = TextClassificationModel(vocab_size, emsize, num_class).to(device)
+    print(model)
+    # Hyperparameters
+    EPOCHS = args.epochs  # epoch
+    LR = args.lr  # learning rate
+    BATCH_SIZE = args.batch_size  # batch size for training
+    criterion = torch.nn.CrossEntropyLoss()
+    optimizer = torch.optim.SGD(model.parameters(), lr=LR)
+    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
+    total_accu = None
+    train_iter = iter(TextClassificationModel.data_iter(train_path, categories))
+    test_iter = iter(TextClassificationModel.data_iter(test_path, categories))
+    train_dataset = to_map_style_dataset(train_iter)
+    test_dataset = to_map_style_dataset(test_iter)
+    num_train = int(len(train_dataset) * 0.95)
+    split_train_, split_valid_ = random_split(
+        train_dataset, [num_train, len(train_dataset) - num_train]
+    )
+    train_dataloader = DataLoader(
+        split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=TextClassificationModel.collate_batch
+    )
+    valid_dataloader = DataLoader(
+        split_valid_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=TextClassificationModel.collate_batch
+    )
+    test_dataloader = DataLoader(
+        test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=TextClassificationModel.collate_batch
+    )
+    model.train_model(train_dataloader,valid_dataloader)
+    print("Checking the results of test dataset.")
+    accu_test = model.evaluate(test_dataloader)
+    print("test accuracy {:8.3f}".format(accu_test))
+    model.save_model("text_classification_model.pth")

requirements.txt ADDED Viewed

	@@ -0,0 +1,225 @@

+accelerate==0.29.3
+alabaster==0.7.8
+altair==5.3.0
+annotated-types==0.6.0
+anyascii==0.3.2
+anyio==4.2.0
+asttokens==2.4.1
+attrs==19.3.0
+Automat==0.8.0
+Babel==2.6.0
+backcall==0.2.0
+beautifulsoup4==4.12.3
+benepar==0.2.0
+blinker==1.7.0
+blis==0.7.11
+bs4==0.0.2
+cachetools==5.3.3
+catalogue==2.0.10
+certifi==2019.11.28
+chardet==3.0.4
+charset-normalizer==3.3.2
+click==8.1.7
+cloud-init==23.4.4
+cloudpathlib==0.16.0
+colorama==0.4.3
+command-not-found==0.3
+confection==0.1.4
+configobj==5.0.6
+constantly==15.1.0
+constituent-treelib==0.0.7
+contractions==0.1.73
+cryptography==2.8
+cymem==2.0.8
+dbus-python==1.2.16
+decorator==5.1.1
+distro==1.9.0
+distro-info==0.23+ubuntu1.1
+docopt==0.6.2
+docutils==0.16
+entrypoints==0.3
+exceptiongroup==1.2.0
+executing==2.0.1
+falcon==3.1.3
+feedparser==5.2.1
+filelock==3.13.1
+flask==3.0.2
+fsspec==2023.12.2
+gensim==4.3.2
+gevent==23.9.1
+gitdb==4.0.11
+GitPython==3.1.43
+gitsome==0.8.0
+greenlet==3.0.3
+h11==0.14.0
+httpcore==1.0.2
+httplib2==0.14.0
+httpx==0.26.0
+huggingface-hub==0.22.2
+huspacy==0.11.0
+hyperlink==19.0.0
+idna==2.8
+ijson==3.2.3
+imagesize==1.2.0
+importlib-metadata==7.1.0
+importlib-resources==6.4.0
+incremental==16.10.1
+iniconfig==2.0.0
+install==1.3.5
+ipython==8.12.3
+itsdangerous==2.1.2
+jedi==0.19.1
+Jinja2==3.1.3
+joblib==1.3.2
+jsonpatch==1.22
+jsonpointer==2.0
+jsonschema==3.2.0
+keyring==18.0.1
+langcodes==3.3.0
+langid==1.1.6
+language-selector==0.1
+launchpadlib==1.10.13
+lazr.restfulclient==0.14.2
+lazr.uri==1.0.3
+lxml==5.1.0
+Mako==1.3.1
+markdown-it-py==3.0.0
+MarkupSafe==2.1.5
+matplotlib-inline==0.1.7
+mdurl==0.1.2
+more-itertools==4.2.0
+mpmath==1.3.0
+murmurhash==1.0.10
+netifaces==0.10.4
+networkx==3.1
+nltk==3.8.1
+numpy==1.24.4
+numpydoc==0.7.0
+nvidia-cublas-cu12==12.1.3.1
+nvidia-cuda-cupti-cu12==12.1.105
+nvidia-cuda-nvrtc-cu12==12.1.105
+nvidia-cuda-runtime-cu12==12.1.105
+nvidia-cudnn-cu12==8.9.2.26
+nvidia-cufft-cu12==11.0.2.54
+nvidia-curand-cu12==10.3.2.106
+nvidia-cusolver-cu12==11.4.5.107
+nvidia-cusparse-cu12==12.1.0.106
+nvidia-nccl-cu12==2.19.3
+nvidia-nvjitlink-cu12==12.3.101
+nvidia-nvtx-cu12==12.1.105
+oauthlib==3.1.0
+olefile==0.46
+openai==1.9.0
+packaging==20.3
+pandas==2.0.0
+parso==0.8.4
+pdfkit==1.0.0
+pexpect==4.6.0
+pickleshare==0.7.5
+pillow==10.3.0
+pluggy==1.5.0
+ply==3.11
+portalocker==2.8.2
+preshed==3.0.9
+prompt-toolkit==3.0.43
+protobuf==3.20.3
+psutil==5.9.8
+psycopg2-binary==2.9.6
+pure-eval==0.2.2
+pyahocorasick==2.1.0
+pyarrow==16.0.0
+pyasn1==0.4.2
+pyasn1-modules==0.2.1
+pydantic==2.5.3
+pydantic-core==2.14.6
+pydeck==0.9.0b1
+pygments==2.17.2
+PyGObject==3.36.0
+PyHamcrest==1.9.0
+pyinotify==0.9.6
+PyJWT==1.7.1
+pymacaroons==0.13.0
+PyNaCl==1.3.0
+pyOpenSSL==19.0.0
+pyparsing==2.4.6
+pyrsistent==0.15.5
+pyserial==3.4
+pytest==8.2.0
+python-apt==2.0.1+ubuntu0.20.4.1
+python-baseconv==1.2.2
+python-dateutil==2.8.2
+python-debian==0.1.36+ubuntu1.1
+python-dotenv==0.10.5
+pytz==2023.3
+PyYAML==5.3.1
+regex==2023.12.25
+requests==2.31.0
+requests-unixsocket==0.2.0
+rich==13.7.1
+roman==2.0.0
+safetensors==0.4.3
+scikit-learn==1.3.2
+scipy==1.10.1
+SecretStorage==2.3.1
+sentencepiece==0.2.0
+service-identity==18.1.0
+setproctitle==1.1.10
+simplejson==3.16.0
+six==1.14.0
+smart-open==6.4.0
+smmap==5.0.1
+sniffio==1.3.0
+sos==4.5.6
+soupsieve==2.5
+spacy==3.7.2
+spacy-legacy==3.0.12
+spacy-loggers==1.0.5
+Sphinx==1.8.5
+srsly==2.4.8
+ssh-import-id==5.10
+stack-data==0.6.3
+streamlit==1.33.0
+svgling==0.4.0
+svgwrite==1.4.3
+sympy==1.12
+systemd-python==234
+tenacity==8.2.3
+textblob==0.18.0.post0
+textsearch==0.0.24
+thinc==8.2.2
+threadpoolctl==3.2.0
+tokenizers==0.19.1
+toml==0.10.2
+tomli==2.0.1
+toolz==0.12.1
+torch==2.2.2
+torch-struct==0.5
+torchtext==0.17.2
+torchvision==0.16.2
+tornado==6.4
+tqdm==4.66.1
+traitlets==5.14.3
+transformers==4.40.1
+triton==2.2.0
+Twisted==18.9.0
+typer==0.9.0
+typing-extensions==4.9.0
+tzdata==2023.3
+ubuntu-pro-client==8001
+ufw==0.36
+unattended-upgrades==0.1
+Unidecode==1.3.8
+urllib3==1.26.9
+wadllib==1.3.3
+Wand==0.6.13
+wasabi==1.1.2
+watchdog==4.0.0
+wcwidth==0.1.8
+weasel==0.3.4
+werkzeug==3.0.2
+xonsh==0.9.13
+zimply==1.1.4
+zipp==3.18.1
+zope.event==5.0
+zope.interface==4.7.1
+zstandard==0.22.0

tfidf_model.py ADDED Viewed

	@@ -0,0 +1,275 @@

+import json
+import nltk
+from nltk.corpus import stopwords
+from nltk.tokenize import word_tokenize, sent_tokenize
+from nltk.stem import PorterStemmer, WordNetLemmatizer
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.metrics.pairwise import cosine_similarity
+import numpy as np
+import os
+import math
+import pickle
+import joblib
+import multiprocessing
+from concurrent.futures import ProcessPoolExecutor
+from tqdm import tqdm  # Import tqdm for progress tracking
+from collections import defaultdict
+nltk.download('punkt')
+nltk.download('stopwords')
+nltk.download('averaged_perceptron_tagger')
+nltk.download('maxent_ne_chunker')
+nltk.download('words')
+nltk.download('wordnet')
+nltk.download('omw-1.4')
+# Helper function to map NLTK POS tags to WordNet POS tags
+def get_wordnet_pos(treebank_tag):
+    if treebank_tag.startswith('J'):
+        return nltk.corpus.wordnet.ADJ
+    elif treebank_tag.startswith('V'):
+        return nltk.corpus.wordnet.VERB
+    elif treebank_tag.startswith('N'):
+        return nltk.corpus.wordnet.NOUN
+    elif treebank_tag.startswith('R'):
+        return nltk.corpus.wordnet.ADV
+    else:
+        return nltk.corpus.wordnet.NOUN
+class NLPModel:
+    def __init__(self):  # Initialize the model with necessary parameters
+        # Initialize model components (preprocessing, training, etc.)
+        #self.model
+        self.tfidf = TfidfVectorizer(tokenizer=self.tokenize, lowercase=False)
+        self.training_tfidf = None
+        #self.manager = multiprocessing.Manager()
+        self.flattened_sentences = []
+        self.training_tagged = []
+        self.answers = []
+    def tokenize(self, text):
+        # Your tokenization logic goes here
+        return text  # No tokenization needed, return the input as-is
+    def preprocess_text(self, text):
+        # Tokenization
+        sentences = sent_tokenize(text)
+        preprocessed_sentences = []
+        batch_size = 50  # Adjust the batch size based on your system's capabilities
+        for i in range(0, len(sentences), batch_size):
+            batch_sentences = sentences[i:i + batch_size]
+            batch_words = [word_tokenize(sentence) for sentence in batch_sentences]
+            # Filtering Stop Words
+            stop_words = set(stopwords.words('english'))
+            filtered_words = [[word for word in words if word.lower() not in stop_words] for words in batch_words]
+            # Stemming
+            stemmer = PorterStemmer()
+            stemmed_words = [[stemmer.stem(word) for word in words] for words in filtered_words]
+            # Tagging Parts of Speech
+            pos_tags = [nltk.pos_tag(words) for words in stemmed_words]
+            # Lemmatizing
+            lemmatizer = WordNetLemmatizer()
+            lemmatized_words = [[lemmatizer.lemmatize(word, pos=get_wordnet_pos(tag)) for word, tag in pos] for pos in pos_tags]
+            preprocessed_sentences.extend(lemmatized_words)
+        return preprocessed_sentences
+    def process_data(self, data_json):
+        #print("Processing data in parallel...")
+        batch_size = 10000  # Experiment with different batch sizes
+        num_processes = int(multiprocessing.cpu_count()/2)  # Utilize more processes
+        batches = [data_json[i:i + batch_size] for i in range(0, len(data_json), batch_size)]
+        #print('batches')
+        #training_tagged = []  # Initialize or clear self.training_tagged
+        sentence_answers = []
+        with ProcessPoolExecutor(max_workers=num_processes) as executor:
+            results = list(tqdm(executor.map(self.process_data_batch, batches), total=len(batches)))
+        #with multiprocessing.Pool() as pool:
+        #results = []
+        #for batch in batches:
+        #results.append(self.process_data_batch(batch))
+        for batch_result in results:
+            for result in batch_result:
+                sentence_answers.extend(result)
+            #print("here")
+        # Create a dictionary to group sentences by answer
+        answer_groups = defaultdict(list)
+        # Iterate through each (sentence, answer) pair in batch_results
+        for sentence, answer in sentence_answers:
+            answer_groups[answer].extend(sentence)
+        #print(list(answer_groups.items())[0])
+        # Create a new list with sentences grouped by answer
+        sentence_answers.extend([(sentence,answer) for answer, sentence in answer_groups.items()])
+        self.flattened_sentences.extend([x[0] for x in sentence_answers])
+        self.training_tagged.extend([x[1] for x in sentence_answers])
+        #print("Data processing complete.")
+    def process_data_batch(self, batch):
+        batch_results = []
+        for data in batch:
+            text = data["text"]
+            answer = data["answer"]
+            preprocessed_sentences = self.preprocess_text(text)
+            training_tagged = [(sentence, answer) for sentence in preprocessed_sentences]
+            #print(training_tagged)
+            batch_results.append(training_tagged)
+        #create another list where instead, the "sentence" of elements with the same answer are appended with each other
+        return batch_results
+    def train_model(self):
+        # Fit and transform the TF-IDF vectorizer
+        #print(self.flattened_sentences)
+        if(self.flattened_sentences):
+            self.training_tfidf = self.tfidf.fit_transform(self.flattened_sentences)
+            self.flattened_sentences = []
+            #self.
+        #print(self.training_tfidf)
+        #print(self.training_tagged)
+    def save(self, file_path):
+        model_data = {
+            'training_tagged': list(self.training_tagged),
+            'tfidf': self.tfidf,
+            'training_tfidf': self.training_tfidf
+        }
+        #print(model_data)
+        with open(file_path, 'wb') as f:
+            joblib.dump(model_data, f)
+    def load(self, file_path):
+        if os.path.exists(file_path):
+            with open(file_path, 'rb') as f:
+                print(os.path.exists(file_path))
+                model_data = joblib.load(file_path)
+                self.training_tagged = list(model_data['training_tagged'])
+                self.tfidf = model_data['tfidf']
+                print(self.tfidf)
+                self.training_tfidf = model_data['training_tfidf']
+        return self
+    def predict(self, input_data):
+        # Preprocess input data
+        new_text_processed = self.preprocess_text(input_data)
+        new_text_processed_tfidf = self.tfidf.transform(new_text_processed)
+        training_tfidf = self.training_tfidf
+        # Calculate sentence similarities
+        sentence_similarities = cosine_similarity(new_text_processed_tfidf, training_tfidf)
+        # Initialize data structures
+        similarities_max = {}
+        answers = []
+        # Iterate over sentence similarities
+        for similarity_row in sentence_similarities:
+            for answer, similarity in zip(self.training_tagged, similarity_row):
+                if isinstance(answer, list):
+                    continue
+                # Update similarities_max only when the new similarity is greater
+                if answer not in similarities_max or similarity > similarities_max[answer]:
+                    similarities_max[answer] = similarity
+            if not answers:
+                answers.extend(similarities_max.keys())
+        # Calculate total similarity for each answer and find the maximum similarity and its index
+        total_similarities = np.array([similarities_max[answer] for answer in answers])
+        closest_index = np.argmax(total_similarities)
+        closest_answer = answers[closest_index]
+        return total_similarities[closest_index], closest_answer
+    #return (sentences.max(),self.training_tagged[closest_index])
+    def evaluate(self, test_data, labels):
+        # Evaluate the performance of the model on test data
+        # Return evaluation metrics
+        pass
+    # Additional functions for model tuning, hyperparameter optimization, etc.
+if __name__ == "__main__":
+    # Train a simple model on QB data, save it to a file
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--data', type=str)
+    parser.add_argument('--model', type=str)
+    parser.add_argument('--predict', type=str)
+    flags = parser.parse_args()
+    model = NLPModel()
+    if flags.data:
+        with open(flags.data, 'r') as data_file:
+            data_json = json.load(data_file)
+            model.process_data(data_json)
+            model.train_model()
+            print(model.predict("My name is bobby, bobby newport. your name is jeff?"))
+            model.save("model.pkl")
+    if flags.model:
+        model.load(flags.model)
+    if flags.predict:
+        print(model.predict(flags.predict))