Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,112 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
datasets:
|
| 4 |
+
- rajpurkar/squad
|
| 5 |
+
- google-research-datasets/natural_questions
|
| 6 |
+
- hotpotqa/hotpot_qa
|
| 7 |
+
pipeline_tag: question-answering
|
| 8 |
+
---
|
| 9 |
+
# Applied Deep Learning
|
| 10 |
+
- Sanju Debnath
|
| 11 |
+
- Project Type: Bring your own method
|
| 12 |
+
|
| 13 |
+
## Structure
|
| 14 |
+
- `data/` contains the data used for the project (after running `load_data.py`, and downloading the natural questions)
|
| 15 |
+
- `distilbert.py` contains the code for the DistilBERT model and the Dataset. A function for testing the functionality is in there too.
|
| 16 |
+
- `distilbert.ipynb` contains the creation and training of the DistilBERT model
|
| 17 |
+
- `distilbert.model` is the distilbert model
|
| 18 |
+
- `distilbert_reuse.model` is the question answering model
|
| 19 |
+
- `load_data.py` contains the code for loading the data and preprocessing it. We also split it up into smaller files to load in the Dataset later on.
|
| 20 |
+
- `qa_model.py` contains the code for thee different QA models. We also define a separate Dataset class in there and a method for testing the models.
|
| 21 |
+
- `qa_model.ipynb` contains the creation and training of the QA models.
|
| 22 |
+
- `requirements.txt` contains the requirements for the project
|
| 23 |
+
- `utils.py` contains some helper functions for the project. It contains the functions to evaluate the models and a way to visualise the trained parameters for each model.
|
| 24 |
+
- `application.py` contains the streamlit application to run everything
|
| 25 |
+
|
| 26 |
+
## How to run
|
| 27 |
+
- Install the requirements with `pip install -r requirements.txt`
|
| 28 |
+
- Run `load_data.py` to download the data and preprocess it (follow the documentation in the file regarding the natural questions dataset)
|
| 29 |
+
- Run `distilbert.ipynb` to train the DistilBERT model
|
| 30 |
+
- Run `qa_model.ipynb` to train the QA models
|
| 31 |
+
- Run `streamlit run application.py` to run the streamlit app
|
| 32 |
+
|
| 33 |
+
## Project
|
| 34 |
+
1. Create own DistilBERT Model using the OpenWebText dataset from Huggingface (https://huggingface.co/datasets/openwebtext) - 20h (active work, training is a lot longer)
|
| 35 |
+
- I initially wanted to use the Oscar dataset (https://huggingface.co/datasets/oscar) or the TriviaQA dataset (https://huggingface.co/datasets/mandarjoshi/trivia_qa), but it took too much storage
|
| 36 |
+
- I will train a MaskedLM model myself. However, my computational resources are limiting me, so my model's performance should not be sufficient, I will use the Huggingface model (https://huggingface.co/distilbert-base-cased)
|
| 37 |
+
2. Current methods often fine-tune the models on specific tasks. I believe that MultiTask learning is extremely useful, hence, I want to fix the DistilBERT weights here and train a head to do question answering - 30h
|
| 38 |
+
- Dataset: SQuAD (https://paperswithcode.com/dataset/squad), also Natural Questions (https://paperswithcode.com/dataset/natural-questions)
|
| 39 |
+
- The idea is to have one common corpus and specific heads, rather than a separate model for every single task
|
| 40 |
+
- In particular, I want to evaluate whether it is really necessary to fine-tune the base model too, as it already contains a model of the language. Ideally, having task-specific heads could make up for the lacking fine-tuning of the base model.
|
| 41 |
+
- If the performance of the model is comparable, this could reduce training efforts and resources
|
| 42 |
+
- Either add another Bert Layer per task or just the multi-head self-attention layer (see next section)
|
| 43 |
+
3. Application - 10h
|
| 44 |
+
- GUI, that lets people enter a context (base text), question, and they will receive an answer.
|
| 45 |
+
- Will contain some SQuAD questions as examples.
|
| 46 |
+
4. Report - 2h
|
| 47 |
+
5. Presentation - 2h
|
| 48 |
+
|
| 49 |
+
## Goal
|
| 50 |
+
The DistilBERT model was quite straightforward to train, I mostly used what HuggingFace provided anyways, so the only real challenge here was to download the dataset. Also, training is a lot of effort, so I wasn't able to train it to full convergence, as I just didn't have the resource. The DistilBERT model can be found in `distilbert.ipynb` and is fully functional.
|
| 51 |
+
* Error Metric: I landed at about 0.2 CrossEntropyLoss for both training and test set. The preconfiguration is quite good, as it didn't overfit.
|
| 52 |
+
* DistilBERT is primarily trained for masked prediction, I ran some manual sanity tests, to see which words are predicted. They usually make sense (although not entirely sometimes) and the grammatics are usually quite correct too.
|
| 53 |
+
* e.g. "It seems important to tackle the climate [MSK]." gave change (19%), crisis (12%), issues (5.8%), which are all appropriate in the context.
|
| 54 |
+
|
| 55 |
+
Now for the Question Answering model.
|
| 56 |
+
|
| 57 |
+
* Error Metric:
|
| 58 |
+
* We use the CrossEntropy loss to train the QA model
|
| 59 |
+
* Afterwards, we will fall back to F-1 score and the Exact Match (EM). These are also the metrics used for the SQuAD competition. (https://rajpurkar.github.io/SQuAD-explorer/).
|
| 60 |
+
* The definitions are retrieved from here (https://qa.fastforwardlabs.com/no%20answer/null%20threshold/bert/distilbert/exact%20match/f1/robust%20predictions/2020/06/09/Evaluating_BERT_on_SQuAD.html#Metrics-for-QA).
|
| 61 |
+
* EM: 1 if the prediction exactly matches the original, 0 otherwise
|
| 62 |
+
* F-1: Computed over the individual words in the prediction against those in the answer. Number of shared words is the key. Precision: Ratio of shared words to the number of words in the prediction. Recall: Ratio of shared words to number of words in GT.
|
| 63 |
+
* Target for Error Metric:
|
| 64 |
+
* EM: 0.6
|
| 65 |
+
* F-1: 0.7
|
| 66 |
+
* Achieved value: I almost achieved the target for both of the measurements. I ultimately quit I had already spent a lot of time on the project and thought that the results were reasonable.
|
| 67 |
+
* EM: 0.52
|
| 68 |
+
* F-1: 0.67
|
| 69 |
+
|
| 70 |
+
Amount of time for each task:
|
| 71 |
+
* DistilBERT model: ~20h (without training time). This was very similar to what I estimated, because I relied heavily on the Huggingface library. Loading the data was easy and the data is already very clean.
|
| 72 |
+
* QA model: ~40h (without training time). Was a lot of effort, as my first approach didn't work and it took me making up a basic POC model, to get to the final architecture.
|
| 73 |
+
* Application: 2h. Streamlit was really easy to use and fairly straightforward.
|
| 74 |
+
|
| 75 |
+
## Data
|
| 76 |
+
- Aaron Gokaslan et al. OpenWebText Corpus. 2019. https://skylion007.github.io/OpenWebTextCorpus/: **OpenWebText**
|
| 77 |
+
- Open source replication of the WebText dataset from OpenAI.
|
| 78 |
+
- They scraped web pages, with a focus on quality. They looked at the Reddit up- and downvotes to determine the quality of the resource.
|
| 79 |
+
- The dataset will be used to train the DistilBERT model using language masking.
|
| 80 |
+
- Rajpurkar et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text. 2016. https://rajpurkar.github.io/SQuAD-explorer/): **SQuAD**
|
| 81 |
+
- Standford Question Answering Dataset
|
| 82 |
+
- Collection of question-answer pairs, where the answer is a sequence of tokens in the given context text.
|
| 83 |
+
- Very diverse because it was created using crowdsourcing.
|
| 84 |
+
- Kwiatkowski et al. Natural Questions: a Benchmark for Question Answering Research. 2019. https://ai.google.com/research/NaturalQuestions/: **Natural Questions**
|
| 85 |
+
- Also a question-answer set, based on a Google query and corresponding Wikipedia page, containing the answer.
|
| 86 |
+
- Very similar to the SQuAD dataset.
|
| 87 |
+
- Yang, Zhilin et al. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. https://hotpotqa.github.io/
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
## Related Papers
|
| 91 |
+
- Sanh, Victor et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108. 2019.: https://arxiv.org/abs/1910.01108v4
|
| 92 |
+
- The choice of DistilBERT, as opposed to BERT, RoBERTa or XLNet is primarily based on the size of the network and training time
|
| 93 |
+
- I hope that the slight performance degradation will be compensated by the head, that is fine-tuned
|
| 94 |
+
- 艁. Maziarka and T. Danel. Multitask Learning Using BERT with Task-Embedded Attention. 2021 International Joint Conference on Neural Networks (IJCNN). 2021, pp. 1-6: https://ieeexplore.ieee.org/document/9533990
|
| 95 |
+
- In the paper they add task-specific parameters to the original model, hence, they change the baseline BERT
|
| 96 |
+
- "One possible solution is to add the task-specific, randomly initialized BERT_LAYERS at the top of the model."
|
| 97 |
+
- This is an interesting approach
|
| 98 |
+
- However, it increases the parameters drastically
|
| 99 |
+
- "We could prune the number of parameters in this setting, by adding only the multi-head self-attention layer,
|
| 100 |
+
without the position-wise feed-forward network."
|
| 101 |
+
- This would also be an interesting approach to investigate
|
| 102 |
+
- Jia, Qinjin et al. ALL-IN-ONE: Multi-Task Learning BERT models for Evaluating Peer Assessments. ArXiV abs/2110.03895. 2021.: https://arxiv.org/abs/2110.03895
|
| 103 |
+
- The authors compared single-task fine-tuned models (BERT and DistiLBERT) with multitask models
|
| 104 |
+
- They added one Dense layer on top of the base model for single-task, and three Dense layers for multitask
|
| 105 |
+
- They did not fix the base model's weights though, instead they fine-tuned it on multiple tasks, adding up the cross-entropy for each task to create the loss function
|
| 106 |
+
- El Mekki et al. BERT-based Multi-Task Model for Country and Province Level MSA and Dialectal Arabic Identification. WANLP. 2021.: https://aclanthology.org/2021.wanlp-1.31/
|
| 107 |
+
- The authors use a BERT (MARBERT), task specific attention layers and then classifiers to train the network
|
| 108 |
+
- They do not fix the weights of the BERT model either
|
| 109 |
+
- Jia et al. Large-scale Transfer Learning for Low-resource Spoken Language Understanding. ArXiV abs/2008.05671. 2020.: https://arxiv.org/abs/2008.05671
|
| 110 |
+
- This paper deals with Spoken Language Understanding (SLU)
|
| 111 |
+
- The authors test an architecture, where they fine-tune the BERT model and one where they fix the weights and add a specific head on top
|
| 112 |
+
- They conclude: "Results in Table 4 indicate that both strategies have abilities of improving the performance of SLU model."
|