mcq-generator / test /DeepLearning_mcq_output.json
Sukmadi
Cleanup (#5)
dd77059 unverified
{
"mcqs": {
"1": {
"câu hỏi": "Theo nội dung trên, điểm khác biệt chính của kiến trúc Transformer so với các mô hình trước đó là gì?",
"lựa chọn": {
"a": "Sử dụng mạng hồi tiếp (recurrent) để mô hình hoá phụ thuộc dài hạn",
"b": "Dựa hoàn toàn vào cơ chế attention mà không có bất kỳ thành phần hồi tiếp nào",
"c": "Áp dụng các lớp convolution để tính toán các biểu diễn ẩn",
"d": "Chỉ sử dụng các lớp feed‑forward điểm‑điểm mà không có attention"
},
"đáp án": "Dựa hoàn toàn vào cơ chế attention mà không có bất kỳ thành phần hồi tiếp nào"
},
"2": {
"câu hỏi": "Trong quá trình huấn luyện các mô hình được mô tả, thuật toán tối ưu nào đã được sử dụng?",
"lựa chọn": {
"a": "Adam optimizer",
"b": "Stochastic Gradient Descent (SGD)",
"c": "RMSprop",
"d": "Adagrad"
},
"đáp án": "Adam optimizer"
},
"3": {
"câu hỏi": "Theo Bảng 3, mô hình Transformer cơ bản (base) đạt điểm BLEU bao nhiêu trên tập phát triển English‑to‑German (newstest2013)?",
"lựa chọn": {
"a": "25.8",
"b": "24.9",
"c": "26.4",
"d": "23.7"
},
"đáp án": "25.8"
},
"4": {
"câu hỏi": "Theo mô tả trong tài liệu, số bước warmup (warmup steps) được sử dụng trong quá trình huấn luyện là bao nhiêu?",
"lựa chọn": {
"a": "2000",
"b": "4000",
"c": "8000",
"d": "10000"
},
"đáp án": "4000"
},
"5": {
"câu hỏi": "Theo nội dung, mô hình Transformer (big) đạt được điểm BLEU bao nhiêu trên nhiệm vụ dịch tiếng Anh‑tiếng Đức WMT 2014?",
"lựa chọn": {
"a": "28.4",
"b": "30.0",
"c": "26.5",
"d": "27.0"
},
"đáp án": "28.4"
}
},
"validation": {
"1": {
"supported_by_embeddings": true,
"max_similarity": 0.7559994459152222,
"evidence": [
{
"idx": 9,
"page": 2,
"score": 0.7559994459152222,
"text": "To the best of our knowledge, however, the Transformer is the first transduction model relying\nentirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution. In the following sections, we will describe the Transformer, motivate\nself-attention and discuss its advantages over models such as [17, 18] and [9]. **3** **Model Architecture**\n\n\nMost competitive neural sequence transduction models have an encoder-decoder structure [ 5, 2, 35 ]. Here, the encoder maps an input sequence of symbol representations ( _x_ 1 _, ..., x_ _n_ ) to a sequence\nof continuous representations **z** = ( _z_ 1 _, ..., z_ _n_ ) . Given **z**, the decoder then generates an output\nsequence ( _y_ 1 _, ..., y_ _m_ ) of symbols one element at a time. At each step the model is auto-regressive\n\n[10], consuming the previously generated symbols as additional input when generating the next. 2"
},
{
"idx": 5,
"page": 3,
"score": 0.6882933974266052,
"text": "Figure 1: The Transformer - model architecture.\n\n\nThe Transformer follows this overall architecture using stacked self-attention and point-wise, fully\nconnected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,\nrespectively.\n\n\n**3.1** **Encoder and Decoder Stacks**\n\nthe two sub-layers, followed by layer normalization [ 1 ]. That is, the output of each sub-layer is\nLayerNorm( _x_ + Sublayer( _x_ )), where Sublayer( _x_ ) is the function implemented by the sub-layer\nitself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding\nlayers, produce outputs of dimension _d_ model = 512.\n\nmasking, combined with fact that the output embeddings are offset by one position, ensures that the\npredictions for position _i_ can depend only on the known outputs at positions less than _i_ .\n\n\n**3.2** **Attention**\n\n\n3"
},
{
"idx": 11,
"page": 10,
"score": 0.6689025163650513,
"text": "We\nplan to extend the Transformer to problems involving input and output modalities other than text and\nto investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs\nsuch as images, audio and video. Making generation less sequential is another research goals of ours. The code we used to train and evaluate our models is available at `[https://github.com/](https://github.com/tensorflow/tensor2tensor)`\n`[tensorflow/tensor2tensor](https://github.com/tensorflow/tensor2tensor)` . **Acknowledgements** We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful\ncomments, corrections and inspiration. **References**\n\n\n[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. _arXiv preprint_\n_[arXiv:1607.06450](http://arxiv.org/abs/1607.06450)_, 2016. [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\nlearning to align and translate. _CoRR_, abs/1409.0473, 2014. [3] Denny Britz, A..."
}
],
"model_verdict": {
"supported": true,
"confidence": 0.99,
"evidence": "the Transformer is the first transduction model relying entirely on self-attention ... without using sequence‑aligned RNNs or convolution.",
"reason": "Context explicitly states Transformer relies fully on attention and has no recurrent components."
}
},
"2": {
"supported_by_embeddings": true,
"max_similarity": 0.6170728206634521,
"evidence": [
{
"idx": 33,
"page": 10,
"score": 0.6170728206634521,
"text": "Our results in Table 4 show that despite the lack of task-specific tuning our model performs surprisingly well, yielding better results than all previously reported models with the exception of the\nRecurrent Neural Network Grammar [8]. In contrast to RNN sequence-to-sequence models [ 37 ], the Transformer outperforms the BerkeleyParser [29] even when training only on the WSJ training set of 40K sentences. **7** **Conclusion**\n\n\nIn this work, we presented the Transformer, the first sequence transduction model based entirely on\nattention, replacing the recurrent layers most commonly used in encoder-decoder architectures with\nmulti-headed self-attention. For translation tasks, the Transformer can be trained significantly faster than architectures based\non recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014\nEnglish-to-French translation tasks, we achieve a new state of the art. In the former task our best\nmodel outperforms even all previously reported ensembl..."
},
{
"idx": 3,
"page": 9,
"score": 0.5712530016899109,
"text": "This task presents specific challenges: the output is subject to strong structural\nconstraints and is significantly longer than the input. Furthermore, RNN sequence-to-sequence\nmodels have not been able to attain state-of-the-art results in small-data regimes [37]. We trained a 4-layer transformer with _d_ _model_ = 1024 on the Wall Street Journal (WSJ) portion of the\nPenn Treebank [ 25 ], about 40K training sentences. We also trained it in a semi-supervised setting,\nusing the larger high-confidence and BerkleyParser corpora from with approximately 17M sentences\n\n[ 37 ]. We used a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens\nfor the semi-supervised setting. We performed only a small number of experiments to select the dropout, both attention and residual\n(section 5.4), learning rates and beam size on the Section 22 development set, all other parameters\nremained unchanged from the English-to-German base translation model. During inference, we\n\n\n9"
},
{
"idx": 10,
"page": 8,
"score": 0.5459271669387817,
"text": "**Label Smoothing** During training, we employed label smoothing of value _ϵ_ _ls_ = 0 _._ 1 [ 36 ]. This\nhurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score. **6** **Results**\n\n\n**6.1** **Machine Translation**\n\n\nOn the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big)\nin Table 2) outperforms the best previously reported models (including ensembles) by more than 2 _._ 0\nBLEU, establishing a new state-of-the-art BLEU score of 28 _._ 4 . The configuration of this model is\nlisted in the bottom line of Table 3. Training took 3 _._ 5 days on 8 P100 GPUs. Even our base model\nsurpasses all previously published models and ensembles, at a fraction of the training cost of any of\nthe competitive models. On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41 _._ 0,\noutperforming all of the previously published single models, at less than 1 _/_ 4 the training cost of the\nprevious..."
}
],
"model_verdict": {
"supported": false,
"confidence": 0.99,
"evidence": "",
"reason": "Trong đoạn văn cung cấp không có thông tin nào đề cập đến thuật toán tối ưu được sử dụng, vì vậy không thể chứng thực đáp án 'Adam optimizer'."
}
},
"3": {
"supported_by_embeddings": true,
"max_similarity": 0.7403339743614197,
"evidence": [
{
"idx": 24,
"page": 8,
"score": 0.7403339743614197,
"text": "Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the\nEnglish-to-German and English-to-French newstest2014 tests at a fraction of the training cost. BLEU Training Cost (FLOPs)\nModel\n\nEN-DE EN-FR EN-DE EN-FR\nByteNet [18] 23.75\nDeep-Att + PosUnk [39] 39.2 1 _._ 0 _·_ 10 [20]\n\nGNMT + RL [38] 24.6 39.92 2 _._ 3 _·_ 10 [19] 1 _._ 4 _·_ 10 [20]\n\nConvS2S [9] 25.16 40.46 9 _._ 6 _·_ 10 [18] 1 _._ 5 _·_ 10 [20]\n\nMoE [32] 26.03 40.56 2 _._ 0 _·_ 10 [19] 1 _._ 2 _·_ 10 [20]\n\nDeep-Att + PosUnk Ensemble [39] 40.4 8 _._ 0 _·_ 10 [20]\n\nGNMT + RL Ensemble [38] 26.30 41.16 1 _._ 8 _·_ 10 [20] 1 _._ 1 _·_ 10 [21]\n\nConvS2S Ensemble [9] 26.36 **41.29** 7 _._ 7 _·_ 10 [19] 1 _._ 2 _·_ 10 [21]\n\nTransformer (base model) 27.3 38.1 **3** _**.**_ **3** _**·**_ **10** **[18]**\n\nTransformer (big) **28.4** **41.8** 2 _._ 3 _·_ 10 [19]\n\n\n**Residual Dropout** We apply dropout [ 33 ] to the output of each sub-layer, before it is added to the\nsub-layer input and normalized. ..."
},
{
"idx": 1,
"page": 9,
"score": 0.6753625273704529,
"text": "Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base\nmodel. All metrics are on the English-to-German translation development set, newstest2013. Listed\nperplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to\nper-word perplexities. |Col1|train<br>N d d h d d P ϵ<br>model ff k v drop ls steps|PPL BLEU params<br>(dev) (dev) ×106|\n|---|---|---|\n|base|6<br>512<br>2048<br>8<br>64<br>64<br>0.1<br>0.1<br>100K|4.92<br>25.8<br>65|\n|(A)|1<br>512<br>512<br>4<br>128<br>128<br>16<br>32<br>32<br>32<br>16<br>16|5.29<br>24.9<br>5.00<br>25.5<br>4.91<br>25.8<br>5.01<br>25.4|\n|(B)|16<br>32|5.16<br>25.1<br>58<br>5.01<br>25.4<br>60|\n|(C)|2<br>4<br>8<br>256<br>32<br>32<br>1024<br>128<br>128<br>1024<br>4096|6.11<br>23.7<br>36<br>5.19<br>25.3<br>50<br>4.88<br>25.5<br>80<br>5.75<br>24.5<br>28<br>4.66<br>26.0<br>168<br>5.12<br>25.4<br>53<br>4.75<br>26.2<br>90|\n|(D)|0.0<br>0.2<br>0.0<br>0.2|5.77<br>24.6<br>4.95<br>25.5<br..."
},
{
"idx": 9,
"page": 2,
"score": 0.5718259811401367,
"text": "To the best of our knowledge, however, the Transformer is the first transduction model relying\nentirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution. In the following sections, we will describe the Transformer, motivate\nself-attention and discuss its advantages over models such as [17, 18] and [9]. **3** **Model Architecture**\n\n\nMost competitive neural sequence transduction models have an encoder-decoder structure [ 5, 2, 35 ]. Here, the encoder maps an input sequence of symbol representations ( _x_ 1 _, ..., x_ _n_ ) to a sequence\nof continuous representations **z** = ( _z_ 1 _, ..., z_ _n_ ) . Given **z**, the decoder then generates an output\nsequence ( _y_ 1 _, ..., y_ _m_ ) of symbols one element at a time. At each step the model is auto-regressive\n\n[10], consuming the previously generated symbols as additional input when generating the next. 2"
}
],
"model_verdict": {
"supported": true,
"confidence": 0.99,
"evidence": "base|6 512 2048 8 64 64 0.1 0.1 100K|4.92 25.8 65",
"reason": "Bảng 3 liệt kê mô hình Transformer (base) với BLEU = 25.8 trên tập phát triển newstest2013, khớp với đáp án."
}
},
"4": {
"supported_by_embeddings": true,
"max_similarity": 0.6373076438903809,
"evidence": [
{
"idx": 36,
"page": 7,
"score": 0.6373076438903809,
"text": "Each training\nbatch contained a set of sentence pairs containing approximately 25000 source tokens and 25000\ntarget tokens. **5.2** **Hardware and Schedule**\n\n\nWe trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using\nthe hyperparameters described throughout the paper, each training step took about 0.4 seconds. We\ntrained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the\nbottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps\n(3.5 days). **5.3** **Optimizer**\n\n\nWe used the Adam optimizer [ 20 ] with _β_ 1 = 0 _._ 9, _β_ 2 = 0 _._ 98 and _ϵ_ = 10 _[−]_ [9] . We varied the learning\nrate over the course of training, according to the formula:\n\n\n_lrate_ = _d_ _[−]_ model [0] _[.]_ [5] _[·]_ [ min(] _[step]_ [_] _[num]_ _[−]_ [0] _[.]_ [5] _[, step]_ [_] _[num][ ·][ warmup]_ [_] _[steps]_ _[−]_ [1] _[.]_ [5] [)] (3)\n\n\nThis corresponds to increasing the learning rate linearly f..."
}
],
"model_verdict": {
"supported": true,
"confidence": 0.99,
"evidence": "We used _warmup_ _ _steps_ = 4000.",
"reason": "Context explicitly states that warmup steps were set to 4000, matching the answer."
}
},
"5": {
"supported_by_embeddings": true,
"max_similarity": 0.7000005841255188,
"evidence": [
{
"idx": 24,
"page": 8,
"score": 0.7000005841255188,
"text": "Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the\nEnglish-to-German and English-to-French newstest2014 tests at a fraction of the training cost. BLEU Training Cost (FLOPs)\nModel\n\nEN-DE EN-FR EN-DE EN-FR\nByteNet [18] 23.75\nDeep-Att + PosUnk [39] 39.2 1 _._ 0 _·_ 10 [20]\n\nGNMT + RL [38] 24.6 39.92 2 _._ 3 _·_ 10 [19] 1 _._ 4 _·_ 10 [20]\n\nConvS2S [9] 25.16 40.46 9 _._ 6 _·_ 10 [18] 1 _._ 5 _·_ 10 [20]\n\nMoE [32] 26.03 40.56 2 _._ 0 _·_ 10 [19] 1 _._ 2 _·_ 10 [20]\n\nDeep-Att + PosUnk Ensemble [39] 40.4 8 _._ 0 _·_ 10 [20]\n\nGNMT + RL Ensemble [38] 26.30 41.16 1 _._ 8 _·_ 10 [20] 1 _._ 1 _·_ 10 [21]\n\nConvS2S Ensemble [9] 26.36 **41.29** 7 _._ 7 _·_ 10 [19] 1 _._ 2 _·_ 10 [21]\n\nTransformer (base model) 27.3 38.1 **3** _**.**_ **3** _**·**_ **10** **[18]**\n\nTransformer (big) **28.4** **41.8** 2 _._ 3 _·_ 10 [19]\n\n\n**Residual Dropout** We apply dropout [ 33 ] to the output of each sub-layer, before it is added to the\nsub-layer input and normalized. ..."
},
{
"idx": 1,
"page": 9,
"score": 0.5974264144897461,
"text": "Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base\nmodel. All metrics are on the English-to-German translation development set, newstest2013. Listed\nperplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to\nper-word perplexities. |Col1|train<br>N d d h d d P ϵ<br>model ff k v drop ls steps|PPL BLEU params<br>(dev) (dev) ×106|\n|---|---|---|\n|base|6<br>512<br>2048<br>8<br>64<br>64<br>0.1<br>0.1<br>100K|4.92<br>25.8<br>65|\n|(A)|1<br>512<br>512<br>4<br>128<br>128<br>16<br>32<br>32<br>32<br>16<br>16|5.29<br>24.9<br>5.00<br>25.5<br>4.91<br>25.8<br>5.01<br>25.4|\n|(B)|16<br>32|5.16<br>25.1<br>58<br>5.01<br>25.4<br>60|\n|(C)|2<br>4<br>8<br>256<br>32<br>32<br>1024<br>128<br>128<br>1024<br>4096|6.11<br>23.7<br>36<br>5.19<br>25.3<br>50<br>4.88<br>25.5<br>80<br>5.75<br>24.5<br>28<br>4.66<br>26.0<br>168<br>5.12<br>25.4<br>53<br>4.75<br>26.2<br>90|\n|(D)|0.0<br>0.2<br>0.0<br>0.2|5.77<br>24.6<br>4.95<br>25.5<br..."
},
{
"idx": 32,
"page": 1,
"score": 0.5703283548355103,
"text": "Experiments on two machine translation tasks show these models to\nbe superior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including\nensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,\nour model establishes a new single-model state-of-the-art BLEU score of 41.8 after\ntraining for 3.5 days on eight GPUs, a small fraction of the training costs of the\nbest models from the literature. We show that the Transformer generalizes well to\nother tasks by applying it successfully to English constituency parsing both with\nlarge and limited training data. _∗_ Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started\nthe effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and\nhas been crucially involved in eve..."
}
],
"model_verdict": {
"supported": true,
"confidence": 0.99,
"evidence": "Transformer (big) **28.4** ...; Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task",
"reason": "Context explicitly states that the Transformer (big) model achieved a BLEU score of 28.4 on the English‑German WMT 2014 task."
}
}
}
}