Odysseas Sierepeklis commited on
Commit
62a49dc
·
1 Parent(s): bd6b963

Add fine-tuned BERT models on different QA datasets

Browse files
.DS_Store ADDED
Binary file (10.2 kB). View file
 
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ mixed_best/pytorch_model.bin filter=lfs diff=lfs merge=lfs -text
37
+ squad-v2_best/pytorch_model.bin filter=lfs diff=lfs merge=lfs -text
38
+ te-cde_best/pytorch_model.bin filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,138 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Fine-Tuned BERT Models for Thermoelectric Materials Question Answering
2
+
3
+ ## Introduction
4
+
5
+ This repository contains three BERT models fine-tuned for question-answering (QA) tasks related to thermoelectric materials. The models are trained on different datasets to evaluate their performance on specialised QA tasks in the field of materials science.
6
+
7
+ We present a method for auto-generating a large question-answering dataset about thermoelectric materials for language model applications. The method was used to generate a dataset with sentence-wide contexts from a database of thermoelectric material records. The dataset was contrasted with SQuAD-v2, as well as the mixed combination of the two datasets. Hyperparameter optimisation was employed to fine-tune BERT models on each dataset, and the three best-performing models were then compared on a manually annotated test set of thermoelectric material paragraph contexts with questions spanning material names, five different properties, and temperatures during recording. The best BERT model fine-tuned on the mixed dataset outperforms the other two models when evaluated on the test dataset, indicating that mixing datasets with different semantic and syntactic scopes might be a beneficial approach to improving performance on specialised question-answering tasks.
8
+
9
+ ## Models Included
10
+
11
+ 1. **squad-v2_best**
12
+
13
+ Description: Fine-tuned on the SQuAD-v2 dataset, which is a widely used benchmark for QA tasks. \
14
+ Dataset: SQuAD-v2 \
15
+ Location: squad-v2_best/
16
+
17
+ 2. **te-cde_best**
18
+
19
+ Description: Fine-tuned on a thermoelectric materials-specific dataset generated using our auto-generation method. \
20
+ Dataset: Thermoelectric Materials QA Dataset (TE-CDE) \
21
+ Location: te-cde_best/
22
+
23
+ 3. **mixed_best**
24
+
25
+ Description: Fine-tuned on a mixed dataset combining SQuAD-v2 and the thermoelectric materials dataset to enhance performance on specialised QA tasks. \
26
+ Dataset: Combination of SQuAD-v2 and TE-CDE \
27
+ Location: mixed_best/
28
+
29
+ ## Dataset Details
30
+
31
+ **SQuAD-v2**
32
+
33
+ A reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles.
34
+ Some questions are unanswerable, adding complexity to the QA task.
35
+
36
+ **Thermoelectric Materials QA Dataset (TE-CDE)**
37
+
38
+ Auto-generated dataset containing QA pairs about thermoelectric materials.
39
+ Contexts are sentence-wide excerpts from a database of thermoelectric material records.
40
+ Questions cover:
41
+ Material names
42
+ Five different properties
43
+ Temperatures during recording
44
+
45
+ **Mixed Dataset**
46
+
47
+ A combination of SQuAD-v2 and TE-CDE datasets.
48
+ Aims to leverage the strengths of both general-purpose and domain-specific data.
49
+
50
+ ## Training Details
51
+
52
+ Base Model: BERT Base Uncased
53
+ Hyperparameter Optimisation: Employed to find the best-performing models for each dataset.
54
+ Training Parameters:
55
+ Epochs: Adjusted per dataset based on validation loss.
56
+ Batch Size: Optimized during training.
57
+ Learning Rate: Tuned using grid search.
58
+
59
+ ## Evaluation Metrics
60
+
61
+ Evaluation Dataset: Manually annotated test set of thermoelectric material paragraph contexts.
62
+ Metrics Used:
63
+ Exact Match (EM): Measures the percentage of predictions that match any one of the ground truth answers exactly.
64
+ F1 Score: Harmonic mean of precision and recall, considering overlap between the prediction and ground truth answers.
65
+
66
+ ### Performance Comparison
67
+ Model Exact Match (EM) F1 Score
68
+ squad-v2_best 57.60% 61.82%
69
+ te-cde_best 65.39% 69.78%
70
+ mixed_best 67.92% 72.29%
71
+
72
+ ## Usage Instructions
73
+
74
+ ### Installing Dependencies
75
+
76
+ ```bash
77
+ pip install transformers
78
+ ```
79
+
80
+ ### Loading a Model
81
+
82
+ Replace `model_name` with one of the following:
83
+
84
+ "odysie/bert-finetuned-qa-datasets/squad-v2_best"
85
+ "odysie/bert-finetuned-qa-datasets/te-cde_best"
86
+ "odysie/bert-finetuned-qa-datasets/mixed_best"
87
+
88
+ ```python
89
+ from transformers import BertForQuestionAnswering, BertTokenizer
90
+
91
+ model_name = "odysie/bert-finetuned-qa-datasets/mixed_best"
92
+
93
+ tokenizer = BertTokenizer.from_pretrained(model_name)
94
+ model = BertForQuestionAnswering.from_pretrained(model_name)
95
+
96
+ # Example question and context
97
+ question = "What is the chemical formula for water?"
98
+ context = "Water is a molecule composed of two hydrogen atoms and one oxygen atom, with the chemical formula H2O."
99
+
100
+ # Tokenize input
101
+ inputs = tokenizer.encode_plus(question, context, return_tensors="pt")
102
+
103
+ # Get model predictions
104
+ outputs = model(**inputs)
105
+ start_scores = outputs.start_logits
106
+ end_scores = outputs.end_logits
107
+
108
+ # Get the most likely beginning and end of answer with the argmax of the score
109
+ start_index = start_scores.argmax()
110
+ end_index = end_scores.argmax()
111
+
112
+ # Convert tokens to answer
113
+ tokens = inputs["input_ids"][0][start_index : end_index + 1]
114
+ answer = tokenizer.decode(tokens)
115
+
116
+ print(f"Answer: {answer}")
117
+ ```
118
+
119
+ ## License
120
+
121
+ This project is licensed under Apache 2.0
122
+
123
+
124
+ ## Citation
125
+
126
+ If you use these models in your research or application, please cite our work:
127
+
128
+ bibtex
129
+
130
+ (PENDING)
131
+
132
+ @article{
133
+ ...
134
+ }
135
+
136
+ ## Acknowledgments
137
+
138
+ We thank the contributors of the SQuAD-v2 dataset and the developers of the Hugging Face Transformers library for providing valuable resources that made this work possible.
mixed_best/README.md ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ### Model Name
2
+
3
+ ## Overview
4
+
5
+ This model is a fine-tuned version of BERT Base Uncased on a mixing of the SQuAD-v2 dataset and the TE-CDE dataset. It is optimised for pecialiased question-answering tasks in the field of thermoelectric materials, across five seminal quantities (the thermoelectric figure of merit, thermal conductivity, Seebeck coefficient, electrical conductivity, and power factor), while still performing well on general questions, and can be used to extract answers from given contexts.
6
+
7
+ ## Model Details
8
+
9
+ Model Type: BERT Base Uncased
10
+ Fine-Tuned On: Mixed Dataset
11
+ Language: English
12
+ License: Apache 2.0
13
+ Tags: bert, question-answering, transformers, fine-tuned, thermoelectric materials
14
+
15
+ ## Usage
16
+ ### Installation
17
+
18
+ Make sure you have the Transformers library installed:
19
+
20
+ ```bash
21
+ pip install transformers
22
+ ```
23
+
24
+ ### Loading the Model
25
+
26
+ ```python
27
+ from transformers import BertForQuestionAnswering, BertTokenizer
28
+
29
+ model_name = "odysie/bert-finetuned-qa-datasets/mixed_best"
30
+
31
+ tokenizer = BertTokenizer.from_pretrained(model_name)
32
+ model = BertForQuestionAnswering.from_pretrained(model_name)
33
+
34
+ Replace Model_Name with: squad-v2_best, te-cde_best, or mixed_best
35
+ ```
36
+
37
+ ### Example Usage
38
+
39
+ ```python
40
+ from transformers import BertForQuestionAnswering, BertTokenizer
41
+ import torch
42
+
43
+ model_name = "odysie/bert-finetuned-qa-datasets/squad-v2_best"
44
+
45
+ tokenizer = BertTokenizer.from_pretrained(model_name)
46
+ model = BertForQuestionAnswering.from_pretrained(model_name)
47
+
48
+ # Sample question and context
49
+ question = "What is the value of the Seebeck coefficient?"
50
+ context = "Cu2Sn0.93Ag0.07Se3 demonstrated a Seebeck coefficient of 1.2 VK-1 at 300 K."
51
+
52
+ # Tokenize input
53
+ inputs = tokenizer.encode_plus(question, context, return_tensors="pt")
54
+ input_ids = inputs["input_ids"].tolist()[0]
55
+
56
+ # Get model output
57
+ outputs = model(**inputs)
58
+ answer_start_scores = outputs.start_logits
59
+ answer_end_scores = outputs.end_logits
60
+
61
+ # Find the tokens with the highest `start` and `end` scores
62
+ answer_start = torch.argmax(answer_start_scores)
63
+ answer_end = torch.argmax(answer_end_scores) + 1
64
+
65
+ # Convert tokens to answer
66
+ answer = tokenizer.convert_tokens_to_string(
67
+ tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
68
+ )
69
+
70
+ print(f"Question: {question}")
71
+ print(f"Answer: {answer}")
72
+ ```
73
+
74
+ ### Training hyperparameters
75
+
76
+ The following hyperparameters were used during training:
77
+ - learning_rate: 6.257686103023713e-05
78
+ - train_batch_size: 8
79
+ - eval_batch_size: 8
80
+ - seed: 0
81
+ - distributed_type: multi-GPU
82
+ - num_devices: 16
83
+ - total_train_batch_size: 128
84
+ - total_eval_batch_size: 128
85
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
86
+ - lr_scheduler_type: linear
87
+ - lr_scheduler_warmup_ratio: 0.08443391405864548
88
+ - num_epochs: 5.0
89
+ - mixed_precision_training: Native AMP
90
+
91
+ ### Framework versions
92
+
93
+ - Transformers 4.41.0
94
+ - Pytorch 2.3.0+cu121
95
+ - Datasets 2.19.1
96
+ - Tokenizers 0.19.1
97
+
98
+ ## Dataset
99
+ ### Description
100
+
101
+ SQuAD-v2 combines the 100,000 questions in SQuAD-v1.1 with over 50,000 unanswerable questions. This dataset tests the ability of a model not only to answer questions when possible but also to abstain from answering when the question is unanswerable based on the context.
102
+
103
+ ## Link
104
+
105
+ SQuAD-v2: https://rajpurkar.github.io/SQuAD-explorer/
106
+
107
+ ## License
108
+
109
+ This project is licensed under Apache 2.0
110
+
111
+
112
+ ## Citation
113
+
114
+ If you use these models in your research or application, please cite our work:
115
+
116
+ bibtex
117
+
118
+ (PENDING)
119
+
120
+ @article{
121
+ ...
122
+ }
123
+
124
+ ## Acknowledgments
125
+
126
+ We thank the contributors of the SQuAD-v2 dataset and the developers of the Hugging Face Transformers library for providing valuable resources that made this work possible.
mixed_best/config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "bert-base-uncased",
3
+ "architectures": [
4
+ "BertForQuestionAnswering"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "float16",
22
+ "transformers_version": "4.41.0",
23
+ "type_vocab_size": 2,
24
+ "use_cache": true,
25
+ "vocab_size": 30522
26
+ }
mixed_best/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ce6682bc301b640dfff04fb949fe418529aa4c698fccf8a4e7756055029b1d8e
3
+ size 435638182
mixed_best/special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
mixed_best/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
mixed_best/tokenizer_config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "mask_token": "[MASK]",
48
+ "model_max_length": 512,
49
+ "pad_token": "[PAD]",
50
+ "sep_token": "[SEP]",
51
+ "strip_accents": null,
52
+ "tokenize_chinese_chars": true,
53
+ "tokenizer_class": "BertTokenizer",
54
+ "unk_token": "[UNK]"
55
+ }
mixed_best/vocab.txt ADDED
The diff for this file is too large to render. See raw diff
 
squad-v2_best/README.md ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ### Model Name
2
+
3
+ ## Overview
4
+
5
+ This model is a fine-tuned version of BERT Base Uncased on the SQuAD-v2 dataset. It is optimised for question-answering tasks and can be used to extract answers from given contexts.
6
+
7
+ ## Model Details
8
+
9
+ Model Type: BERT Base Uncased
10
+ Fine-Tuned On: SQuAD-v2
11
+ Language: English
12
+ License: Apache 2.0
13
+ Tags: bert, question-answering, transformers, fine-tuned
14
+
15
+ ## Usage
16
+ ### Installation
17
+
18
+ Make sure you have the Transformers library installed:
19
+
20
+ ```bash
21
+ pip install transformers
22
+ ```
23
+
24
+ ### Loading the Model
25
+
26
+ ```python
27
+ from transformers import BertForQuestionAnswering, BertTokenizer
28
+
29
+ model_name = "odysie/bert-finetuned-qa-datasets/squad-v2_best"
30
+
31
+ tokenizer = BertTokenizer.from_pretrained(model_name)
32
+ model = BertForQuestionAnswering.from_pretrained(model_name)
33
+
34
+ Replace Model_Name with: squad-v2_best, te-cde_best, or mixed_best
35
+ ```
36
+
37
+ ### Example Usage
38
+
39
+ ```python
40
+ from transformers import BertForQuestionAnswering, BertTokenizer
41
+ import torch
42
+
43
+ model_name = "odysie/bert-finetuned-qa-datasets/squad-v2_best"
44
+
45
+ tokenizer = BertTokenizer.from_pretrained(model_name)
46
+ model = BertForQuestionAnswering.from_pretrained(model_name)
47
+
48
+ # Sample question and context
49
+ question = "What is the value of the Seebeck coefficient?"
50
+ context = "Cu2Sn0.93Ag0.07Se3 demonstrated a Seebeck coefficient of 1.2 VK-1 at 300 K."
51
+
52
+ # Tokenize input
53
+ inputs = tokenizer.encode_plus(question, context, return_tensors="pt")
54
+ input_ids = inputs["input_ids"].tolist()[0]
55
+
56
+ # Get model output
57
+ outputs = model(**inputs)
58
+ answer_start_scores = outputs.start_logits
59
+ answer_end_scores = outputs.end_logits
60
+
61
+ # Find the tokens with the highest `start` and `end` scores
62
+ answer_start = torch.argmax(answer_start_scores)
63
+ answer_end = torch.argmax(answer_end_scores) + 1
64
+
65
+ # Convert tokens to answer
66
+ answer = tokenizer.convert_tokens_to_string(
67
+ tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
68
+ )
69
+
70
+ print(f"Question: {question}")
71
+ print(f"Answer: {answer}")
72
+ ```
73
+
74
+ ### Training hyperparameters
75
+
76
+ The following hyperparameters were used during training:
77
+ - learning_rate: 1.5218292681575764e-05
78
+ - train_batch_size: 1
79
+ - eval_batch_size: 8
80
+ - seed: 0
81
+ - distributed_type: multi-GPU
82
+ - num_devices: 16
83
+ - total_train_batch_size: 16
84
+ - total_eval_batch_size: 128
85
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
86
+ - lr_scheduler_type: linear
87
+ - lr_scheduler_warmup_ratio: 0.0961958191102116
88
+ - num_epochs: 3.0
89
+ - mixed_precision_training: Native AMP
90
+
91
+ ### Framework versions
92
+
93
+ - Transformers 4.41.0
94
+ - Pytorch 2.3.0+cu121
95
+ - Datasets 2.19.1
96
+ - Tokenizers 0.19.1
97
+
98
+ ## Dataset
99
+ ### Description
100
+
101
+ SQuAD-v2 combines the 100,000 questions in SQuAD-v1.1 with over 50,000 unanswerable questions. This dataset tests the ability of a model not only to answer questions when possible but also to abstain from answering when the question is unanswerable based on the context.
102
+
103
+ ## Link
104
+
105
+ SQuAD-v2: https://rajpurkar.github.io/SQuAD-explorer/
106
+
107
+ ## License
108
+
109
+ This project is licensed under Apache 2.0
110
+
111
+
112
+ ## Citation
113
+
114
+ If you use these models in your research or application, please cite our work:
115
+
116
+ bibtex
117
+
118
+ (PENDING)
119
+
120
+ @article{
121
+ ...
122
+ }
123
+
124
+ ## Acknowledgments
125
+
126
+ We thank the contributors of the SQuAD-v2 dataset and the developers of the Hugging Face Transformers library for providing valuable resources that made this work possible.
squad-v2_best/config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "bert-base-uncased",
3
+ "architectures": [
4
+ "BertForQuestionAnswering"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "float16",
22
+ "transformers_version": "4.41.0",
23
+ "type_vocab_size": 2,
24
+ "use_cache": true,
25
+ "vocab_size": 30522
26
+ }
squad-v2_best/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:419696115b900e686f06c2168accf8e35904d2ed3c62e774dc857c1a9c6c5c81
3
+ size 435638182
squad-v2_best/special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
squad-v2_best/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
squad-v2_best/tokenizer_config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "mask_token": "[MASK]",
48
+ "model_max_length": 512,
49
+ "pad_token": "[PAD]",
50
+ "sep_token": "[SEP]",
51
+ "strip_accents": null,
52
+ "tokenize_chinese_chars": true,
53
+ "tokenizer_class": "BertTokenizer",
54
+ "unk_token": "[UNK]"
55
+ }
squad-v2_best/vocab.txt ADDED
The diff for this file is too large to render. See raw diff
 
te-cde_best/README.md ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ### Model Name
2
+
3
+ ## Overview
4
+
5
+ This model is a fine-tuned version of BERT Base Uncased on the automatically generated TE-CDE dataset. It is optimised for specialiased question-answering tasks in the field of thermoelectric materials, across five seminal quantities (the thermoelectric figure of merit, thermal conductivity, Seebeck coefficient, electrical conductivity, and power factor) and can be used to extract answers from given contexts.
6
+
7
+ ## Model Details
8
+
9
+ Model Type: BERT Base Uncased
10
+ Fine-Tuned On: TE-CDE
11
+ Language: English
12
+ License: Apache 2.0
13
+ Tags: bert, question-answering, transformers, fine-tuned, thermoelectric materials
14
+
15
+ ## Usage
16
+ ### Installation
17
+
18
+ Make sure you have the Transformers library installed:
19
+
20
+ ```bash
21
+ pip install transformers
22
+ ```
23
+
24
+ ### Loading the Model
25
+
26
+ ```python
27
+ from transformers import BertForQuestionAnswering, BertTokenizer
28
+
29
+ model_name = "odysie/bert-finetuned-qa-datasets/te-cde_best"
30
+
31
+ tokenizer = BertTokenizer.from_pretrained(model_name)
32
+ model = BertForQuestionAnswering.from_pretrained(model_name)
33
+
34
+ Replace Model_Name with: squad-v2_best, te-cde_best, or mixed_best
35
+ ```
36
+
37
+ ### Example Usage
38
+
39
+ ```python
40
+ from transformers import BertForQuestionAnswering, BertTokenizer
41
+ import torch
42
+
43
+ model_name = "odysie/bert-finetuned-qa-datasets/te-cde_best"
44
+
45
+ tokenizer = BertTokenizer.from_pretrained(model_name)
46
+ model = BertForQuestionAnswering.from_pretrained(model_name)
47
+
48
+ # Sample question and context
49
+ question = "What is the value of the Seebeck coefficient?"
50
+ context = "Cu2Sn0.93Ag0.07Se3 demonstrated a Seebeck coefficient of 1.2 VK-1 at 300 K."
51
+
52
+ # Tokenize input
53
+ inputs = tokenizer.encode_plus(question, context, return_tensors="pt")
54
+ input_ids = inputs["input_ids"].tolist()[0]
55
+
56
+ # Get model output
57
+ outputs = model(**inputs)
58
+ answer_start_scores = outputs.start_logits
59
+ answer_end_scores = outputs.end_logits
60
+
61
+ # Find the tokens with the highest `start` and `end` scores
62
+ answer_start = torch.argmax(answer_start_scores)
63
+ answer_end = torch.argmax(answer_end_scores) + 1
64
+
65
+ # Convert tokens to answer
66
+ answer = tokenizer.convert_tokens_to_string(
67
+ tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
68
+ )
69
+
70
+ print(f"Question: {question}")
71
+ print(f"Answer: {answer}")
72
+ ```
73
+
74
+ ### Training hyperparameters
75
+
76
+ The following hyperparameters were used during training:
77
+ - learning_rate: 7.113287430580505e-05
78
+ - train_batch_size: 16
79
+ - eval_batch_size: 8
80
+ - seed: 0
81
+ - distributed_type: multi-GPU
82
+ - num_devices: 16
83
+ - total_train_batch_size: 256
84
+ - total_eval_batch_size: 128
85
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
86
+ - lr_scheduler_type: linear
87
+ - lr_scheduler_warmup_ratio: 0.1466796240672773
88
+ - num_epochs: 13.0
89
+ - mixed_precision_training: Native AMP
90
+
91
+ ### Framework versions
92
+
93
+ - Transformers 4.41.0
94
+ - Pytorch 2.3.0+cu121
95
+ - Datasets 2.19.1
96
+ - Tokenizers 0.19.1
97
+
98
+ ## Dataset
99
+ ### Description
100
+
101
+ TE-CDE contains 99,757 questions automatically generated from a thermoelectric materials database, across five different properties: the thermoelectric figure of merit, thermal conductivity, Seebeck coefficient, electrical conductivity, and power factor. 66,508 questions are answerable from the context, and 33,249 are not.
102
+
103
+ ## License
104
+
105
+ This project is licensed under Apache 2.0
106
+
107
+ ## Citation
108
+
109
+ If you use these models in your research or application, please cite our work:
110
+
111
+ bibtex
112
+
113
+ (PENDING)
114
+
115
+ @article{
116
+ ...
117
+ }
118
+
119
+ ## Acknowledgments
120
+
121
+ We thank the developers of the Hugging Face Transformers library for providing valuable resources that made this work possible.
122
+
123
+
124
+
te-cde_best/config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "bert-base-uncased",
3
+ "architectures": [
4
+ "BertForQuestionAnswering"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "float16",
22
+ "transformers_version": "4.41.0",
23
+ "type_vocab_size": 2,
24
+ "use_cache": true,
25
+ "vocab_size": 30522
26
+ }
te-cde_best/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:725cd8053135df5c3fc66e21b1387dd452862a4e208d062fa40a3621ed6ee48a
3
+ size 435638182
te-cde_best/special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
te-cde_best/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
te-cde_best/tokenizer_config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "mask_token": "[MASK]",
48
+ "model_max_length": 512,
49
+ "pad_token": "[PAD]",
50
+ "sep_token": "[SEP]",
51
+ "strip_accents": null,
52
+ "tokenize_chinese_chars": true,
53
+ "tokenizer_class": "BertTokenizer",
54
+ "unk_token": "[UNK]"
55
+ }
te-cde_best/vocab.txt ADDED
The diff for this file is too large to render. See raw diff