Odysseas Sierepeklis
commited on
Commit
·
62a49dc
1
Parent(s):
bd6b963
Add fine-tuned BERT models on different QA datasets
Browse files- .DS_Store +0 -0
- .gitattributes +3 -0
- README.md +138 -3
- mixed_best/README.md +126 -0
- mixed_best/config.json +26 -0
- mixed_best/pytorch_model.bin +3 -0
- mixed_best/special_tokens_map.json +7 -0
- mixed_best/tokenizer.json +0 -0
- mixed_best/tokenizer_config.json +55 -0
- mixed_best/vocab.txt +0 -0
- squad-v2_best/README.md +126 -0
- squad-v2_best/config.json +26 -0
- squad-v2_best/pytorch_model.bin +3 -0
- squad-v2_best/special_tokens_map.json +7 -0
- squad-v2_best/tokenizer.json +0 -0
- squad-v2_best/tokenizer_config.json +55 -0
- squad-v2_best/vocab.txt +0 -0
- te-cde_best/README.md +124 -0
- te-cde_best/config.json +26 -0
- te-cde_best/pytorch_model.bin +3 -0
- te-cde_best/special_tokens_map.json +7 -0
- te-cde_best/tokenizer.json +0 -0
- te-cde_best/tokenizer_config.json +55 -0
- te-cde_best/vocab.txt +0 -0
.DS_Store
ADDED
|
Binary file (10.2 kB). View file
|
|
|
.gitattributes
CHANGED
|
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
mixed_best/pytorch_model.bin filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
squad-v2_best/pytorch_model.bin filter=lfs diff=lfs merge=lfs -text
|
| 38 |
+
te-cde_best/pytorch_model.bin filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -1,3 +1,138 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Fine-Tuned BERT Models for Thermoelectric Materials Question Answering
|
| 2 |
+
|
| 3 |
+
## Introduction
|
| 4 |
+
|
| 5 |
+
This repository contains three BERT models fine-tuned for question-answering (QA) tasks related to thermoelectric materials. The models are trained on different datasets to evaluate their performance on specialised QA tasks in the field of materials science.
|
| 6 |
+
|
| 7 |
+
We present a method for auto-generating a large question-answering dataset about thermoelectric materials for language model applications. The method was used to generate a dataset with sentence-wide contexts from a database of thermoelectric material records. The dataset was contrasted with SQuAD-v2, as well as the mixed combination of the two datasets. Hyperparameter optimisation was employed to fine-tune BERT models on each dataset, and the three best-performing models were then compared on a manually annotated test set of thermoelectric material paragraph contexts with questions spanning material names, five different properties, and temperatures during recording. The best BERT model fine-tuned on the mixed dataset outperforms the other two models when evaluated on the test dataset, indicating that mixing datasets with different semantic and syntactic scopes might be a beneficial approach to improving performance on specialised question-answering tasks.
|
| 8 |
+
|
| 9 |
+
## Models Included
|
| 10 |
+
|
| 11 |
+
1. **squad-v2_best**
|
| 12 |
+
|
| 13 |
+
Description: Fine-tuned on the SQuAD-v2 dataset, which is a widely used benchmark for QA tasks. \
|
| 14 |
+
Dataset: SQuAD-v2 \
|
| 15 |
+
Location: squad-v2_best/
|
| 16 |
+
|
| 17 |
+
2. **te-cde_best**
|
| 18 |
+
|
| 19 |
+
Description: Fine-tuned on a thermoelectric materials-specific dataset generated using our auto-generation method. \
|
| 20 |
+
Dataset: Thermoelectric Materials QA Dataset (TE-CDE) \
|
| 21 |
+
Location: te-cde_best/
|
| 22 |
+
|
| 23 |
+
3. **mixed_best**
|
| 24 |
+
|
| 25 |
+
Description: Fine-tuned on a mixed dataset combining SQuAD-v2 and the thermoelectric materials dataset to enhance performance on specialised QA tasks. \
|
| 26 |
+
Dataset: Combination of SQuAD-v2 and TE-CDE \
|
| 27 |
+
Location: mixed_best/
|
| 28 |
+
|
| 29 |
+
## Dataset Details
|
| 30 |
+
|
| 31 |
+
**SQuAD-v2**
|
| 32 |
+
|
| 33 |
+
A reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles.
|
| 34 |
+
Some questions are unanswerable, adding complexity to the QA task.
|
| 35 |
+
|
| 36 |
+
**Thermoelectric Materials QA Dataset (TE-CDE)**
|
| 37 |
+
|
| 38 |
+
Auto-generated dataset containing QA pairs about thermoelectric materials.
|
| 39 |
+
Contexts are sentence-wide excerpts from a database of thermoelectric material records.
|
| 40 |
+
Questions cover:
|
| 41 |
+
Material names
|
| 42 |
+
Five different properties
|
| 43 |
+
Temperatures during recording
|
| 44 |
+
|
| 45 |
+
**Mixed Dataset**
|
| 46 |
+
|
| 47 |
+
A combination of SQuAD-v2 and TE-CDE datasets.
|
| 48 |
+
Aims to leverage the strengths of both general-purpose and domain-specific data.
|
| 49 |
+
|
| 50 |
+
## Training Details
|
| 51 |
+
|
| 52 |
+
Base Model: BERT Base Uncased
|
| 53 |
+
Hyperparameter Optimisation: Employed to find the best-performing models for each dataset.
|
| 54 |
+
Training Parameters:
|
| 55 |
+
Epochs: Adjusted per dataset based on validation loss.
|
| 56 |
+
Batch Size: Optimized during training.
|
| 57 |
+
Learning Rate: Tuned using grid search.
|
| 58 |
+
|
| 59 |
+
## Evaluation Metrics
|
| 60 |
+
|
| 61 |
+
Evaluation Dataset: Manually annotated test set of thermoelectric material paragraph contexts.
|
| 62 |
+
Metrics Used:
|
| 63 |
+
Exact Match (EM): Measures the percentage of predictions that match any one of the ground truth answers exactly.
|
| 64 |
+
F1 Score: Harmonic mean of precision and recall, considering overlap between the prediction and ground truth answers.
|
| 65 |
+
|
| 66 |
+
### Performance Comparison
|
| 67 |
+
Model Exact Match (EM) F1 Score
|
| 68 |
+
squad-v2_best 57.60% 61.82%
|
| 69 |
+
te-cde_best 65.39% 69.78%
|
| 70 |
+
mixed_best 67.92% 72.29%
|
| 71 |
+
|
| 72 |
+
## Usage Instructions
|
| 73 |
+
|
| 74 |
+
### Installing Dependencies
|
| 75 |
+
|
| 76 |
+
```bash
|
| 77 |
+
pip install transformers
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
### Loading a Model
|
| 81 |
+
|
| 82 |
+
Replace `model_name` with one of the following:
|
| 83 |
+
|
| 84 |
+
"odysie/bert-finetuned-qa-datasets/squad-v2_best"
|
| 85 |
+
"odysie/bert-finetuned-qa-datasets/te-cde_best"
|
| 86 |
+
"odysie/bert-finetuned-qa-datasets/mixed_best"
|
| 87 |
+
|
| 88 |
+
```python
|
| 89 |
+
from transformers import BertForQuestionAnswering, BertTokenizer
|
| 90 |
+
|
| 91 |
+
model_name = "odysie/bert-finetuned-qa-datasets/mixed_best"
|
| 92 |
+
|
| 93 |
+
tokenizer = BertTokenizer.from_pretrained(model_name)
|
| 94 |
+
model = BertForQuestionAnswering.from_pretrained(model_name)
|
| 95 |
+
|
| 96 |
+
# Example question and context
|
| 97 |
+
question = "What is the chemical formula for water?"
|
| 98 |
+
context = "Water is a molecule composed of two hydrogen atoms and one oxygen atom, with the chemical formula H2O."
|
| 99 |
+
|
| 100 |
+
# Tokenize input
|
| 101 |
+
inputs = tokenizer.encode_plus(question, context, return_tensors="pt")
|
| 102 |
+
|
| 103 |
+
# Get model predictions
|
| 104 |
+
outputs = model(**inputs)
|
| 105 |
+
start_scores = outputs.start_logits
|
| 106 |
+
end_scores = outputs.end_logits
|
| 107 |
+
|
| 108 |
+
# Get the most likely beginning and end of answer with the argmax of the score
|
| 109 |
+
start_index = start_scores.argmax()
|
| 110 |
+
end_index = end_scores.argmax()
|
| 111 |
+
|
| 112 |
+
# Convert tokens to answer
|
| 113 |
+
tokens = inputs["input_ids"][0][start_index : end_index + 1]
|
| 114 |
+
answer = tokenizer.decode(tokens)
|
| 115 |
+
|
| 116 |
+
print(f"Answer: {answer}")
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
## License
|
| 120 |
+
|
| 121 |
+
This project is licensed under Apache 2.0
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
## Citation
|
| 125 |
+
|
| 126 |
+
If you use these models in your research or application, please cite our work:
|
| 127 |
+
|
| 128 |
+
bibtex
|
| 129 |
+
|
| 130 |
+
(PENDING)
|
| 131 |
+
|
| 132 |
+
@article{
|
| 133 |
+
...
|
| 134 |
+
}
|
| 135 |
+
|
| 136 |
+
## Acknowledgments
|
| 137 |
+
|
| 138 |
+
We thank the contributors of the SQuAD-v2 dataset and the developers of the Hugging Face Transformers library for providing valuable resources that made this work possible.
|
mixed_best/README.md
ADDED
|
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
### Model Name
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
|
| 5 |
+
This model is a fine-tuned version of BERT Base Uncased on a mixing of the SQuAD-v2 dataset and the TE-CDE dataset. It is optimised for pecialiased question-answering tasks in the field of thermoelectric materials, across five seminal quantities (the thermoelectric figure of merit, thermal conductivity, Seebeck coefficient, electrical conductivity, and power factor), while still performing well on general questions, and can be used to extract answers from given contexts.
|
| 6 |
+
|
| 7 |
+
## Model Details
|
| 8 |
+
|
| 9 |
+
Model Type: BERT Base Uncased
|
| 10 |
+
Fine-Tuned On: Mixed Dataset
|
| 11 |
+
Language: English
|
| 12 |
+
License: Apache 2.0
|
| 13 |
+
Tags: bert, question-answering, transformers, fine-tuned, thermoelectric materials
|
| 14 |
+
|
| 15 |
+
## Usage
|
| 16 |
+
### Installation
|
| 17 |
+
|
| 18 |
+
Make sure you have the Transformers library installed:
|
| 19 |
+
|
| 20 |
+
```bash
|
| 21 |
+
pip install transformers
|
| 22 |
+
```
|
| 23 |
+
|
| 24 |
+
### Loading the Model
|
| 25 |
+
|
| 26 |
+
```python
|
| 27 |
+
from transformers import BertForQuestionAnswering, BertTokenizer
|
| 28 |
+
|
| 29 |
+
model_name = "odysie/bert-finetuned-qa-datasets/mixed_best"
|
| 30 |
+
|
| 31 |
+
tokenizer = BertTokenizer.from_pretrained(model_name)
|
| 32 |
+
model = BertForQuestionAnswering.from_pretrained(model_name)
|
| 33 |
+
|
| 34 |
+
Replace Model_Name with: squad-v2_best, te-cde_best, or mixed_best
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
### Example Usage
|
| 38 |
+
|
| 39 |
+
```python
|
| 40 |
+
from transformers import BertForQuestionAnswering, BertTokenizer
|
| 41 |
+
import torch
|
| 42 |
+
|
| 43 |
+
model_name = "odysie/bert-finetuned-qa-datasets/squad-v2_best"
|
| 44 |
+
|
| 45 |
+
tokenizer = BertTokenizer.from_pretrained(model_name)
|
| 46 |
+
model = BertForQuestionAnswering.from_pretrained(model_name)
|
| 47 |
+
|
| 48 |
+
# Sample question and context
|
| 49 |
+
question = "What is the value of the Seebeck coefficient?"
|
| 50 |
+
context = "Cu2Sn0.93Ag0.07Se3 demonstrated a Seebeck coefficient of 1.2 VK-1 at 300 K."
|
| 51 |
+
|
| 52 |
+
# Tokenize input
|
| 53 |
+
inputs = tokenizer.encode_plus(question, context, return_tensors="pt")
|
| 54 |
+
input_ids = inputs["input_ids"].tolist()[0]
|
| 55 |
+
|
| 56 |
+
# Get model output
|
| 57 |
+
outputs = model(**inputs)
|
| 58 |
+
answer_start_scores = outputs.start_logits
|
| 59 |
+
answer_end_scores = outputs.end_logits
|
| 60 |
+
|
| 61 |
+
# Find the tokens with the highest `start` and `end` scores
|
| 62 |
+
answer_start = torch.argmax(answer_start_scores)
|
| 63 |
+
answer_end = torch.argmax(answer_end_scores) + 1
|
| 64 |
+
|
| 65 |
+
# Convert tokens to answer
|
| 66 |
+
answer = tokenizer.convert_tokens_to_string(
|
| 67 |
+
tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
|
| 68 |
+
)
|
| 69 |
+
|
| 70 |
+
print(f"Question: {question}")
|
| 71 |
+
print(f"Answer: {answer}")
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
### Training hyperparameters
|
| 75 |
+
|
| 76 |
+
The following hyperparameters were used during training:
|
| 77 |
+
- learning_rate: 6.257686103023713e-05
|
| 78 |
+
- train_batch_size: 8
|
| 79 |
+
- eval_batch_size: 8
|
| 80 |
+
- seed: 0
|
| 81 |
+
- distributed_type: multi-GPU
|
| 82 |
+
- num_devices: 16
|
| 83 |
+
- total_train_batch_size: 128
|
| 84 |
+
- total_eval_batch_size: 128
|
| 85 |
+
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
| 86 |
+
- lr_scheduler_type: linear
|
| 87 |
+
- lr_scheduler_warmup_ratio: 0.08443391405864548
|
| 88 |
+
- num_epochs: 5.0
|
| 89 |
+
- mixed_precision_training: Native AMP
|
| 90 |
+
|
| 91 |
+
### Framework versions
|
| 92 |
+
|
| 93 |
+
- Transformers 4.41.0
|
| 94 |
+
- Pytorch 2.3.0+cu121
|
| 95 |
+
- Datasets 2.19.1
|
| 96 |
+
- Tokenizers 0.19.1
|
| 97 |
+
|
| 98 |
+
## Dataset
|
| 99 |
+
### Description
|
| 100 |
+
|
| 101 |
+
SQuAD-v2 combines the 100,000 questions in SQuAD-v1.1 with over 50,000 unanswerable questions. This dataset tests the ability of a model not only to answer questions when possible but also to abstain from answering when the question is unanswerable based on the context.
|
| 102 |
+
|
| 103 |
+
## Link
|
| 104 |
+
|
| 105 |
+
SQuAD-v2: https://rajpurkar.github.io/SQuAD-explorer/
|
| 106 |
+
|
| 107 |
+
## License
|
| 108 |
+
|
| 109 |
+
This project is licensed under Apache 2.0
|
| 110 |
+
|
| 111 |
+
|
| 112 |
+
## Citation
|
| 113 |
+
|
| 114 |
+
If you use these models in your research or application, please cite our work:
|
| 115 |
+
|
| 116 |
+
bibtex
|
| 117 |
+
|
| 118 |
+
(PENDING)
|
| 119 |
+
|
| 120 |
+
@article{
|
| 121 |
+
...
|
| 122 |
+
}
|
| 123 |
+
|
| 124 |
+
## Acknowledgments
|
| 125 |
+
|
| 126 |
+
We thank the contributors of the SQuAD-v2 dataset and the developers of the Hugging Face Transformers library for providing valuable resources that made this work possible.
|
mixed_best/config.json
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_name_or_path": "bert-base-uncased",
|
| 3 |
+
"architectures": [
|
| 4 |
+
"BertForQuestionAnswering"
|
| 5 |
+
],
|
| 6 |
+
"attention_probs_dropout_prob": 0.1,
|
| 7 |
+
"classifier_dropout": null,
|
| 8 |
+
"gradient_checkpointing": false,
|
| 9 |
+
"hidden_act": "gelu",
|
| 10 |
+
"hidden_dropout_prob": 0.1,
|
| 11 |
+
"hidden_size": 768,
|
| 12 |
+
"initializer_range": 0.02,
|
| 13 |
+
"intermediate_size": 3072,
|
| 14 |
+
"layer_norm_eps": 1e-12,
|
| 15 |
+
"max_position_embeddings": 512,
|
| 16 |
+
"model_type": "bert",
|
| 17 |
+
"num_attention_heads": 12,
|
| 18 |
+
"num_hidden_layers": 12,
|
| 19 |
+
"pad_token_id": 0,
|
| 20 |
+
"position_embedding_type": "absolute",
|
| 21 |
+
"torch_dtype": "float16",
|
| 22 |
+
"transformers_version": "4.41.0",
|
| 23 |
+
"type_vocab_size": 2,
|
| 24 |
+
"use_cache": true,
|
| 25 |
+
"vocab_size": 30522
|
| 26 |
+
}
|
mixed_best/pytorch_model.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ce6682bc301b640dfff04fb949fe418529aa4c698fccf8a4e7756055029b1d8e
|
| 3 |
+
size 435638182
|
mixed_best/special_tokens_map.json
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cls_token": "[CLS]",
|
| 3 |
+
"mask_token": "[MASK]",
|
| 4 |
+
"pad_token": "[PAD]",
|
| 5 |
+
"sep_token": "[SEP]",
|
| 6 |
+
"unk_token": "[UNK]"
|
| 7 |
+
}
|
mixed_best/tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
mixed_best/tokenizer_config.json
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"0": {
|
| 4 |
+
"content": "[PAD]",
|
| 5 |
+
"lstrip": false,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": false,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"100": {
|
| 12 |
+
"content": "[UNK]",
|
| 13 |
+
"lstrip": false,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": false,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
},
|
| 19 |
+
"101": {
|
| 20 |
+
"content": "[CLS]",
|
| 21 |
+
"lstrip": false,
|
| 22 |
+
"normalized": false,
|
| 23 |
+
"rstrip": false,
|
| 24 |
+
"single_word": false,
|
| 25 |
+
"special": true
|
| 26 |
+
},
|
| 27 |
+
"102": {
|
| 28 |
+
"content": "[SEP]",
|
| 29 |
+
"lstrip": false,
|
| 30 |
+
"normalized": false,
|
| 31 |
+
"rstrip": false,
|
| 32 |
+
"single_word": false,
|
| 33 |
+
"special": true
|
| 34 |
+
},
|
| 35 |
+
"103": {
|
| 36 |
+
"content": "[MASK]",
|
| 37 |
+
"lstrip": false,
|
| 38 |
+
"normalized": false,
|
| 39 |
+
"rstrip": false,
|
| 40 |
+
"single_word": false,
|
| 41 |
+
"special": true
|
| 42 |
+
}
|
| 43 |
+
},
|
| 44 |
+
"clean_up_tokenization_spaces": true,
|
| 45 |
+
"cls_token": "[CLS]",
|
| 46 |
+
"do_lower_case": true,
|
| 47 |
+
"mask_token": "[MASK]",
|
| 48 |
+
"model_max_length": 512,
|
| 49 |
+
"pad_token": "[PAD]",
|
| 50 |
+
"sep_token": "[SEP]",
|
| 51 |
+
"strip_accents": null,
|
| 52 |
+
"tokenize_chinese_chars": true,
|
| 53 |
+
"tokenizer_class": "BertTokenizer",
|
| 54 |
+
"unk_token": "[UNK]"
|
| 55 |
+
}
|
mixed_best/vocab.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
squad-v2_best/README.md
ADDED
|
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
### Model Name
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
|
| 5 |
+
This model is a fine-tuned version of BERT Base Uncased on the SQuAD-v2 dataset. It is optimised for question-answering tasks and can be used to extract answers from given contexts.
|
| 6 |
+
|
| 7 |
+
## Model Details
|
| 8 |
+
|
| 9 |
+
Model Type: BERT Base Uncased
|
| 10 |
+
Fine-Tuned On: SQuAD-v2
|
| 11 |
+
Language: English
|
| 12 |
+
License: Apache 2.0
|
| 13 |
+
Tags: bert, question-answering, transformers, fine-tuned
|
| 14 |
+
|
| 15 |
+
## Usage
|
| 16 |
+
### Installation
|
| 17 |
+
|
| 18 |
+
Make sure you have the Transformers library installed:
|
| 19 |
+
|
| 20 |
+
```bash
|
| 21 |
+
pip install transformers
|
| 22 |
+
```
|
| 23 |
+
|
| 24 |
+
### Loading the Model
|
| 25 |
+
|
| 26 |
+
```python
|
| 27 |
+
from transformers import BertForQuestionAnswering, BertTokenizer
|
| 28 |
+
|
| 29 |
+
model_name = "odysie/bert-finetuned-qa-datasets/squad-v2_best"
|
| 30 |
+
|
| 31 |
+
tokenizer = BertTokenizer.from_pretrained(model_name)
|
| 32 |
+
model = BertForQuestionAnswering.from_pretrained(model_name)
|
| 33 |
+
|
| 34 |
+
Replace Model_Name with: squad-v2_best, te-cde_best, or mixed_best
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
### Example Usage
|
| 38 |
+
|
| 39 |
+
```python
|
| 40 |
+
from transformers import BertForQuestionAnswering, BertTokenizer
|
| 41 |
+
import torch
|
| 42 |
+
|
| 43 |
+
model_name = "odysie/bert-finetuned-qa-datasets/squad-v2_best"
|
| 44 |
+
|
| 45 |
+
tokenizer = BertTokenizer.from_pretrained(model_name)
|
| 46 |
+
model = BertForQuestionAnswering.from_pretrained(model_name)
|
| 47 |
+
|
| 48 |
+
# Sample question and context
|
| 49 |
+
question = "What is the value of the Seebeck coefficient?"
|
| 50 |
+
context = "Cu2Sn0.93Ag0.07Se3 demonstrated a Seebeck coefficient of 1.2 VK-1 at 300 K."
|
| 51 |
+
|
| 52 |
+
# Tokenize input
|
| 53 |
+
inputs = tokenizer.encode_plus(question, context, return_tensors="pt")
|
| 54 |
+
input_ids = inputs["input_ids"].tolist()[0]
|
| 55 |
+
|
| 56 |
+
# Get model output
|
| 57 |
+
outputs = model(**inputs)
|
| 58 |
+
answer_start_scores = outputs.start_logits
|
| 59 |
+
answer_end_scores = outputs.end_logits
|
| 60 |
+
|
| 61 |
+
# Find the tokens with the highest `start` and `end` scores
|
| 62 |
+
answer_start = torch.argmax(answer_start_scores)
|
| 63 |
+
answer_end = torch.argmax(answer_end_scores) + 1
|
| 64 |
+
|
| 65 |
+
# Convert tokens to answer
|
| 66 |
+
answer = tokenizer.convert_tokens_to_string(
|
| 67 |
+
tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
|
| 68 |
+
)
|
| 69 |
+
|
| 70 |
+
print(f"Question: {question}")
|
| 71 |
+
print(f"Answer: {answer}")
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
### Training hyperparameters
|
| 75 |
+
|
| 76 |
+
The following hyperparameters were used during training:
|
| 77 |
+
- learning_rate: 1.5218292681575764e-05
|
| 78 |
+
- train_batch_size: 1
|
| 79 |
+
- eval_batch_size: 8
|
| 80 |
+
- seed: 0
|
| 81 |
+
- distributed_type: multi-GPU
|
| 82 |
+
- num_devices: 16
|
| 83 |
+
- total_train_batch_size: 16
|
| 84 |
+
- total_eval_batch_size: 128
|
| 85 |
+
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
| 86 |
+
- lr_scheduler_type: linear
|
| 87 |
+
- lr_scheduler_warmup_ratio: 0.0961958191102116
|
| 88 |
+
- num_epochs: 3.0
|
| 89 |
+
- mixed_precision_training: Native AMP
|
| 90 |
+
|
| 91 |
+
### Framework versions
|
| 92 |
+
|
| 93 |
+
- Transformers 4.41.0
|
| 94 |
+
- Pytorch 2.3.0+cu121
|
| 95 |
+
- Datasets 2.19.1
|
| 96 |
+
- Tokenizers 0.19.1
|
| 97 |
+
|
| 98 |
+
## Dataset
|
| 99 |
+
### Description
|
| 100 |
+
|
| 101 |
+
SQuAD-v2 combines the 100,000 questions in SQuAD-v1.1 with over 50,000 unanswerable questions. This dataset tests the ability of a model not only to answer questions when possible but also to abstain from answering when the question is unanswerable based on the context.
|
| 102 |
+
|
| 103 |
+
## Link
|
| 104 |
+
|
| 105 |
+
SQuAD-v2: https://rajpurkar.github.io/SQuAD-explorer/
|
| 106 |
+
|
| 107 |
+
## License
|
| 108 |
+
|
| 109 |
+
This project is licensed under Apache 2.0
|
| 110 |
+
|
| 111 |
+
|
| 112 |
+
## Citation
|
| 113 |
+
|
| 114 |
+
If you use these models in your research or application, please cite our work:
|
| 115 |
+
|
| 116 |
+
bibtex
|
| 117 |
+
|
| 118 |
+
(PENDING)
|
| 119 |
+
|
| 120 |
+
@article{
|
| 121 |
+
...
|
| 122 |
+
}
|
| 123 |
+
|
| 124 |
+
## Acknowledgments
|
| 125 |
+
|
| 126 |
+
We thank the contributors of the SQuAD-v2 dataset and the developers of the Hugging Face Transformers library for providing valuable resources that made this work possible.
|
squad-v2_best/config.json
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_name_or_path": "bert-base-uncased",
|
| 3 |
+
"architectures": [
|
| 4 |
+
"BertForQuestionAnswering"
|
| 5 |
+
],
|
| 6 |
+
"attention_probs_dropout_prob": 0.1,
|
| 7 |
+
"classifier_dropout": null,
|
| 8 |
+
"gradient_checkpointing": false,
|
| 9 |
+
"hidden_act": "gelu",
|
| 10 |
+
"hidden_dropout_prob": 0.1,
|
| 11 |
+
"hidden_size": 768,
|
| 12 |
+
"initializer_range": 0.02,
|
| 13 |
+
"intermediate_size": 3072,
|
| 14 |
+
"layer_norm_eps": 1e-12,
|
| 15 |
+
"max_position_embeddings": 512,
|
| 16 |
+
"model_type": "bert",
|
| 17 |
+
"num_attention_heads": 12,
|
| 18 |
+
"num_hidden_layers": 12,
|
| 19 |
+
"pad_token_id": 0,
|
| 20 |
+
"position_embedding_type": "absolute",
|
| 21 |
+
"torch_dtype": "float16",
|
| 22 |
+
"transformers_version": "4.41.0",
|
| 23 |
+
"type_vocab_size": 2,
|
| 24 |
+
"use_cache": true,
|
| 25 |
+
"vocab_size": 30522
|
| 26 |
+
}
|
squad-v2_best/pytorch_model.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:419696115b900e686f06c2168accf8e35904d2ed3c62e774dc857c1a9c6c5c81
|
| 3 |
+
size 435638182
|
squad-v2_best/special_tokens_map.json
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cls_token": "[CLS]",
|
| 3 |
+
"mask_token": "[MASK]",
|
| 4 |
+
"pad_token": "[PAD]",
|
| 5 |
+
"sep_token": "[SEP]",
|
| 6 |
+
"unk_token": "[UNK]"
|
| 7 |
+
}
|
squad-v2_best/tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
squad-v2_best/tokenizer_config.json
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"0": {
|
| 4 |
+
"content": "[PAD]",
|
| 5 |
+
"lstrip": false,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": false,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"100": {
|
| 12 |
+
"content": "[UNK]",
|
| 13 |
+
"lstrip": false,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": false,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
},
|
| 19 |
+
"101": {
|
| 20 |
+
"content": "[CLS]",
|
| 21 |
+
"lstrip": false,
|
| 22 |
+
"normalized": false,
|
| 23 |
+
"rstrip": false,
|
| 24 |
+
"single_word": false,
|
| 25 |
+
"special": true
|
| 26 |
+
},
|
| 27 |
+
"102": {
|
| 28 |
+
"content": "[SEP]",
|
| 29 |
+
"lstrip": false,
|
| 30 |
+
"normalized": false,
|
| 31 |
+
"rstrip": false,
|
| 32 |
+
"single_word": false,
|
| 33 |
+
"special": true
|
| 34 |
+
},
|
| 35 |
+
"103": {
|
| 36 |
+
"content": "[MASK]",
|
| 37 |
+
"lstrip": false,
|
| 38 |
+
"normalized": false,
|
| 39 |
+
"rstrip": false,
|
| 40 |
+
"single_word": false,
|
| 41 |
+
"special": true
|
| 42 |
+
}
|
| 43 |
+
},
|
| 44 |
+
"clean_up_tokenization_spaces": true,
|
| 45 |
+
"cls_token": "[CLS]",
|
| 46 |
+
"do_lower_case": true,
|
| 47 |
+
"mask_token": "[MASK]",
|
| 48 |
+
"model_max_length": 512,
|
| 49 |
+
"pad_token": "[PAD]",
|
| 50 |
+
"sep_token": "[SEP]",
|
| 51 |
+
"strip_accents": null,
|
| 52 |
+
"tokenize_chinese_chars": true,
|
| 53 |
+
"tokenizer_class": "BertTokenizer",
|
| 54 |
+
"unk_token": "[UNK]"
|
| 55 |
+
}
|
squad-v2_best/vocab.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
te-cde_best/README.md
ADDED
|
@@ -0,0 +1,124 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
### Model Name
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
|
| 5 |
+
This model is a fine-tuned version of BERT Base Uncased on the automatically generated TE-CDE dataset. It is optimised for specialiased question-answering tasks in the field of thermoelectric materials, across five seminal quantities (the thermoelectric figure of merit, thermal conductivity, Seebeck coefficient, electrical conductivity, and power factor) and can be used to extract answers from given contexts.
|
| 6 |
+
|
| 7 |
+
## Model Details
|
| 8 |
+
|
| 9 |
+
Model Type: BERT Base Uncased
|
| 10 |
+
Fine-Tuned On: TE-CDE
|
| 11 |
+
Language: English
|
| 12 |
+
License: Apache 2.0
|
| 13 |
+
Tags: bert, question-answering, transformers, fine-tuned, thermoelectric materials
|
| 14 |
+
|
| 15 |
+
## Usage
|
| 16 |
+
### Installation
|
| 17 |
+
|
| 18 |
+
Make sure you have the Transformers library installed:
|
| 19 |
+
|
| 20 |
+
```bash
|
| 21 |
+
pip install transformers
|
| 22 |
+
```
|
| 23 |
+
|
| 24 |
+
### Loading the Model
|
| 25 |
+
|
| 26 |
+
```python
|
| 27 |
+
from transformers import BertForQuestionAnswering, BertTokenizer
|
| 28 |
+
|
| 29 |
+
model_name = "odysie/bert-finetuned-qa-datasets/te-cde_best"
|
| 30 |
+
|
| 31 |
+
tokenizer = BertTokenizer.from_pretrained(model_name)
|
| 32 |
+
model = BertForQuestionAnswering.from_pretrained(model_name)
|
| 33 |
+
|
| 34 |
+
Replace Model_Name with: squad-v2_best, te-cde_best, or mixed_best
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
### Example Usage
|
| 38 |
+
|
| 39 |
+
```python
|
| 40 |
+
from transformers import BertForQuestionAnswering, BertTokenizer
|
| 41 |
+
import torch
|
| 42 |
+
|
| 43 |
+
model_name = "odysie/bert-finetuned-qa-datasets/te-cde_best"
|
| 44 |
+
|
| 45 |
+
tokenizer = BertTokenizer.from_pretrained(model_name)
|
| 46 |
+
model = BertForQuestionAnswering.from_pretrained(model_name)
|
| 47 |
+
|
| 48 |
+
# Sample question and context
|
| 49 |
+
question = "What is the value of the Seebeck coefficient?"
|
| 50 |
+
context = "Cu2Sn0.93Ag0.07Se3 demonstrated a Seebeck coefficient of 1.2 VK-1 at 300 K."
|
| 51 |
+
|
| 52 |
+
# Tokenize input
|
| 53 |
+
inputs = tokenizer.encode_plus(question, context, return_tensors="pt")
|
| 54 |
+
input_ids = inputs["input_ids"].tolist()[0]
|
| 55 |
+
|
| 56 |
+
# Get model output
|
| 57 |
+
outputs = model(**inputs)
|
| 58 |
+
answer_start_scores = outputs.start_logits
|
| 59 |
+
answer_end_scores = outputs.end_logits
|
| 60 |
+
|
| 61 |
+
# Find the tokens with the highest `start` and `end` scores
|
| 62 |
+
answer_start = torch.argmax(answer_start_scores)
|
| 63 |
+
answer_end = torch.argmax(answer_end_scores) + 1
|
| 64 |
+
|
| 65 |
+
# Convert tokens to answer
|
| 66 |
+
answer = tokenizer.convert_tokens_to_string(
|
| 67 |
+
tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
|
| 68 |
+
)
|
| 69 |
+
|
| 70 |
+
print(f"Question: {question}")
|
| 71 |
+
print(f"Answer: {answer}")
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
### Training hyperparameters
|
| 75 |
+
|
| 76 |
+
The following hyperparameters were used during training:
|
| 77 |
+
- learning_rate: 7.113287430580505e-05
|
| 78 |
+
- train_batch_size: 16
|
| 79 |
+
- eval_batch_size: 8
|
| 80 |
+
- seed: 0
|
| 81 |
+
- distributed_type: multi-GPU
|
| 82 |
+
- num_devices: 16
|
| 83 |
+
- total_train_batch_size: 256
|
| 84 |
+
- total_eval_batch_size: 128
|
| 85 |
+
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
| 86 |
+
- lr_scheduler_type: linear
|
| 87 |
+
- lr_scheduler_warmup_ratio: 0.1466796240672773
|
| 88 |
+
- num_epochs: 13.0
|
| 89 |
+
- mixed_precision_training: Native AMP
|
| 90 |
+
|
| 91 |
+
### Framework versions
|
| 92 |
+
|
| 93 |
+
- Transformers 4.41.0
|
| 94 |
+
- Pytorch 2.3.0+cu121
|
| 95 |
+
- Datasets 2.19.1
|
| 96 |
+
- Tokenizers 0.19.1
|
| 97 |
+
|
| 98 |
+
## Dataset
|
| 99 |
+
### Description
|
| 100 |
+
|
| 101 |
+
TE-CDE contains 99,757 questions automatically generated from a thermoelectric materials database, across five different properties: the thermoelectric figure of merit, thermal conductivity, Seebeck coefficient, electrical conductivity, and power factor. 66,508 questions are answerable from the context, and 33,249 are not.
|
| 102 |
+
|
| 103 |
+
## License
|
| 104 |
+
|
| 105 |
+
This project is licensed under Apache 2.0
|
| 106 |
+
|
| 107 |
+
## Citation
|
| 108 |
+
|
| 109 |
+
If you use these models in your research or application, please cite our work:
|
| 110 |
+
|
| 111 |
+
bibtex
|
| 112 |
+
|
| 113 |
+
(PENDING)
|
| 114 |
+
|
| 115 |
+
@article{
|
| 116 |
+
...
|
| 117 |
+
}
|
| 118 |
+
|
| 119 |
+
## Acknowledgments
|
| 120 |
+
|
| 121 |
+
We thank the developers of the Hugging Face Transformers library for providing valuable resources that made this work possible.
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
|
te-cde_best/config.json
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_name_or_path": "bert-base-uncased",
|
| 3 |
+
"architectures": [
|
| 4 |
+
"BertForQuestionAnswering"
|
| 5 |
+
],
|
| 6 |
+
"attention_probs_dropout_prob": 0.1,
|
| 7 |
+
"classifier_dropout": null,
|
| 8 |
+
"gradient_checkpointing": false,
|
| 9 |
+
"hidden_act": "gelu",
|
| 10 |
+
"hidden_dropout_prob": 0.1,
|
| 11 |
+
"hidden_size": 768,
|
| 12 |
+
"initializer_range": 0.02,
|
| 13 |
+
"intermediate_size": 3072,
|
| 14 |
+
"layer_norm_eps": 1e-12,
|
| 15 |
+
"max_position_embeddings": 512,
|
| 16 |
+
"model_type": "bert",
|
| 17 |
+
"num_attention_heads": 12,
|
| 18 |
+
"num_hidden_layers": 12,
|
| 19 |
+
"pad_token_id": 0,
|
| 20 |
+
"position_embedding_type": "absolute",
|
| 21 |
+
"torch_dtype": "float16",
|
| 22 |
+
"transformers_version": "4.41.0",
|
| 23 |
+
"type_vocab_size": 2,
|
| 24 |
+
"use_cache": true,
|
| 25 |
+
"vocab_size": 30522
|
| 26 |
+
}
|
te-cde_best/pytorch_model.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:725cd8053135df5c3fc66e21b1387dd452862a4e208d062fa40a3621ed6ee48a
|
| 3 |
+
size 435638182
|
te-cde_best/special_tokens_map.json
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cls_token": "[CLS]",
|
| 3 |
+
"mask_token": "[MASK]",
|
| 4 |
+
"pad_token": "[PAD]",
|
| 5 |
+
"sep_token": "[SEP]",
|
| 6 |
+
"unk_token": "[UNK]"
|
| 7 |
+
}
|
te-cde_best/tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
te-cde_best/tokenizer_config.json
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"0": {
|
| 4 |
+
"content": "[PAD]",
|
| 5 |
+
"lstrip": false,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": false,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"100": {
|
| 12 |
+
"content": "[UNK]",
|
| 13 |
+
"lstrip": false,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": false,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
},
|
| 19 |
+
"101": {
|
| 20 |
+
"content": "[CLS]",
|
| 21 |
+
"lstrip": false,
|
| 22 |
+
"normalized": false,
|
| 23 |
+
"rstrip": false,
|
| 24 |
+
"single_word": false,
|
| 25 |
+
"special": true
|
| 26 |
+
},
|
| 27 |
+
"102": {
|
| 28 |
+
"content": "[SEP]",
|
| 29 |
+
"lstrip": false,
|
| 30 |
+
"normalized": false,
|
| 31 |
+
"rstrip": false,
|
| 32 |
+
"single_word": false,
|
| 33 |
+
"special": true
|
| 34 |
+
},
|
| 35 |
+
"103": {
|
| 36 |
+
"content": "[MASK]",
|
| 37 |
+
"lstrip": false,
|
| 38 |
+
"normalized": false,
|
| 39 |
+
"rstrip": false,
|
| 40 |
+
"single_word": false,
|
| 41 |
+
"special": true
|
| 42 |
+
}
|
| 43 |
+
},
|
| 44 |
+
"clean_up_tokenization_spaces": true,
|
| 45 |
+
"cls_token": "[CLS]",
|
| 46 |
+
"do_lower_case": true,
|
| 47 |
+
"mask_token": "[MASK]",
|
| 48 |
+
"model_max_length": 512,
|
| 49 |
+
"pad_token": "[PAD]",
|
| 50 |
+
"sep_token": "[SEP]",
|
| 51 |
+
"strip_accents": null,
|
| 52 |
+
"tokenize_chinese_chars": true,
|
| 53 |
+
"tokenizer_class": "BertTokenizer",
|
| 54 |
+
"unk_token": "[UNK]"
|
| 55 |
+
}
|
te-cde_best/vocab.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|