|
|
--- |
|
|
library_name: transformers |
|
|
base_model: allenai/scibert_scivocab_cased |
|
|
tags: |
|
|
- generated_from_trainer |
|
|
- classification |
|
|
metrics: |
|
|
- precision |
|
|
- recall |
|
|
- f1 |
|
|
- accuracy |
|
|
model-index: |
|
|
- name: results_bert-finetuned-ner |
|
|
results: [] |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- JonyC/ScienceGlossary |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
<b><span style="color:red;">IMPORTENT! READ THIS!</span></b> |
|
|
|
|
|
## Model description |
|
|
|
|
|
This model recognizes scientific terms in a given token. The best way to use it is as follows: |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
from nltk.tokenize import word_tokenize |
|
|
import torch |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("JonyC/results_bert-finetuned-ner") |
|
|
model = AutoModelForTokenClassification.from_pretrained("JonyC/results_bert-finetuned-ner") |
|
|
|
|
|
words = word_tokenize("scientific_text") |
|
|
inputs = tokenizer(words, return_tensors="pt", truncation=True, padding=True) |
|
|
|
|
|
words_output, pred_labels = [], [] |
|
|
# Loop over tokenized inputs and send each one to the model |
|
|
for i in range(inputs['input_ids'].shape[0]): # Loop over each sentence in the batch |
|
|
input_data = {key: value[i].unsqueeze(0).to(model.device) for key, value in inputs.items()} # Prepare single input |
|
|
# Get model predictions for the current tokenized input |
|
|
with torch.no_grad(): |
|
|
outputs = model(**input_data).logits |
|
|
# Convert logits to predictions |
|
|
predictions = torch.argmax(outputs, dim=2) |
|
|
# Map token IDs to words and labels |
|
|
tokens = tokenizer.convert_ids_to_tokens(input_data['input_ids'][0]) |
|
|
# Get the word ids using the tokenizer |
|
|
labels = [model.config.id2label[p.item()] for p in predictions[0]] |
|
|
# Align tokens properly and ignore special tokens |
|
|
aligned_tokens = [] |
|
|
aligned_labels = [] |
|
|
word_ids = inputs.word_ids() |
|
|
current_word = "" |
|
|
current_label = "" |
|
|
# Loop through the tokens and their corresponding word ids to group subwords into full words |
|
|
for token, word_id, label in zip(tokens, word_ids, labels): |
|
|
# Skip special tokens |
|
|
if token.startswith('[CLS]') or token.startswith('[SEP]') or token.startswith('[PAD]') or token.startswith('[UNK]'): |
|
|
continue |
|
|
# If the token corresponds to the start of a new word |
|
|
if word_id != word_ids[0]: # Handle the first word |
|
|
if current_word != "": |
|
|
aligned_tokens.append(current_word) # Append the full word |
|
|
aligned_labels.append(current_label) # Append the label |
|
|
current_word = token |
|
|
current_label = label |
|
|
else: |
|
|
current_word += token # Append subwords to the current word |
|
|
|
|
|
# Add the last word if needed |
|
|
if current_word != "": |
|
|
aligned_tokens.append(current_word) |
|
|
aligned_labels.append(current_label) |
|
|
words_output.append(current_word) |
|
|
pred_labels.append(current_label) |
|
|
|
|
|
for w, p in zip(words_output, pred_labels): |
|
|
print(f"Word: {w}, Predicted Label: {p}") |
|
|
``` |
|
|
|
|
|
|
|
|
## Example usage |
|
|
Given the following text: |
|
|
"Quantum computing is a new field that changes how we think about solving complex problems. Unlike regular computers that use bits (which are either 0 or 1), quantum computers use qubits, which can be both 0 and 1 at the same time, thanks to a property called superposition. |
|
|
One important feature of quantum computers is quantum entanglement, where two qubits can be linked in such a way that changing one will instantly affect the other, no matter how far apart they are. |
|
|
This allows quantum computers to perform certain calculations much faster than traditional computers. For example, quantum computers could one day factor large numbers much faster, which is currently a task that takes regular computers a very long time. However, there are still challenges to overcome, like maintaining the qubits' state long enough to do calculations without errors. |
|
|
Scientists are working on ways to fix these errors, which is necessary for quantum computers to work on a large scale and solve real-world problems more efficiently than today's computers." |
|
|
|
|
|
the results are:<br> |
|
|
``` |
|
|
Word: qubits, Predicted Label: I-Scns. |
|
|
Word: superposition, Predicted Label: B-Scns. |
|
|
Word: entanglement, Predicted Label: B-Scns. |
|
|
Word: qubits, Predicted Label: I-Scns. |
|
|
Word: qubits, Predicted Label: I-Scns. |
|
|
``` |
|
|
|
|
|
(all the others defined as 'O', meaning non-science terms) |
|
|
|
|
|
# results_bert-finetuned-ner |
|
|
|
|
|
This model is a fine-tuned version of [allenai/scibert_scivocab_cased](https://huggingface.co/allenai/scibert_scivocab_cased) on the [JonyC/ScienceGlossary](https://huggingface.co/datasets/JonyC/ScienceGlossary) dataset. |
|
|
It achieves the following results on the evaluation set: |
|
|
- Loss: 0.2219 |
|
|
- Precision: 0.7689 |
|
|
- Recall: 0.7441 |
|
|
- F1: 0.7563 |
|
|
- Accuracy: 0.9336 |
|
|
- |
|
|
### Training hyperparameters |
|
|
|
|
|
The following hyperparameters were used during training: |
|
|
- learning_rate: 3e-05 |
|
|
- train_batch_size: 8 |
|
|
- eval_batch_size: 8 |
|
|
- seed: 42 |
|
|
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments |
|
|
- lr_scheduler_type: linear |
|
|
- num_epochs: 25 |
|
|
|
|
|
### Training results |
|
|
|
|
|
| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy | |
|
|
|:-------------:|:-----:|:------:|:---------------:|:---------:|:------:|:------:|:--------:| |
|
|
| 0.139 | 1.0 | 9399 | 0.1158 | 0.9515 | 0.9230 | 0.9370 | 0.9755 | |
|
|
| 0.1003 | 2.0 | 18798 | 0.1766 | 0.9570 | 0.8907 | 0.9226 | 0.9716 | |
|
|
| 0.1119 | 3.0 | 28197 | 0.2278 | 0.9844 | 0.8075 | 0.8872 | 0.9608 | |
|
|
| 0.1204 | 4.0 | 37596 | 0.2130 | 0.9796 | 0.8226 | 0.8943 | 0.9623 | |
|
|
| 0.0983 | 5.0 | 46995 | 0.1947 | 0.9707 | 0.8390 | 0.9001 | 0.9669 | |
|
|
| 0.1313 | 6.0 | 56394 | 0.1767 | 0.8988 | 0.9261 | 0.9123 | 0.9669 | |
|
|
| 0.1012 | 7.0 | 65793 | 0.1513 | 0.9528 | 0.8946 | 0.9228 | 0.9744 | |
|
|
| 0.1264 | 8.0 | 75192 | 0.1829 | 0.8573 | 0.7993 | 0.8273 | 0.9611 | |
|
|
| 0.1521 | 9.0 | 84591 | 0.1943 | 0.9182 | 0.8471 | 0.8812 | 0.9650 | |
|
|
| 0.6277 | 10.0 | 93990 | 0.6086 | 0.0 | 0.0 | 0.0 | 0.8039 | |
|
|
| 0.4465 | 11.0 | 103389 | 0.2022 | 0.8728 | 0.8514 | 0.8620 | 0.9639 | |
|
|
| 0.1114 | 12.0 | 112788 | 0.1885 | 0.7967 | 0.8172 | 0.8068 | 0.9595 | |
|
|
| 0.1492 | 13.0 | 122187 | 0.2386 | 0.7724 | 0.6562 | 0.7096 | 0.9226 | |
|
|
| 0.1785 | 14.0 | 131586 | 0.2137 | 0.5960 | 0.7145 | 0.6499 | 0.9296 | |
|
|
| 0.1496 | 15.0 | 140985 | 0.2184 | 0.7454 | 0.7620 | 0.7536 | 0.9325 | |
|
|
| 0.1458 | 16.0 | 150384 | 0.2195 | 0.7639 | 0.7437 | 0.7536 | 0.9304 | |
|
|
| 0.1241 | 17.0 | 159783 | 0.2271 | 0.7737 | 0.7406 | 0.7568 | 0.9341 | |
|
|
| 0.1266 | 18.0 | 169182 | 0.2281 | 0.6259 | 0.6962 | 0.6592 | 0.9334 | |
|
|
| 0.1313 | 19.0 | 178581 | 0.2125 | 0.7702 | 0.7534 | 0.7617 | 0.9349 | |
|
|
| 0.1416 | 20.0 | 187980 | 0.2258 | 0.7707 | 0.7464 | 0.7583 | 0.9332 | |
|
|
| 0.1237 | 21.0 | 197379 | 0.2374 | 0.7691 | 0.7410 | 0.7548 | 0.9331 | |
|
|
| 0.1184 | 22.0 | 206778 | 0.2297 | 0.7598 | 0.7371 | 0.7483 | 0.9327 | |
|
|
| 0.1278 | 23.0 | 216177 | 0.2134 | 0.7695 | 0.7402 | 0.7546 | 0.9335 | |
|
|
| 0.1195 | 24.0 | 225576 | 0.2171 | 0.7701 | 0.7441 | 0.7569 | 0.9332 | |
|
|
| 0.1249 | 25.0 | 234975 | 0.2219 | 0.7689 | 0.7441 | 0.7563 | 0.9336 | |
|
|
|
|
|
|
|
|
### Framework versions |
|
|
|
|
|
- Transformers 4.47.0 |
|
|
- Pytorch 2.5.1+cu124 |
|
|
- Datasets 3.2.0 |
|
|
- Tokenizers 0.21.0 |