Update README.md

66b7cd7 verified 10 months ago

7.77 kB

	---
	library_name: transformers
	base_model: allenai/scibert_scivocab_cased
	tags:
	- generated_from_trainer
	- classification
	metrics:
	- precision
	- recall
	- f1
	- accuracy
	model-index:
	- name: results_bert-finetuned-ner
	results: []
	license: apache-2.0
	datasets:
	- JonyC/ScienceGlossary
	language:
	- en
	---
	<b><span style="color:red;">IMPORTENT! READ THIS!</span></b>

	## Model description

	This model recognizes scientific terms in a given token. The best way to use it is as follows:
	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	from nltk.tokenize import word_tokenize
	import torch

	tokenizer = AutoTokenizer.from_pretrained("JonyC/results_bert-finetuned-ner")
	model = AutoModelForTokenClassification.from_pretrained("JonyC/results_bert-finetuned-ner")

	words = word_tokenize("scientific_text")
	inputs = tokenizer(words, return_tensors="pt", truncation=True, padding=True)

	words_output, pred_labels = [], []
	# Loop over tokenized inputs and send each one to the model
	for i in range(inputs['input_ids'].shape[0]): # Loop over each sentence in the batch
	input_data = {key: value[i].unsqueeze(0).to(model.device) for key, value in inputs.items()} # Prepare single input
	# Get model predictions for the current tokenized input
	with torch.no_grad():
	outputs = model(**input_data).logits
	# Convert logits to predictions
	predictions = torch.argmax(outputs, dim=2)
	# Map token IDs to words and labels
	tokens = tokenizer.convert_ids_to_tokens(input_data['input_ids'][0])
	# Get the word ids using the tokenizer
	labels = [model.config.id2label[p.item()] for p in predictions[0]]
	# Align tokens properly and ignore special tokens
	aligned_tokens = []
	aligned_labels = []
	word_ids = inputs.word_ids()
	current_word = ""
	current_label = ""
	# Loop through the tokens and their corresponding word ids to group subwords into full words
	for token, word_id, label in zip(tokens, word_ids, labels):
	# Skip special tokens
	if token.startswith('[CLS]') or token.startswith('[SEP]') or token.startswith('[PAD]') or token.startswith('[UNK]'):
	continue
	# If the token corresponds to the start of a new word
	if word_id != word_ids[0]: # Handle the first word
	if current_word != "":
	aligned_tokens.append(current_word) # Append the full word
	aligned_labels.append(current_label) # Append the label
	current_word = token
	current_label = label
	else:
	current_word += token # Append subwords to the current word

	# Add the last word if needed
	if current_word != "":
	aligned_tokens.append(current_word)
	aligned_labels.append(current_label)
	words_output.append(current_word)
	pred_labels.append(current_label)

	for w, p in zip(words_output, pred_labels):
	print(f"Word: {w}, Predicted Label: {p}")
	```


	## Example usage
	Given the following text:
	"Quantum computing is a new field that changes how we think about solving complex problems. Unlike regular computers that use bits (which are either 0 or 1), quantum computers use qubits, which can be both 0 and 1 at the same time, thanks to a property called superposition.
	One important feature of quantum computers is quantum entanglement, where two qubits can be linked in such a way that changing one will instantly affect the other, no matter how far apart they are.
	This allows quantum computers to perform certain calculations much faster than traditional computers. For example, quantum computers could one day factor large numbers much faster, which is currently a task that takes regular computers a very long time. However, there are still challenges to overcome, like maintaining the qubits' state long enough to do calculations without errors.
	Scientists are working on ways to fix these errors, which is necessary for quantum computers to work on a large scale and solve real-world problems more efficiently than today's computers."

	the results are:<br>
	```
	Word: qubits, Predicted Label: I-Scns.
	Word: superposition, Predicted Label: B-Scns.
	Word: entanglement, Predicted Label: B-Scns.
	Word: qubits, Predicted Label: I-Scns.
	Word: qubits, Predicted Label: I-Scns.
	```

	(all the others defined as 'O', meaning non-science terms)

	# results_bert-finetuned-ner

	This model is a fine-tuned version of [allenai/scibert_scivocab_cased](https://huggingface.co/allenai/scibert_scivocab_cased) on the [JonyC/ScienceGlossary](https://huggingface.co/datasets/JonyC/ScienceGlossary) dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.2219
	- Precision: 0.7689
	- Recall: 0.7441
	- F1: 0.7563
	- Accuracy: 0.9336
	-
	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 3e-05
	- train_batch_size: 8
	- eval_batch_size: 8
	- seed: 42
	- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
	- lr_scheduler_type: linear
	- num_epochs: 25

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Precision \| Recall \| F1 \| Accuracy \|
	\|:-------------:\|:-----:\|:------:\|:---------------:\|:---------:\|:------:\|:------:\|:--------:\|
	\| 0.139 \| 1.0 \| 9399 \| 0.1158 \| 0.9515 \| 0.9230 \| 0.9370 \| 0.9755 \|
	\| 0.1003 \| 2.0 \| 18798 \| 0.1766 \| 0.9570 \| 0.8907 \| 0.9226 \| 0.9716 \|
	\| 0.1119 \| 3.0 \| 28197 \| 0.2278 \| 0.9844 \| 0.8075 \| 0.8872 \| 0.9608 \|
	\| 0.1204 \| 4.0 \| 37596 \| 0.2130 \| 0.9796 \| 0.8226 \| 0.8943 \| 0.9623 \|
	\| 0.0983 \| 5.0 \| 46995 \| 0.1947 \| 0.9707 \| 0.8390 \| 0.9001 \| 0.9669 \|
	\| 0.1313 \| 6.0 \| 56394 \| 0.1767 \| 0.8988 \| 0.9261 \| 0.9123 \| 0.9669 \|
	\| 0.1012 \| 7.0 \| 65793 \| 0.1513 \| 0.9528 \| 0.8946 \| 0.9228 \| 0.9744 \|
	\| 0.1264 \| 8.0 \| 75192 \| 0.1829 \| 0.8573 \| 0.7993 \| 0.8273 \| 0.9611 \|
	\| 0.1521 \| 9.0 \| 84591 \| 0.1943 \| 0.9182 \| 0.8471 \| 0.8812 \| 0.9650 \|
	\| 0.6277 \| 10.0 \| 93990 \| 0.6086 \| 0.0 \| 0.0 \| 0.0 \| 0.8039 \|
	\| 0.4465 \| 11.0 \| 103389 \| 0.2022 \| 0.8728 \| 0.8514 \| 0.8620 \| 0.9639 \|
	\| 0.1114 \| 12.0 \| 112788 \| 0.1885 \| 0.7967 \| 0.8172 \| 0.8068 \| 0.9595 \|
	\| 0.1492 \| 13.0 \| 122187 \| 0.2386 \| 0.7724 \| 0.6562 \| 0.7096 \| 0.9226 \|
	\| 0.1785 \| 14.0 \| 131586 \| 0.2137 \| 0.5960 \| 0.7145 \| 0.6499 \| 0.9296 \|
	\| 0.1496 \| 15.0 \| 140985 \| 0.2184 \| 0.7454 \| 0.7620 \| 0.7536 \| 0.9325 \|
	\| 0.1458 \| 16.0 \| 150384 \| 0.2195 \| 0.7639 \| 0.7437 \| 0.7536 \| 0.9304 \|
	\| 0.1241 \| 17.0 \| 159783 \| 0.2271 \| 0.7737 \| 0.7406 \| 0.7568 \| 0.9341 \|
	\| 0.1266 \| 18.0 \| 169182 \| 0.2281 \| 0.6259 \| 0.6962 \| 0.6592 \| 0.9334 \|
	\| 0.1313 \| 19.0 \| 178581 \| 0.2125 \| 0.7702 \| 0.7534 \| 0.7617 \| 0.9349 \|
	\| 0.1416 \| 20.0 \| 187980 \| 0.2258 \| 0.7707 \| 0.7464 \| 0.7583 \| 0.9332 \|
	\| 0.1237 \| 21.0 \| 197379 \| 0.2374 \| 0.7691 \| 0.7410 \| 0.7548 \| 0.9331 \|
	\| 0.1184 \| 22.0 \| 206778 \| 0.2297 \| 0.7598 \| 0.7371 \| 0.7483 \| 0.9327 \|
	\| 0.1278 \| 23.0 \| 216177 \| 0.2134 \| 0.7695 \| 0.7402 \| 0.7546 \| 0.9335 \|
	\| 0.1195 \| 24.0 \| 225576 \| 0.2171 \| 0.7701 \| 0.7441 \| 0.7569 \| 0.9332 \|
	\| 0.1249 \| 25.0 \| 234975 \| 0.2219 \| 0.7689 \| 0.7441 \| 0.7563 \| 0.9336 \|


	### Framework versions

	- Transformers 4.47.0
	- Pytorch 2.5.1+cu124
	- Datasets 3.2.0
	- Tokenizers 0.21.0