File size: 7,609 Bytes
b635a9e
 
3a2e254
 
 
 
 
 
 
 
 
 
 
 
 
b635a9e
3a2e254
 
 
f0e4947
3a2e254
 
 
 
 
 
 
5b205be
3a2e254
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
06263dc
 
 
 
 
 
 
 
 
 
 
 
 
 
b635a9e
 
3a2e254
b635a9e
 
 
 
 
3a2e254
b635a9e
3a2e254
 
 
 
 
b635a9e
 
 
 
 
3a2e254
 
 
 
 
 
 
 
 
 
 
 
06263dc
3a2e254
 
 
 
 
06263dc
b635a9e
 
 
3a2e254
b635a9e
 
3a2e254
 
b635a9e
 
3a2e254
 
b635a9e
 
 
 
3a2e254
b635a9e
3a2e254
 
b635a9e
3a2e254
b635a9e
3a2e254
b635a9e
3a2e254
b635a9e
3a2e254
b635a9e
3a2e254
b635a9e
3a2e254
 
b635a9e
 
3a2e254
b635a9e
3a2e254
b635a9e
3a2e254
b635a9e
 
3a2e254
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b635a9e
 
 
3a2e254
b635a9e
 
3a2e254
2de2335
3a2e254
 
b635a9e
 
3a2e254
b635a9e
3a2e254
b635a9e
2de2335
3a2e254
 
 
2de2335
3a2e254
b635a9e
2de2335
b635a9e
3a2e254
b635a9e
 
3a2e254
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
---
library_name: transformers
tags:
- Science
- NER
- token-classification
- scientific-term-detection
license: apache-2.0
datasets:
- JonyC/ScienceGlossary-NER_fit
language:
- en
base_model:
- allenai/scibert_scivocab_uncased
pipeline_tag: token-classification
---
<b><span style="color:red;">IMPORTENT! READ THIS!</span></b> 
# BEST USE:
The dataset was first tokenized with Spacy for better results of the model, so
even though you can use the model with the pipeline API (described in [**Direct Use**](https://huggingface.co/JonyC/scibert-NER-finetuned-improved/blob/main/README.md#direct-use)) it is highly recommended to use this way: 
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import spacy

nlp = spacy.load("en_core_web_sm")
tokenizer = AutoTokenizer.from_pretrained("JonyC/scibert-NER-finetuned-improved")
model = AutoModelForTokenClassification.from_pretrained("JonyC/scibert-NER-finetuned-improved")
id2label = model.config.id2label

def predict_scibert_labels(sentence):
    # Step 1: SpaCy tokenization
    words = [token.text for token in nlp(sentence) if not token.is_space]
    # alternative remove entities:
    # words = [token.text for token in nlp(sentence) if not token.is_space and not token.ent_type_]
    # Step 2: Tokenize with SciBERT using the words
    inputs = tokenizer(
        words,
        is_split_into_words=True,
        return_tensors="pt",
        truncation=True,
        padding=True
    )

    with torch.no_grad():
        outputs = model(**inputs).logits  # (1, seq_len, num_labels)
    
    predictions = torch.argmax(outputs, dim=-1).squeeze().tolist()
    word_ids = inputs.word_ids()

    # Step 3: Align predictions to original words (skip subwords)
    final_tokens = []
    final_labels = []

    previous_word_idx = None
    for i, word_idx in enumerate(word_ids):
        if word_idx is None or word_idx == previous_word_idx:
            continue
        label = id2label[predictions[i]]
        final_tokens.append(words[word_idx])
        final_labels.append(label)
        previous_word_idx = word_idx

    return list(zip(final_tokens, final_labels))
```
output:
```
CRISPR          -> B-Scns
-               -> I-Scns
Cas9            -> I-Scns
is              -> O
a               -> O
powerful        -> O
tool            -> O
for             -> O
genome          -> B-Scns
editing         -> O
.               -> O
```
# Model Card for Model ID

This model is a fine-tuned version of `allenai/scibert_scivocab_uncased` for scientific terms/phrases detection in text. It is trained on a custom dataset [JonyC/ScienceGlossary-NER_fit](https://huggingface.co/JonyC/ScienceGlossary-NER_fit) for Named Entity Recognition (NER), aiming to identify scientific terms in a variety of academic and technical texts.

## Model Details

### Model Description

This model has been fine-tuned for the task of scientific term detection. It classifies words as scientific terms, using the Scns label to denote scientific terms and the O label for non-scientific terms. The model has been trained on a custom dataset, which makes it effective for extracting scientific terms from academic and technical texts.

- **Developed by:** [JonyC]
- **Model type:** BERT-based token classifier
- **Language(s) (NLP):**  English
- **License:** Apache 2.0
- **Finetuned from model:** allenai/scibert_scivocab_uncased

## Uses


### Direct Use
This model can be used directly to detect scientific terms in text. You can apply it to text data where you want to extract scientific terminology with the huggingface pipline like this:
```python
from transformers import pipeline

# Use a model from Hugging Face Hub directly in the notebook
pipe = pipeline("token-classification", model="JonyC/scibert-NER-finetuned-improved")
sentence = "CRISPR-Cas9 is a powerful tool for genome editing."

result = pipe(sentence)
result
```
results:
```
[{'entity': 'B-Scns', 'score': 0.9897461, 'index': 1, 'word': 'crispr', 'start': 0, 'end': 6},
 {'entity': 'I-Scns', 'score': 0.9474513, 'index': 2, 'word': '-', 'start': 6, 'end': 7},
 {'entity': 'I-Scns', 'score': 0.97595257, 'index': 3, 'word': 'cas', 'start': 7, 'end': 10},
 {'entity': 'I-Scns', 'score': 0.9894609, 'index': 4, 'word': '##9', 'start': 10, 'end': 11},
 {'entity': 'B-Scns', 'score': 0.999246, 'index': 10, 'word': 'genome', 'start': 35, 'end': 41}]
```

### Out-of-Scope Use

This model is not intended for general-purpose NER tasks outside the scope of scientific term detection. It may perform poorly on tasks unrelated to scientific term extraction, for example entities recognition.

## Bias, Risks, and Limitations
While this model is designed to identify scientific terms, there may be limitations in recognizing terms from specialized subfields of science that were not represented in the training data. Additionally, the model might struggle with terms that are ambiguous or have multiple meanings depending on the context.
The model mainly trained on less familiar terms so it might recognise the most special term and less the superfical terms (e.g. it doesnt label computer as scientific). and it very bais towards people name and places, so you might want to combine the use of ner in Spacy.

### Recommendations
Users should be cautious when applying the model to texts that differ significantly from the training data. The model is optimized for extracting scientific terms but may not perform well for general NER tasks. Further fine-tuning might be necessary for specific domains or types of terminology.
best use with Spacy to remove entities.

## Training Details

### Training Data
The model was fine-tuned on the JonyC/ScienceGlossary-NER_fit dataset, which consists of scientific texts annotated with scientific terms.

Training Procedure
The model was trained using the following parameters:

Epochs: 10

Learning Rate: 3e-6

Batch Size: 16

Weight Decay: 0.05

Scheduler: Cosine learning rate decay

#### Preprocessing 
It's all written in  [JonyC/ScienceGlossary-NER_fit](https://huggingface.co/JonyC/ScienceGlossary-NER_fit)

#### Training Hyperparameters
Training regime: fp16 mixed precision

Optimizer: AdamW

Loss function: Cross-entropy loss

## Evaluation
The The evaluation on the validation set are:
**Validation set:**
- Precision: 0.9283
- Recall: 0.941
- F1: 0.9346
- Accuracy: 0.9833
- Loss: 0.05465

The model also was evaluated on a held-out test set.
**Test set:**
- Precision: 0.9278
- Recall: 0.9403
- F1: 0.9341
- Accuracy: 0.9834
- Loss: 0.05307
### Testing Data, Factors & Metrics

#### Testing Data
[JonyC/ScienceGlossary-NER_fit](https://huggingface.co/JonyC/ScienceGlossary-NER_fit)

#### Metrics
The evaluation metrics include precision, recall, F1 score, and accuracy, which are standard for token classification tasks. These metrics provide a comprehensive understanding of the model's ability to identify scientific terms.
```python
metric = evaluate.load("seqeval")
```

### Results
On the test set, the model achieved an accuracy of 98.34% with a F1 score of 0.934, demonstrating its effectiveness at detecting scientific terms.

## Citation
**BibTeX:**
@misc{scibert-NER-finetuned-improved,
  author = {JonyC},
  title = {SciBERT for Scientific Term Detection},
  year = {2025},
  url = {https://huggingface.co/JonyC/scibert-NER-finetuned-improved}
}
**APA:**
JonyC. (2025). SciBERT for Scientific Term Detection. Hugging Face. https://huggingface.co/JonyC/scibert-NER-finetuned-improved

Author: JonyC

## Model Card Contact
For questions, contributions, or collaborations, feel free to contact me: 📧 jonicohen97@gmail.com