kjappelbaum/chemnlp-chemdner
Viewer • Updated • 19.4k • 16 • 1
Fine-tuned m3rg-iitd/matscibert on the CHEMDNER corpus for chemical named entity recognition (NER) in biomedical and scientific text.
The model identifies and highlights chemical compound names, drug names, and chemical formulas in free text.
| Property | Value |
|---|---|
| Base model | m3rg-iitd/matscibert (BERT-base, domain pre-trained on 2M+ materials science papers) |
| Task | Token classification — NER |
| Labels | O, B-CHEM, I-CHEM |
| Training data | CHEMDNER corpus via kjappelbaum/chemnlp-chemdner |
| Framework | HuggingFace Transformers + Trainer API |
| Hardware | NVIDIA Quadro P2000 (4 GB VRAM) |
| Label | Description | Example |
|---|---|---|
O |
Outside — not a chemical | reacts, with, is |
B-CHEM |
Beginning of a chemical entity | nitric (start of "nitric oxide") |
I-CHEM |
Inside a chemical entity | oxide (continuation of "nitric oxide") |
After aggregation, the pipeline outputs CHEM spans combining B-CHEM/I-CHEM tokens.
Evaluated on the CHEMDNER validation set (~6 808 examples) using seqeval (entity-span level):
| Metric | Score |
|---|---|
| F1 | 0.9146 |
| Precision | 0.9075 |
| Recall | 0.9219 |
| Accuracy (token) | 0.9927 |
pipeline
from transformers import pipeline
ner = pipeline(
"ner",
model="teman67/matscibert-chem-ner",
aggregation_strategy="simple",
)
text = "Nitric oxide reacts with oxygen to form nitrogen dioxide."
results = ner(text)
for entity in results:
print(f"{entity['word']:<25} {entity['entity_group']} ({entity['score']:.1%})")
Output:
nitric oxide CHEM (100.0%)
oxygen CHEM (100.0%)
nitrogen dioxide CHEM (99.9%)
AutoModel
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("teman67/matscibert-chem-ner")
model = AutoModelForTokenClassification.from_pretrained("teman67/matscibert-chem-ner")
inputs = tokenizer("Aspirin inhibits COX-1 and COX-2 enzymes.", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits.argmax(-1)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]
for token, label in zip(tokens, labels):
if not token.startswith("##") and token not in ("[CLS]", "[SEP]"):
print(f"{token:<20} {label}")
| Hyperparameter | Value |
|---|---|
| Epochs | 5 |
| Batch size | 8 |
| Learning rate | 2e-5 |
| Weight decay | 0.01 |
| Warmup ratio | 0.1 |
| Max sequence length | 128 |
| Optimiser | AdamW |
| LR schedule | Linear decay with warmup |
| Best model selection | Highest validation F1 |
Training data split from the CHEMDNER corpus:
| Split | Examples |
|---|---|
| Train | 6 796 |
| Validation | 6 808 |
Source code available at: github.com/teman67/Fine-tuning-Materials-Scientific-NER-
Intended for:
Limitations:
CHEM); does not distinguish subtypes (drugs, elements, formulas, etc.)CHEMMIT
Base model
m3rg-iitd/matscibert