MatSciBERT — Chemical Named Entity Recognition

Fine-tuned m3rg-iitd/matscibert on the CHEMDNER corpus for chemical named entity recognition (NER) in biomedical and scientific text.

The model identifies and highlights chemical compound names, drug names, and chemical formulas in free text.

Model Description

Property	Value
Base model	`m3rg-iitd/matscibert` (BERT-base, domain pre-trained on 2M+ materials science papers)
Task	Token classification — NER
Labels	`O`, `B-CHEM`, `I-CHEM`
Training data	CHEMDNER corpus via `kjappelbaum/chemnlp-chemdner`
Framework	HuggingFace Transformers + Trainer API
Hardware	NVIDIA Quadro P2000 (4 GB VRAM)

Labels

Label	Description	Example
`O`	Outside — not a chemical	reacts, with, is
`B-CHEM`	Beginning of a chemical entity	nitric (start of "nitric oxide")
`I-CHEM`	Inside a chemical entity	oxide (continuation of "nitric oxide")

After aggregation, the pipeline outputs CHEM spans combining B-CHEM/I-CHEM tokens.

Evaluation Results

Evaluated on the CHEMDNER validation set (~6 808 examples) using seqeval (entity-span level):

Metric	Score
F1	0.9146
Precision	0.9075
Recall	0.9219
Accuracy (token)	0.9927

Usage

With `pipeline`

from transformers import pipeline

ner = pipeline(
    "ner",
    model="teman67/matscibert-chem-ner",
    aggregation_strategy="simple",
)

text = "Nitric oxide reacts with oxygen to form nitrogen dioxide."
results = ner(text)

for entity in results:
    print(f"{entity['word']:<25} {entity['entity_group']}  ({entity['score']:.1%})")

Output:

nitric oxide              CHEM  (100.0%)
oxygen                    CHEM  (100.0%)
nitrogen dioxide          CHEM  (99.9%)

With `AutoModel`

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("teman67/matscibert-chem-ner")
model = AutoModelForTokenClassification.from_pretrained("teman67/matscibert-chem-ner")

inputs = tokenizer("Aspirin inhibits COX-1 and COX-2 enzymes.", return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

predictions = outputs.logits.argmax(-1)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]

for token, label in zip(tokens, labels):
    if not token.startswith("##") and token not in ("[CLS]", "[SEP]"):
        print(f"{token:<20} {label}")

Training Details

Hyperparameter	Value
Epochs	5
Batch size	8
Learning rate	2e-5
Weight decay	0.01
Warmup ratio	0.1
Max sequence length	128
Optimiser	AdamW
LR schedule	Linear decay with warmup
Best model selection	Highest validation F1

Training data split from the CHEMDNER corpus:

Split	Examples
Train	6 796
Validation	6 808

Training Code

Source code available at: github.com/teman67/Fine-tuning-Materials-Scientific-NER-

Intended Use & Limitations

Intended for:

Extracting chemical and drug names from biomedical literature
Pre-processing step for downstream chemistry/materials science NLP tasks
Scientific text mining pipelines

Limitations:

Trained on PubMed/patent text — performance may degrade on very different domains
Recognises a single entity type (CHEM); does not distinguish subtypes (drugs, elements, formulas, etc.)
Sentences longer than 128 WordPiece tokens are truncated — chemicals at the end of long passages may be missed
Entity strings matched by surface form; ambiguous terms (e.g. Mercury the planet vs element) are always tagged as CHEM

References

MatSciBERT: Gupta et al., "MatSciBERT: A materials domain language model for text mining and information extraction", npj Computational Materials, 2022. doi:10.1038/s41524-022-00784-w
CHEMDNER: Krallinger et al., "The CHEMDNER corpus of chemicals and drugs and its annotation principles", Journal of Cheminformatics, 2015. doi:10.1186/1758-2946-7-S1-S2

License

MIT

Downloads last month: 1

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for teman67/matscibert-chem-ner

Base model

m3rg-iitd/matscibert

Finetuned

(19)

this model

teman67
/

matscibert-chem-ner

MatSciBERT — Chemical Named Entity Recognition

Model Description

Labels

Evaluation Results

Usage

With `pipeline`

With `AutoModel`

Training Details

Training Code

Intended Use & Limitations

References

License

Model tree for teman67/matscibert-chem-ner

Dataset used to train teman67/matscibert-chem-ner

MatSciBERT — Chemical Named Entity Recognition

Model Description

Labels

Evaluation Results

Usage

With pipeline

With AutoModel

Training Details

Training Code

Intended Use & Limitations

References

License

Model tree for teman67/matscibert-chem-ner

Dataset used to train teman67/matscibert-chem-ner

With `pipeline`

With `AutoModel`