Shoriful025
/

biomedical_ner_roberta_base

Token Classification

Model card Files Files and versions

biomedical_ner_roberta_base / README.md

Shoriful025's picture

Create README.md

a2331f8 verified 4 days ago

|

history blame contribute delete

2.47 kB

	---
	language:
	- en
	tags:
	- ner
	- biomedical
	- token-classification
	- roberta
	license: apache-2.0
	datasets:
	- bc5cdr
	- ncbi_disease
	---

	# biomedical_ner_roberta_base

	## Overview

	`biomedical_ner_roberta_base` is a token classification model specifically fine-tuned for Named Entity Recognition (NER) in the biomedical domain. It is designed to extract entities from scientific abstracts, clinical notes, and medical literature.

	The model identifies three primary entity types using the BIO labeling scheme:
	* DISEASE: Pathological conditions, signs, and symptoms.
	* CHEMICAL: Drugs, medications, and chemical compounds.
	* GENE: Genes, proteins, and related molecular structures.

	## Model Architecture

	This model is based on the `roberta-base` architecture, fine-tuned using `RobertaForTokenClassification`. It was trained on a composite dataset including BC5CDR (BioCreative V CDR task corpus) and the NCBI Disease corpus.

	- Base Model: RoBERTa Base (12 layers, 768 hidden dimension, 12 heads, 125M parameters).
	- Task: Token Classification (7 labels: O, B-DISEASE, I-DISEASE, B-CHEMICAL, I-CHEMICAL, B-GENE, I-GENE).

	## Intended Use

	This model is intended for researchers and developers working with biomedical text data.

	- Information Extraction: Automated parsing of PubMed abstracts to identify key biomedical concepts.
	- Knowledge Graph Construction: Linking genes, drugs, and diseases discovered in text to structured knowledge bases.
	- Clinical Text Mining: Assisting in extracting relevant information from unstructured electronic health records (EHRs).

	### How to use

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	from transformers import pipeline

	model_name = "your_username/biomedical_ner_roberta_base"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForTokenClassification.from_pretrained(model_name)

	nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

	text = "The patient was treated with metformin for type 2 diabetes, but showed resistance related to the SLC22A1 gene variant."
	results = nlp(text)

	for entity in results:
	print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.4f}")

	# Expected Output structure:
	# Entity: metformin, Label: CHEMICAL, Score: 0.99...
	# Entity: type 2 diabetes, Label: DISEASE, Score: 0.98...
	# Entity: SLC22A1, Label: GENE, Score: 0.97...