--- language: - en tags: - ner - biomedical - token-classification - roberta license: apache-2.0 datasets: - bc5cdr - ncbi_disease --- # biomedical_ner_roberta_base ## Overview `biomedical_ner_roberta_base` is a token classification model specifically fine-tuned for Named Entity Recognition (NER) in the biomedical domain. It is designed to extract entities from scientific abstracts, clinical notes, and medical literature. The model identifies three primary entity types using the BIO labeling scheme: * **DISEASE**: Pathological conditions, signs, and symptoms. * **CHEMICAL**: Drugs, medications, and chemical compounds. * **GENE**: Genes, proteins, and related molecular structures. ## Model Architecture This model is based on the `roberta-base` architecture, fine-tuned using `RobertaForTokenClassification`. It was trained on a composite dataset including BC5CDR (BioCreative V CDR task corpus) and the NCBI Disease corpus. - **Base Model:** RoBERTa Base (12 layers, 768 hidden dimension, 12 heads, 125M parameters). - **Task:** Token Classification (7 labels: O, B-DISEASE, I-DISEASE, B-CHEMICAL, I-CHEMICAL, B-GENE, I-GENE). ## Intended Use This model is intended for researchers and developers working with biomedical text data. - **Information Extraction:** Automated parsing of PubMed abstracts to identify key biomedical concepts. - **Knowledge Graph Construction:** Linking genes, drugs, and diseases discovered in text to structured knowledge bases. - **Clinical Text Mining:** Assisting in extracting relevant information from unstructured electronic health records (EHRs). ### How to use ```python from transformers import AutoTokenizer, AutoModelForTokenClassification from transformers import pipeline model_name = "your_username/biomedical_ner_roberta_base" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") text = "The patient was treated with metformin for type 2 diabetes, but showed resistance related to the SLC22A1 gene variant." results = nlp(text) for entity in results: print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.4f}") # Expected Output structure: # Entity: metformin, Label: CHEMICAL, Score: 0.99... # Entity: type 2 diabetes, Label: DISEASE, Score: 0.98... # Entity: SLC22A1, Label: GENE, Score: 0.97...