tsantos commited on
Commit
88915db
·
1 Parent(s): 6641c71

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -8,7 +8,7 @@ tags:
8
  # PathologyBERT - Masked Language Model with Breast Pathology Specimens.
9
 
10
  Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. Recently, several studies have explored the utility and efficacy of contextual models in the clinical, medical, and biomedical domains ([BioBERT](https://arxiv.org/pdf/1901.08746.pdf), [ClinicalBERT](https://aclanthology.org/W19-1909/), [SciBERT](https://arxiv.org/abs/1903.10676), [BlueBERT](https://arxiv.org/abs/1906.05474)
11
- However, while there is a growing interest in developing language models for more specific domains, the current trend appears to prefer re-training general-domain models on specialized corpora rather than developing models from the ground up with specialized vocabulary. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. However, in fields requiring specialized terminology, such as pathology, these models often fail to perform adequately. One of the major reasons for this limitation is because BERT employs [Word-Pieces](https://www.semanticscholar.org/paper/Google%27s-Neural-Machine-Translation-System%3A-the-Gap-Wu-Schuster/dbde7dfa6cae81df8ac19ef500c42db96c3d1edd) for unsupervised input tokenization, a technique that relies on a predetermined set of Word-Pieces. The vocabulary is built such that it contains the most commonly used words or subword units and as a result, any new words can be represented by frequent subwords. Although WordPiece was built to handle suffixes and complex compound words, it often fails with domain-specific terms. For example, while [ClinicalBERT](https://aclanthology.org/W19-1909/) successfully tokenizes the word ``endpoint" as ['end', ##point], it tokenize the word "carcinoma" as ['car', '##cin', '##oma'] in which the word lost its actual meaning and replaced by some non-relevant junk words, such as `car'. The words which was replaced by the junk pieces, may not play the original role in deriving the contextual representation of the sentence or the paragraph, even when analyzed by the powerful transformer models.
12
 
13
 
14
  To facilitate research on language representations in the pathology domain and assist researchers in addressing the current limitations and advancing cancer research, we preset PathologyBERT, a pre-trained masked language model trained on Histopathology Specimens Reports.
 
8
  # PathologyBERT - Masked Language Model with Breast Pathology Specimens.
9
 
10
  Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. Recently, several studies have explored the utility and efficacy of contextual models in the clinical, medical, and biomedical domains ([BioBERT](https://arxiv.org/pdf/1901.08746.pdf), [ClinicalBERT](https://aclanthology.org/W19-1909/), [SciBERT](https://arxiv.org/abs/1903.10676), [BlueBERT](https://arxiv.org/abs/1906.05474)
11
+ However, while there is a growing interest in developing language models for more specific domains, the current trend appears to prefer re-training general-domain models on specialized corpora rather than developing models from the ground up with specialized vocabulary. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. However, in fields requiring specialized terminology, such as pathology, these models often fail to perform adequately. One of the major reasons for this limitation is because BERT employs [Word-Pieces](https://www.semanticscholar.org/paper/Google%27s-Neural-Machine-Translation-System%3A-the-Gap-Wu-Schuster/dbde7dfa6cae81df8ac19ef500c42db96c3d1edd) for unsupervised input tokenization, a technique that relies on a predetermined set of Word-Pieces. The vocabulary is built such that it contains the most commonly used words or subword units and as a result, any new words can be represented by frequent subwords. Although WordPiece was built to handle suffixes and complex compound words, it often fails with domain-specific terms. For example, while [ClinicalBERT](https://aclanthology.org/W19-1909/) successfully tokenizes the word 'endpoint' as ['end', '##point'], it tokenize the word 'carcinoma' as ['car', '##cin', '##oma'] in which the word lost its actual meaning and replaced by some non-relevant junk words, such as `car'. The words which was replaced by the junk pieces, may not play the original role in deriving the contextual representation of the sentence or the paragraph, even when analyzed by the powerful transformer models.
12
 
13
 
14
  To facilitate research on language representations in the pathology domain and assist researchers in addressing the current limitations and advancing cancer research, we preset PathologyBERT, a pre-trained masked language model trained on Histopathology Specimens Reports.