clinitokenizer is a sentence tokenizer for clinical text to split unstructured text from clinical text (such as Electronic Medical Records) into individual sentences.

To use this model, see the clinitokenizer repository.

General English sentence tokenizers are often unable to correctly parse medical abbreviations, jargon, and other conventions often used in medical records (see "Motivating Examples" section below). clinitokenizer is specifically trained on medical record data and can perform better in these situations (conversely, for non-domain specific use, using more general sentence tokenizers may yield better results).

The model has been trained on multiple datasets provided by i2b2 (now n2c2). Please visit the n2c2 site to request access to the dataset.

Downloads last month: 6