| # Model Details | |
| ##### Model Name: NumericBERT | |
| ##### Model Type: Transformer | |
| ##### Architecture: BERT | |
| ##### Training Method: Masked Language Modeling (MLM) | |
| ##### Training Data: MIMIC IV Lab values data | |
| ##### Training Hyperparameters: | |
| Optimizer: AdamW | |
| Learning Rate: 5e-5 | |
| Masking Rate: 20% | |
| Tokenization | |
| Tokenizer: Custom numeric-to-text mapping using the TextEncoder class | |
| ### Text Encoding Process: | |
| The process converts non-negative integers into uppercase letter-based representations. This mapping allows numerical values to be expressed as sequences of letters. | |
| Subsequently, a method is applied to scale numerical values and convert them into corresponding letters based on a predefined mapping. | |
| Finally, a text encoding is executed to add the corresponding lab ID using the numeric values in specified columns ('Bic', 'Crt', 'Pot', 'Sod', 'Ure', 'Hgb', 'Plt', 'Wbc'). | |
| ### Training Data Preprocessing | |
| Column Selection: Numerical values from the following lab values represented as: 'Bic', 'Crt', 'Pot', 'Sod', 'Ure', 'Hgb', 'Plt', 'Wbc'. | |
| Text Encoding: The numeric values are encoded into text. | |
| Masking: 20% of the data is randomly masked during training. | |
| ### Model Output | |
| The model outputs predictions for masked values during training. | |
| The output contains the encoded text. | |
| ### Limitations and Considerations | |
| Numeric Data Representation: The model relies on a custom text representation of numeric data, which might have limitations in capturing complex patterns present in the original numeric data. | |
| Training Data Source: The model is trained on MIMIC IV numeric data, and its performance might be influenced by the characteristics and biases present in that dataset. | |
| ### Contact Information | |
| For inquiries or additional information, please contact: | |
| David Restrepo | |
| davidres@mit.edu | |
| MIT Critical Data | |
| --- | |
| license: mit | |
| --- | |