Robert Gale commited on
Commit ·
2395368
1
Parent(s): d71247f
fiwjeoi
Browse files
README.md
CHANGED
|
@@ -4,6 +4,10 @@ BORT is a pretrained LLM that is designed to accept a mixture of English phoneme
|
|
| 4 |
|
| 5 |
> Robert Gale, Alexandra C. Salem, Gerasimos Fergadiotis, and Steven Bedrick. 2023. **Mixed Orthographic/Phonemic Language Modeling: Beyond Orthographically Restricted Transformers (BORT).** In Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP-2023), pages TBD, Online. Association for Computational Linguistics.
|
| 6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
## Limitations
|
| 8 |
|
| 9 |
The models presented here were trained with the basic inventory of English phonemes found in CMUDict. However, a more fine-grained phonetic analysis would require a pronunciation dictionary with more narrowly defined entries. Additionally, while this paper focused on models trained with English-only resources (pre-trained BART-BASE, English Wikipedia text, CMUDict, and the English AphasiaBank), the techniques should be applicable to non-English language models as well. Finally, from a clinical standpoint, the model we describe in this paper assumes the existence of transcribed input (from either a manual or automated source, discussed in detail in §2.1 of the paper; in its current form, this represents a limitation to its clinical implementation, though not to its use in research settings with archival or newly-transcribed datasets.
|
|
@@ -14,6 +18,12 @@ Our use of the AphasiaBank data was governed by the TalkBank consortium's data u
|
|
| 14 |
Limitations exist regarding accents and dialect, which in turn would affect the scenarios in which a system based on our model could (and should) be used.
|
| 15 |
It should also be noted that these models and any derived technology are not meant to be tools to diagnose medical conditions, a task best left to qualified clinicians.
|
| 16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
## Usage
|
| 18 |
|
| 19 |
### Downloading BORT
|
|
@@ -93,14 +103,3 @@ Out: Due to its coastal location, Long Beach winter temperatures are milder th
|
|
| 93 |
In: Due to its coastal location, lɔŋfɝd winter temperatures are milder than most of the state.
|
| 94 |
Out: Due to its coastal location, Longford winter temperatures are milder than most of the state.
|
| 95 |
```
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
## Wikipedia Dataset Used in Pre-Training
|
| 99 |
-
|
| 100 |
-
The BPE-tokenized version of the dataset, including metadata used in word transforms.
|
| 101 |
-
|
| 102 |
-
- **Dataset** (upload ETA ≤ ACL 2023)
|
| 103 |
-
|
| 104 |
-
## Acknowledgements
|
| 105 |
-
|
| 106 |
-
This work was supported by the National Institute on Deafness and Other Communication Disorders of the National Institutes of Health under award 5R01DC015999 (Principal Investigators: Bedrick \& Fergadiotis). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
|
|
|
|
| 4 |
|
| 5 |
> Robert Gale, Alexandra C. Salem, Gerasimos Fergadiotis, and Steven Bedrick. 2023. **Mixed Orthographic/Phonemic Language Modeling: Beyond Orthographically Restricted Transformers (BORT).** In Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP-2023), pages TBD, Online. Association for Computational Linguistics.
|
| 6 |
|
| 7 |
+
## Acknowledgements
|
| 8 |
+
|
| 9 |
+
This work was supported by the National Institute on Deafness and Other Communication Disorders of the National Institutes of Health under award 5R01DC015999 (Principal Investigators: Bedrick \& Fergadiotis). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
|
| 10 |
+
|
| 11 |
## Limitations
|
| 12 |
|
| 13 |
The models presented here were trained with the basic inventory of English phonemes found in CMUDict. However, a more fine-grained phonetic analysis would require a pronunciation dictionary with more narrowly defined entries. Additionally, while this paper focused on models trained with English-only resources (pre-trained BART-BASE, English Wikipedia text, CMUDict, and the English AphasiaBank), the techniques should be applicable to non-English language models as well. Finally, from a clinical standpoint, the model we describe in this paper assumes the existence of transcribed input (from either a manual or automated source, discussed in detail in §2.1 of the paper; in its current form, this represents a limitation to its clinical implementation, though not to its use in research settings with archival or newly-transcribed datasets.
|
|
|
|
| 18 |
Limitations exist regarding accents and dialect, which in turn would affect the scenarios in which a system based on our model could (and should) be used.
|
| 19 |
It should also be noted that these models and any derived technology are not meant to be tools to diagnose medical conditions, a task best left to qualified clinicians.
|
| 20 |
|
| 21 |
+
## Wikipedia Dataset Used in Pre-Training
|
| 22 |
+
|
| 23 |
+
The BPE-tokenized version of the dataset, including metadata used in word transforms.
|
| 24 |
+
|
| 25 |
+
- **Dataset** (upload ETA ≤ ACL 2023)
|
| 26 |
+
|
| 27 |
## Usage
|
| 28 |
|
| 29 |
### Downloading BORT
|
|
|
|
| 103 |
In: Due to its coastal location, lɔŋfɝd winter temperatures are milder than most of the state.
|
| 104 |
Out: Due to its coastal location, Longford winter temperatures are milder than most of the state.
|
| 105 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|