Robert Gale commited on
Commit
2395368
·
1 Parent(s): d71247f
Files changed (1) hide show
  1. README.md +10 -11
README.md CHANGED
@@ -4,6 +4,10 @@ BORT is a pretrained LLM that is designed to accept a mixture of English phoneme
4
 
5
  > Robert Gale, Alexandra C. Salem, Gerasimos Fergadiotis, and Steven Bedrick. 2023. **Mixed Orthographic/Phonemic Language Modeling: Beyond Orthographically Restricted Transformers (BORT).** In Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP-2023), pages TBD, Online. Association for Computational Linguistics.
6
 
 
 
 
 
7
  ## Limitations
8
 
9
  The models presented here were trained with the basic inventory of English phonemes found in CMUDict. However, a more fine-grained phonetic analysis would require a pronunciation dictionary with more narrowly defined entries. Additionally, while this paper focused on models trained with English-only resources (pre-trained BART-BASE, English Wikipedia text, CMUDict, and the English AphasiaBank), the techniques should be applicable to non-English language models as well. Finally, from a clinical standpoint, the model we describe in this paper assumes the existence of transcribed input (from either a manual or automated source, discussed in detail in §2.1 of the paper; in its current form, this represents a limitation to its clinical implementation, though not to its use in research settings with archival or newly-transcribed datasets.
@@ -14,6 +18,12 @@ Our use of the AphasiaBank data was governed by the TalkBank consortium's data u
14
  Limitations exist regarding accents and dialect, which in turn would affect the scenarios in which a system based on our model could (and should) be used.
15
  It should also be noted that these models and any derived technology are not meant to be tools to diagnose medical conditions, a task best left to qualified clinicians.
16
 
 
 
 
 
 
 
17
  ## Usage
18
 
19
  ### Downloading BORT
@@ -93,14 +103,3 @@ Out: Due to its coastal location, Long Beach winter temperatures are milder th
93
  In: Due to its coastal location, lɔŋfɝd winter temperatures are milder than most of the state.
94
  Out: Due to its coastal location, Longford winter temperatures are milder than most of the state.
95
  ```
96
-
97
-
98
- ## Wikipedia Dataset Used in Pre-Training
99
-
100
- The BPE-tokenized version of the dataset, including metadata used in word transforms.
101
-
102
- - **Dataset** (upload ETA ≤ ACL 2023)
103
-
104
- ## Acknowledgements
105
-
106
- This work was supported by the National Institute on Deafness and Other Communication Disorders of the National Institutes of Health under award 5R01DC015999 (Principal Investigators: Bedrick \& Fergadiotis). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
 
4
 
5
  > Robert Gale, Alexandra C. Salem, Gerasimos Fergadiotis, and Steven Bedrick. 2023. **Mixed Orthographic/Phonemic Language Modeling: Beyond Orthographically Restricted Transformers (BORT).** In Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP-2023), pages TBD, Online. Association for Computational Linguistics.
6
 
7
+ ## Acknowledgements
8
+
9
+ This work was supported by the National Institute on Deafness and Other Communication Disorders of the National Institutes of Health under award 5R01DC015999 (Principal Investigators: Bedrick \& Fergadiotis). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
10
+
11
  ## Limitations
12
 
13
  The models presented here were trained with the basic inventory of English phonemes found in CMUDict. However, a more fine-grained phonetic analysis would require a pronunciation dictionary with more narrowly defined entries. Additionally, while this paper focused on models trained with English-only resources (pre-trained BART-BASE, English Wikipedia text, CMUDict, and the English AphasiaBank), the techniques should be applicable to non-English language models as well. Finally, from a clinical standpoint, the model we describe in this paper assumes the existence of transcribed input (from either a manual or automated source, discussed in detail in §2.1 of the paper; in its current form, this represents a limitation to its clinical implementation, though not to its use in research settings with archival or newly-transcribed datasets.
 
18
  Limitations exist regarding accents and dialect, which in turn would affect the scenarios in which a system based on our model could (and should) be used.
19
  It should also be noted that these models and any derived technology are not meant to be tools to diagnose medical conditions, a task best left to qualified clinicians.
20
 
21
+ ## Wikipedia Dataset Used in Pre-Training
22
+
23
+ The BPE-tokenized version of the dataset, including metadata used in word transforms.
24
+
25
+ - **Dataset** (upload ETA ≤ ACL 2023)
26
+
27
  ## Usage
28
 
29
  ### Downloading BORT
 
103
  In: Due to its coastal location, lɔŋfɝd winter temperatures are milder than most of the state.
104
  Out: Due to its coastal location, Longford winter temperatures are milder than most of the state.
105
  ```