Robert Gale
commited on
Commit
·
9368dd9
1
Parent(s):
5b2649b
asdoifjw
Browse files
README.md
CHANGED
|
@@ -14,15 +14,54 @@ Our use of the AphasiaBank data was governed by the TalkBank consortium's data u
|
|
| 14 |
Limitations exist regarding accents and dialect, which in turn would affect the scenarios in which a system based on our model could (and should) be used.
|
| 15 |
It should also be noted that these models and any derived technology are not meant to be tools to diagnose medical conditions, a task best left to qualified clinicians.
|
| 16 |
|
| 17 |
-
##
|
| 18 |
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
- **BORT-PR-NOISY** (upload ETA ≤ ACL 2023)
|
| 23 |
-
- **BORT-SP-NOISY** (upload ETA ≤ ACL 2023)
|
| 24 |
-
- **BORT-PR-SP-NOISY** (upload ETA ≤ ACL 2023)
|
| 25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
## Wikipedia Dataset Used in Pre-Training
|
| 27 |
|
| 28 |
The BPE-tokenized version of the dataset, including metadata used in word transforms.
|
|
|
|
| 14 |
Limitations exist regarding accents and dialect, which in turn would affect the scenarios in which a system based on our model could (and should) be used.
|
| 15 |
It should also be noted that these models and any derived technology are not meant to be tools to diagnose medical conditions, a task best left to qualified clinicians.
|
| 16 |
|
| 17 |
+
## Usage
|
| 18 |
|
| 19 |
+
### Downloading BORT
|
| 20 |
+
```python
|
| 21 |
+
from transformers import AutoTokenizer, BartForConditionalGeneration
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
+
tokenizer = AutoTokenizer.from_pretrained("rcgale/bort-test")
|
| 24 |
+
model = BartForConditionalGeneration.from_pretrained("rcgale/bort-test")
|
| 25 |
+
```
|
| 26 |
+
|
| 27 |
+
The above uses the default variant, `bort-pr-sp-noisy`. Each variant from the paper can be retrieved by specifiying
|
| 28 |
+
like so:
|
| 29 |
+
|
| 30 |
+
```python
|
| 31 |
+
BartForConditionalGeneration.from_pretrained("rcgale/bort-test", variant="bort-sp")
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
The following variants are available, pre-trained on the specified proportion of each task:
|
| 35 |
+
|
| 36 |
+
| Code | Pronunciation | Spelling | Noise |
|
| 37 |
+
|----------------------|---------------|----------|-------|
|
| 38 |
+
| `bort-pr` | 10% | — | — |
|
| 39 |
+
| `bort-sp` | — | 10% | — |
|
| 40 |
+
| `bort-pr-sp` | 10% | 10% | — |
|
| 41 |
+
| `bort-pr-noisy` | 10% | — | 5% |
|
| 42 |
+
| `bort-sp-noisy` | — | 10% | 5% |
|
| 43 |
+
| `bort-pr-sp-noisy` | 10% | 10% | 5% |
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
## Basic usage
|
| 47 |
+
|
| 48 |
+
```python
|
| 49 |
+
in_texts = [
|
| 50 |
+
"Due to its coastal location, lɔŋ ·aɪlən·d winter temperatures are milder than most of the state.",
|
| 51 |
+
"Due to its coastal location, lɔŋ ·b·iʧ winter temperatures are milder than most of the state.",
|
| 52 |
+
"Due to its coastal location, Long ·b·iʧ winter temperatures are milder than most of the state.",
|
| 53 |
+
"Due to its coastal location, lɔŋfɝd winter temperatures are milder than most of the state.",
|
| 54 |
+
]
|
| 55 |
+
|
| 56 |
+
inputs = tokenizer(in_texts, return_tensors="pt", padding=True)
|
| 57 |
+
summary_ids = model.generate(inputs["input_ids"], num_beams=2, min_length=0, max_length=2048)
|
| 58 |
+
decoded = tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
|
| 59 |
+
|
| 60 |
+
for in_text, out_text in zip(in_texts, decoded):
|
| 61 |
+
print(f"In: \t{in_text}")
|
| 62 |
+
print(f"Out: \t{out_text}")
|
| 63 |
+
print()
|
| 64 |
+
```
|
| 65 |
## Wikipedia Dataset Used in Pre-Training
|
| 66 |
|
| 67 |
The BPE-tokenized version of the dataset, including metadata used in word transforms.
|