Robert Gale commited on
Commit
9368dd9
·
1 Parent(s): 5b2649b
Files changed (1) hide show
  1. README.md +46 -7
README.md CHANGED
@@ -14,15 +14,54 @@ Our use of the AphasiaBank data was governed by the TalkBank consortium's data u
14
  Limitations exist regarding accents and dialect, which in turn would affect the scenarios in which a system based on our model could (and should) be used.
15
  It should also be noted that these models and any derived technology are not meant to be tools to diagnose medical conditions, a task best left to qualified clinicians.
16
 
17
- ## Pre-trained Model Variants
18
 
19
- - **BORT-PR** (upload ETA ≤ ACL 2023)
20
- - **BORT-SP** (upload ETA ≤ ACL 2023)
21
- - **BORT-PR-SP** (upload ETA ≤ ACL 2023)
22
- - **BORT-PR-NOISY** (upload ETA ≤ ACL 2023)
23
- - **BORT-SP-NOISY** (upload ETA ≤ ACL 2023)
24
- - **BORT-PR-SP-NOISY** (upload ETA ≤ ACL 2023)
25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  ## Wikipedia Dataset Used in Pre-Training
27
 
28
  The BPE-tokenized version of the dataset, including metadata used in word transforms.
 
14
  Limitations exist regarding accents and dialect, which in turn would affect the scenarios in which a system based on our model could (and should) be used.
15
  It should also be noted that these models and any derived technology are not meant to be tools to diagnose medical conditions, a task best left to qualified clinicians.
16
 
17
+ ## Usage
18
 
19
+ ### Downloading BORT
20
+ ```python
21
+ from transformers import AutoTokenizer, BartForConditionalGeneration
 
 
 
22
 
23
+ tokenizer = AutoTokenizer.from_pretrained("rcgale/bort-test")
24
+ model = BartForConditionalGeneration.from_pretrained("rcgale/bort-test")
25
+ ```
26
+
27
+ The above uses the default variant, `bort-pr-sp-noisy`. Each variant from the paper can be retrieved by specifiying
28
+ like so:
29
+
30
+ ```python
31
+ BartForConditionalGeneration.from_pretrained("rcgale/bort-test", variant="bort-sp")
32
+ ```
33
+
34
+ The following variants are available, pre-trained on the specified proportion of each task:
35
+
36
+ | Code | Pronunciation | Spelling | Noise |
37
+ |----------------------|---------------|----------|-------|
38
+ | `bort-pr` | 10% | — | — |
39
+ | `bort-sp` | — | 10% | — |
40
+ | `bort-pr-sp` | 10% | 10% | — |
41
+ | `bort-pr-noisy` | 10% | — | 5% |
42
+ | `bort-sp-noisy` | — | 10% | 5% |
43
+ | `bort-pr-sp-noisy` | 10% | 10% | 5% |
44
+
45
+
46
+ ## Basic usage
47
+
48
+ ```python
49
+ in_texts = [
50
+ "Due to its coastal location, lɔŋ ·aɪlən·d winter temperatures are milder than most of the state.",
51
+ "Due to its coastal location, lɔŋ ·b·iʧ winter temperatures are milder than most of the state.",
52
+ "Due to its coastal location, Long ·b·iʧ winter temperatures are milder than most of the state.",
53
+ "Due to its coastal location, lɔŋfɝd winter temperatures are milder than most of the state.",
54
+ ]
55
+
56
+ inputs = tokenizer(in_texts, return_tensors="pt", padding=True)
57
+ summary_ids = model.generate(inputs["input_ids"], num_beams=2, min_length=0, max_length=2048)
58
+ decoded = tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
59
+
60
+ for in_text, out_text in zip(in_texts, decoded):
61
+ print(f"In: \t{in_text}")
62
+ print(f"Out: \t{out_text}")
63
+ print()
64
+ ```
65
  ## Wikipedia Dataset Used in Pre-Training
66
 
67
  The BPE-tokenized version of the dataset, including metadata used in word transforms.