| ## About | |
| This model takes in a word as an input and splits it into syllables. I did this by pre-training a T5 model from a syllables dataset I scraped from the internet. I'm using a custom tokenizer that is effectively character-based. It seems to work okay in my limited tests, but the output may be unpredictable when inputting multiple words, numbers, or non-English characters. It can, however, handle things such as trailing punctuation. | |
| ## Calling the Model | |
| ```python | |
| from transformers import AutoTokenizer, T5ForConditionalGeneration | |
| model = T5ForConditionalGeneration.from_pretrained('imjeffhi/syllabizer') | |
| tokenizer = AutoTokenizer.from_pretrained('imjeffhi/syllabizer') | |
| def generate_output(word): | |
| tokens = tokenizer(word, return_tensors='pt') | |
| output = model.generate(**tokens, do_sample=False, max_length=30, early_stopping=True)[0] | |
| return tokenizer.decode(output, skip_special_tokens=True) | |
| syllables = generate_output('syllabizer') | |
| ``` | |
| The model returns syllables in spaced format. See output below. | |
| ```python | |
| syl la biz er | |
| ``` | |
| ## Using pipelines to syllabize sentences | |
| You can easily syllabize an entire sentence/paragraph and/or convert the output into a list of syllables with the following code: | |
| ```python | |
| from transformers import pipeline | |
| syllabizer_pipe = pipeline('text2text-generation', model = 'imjeffhi/syllabizer', tokenizer='imjeffhi/syllabizer') | |
| sentence = "A unit of spoken language consisting of a single uninterrupted sound formed by a vowel, diphthong, or syllabic consonant alone, or by any of these sounds preceded, followed, or surrounded by one or more consonants." | |
| words = sentence.split(" ") | |
| output = syllabizer_pipe(words, batch_size=len(words),do_sample=False, max_length=30, early_stopping=True) | |
| [{words[i]: gen_text['generated_text'].split(" ")} for i, gen_text in enumerate(output)] | |
| ``` | |
| This outputs the following: | |
| ``` | |
| [{'A': ['a']}, | |
| {'unit': ['u', 'nit']}, | |
| {'of': ['of']}, | |
| {'spoken': ['spok', 'en']}, | |
| {'language': ['lan', 'guage']}, | |
| {'consisting': ['con', 'sis', 'ting']}, | |
| {'of': ['of']}, | |
| {'a': ['a']}, | |
| {'single': ['sing', 'le']}, | |
| {'uninterrupted': ['un', 'in', 'ter', 'rupt', 'ed']}, | |
| {'sound': ['sound']}, | |
| {'formed': ['formed']}, | |
| {'by': ['by']}, | |
| {'a': ['a']}, | |
| {'vowel,': ['vow', 'el']}, | |
| {'diphthong,': ['diph', 'thong']}, | |
| {'or': ['or']}, | |
| {'syllabic': ['syl', 'la', 'bic']}, | |
| {'consonant': ['con', 'so', 'nant']}, | |
| {'alone,': ['a', 'lone']}, | |
| {'or': ['or']}, | |
| {'by': ['by']}, | |
| {'any': ['an', 'y']}, | |
| {'of': ['of']}, | |
| {'these': ['these']}, | |
| {'sounds': ['sounds']}, | |
| {'preceded,': ['pre', 'ced', 'ed']}, | |
| {'followed,': ['fol', 'lowed']}, | |
| {'or': ['or']}, | |
| {'surrounded': ['sur', 'round', 'ed']}, | |
| {'by': ['by']}, | |
| {'one': ['one']}, | |
| {'or': ['or']}, | |
| {'more': ['more']}, | |
| {'consonants.': ['con', 'so', 'nants']}] | |
| ``` | |