imjeffhi
/

syllabizer

text2text-generation

text-generation-inference

Model card Files Files and versions

syllabizer / README.md

imjeffhi's picture

Update README.md

04ba3ad over 3 years ago

|

history blame contribute delete

2.8 kB

	## About
	This model takes in a word as an input and splits it into syllables. I did this by pre-training a T5 model from a syllables dataset I scraped from the internet. I'm using a custom tokenizer that is effectively character-based. It seems to work okay in my limited tests, but the output may be unpredictable when inputting multiple words, numbers, or non-English characters. It can, however, handle things such as trailing punctuation.

	## Calling the Model
	```python
	from transformers import AutoTokenizer, T5ForConditionalGeneration

	model = T5ForConditionalGeneration.from_pretrained('imjeffhi/syllabizer')
	tokenizer = AutoTokenizer.from_pretrained('imjeffhi/syllabizer')

	def generate_output(word):
	tokens = tokenizer(word, return_tensors='pt')
	output = model.generate(**tokens, do_sample=False, max_length=30, early_stopping=True)[0]
	return tokenizer.decode(output, skip_special_tokens=True)

	syllables = generate_output('syllabizer')
	```
	The model returns syllables in spaced format. See output below.
	```python
	syl la biz er
	```
	## Using pipelines to syllabize sentences
	You can easily syllabize an entire sentence/paragraph and/or convert the output into a list of syllables with the following code:
	```python
	from transformers import pipeline

	syllabizer_pipe = pipeline('text2text-generation', model = 'imjeffhi/syllabizer', tokenizer='imjeffhi/syllabizer')

	sentence = "A unit of spoken language consisting of a single uninterrupted sound formed by a vowel, diphthong, or syllabic consonant alone, or by any of these sounds preceded, followed, or surrounded by one or more consonants."
	words = sentence.split(" ")
	output = syllabizer_pipe(words, batch_size=len(words),do_sample=False, max_length=30, early_stopping=True)

	[{words[i]: gen_text['generated_text'].split(" ")} for i, gen_text in enumerate(output)]
	```

	This outputs the following:
	```
	[{'A': ['a']},
	{'unit': ['u', 'nit']},
	{'of': ['of']},
	{'spoken': ['spok', 'en']},
	{'language': ['lan', 'guage']},
	{'consisting': ['con', 'sis', 'ting']},
	{'of': ['of']},
	{'a': ['a']},
	{'single': ['sing', 'le']},
	{'uninterrupted': ['un', 'in', 'ter', 'rupt', 'ed']},
	{'sound': ['sound']},
	{'formed': ['formed']},
	{'by': ['by']},
	{'a': ['a']},
	{'vowel,': ['vow', 'el']},
	{'diphthong,': ['diph', 'thong']},
	{'or': ['or']},
	{'syllabic': ['syl', 'la', 'bic']},
	{'consonant': ['con', 'so', 'nant']},
	{'alone,': ['a', 'lone']},
	{'or': ['or']},
	{'by': ['by']},
	{'any': ['an', 'y']},
	{'of': ['of']},
	{'these': ['these']},
	{'sounds': ['sounds']},
	{'preceded,': ['pre', 'ced', 'ed']},
	{'followed,': ['fol', 'lowed']},
	{'or': ['or']},
	{'surrounded': ['sur', 'round', 'ed']},
	{'by': ['by']},
	{'one': ['one']},
	{'or': ['or']},
	{'more': ['more']},
	{'consonants.': ['con', 'so', 'nants']}]
	```