Add files using upload-large-folder tool

a4d9876 verified about 2 months ago

3.19 kB

	# Voice Design

	Voice Design mode lets you describe the desired speaker through speaker attributes (`instruct` parameter) — no reference audio needed. The model
	generates a matching voice on the fly.

	## Quick Example

	```python
	import torch
	from omnivoice import OmniVoice

	model = OmniVoice.from_pretrained(
	"k2-fsa/OmniVoice",
	device_map="cuda:0",
	dtype=torch.float16
	)

	audio = model.generate(
	text="This is a test for voice design.",
	instruct="female, young adult, high pitch, british accent",
	)
	```

	## How It Works

	The `instruct` parameter accepts a comma-separated string of speaker attributes.
	Each attribute belongs to a category (gender, age, pitch, style, accent,
	or dialect). Within a category, only one attribute may be selected at a time.
	Attributes from different categories can be freely combined.

	The model auto-detects the language of the instruct text and normalises it
	internally — you can write in English, Chinese, or a mix of both.

	## Supported Attributes

	### Gender

	\| English \| Chinese \|
	\|---------\|---------\|
	\| male \| 男 \|
	\| female \| 女 \|

	### Age

	\| English \| Chinese \|
	\|---------\|---------\|
	\| child \| 儿童 \|
	\| teenager \| 少年 \|
	\| young adult \| 青年 \|
	\| middle-aged \| 中年 \|
	\| elderly \| 老年 \|

	### Pitch

	\| English \| Chinese \|
	\|---------\|---------\|
	\| very low pitch \| 极低音调 \|
	\| low pitch \| 低音调 \|
	\| moderate pitch \| 中音调 \|
	\| high pitch \| 高音调 \|
	\| very high pitch \| 极高音调 \|

	### Style

	\| English \| Chinese \|
	\|---------\|---------\|
	\| whisper \| 耳语 \|

	### English Accent

	Only effective when the synthesis text is in English.

	\| Accent \|
	\|--------\|
	\| american accent \|
	\| british accent \|
	\| australian accent \|
	\| canadian accent \|
	\| indian accent \|
	\| chinese accent \|
	\| korean accent \|
	\| japanese accent \|
	\| portuguese accent \|
	\| russian accent \|

	### Chinese Dialect

	Only effective when the synthesis text is in Chinese.

	\| Dialect \|
	\|---------\|
	\| 河南话 \|
	\| 陕西话 \|
	\| 四川话 \|
	\| 贵州话 \|
	\| 云南话 \|
	\| 桂林话 \|
	\| 济南话 \|
	\| 石家庄话 \|
	\| 甘肃话 \|
	\| 宁夏话 \|
	\| 青岛话 \|
	\| 东北话 \|

	## Writing Instruct Strings

	Separate attributes with commas (half-width `,` for English, full-width `，`
	for Chinese — the model auto-fixes mismatches).

	```
	# English
	"female, young adult, high pitch, british accent"

	# Chinese
	"女，青年，高音调，四川话"

	# Mixed (auto-normalised)
	"female, young adult, 四川话"
	```

	### Tips

	- Combine freely across categories: `"male, elderly, low pitch, whisper"`.
	- Leave it to the model: omit attributes you don't care about — the model
	fills in the rest. For example `"female"` alone is valid.
	- Case-insensitive: `"Male"`, `"MALE"`, and `"male"` are all accepted, the code will normalize them to lower case.

	- Accent vs Dialect: English accents are only applied to English speech, Chinese dialects are only applied to Chinese speech.
	- Attribute combinations: Due to training data limitations, some attribute combinations may not work well — the model may ignore certain attributes in a combination. If the output doesn't match your expectation, try simplifying the instruct string.

	# Voice Design

	Voice Design mode lets you describe the desired speaker through speaker attributes (`instruct` parameter) — no reference audio needed. The model
	generates a matching voice on the fly.

	## Quick Example

	```python
	import torch
	from omnivoice import OmniVoice

	model = OmniVoice.from_pretrained(
	"k2-fsa/OmniVoice",
	device_map="cuda:0",
	dtype=torch.float16
	)

	audio = model.generate(
	text="This is a test for voice design.",
	instruct="female, young adult, high pitch, british accent",
	)
	```

	## How It Works

	The `instruct` parameter accepts a comma-separated string of speaker attributes.
	Each attribute belongs to a category (gender, age, pitch, style, accent,
	or dialect). Within a category, only one attribute may be selected at a time.
	Attributes from different categories can be freely combined.

	The model auto-detects the language of the instruct text and normalises it
	internally — you can write in English, Chinese, or a mix of both.

	## Supported Attributes

	### Gender

	\| English \| Chinese \|
	\|---------\|---------\|
	\| male \| 男 \|
	\| female \| 女 \|

	### Age

	\| English \| Chinese \|
	\|---------\|---------\|
	\| child \| 儿童 \|
	\| teenager \| 少年 \|
	\| young adult \| 青年 \|
	\| middle-aged \| 中年 \|
	\| elderly \| 老年 \|

	### Pitch

	\| English \| Chinese \|
	\|---------\|---------\|
	\| very low pitch \| 极低音调 \|
	\| low pitch \| 低音调 \|
	\| moderate pitch \| 中音调 \|
	\| high pitch \| 高音调 \|
	\| very high pitch \| 极高音调 \|

	### Style

	\| English \| Chinese \|
	\|---------\|---------\|
	\| whisper \| 耳语 \|

	### English Accent

	Only effective when the synthesis text is in English.

	\| Accent \|
	\|--------\|
	\| american accent \|
	\| british accent \|
	\| australian accent \|
	\| canadian accent \|
	\| indian accent \|
	\| chinese accent \|
	\| korean accent \|
	\| japanese accent \|
	\| portuguese accent \|
	\| russian accent \|

	### Chinese Dialect

	Only effective when the synthesis text is in Chinese.

	\| Dialect \|
	\|---------\|
	\| 河南话 \|
	\| 陕西话 \|
	\| 四川话 \|
	\| 贵州话 \|
	\| 云南话 \|
	\| 桂林话 \|
	\| 济南话 \|
	\| 石家庄话 \|
	\| 甘肃话 \|
	\| 宁夏话 \|
	\| 青岛话 \|
	\| 东北话 \|

	## Writing Instruct Strings

	Separate attributes with commas (half-width `,` for English, full-width `，`
	for Chinese — the model auto-fixes mismatches).

	```
	# English
	"female, young adult, high pitch, british accent"

	# Chinese
	"女，青年，高音调，四川话"

	# Mixed (auto-normalised)
	"female, young adult, 四川话"
	```

	### Tips

	- Combine freely across categories: `"male, elderly, low pitch, whisper"`.
	- Leave it to the model: omit attributes you don't care about — the model
	fills in the rest. For example `"female"` alone is valid.
	- Case-insensitive: `"Male"`, `"MALE"`, and `"male"` are all accepted, the code will normalize them to lower case.

	- Accent vs Dialect: English accents are only applied to English speech, Chinese dialects are only applied to Chinese speech.
	- Attribute combinations: Due to training data limitations, some attribute combinations may not work well — the model may ignore certain attributes in a combination. If the output doesn't match your expectation, try simplifying the instruct string.