Upload folder using huggingface_hub
Browse files- cotlet/1.ipynb +0 -0
- cotlet/2.ipynb +0 -0
- cotlet/3.ipynb +0 -0
- cotlet/__pycache__/phon.cpython-311.pyc +0 -0
- cotlet/__pycache__/utils.cpython-311.pyc +0 -0
- cotlet/cell_output.log +3 -0
- cotlet/hallucinate.csv +48 -0
- cotlet/phon.py +158 -0
- cotlet/sanity_check.py +46 -0
- cotlet/utils.py +1003 -0
cotlet/1.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
cotlet/2.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
cotlet/3.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
cotlet/__pycache__/phon.cpython-311.pyc
ADDED
|
Binary file (5.55 kB). View file
|
|
|
cotlet/__pycache__/utils.cpython-311.pyc
ADDED
|
Binary file (24.8 kB). View file
|
|
|
cotlet/cell_output.log
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Logging started. Output will be saved to /home/austin/disk2/llmvcs/tt/cotlet/cell_output.log every 5 seconds.
|
| 2 |
+
Finding audio files...
|
| 3 |
+
Finding audio files...
|
cotlet/hallucinate.csv
ADDED
|
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/shinichiro_miki/Shinichiro_Miki__01/Shinichiro_Miki__01_chunk1929.wav|kojomi oniːtɕaɴ da ka kambarɯsaɴ da ka no eikʲoɯ o ɯkesɯgi dʑa neː no ka? omae wa! aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː.|4
|
| 2 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/kamiya_hiroshi/Kamiya_Hiroshi_02/Kamiya_Hiroshi_02_chunk2670.wav|ɴ, ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa , ɯwa .|5
|
| 3 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sakurai_takahiro/Sakurai_Takahiro_02/Sakurai_Takahiro_02_chunk290.wav|ɯɯ ɯɯ ɯɯ ɯɯ.|1
|
| 4 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sawashiro_miyuki/Sawashiro_Miyuki_03/Sawashiro_Miyuki_03_chunk2123.wav|eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː eː.|2
|
| 5 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/shinichiro_miki/Shinichiro_Miki__02/Shinichiro_Miki__02_chunk1211.wav|ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ... ɕi no bɯ...|4
|
| 6 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sakurai_takahiro/Sakurai_Takahiro_02/Sakurai_Takahiro_02_chunk37.wav|ʔɴ?ʔɴ? ɯwaː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː jɯkɯɯɯɯɯ kɯɯɯɯ kɯɯɯɯ kɯɯɯ kɯɯɯ kɯɯɯ kɯɯɯ kɯɯɯ kɯɯ kɯɯɯ kɯɯ kɯɯ kɯɯ kɯɯ kɯɯ kɯ kɯ.|1
|
| 7 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/kamiya_hiroshi/Kamiya_Hiroshi_02/Kamiya_Hiroshi_02_chunk2676.wav|ɯ, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa, ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯwa,ɯ.|5
|
| 8 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/shinichiro_miki/Shinichiro_Miki_03/Shinichiro_Miki_03_chunk258.wav|aɽi enai aɽi enai aɽi enai aɽi enai aɽi enai aɽi enai aɽi enai aɽi enai aɽi enai.|4
|
| 9 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/shinichiro_miki/Shinichiro_Miki__01/Shinichiro_Miki__01_chunk1916.wav|doɯ... doɯ... doɯ... doɯ...|4
|
| 10 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sawashiro_miyuki/Sawashiro_Miyuki_03/Sawashiro_Miyuki_03_chunk1723.wav|aite no kʲotae o ɽijoɯ sɯrɯ sempoɯnʲaɴ. kʲɯɯsoɯ neko o kamɯ, to iɯ kotowaza ga arɯkaɽainʲaɴ. neko ga toɽa o kandaʔte okaɕikɯ wa niai daɽoɯ. soɽe ni...ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯ.|2
|
| 11 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sawashiro_miyuki/Sawashiro_Miyuki_03/Sawashiro_Miyuki_03_chunk1975.wav|aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː itai... itai... itai... atsɯi... itai... atsɯi... atsɯi... atsɯi... atsɯi...|2
|
| 12 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/shinichiro_miki/Shinichiro_Miki_03/Shinichiro_Miki_03_chunk204.wav|kʲaː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː a.|4
|
| 13 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sakurai_takahiro/Sakurai_Takahiro_02/Sakurai_Takahiro_02_chunk634.wav|ʔte, oi oi oi oi oi oi oi!|1
|
| 14 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/shinichiro_miki/Shinichiro_Miki__02/Shinichiro_Miki__02_chunk1373.wav|ʔɴ...te... ɯwa!! ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw ɯw .|4
|
| 15 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sawashiro_miyuki/Sawashiro_Miyuki_03/Sawashiro_Miyuki_03_chunk491.wav|tada, “çinoɯma ɕi no eɯ ma ɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ.|2
|
| 16 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/chiwa_saito/Chiwa_Saito_01/Chiwa_Saito_01_chunk119.wav|ɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ, ɯɯɴ.|3
|
| 17 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/kamiya_hiroshi/Kamiya_Hiroshi_01/Kamiya_Hiroshi_01_chunk1491.wav|ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ, ɴʔ,.|5
|
| 18 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/kamiya_hiroshi/Kamiya_Hiroshi_01/Kamiya_Hiroshi_01_chunk914.wav|nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, nɯʔ, ɯɯʔ, ɯɯʔ, ɯɯʔ, ɯɯʔ, ɯɯʔ, ɯɯʔ, ɯɯʔ, ɯɯʔ, ɯɯʔ, ɯɯʔ, ɯɯʔ, ɯɯʔ, ɯɯʔ, ɯɯʔ, ɯɯʔ, ɯɯʔ,.|5
|
| 19 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sawashiro_miyuki/Sawashiro_Miyuki_03/Sawashiro_Miyuki_03_chunk1801.wav|ɯ o?nʲa aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː a.|2
|
| 20 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sakurai_takahiro/Sakurai_Takahiro_02/Sakurai_Takahiro_02_chunk57.wav|gʲaː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː.|1
|
| 21 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/shinichiro_miki/Shinichiro_Miki_03/Shinichiro_Miki_03_chunk2136.wav|ɯ!! ɯ!! ɯ!!|4
|
| 22 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/shinichiro_miki/Shinichiro_Miki__02/Shinichiro_Miki__02_chunk398.wav|do, do, do, do, do, do, do, do, do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do,do!|4
|
| 23 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sakurai_takahiro/Sakurai_Takahiro_02/Sakurai_Takahiro_02_chunk2832.wav|nani o saɽeta... nani o saɽeta... nani o saɽeta... nani o saɽeta... nani o saɽeta... nani o saɽeta... nani o saɽeta...|1
|
| 24 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sakamoto_maya/Sakamoto_Maya_03/Sakamoto_Maya_03_chunk544.wav|ʔɴ, maː na. ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɽ ɯɯɯ, maː na.|6
|
| 25 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sawashiro_miyuki/Sawashiro_Miyuki_01/Sawashiro_Miyuki_01_chunk1450.wav|aɽaɽagikɯɴ ni taiɕite na no ka, oɕinosaɴ ni taiɕite na no ka, arɯiha wataɕi ni taiɕite na no ka...|2
|
| 26 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sakamoto_maya/Sakamoto_Maya_02/Sakamoto_Maya_02_chunk311.wav|tona ni ɕɯɽeba naɽakɯte ɕɯɽeba naɽakɯte ɕɯɽeba naɽakɯte ɕɯɽeba naɽakɯte ɕɯɽeba naɽakɯte ɕɯɽeba naɽakɯte ɕɯɽeba naɽakɯte ɕɯɽeba naɽakɯte ɕɯɽeba naɽakɯte ɕɯɽeba naɽakɯte ɕɯɽeba naɽakɯte ɕɯɽeba naɽakɯte ɕɯɽeba naɽakɯte ɕɯɽeba naɽakɯte ɕɯɽeba naɽakɯte ɕɯɽeba naɽakɯte ɕɯɽeba naɽakɯte ɕɯɽeba naɽakɯte ɕɯɽeba naɽakɯte ɕɯɽeba naɽakɯte ɕɯɽeba naɽakɯte ɕɯɽeba naɽakɯte ɯ.|6
|
| 27 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sakurai_takahiro/Sakurai_Takahiro_01/Sakurai_Takahiro_01_chunk6.wav|itɕi. ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯ.|1
|
| 28 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/shinichiro_miki/Shinichiro_Miki__01/Shinichiro_Miki__01_chunk1524.wav|kʲaː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː.|4
|
| 29 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sawashiro_miyuki/Sawashiro_Miyuki_02/Sawashiro_Miyuki_02_chunk1106.wav|koɽe ga kazokɯ no kaiwa na no ka, to ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɕɯɯkaɴ ɯɯkaɴ ɯɯkaɴ ɯɯkaɴ ɯɯkaɴ ɯɯkaɴ ɯɯkaɴ ɯɯkaɴ ɯɯkaɴ ɯɯkaɴ ɯɯkaɴ ɯɯkaɴ ɯɯkaɴ ɯɯkaɴ ɯɯkaɴ ɯkaɴ ɯkaɴ ɯkaɴ ɯkaɴ ɯkaɴ ɯka.|2
|
| 30 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/shinichiro_miki/Shinichiro_Miki_03/Shinichiro_Miki_03_chunk771.wav|aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː.|4
|
| 31 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/chiwa_saito/Chiwa_Saito_03/Chiwa_Saito_03_chunk344.wav|eː to, neko, bokɯ ga ima kaɽa iɯ bɯɴɕoɯ o fɯkɯɕoɯ ɕiɽo. naname ɯ ɯɯ ɽiɴ do no naɽabi de ɕɯɯnɯɯ ɕɯɯ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ ɽoɴ .|3
|
| 32 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/shinichiro_miki/Shinichiro_Miki__01/Shinichiro_Miki__01_chunk456.wav|waɽaʔte waɽaʔte waɽaʔte waɽaʔte waɽaʔte.|4
|
| 33 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sawashiro_miyuki/Sawashiro_Miyuki_02/Sawashiro_Miyuki_02_chunk1018.wav|tonikakɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯ.|2
|
| 34 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sawashiro_miyuki/Sawashiro_Miyuki_03/Sawashiro_Miyuki_03_chunk1526.wav|haʔhaʔhaʔhaʔhaʔhaʔhaʔhaʔhaʔhaʔhaʔhaʔhaʔhaʔhaʔhaʔhaʔhaʔhaʔhaʔhaw wa haʔhaʔhaʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔhaw waʔtɽi desɯ ne.|2
|
| 35 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sakurai_takahiro/Sakurai_Takahiro_03/Sakurai_Takahiro_03_chunk1489.wav|ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ.|1
|
| 36 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sawashiro_miyuki/Sawashiro_Miyuki_03/Sawashiro_Miyuki_03_chunk1933.wav|aɽaɽagikɯɴ! aɽaɽagikɯɴ! aɽaɽagikɯɴ!|2
|
| 37 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sawashiro_miyuki/Sawashiro_Miyuki_03/Sawashiro_Miyuki_03_chunk1735.wav|nʲa aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː.|2
|
| 38 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/shinichiro_miki/Shinichiro_Miki_03/Shinichiro_Miki_03_chunk354.wav|nani mo iʔteneː jo, nani mo iʔteneːʔte, nani mo naː.|4
|
| 39 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/horie_yui/Horie_Yui_01/Horie_Yui_01_chunk1622.wav|«haʔhaʔhaʔhaʔ, itai, itai, haʔhaʔ, itai, itai, itai...».|0
|
| 40 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sakamoto_maya/Sakamoto_Maya_03/Sakamoto_Maya_03_chunk1924.wav|nemaki gawaɽi no jɯkata wa, ɕɯɯrʲɯɯ ɕɯɯrʲɯɯ ɕɯɯrʲɯɯ ɕɯɯrʲɯɯ ɕɯɯrʲɯɯ ɕɯɯrʲɯɯ ɕɯɯrʲɯɯ ɕɯɯrʲɯɯ ɕɯɯrʲɯɯ ɕɯɯrʲɯɯ ɕɯɯrʲɯɯ ɕɯɯrʲɯɯ ɕɯɯrʲɯɯ ɕɯɯrʲɯɯ ɕɯɯrʲɯɯ ɕɯɯrʲɯɯ ɕɯɯrʲɯɯ ɕɯɯrʲɯɯ ɕɯɯrʲɯɯ.|6
|
| 41 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/shinichiro_miki/Shinichiro_Miki_03/Shinichiro_Miki_03_chunk2177.wav|kojomi oniːtɕaɴ kojomi oniːtɕaɴ kojomi oniːtɕaɴ kojomi oniːtɕaɴ kojomi oniːtɕaɴ kojomi oniːtɕaɴ kojomi oniːtɕaɴ.|4
|
| 42 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/horie_yui/Horie_Yui_01/Horie_Yui_01_chunk814.wav|kondo tsɯkiçitɕaɴ no hoɯ ni, ɽ ɽ ɽ ɽ ɽi ɽi ɽi ɽi ɽi ɽi ɽi ɽi ɽi ɽi ɽi ɽi ɽi no doɯkoɯ o kiːte okoɯ.|0
|
| 43 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/shinichiro_miki/Shinichiro_Miki_03/Shinichiro_Miki_03_chunk1105.wav|ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɕɯɯ ɯɯ ɕɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ .|4
|
| 44 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sawashiro_miyuki/Sawashiro_Miyuki_03/Sawashiro_Miyuki_03_chunk2136.wav|i!!? aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aː aɯ oɯ kɯɯɯɯ kɯɯɯ kɯɯɯ kɯɯɯ kɯɯɯ kɯɯɯ kɯɯ kɯɯɯ kɯɯ kɯɯɯ kɯɯ kɯɯɯ kɯɯ kɯɯ kɯɯ kɯɯ kɯɯ kɯɯ kɯɯ kɯɯ kɯɯ kɯɯ kɯɯ kɯɯ kɯ k.|2
|
| 45 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sawashiro_miyuki/Sawashiro_Miyuki_03/Sawashiro_Miyuki_03_chunk2134.wav|a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, aɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯɯ.|2
|
| 46 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sakamoto_maya/Sakamoto_Maya_02/Sakamoto_Maya_02_chunk877.wav|ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ ɯɯ.|6
|
| 47 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/kamiya_hiroshi/Kamiya_Hiroshi_01/Kamiya_Hiroshi_01_chunk1765.wav|na na ga tsɯ na na nitɕi.|5
|
| 48 |
+
/home/austin/disk2/llmvcs/tt/stylekan/Data/moe_res/monogatari/monogatari_voices/monogatari_split/sawashiro_miyuki/Sawashiro_Miyuki_03/Sawashiro_Miyuki_03_chunk1792.wav|a tɕi, a tɕi, a tɕi, a tɕi, a tɕi, a tɕi, a tɕi, a tɕi!|2
|
cotlet/phon.py
ADDED
|
@@ -0,0 +1,158 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from cotlet.utils import *
|
| 2 |
+
import cutlet
|
| 3 |
+
|
| 4 |
+
katsu = cutlet.Cutlet(ensure_ascii=False)
|
| 5 |
+
katsu.use_foreign_spelling = False
|
| 6 |
+
|
| 7 |
+
def process_japanese_text(ml):
|
| 8 |
+
# Check for small characters and replace them
|
| 9 |
+
if any(char in ml for char in "ぁぃぅぇぉ"):
|
| 10 |
+
|
| 11 |
+
ml = ml.replace("ぁ", "あ")
|
| 12 |
+
ml = ml.replace("ぃ", "い")
|
| 13 |
+
ml = ml.replace("ぅ", "う")
|
| 14 |
+
ml = ml.replace("ぇ", "え")
|
| 15 |
+
ml = ml.replace("ぉ", "お")
|
| 16 |
+
|
| 17 |
+
# Initialize Cutlet for romaji conversion
|
| 18 |
+
|
| 19 |
+
# Convert to romaji and apply transformations
|
| 20 |
+
# output = katsu.romaji(ml, capitalize=False).lower()
|
| 21 |
+
|
| 22 |
+
output = katsu.romaji(apply_transformations(alphabetreading(ml)), capitalize=False).lower()
|
| 23 |
+
|
| 24 |
+
# Replace specific romaji sequences
|
| 25 |
+
if 'j' in output:
|
| 26 |
+
output = output.replace('j', "dʑ")
|
| 27 |
+
if 'tt' in output:
|
| 28 |
+
output = output.replace('tt', "ʔt")
|
| 29 |
+
if 't t' in output:
|
| 30 |
+
output = output.replace('t t', "ʔt")
|
| 31 |
+
if ' ʔt' in output:
|
| 32 |
+
output = output.replace(' ʔt', "ʔt")
|
| 33 |
+
if 'ssh' in output:
|
| 34 |
+
output = output.replace('ssh', "ɕɕ")
|
| 35 |
+
|
| 36 |
+
# Convert romaji to IPA
|
| 37 |
+
output = Roma2IPA(convert_numbers_in_string(output))
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
output = hira2ipa(output)
|
| 41 |
+
|
| 42 |
+
# Apply additional transformations
|
| 43 |
+
output = replace_chars_2(output)
|
| 44 |
+
output = replace_repeated_chars(replace_tashdid_2(output))
|
| 45 |
+
output = nasal_mapper(output)
|
| 46 |
+
|
| 47 |
+
# Final adjustments
|
| 48 |
+
if " ɴ" in output:
|
| 49 |
+
output = output.replace(" ɴ", "ɴ")
|
| 50 |
+
|
| 51 |
+
if ' neɽitai ' in output:
|
| 52 |
+
output = output.replace(' neɽitai ', "naɽitai")
|
| 53 |
+
|
| 54 |
+
if 'harɯdʑisama' in output:
|
| 55 |
+
output = output.replace('harɯdʑisama', "arɯdʑisama")
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
if "ki ni ɕinai" in output:
|
| 59 |
+
output = re.sub(r'(?<!\s)ki ni ɕinai', r' ki ni ɕinai', output)
|
| 60 |
+
|
| 61 |
+
if 'ʔt' in output:
|
| 62 |
+
output = re.sub(r'(?<!\s)ʔt', r'ʔt', output)
|
| 63 |
+
|
| 64 |
+
if 'de aɽoɯ' in output:
|
| 65 |
+
output = re.sub(r'(?<!\s)de aɽoɯ', r' de aɽoɯ', output)
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
return output.lstrip()
|
| 69 |
+
|
| 70 |
+
# def replace_repeating_patterns(text):
|
| 71 |
+
# def replace_repeats(match):
|
| 72 |
+
# pattern = match.group(1)
|
| 73 |
+
# if len(match.group(0)) // len(pattern) >= 3:
|
| 74 |
+
# return pattern + "~~~"
|
| 75 |
+
# return match.group(0)
|
| 76 |
+
|
| 77 |
+
# # Pattern for space-separated repeats
|
| 78 |
+
# pattern1 = r'((?:\S+\s+){1,5}?)(?:\1){2,}'
|
| 79 |
+
# # Pattern for continuous repeats without spaces
|
| 80 |
+
# pattern2 = r'(.+?)\1{2,}'
|
| 81 |
+
|
| 82 |
+
# text = re.sub(pattern1, replace_repeats, text)
|
| 83 |
+
# text = re.sub(pattern2, replace_repeats, text)
|
| 84 |
+
# return text
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
def replace_repeating_a(output):
|
| 88 |
+
# Define patterns and their replacements
|
| 89 |
+
patterns = [
|
| 90 |
+
(r'(aː)\s*\1+\s*', r'\1~'), # Replace repeating "aː" with "aː~~"
|
| 91 |
+
(r'(aːa)\s*aː', r'\1~'), # Replace "aːa aː" with "aː~~"
|
| 92 |
+
(r'aːa', r'aː~'), # Replace "aːa" with "aː~"
|
| 93 |
+
(r'naː\s*aː', r'naː~'), # Replace "naː aː" with "naː~"
|
| 94 |
+
(r'(oː)\s*\1+\s*', r'\1~'), # Replace repeating "oː" with "oː~~"
|
| 95 |
+
(r'(oːo)\s*oː', r'\1~'), # Replace "oːo oː" with "oː~~"
|
| 96 |
+
(r'oːo', r'oː~'), # Replace "oːo" with "oː~"
|
| 97 |
+
(r'(eː)\s*\1+\s*', r'\1~'),
|
| 98 |
+
(r'(e)\s*\1+\s*', r'\1~'),
|
| 99 |
+
(r'(eːe)\s*eː', r'\1~'),
|
| 100 |
+
(r'eːe', r'eː~'),
|
| 101 |
+
(r'neː\s*eː', r'neː~'),
|
| 102 |
+
]
|
| 103 |
+
|
| 104 |
+
|
| 105 |
+
# Apply each pattern to the output
|
| 106 |
+
for pattern, replacement in patterns:
|
| 107 |
+
output = re.sub(pattern, replacement, output)
|
| 108 |
+
|
| 109 |
+
return output
|
| 110 |
+
|
| 111 |
+
def phonemize(text):
|
| 112 |
+
|
| 113 |
+
if "っ" in text:
|
| 114 |
+
text = text.replace("っ","ʔ")
|
| 115 |
+
|
| 116 |
+
output = post_fix(process_japanese_text(text))
|
| 117 |
+
#output = text
|
| 118 |
+
|
| 119 |
+
if " ɴ" in output:
|
| 120 |
+
output = output.replace(" ɴ", "ɴ")
|
| 121 |
+
if "y" in output:
|
| 122 |
+
output = output.replace("y", "j")
|
| 123 |
+
if "ɯa" in output:
|
| 124 |
+
output = output.replace("ɯa", "wa")
|
| 125 |
+
|
| 126 |
+
if "a aː" in output:
|
| 127 |
+
output = output.replace("a aː","a~")
|
| 128 |
+
if "a a" in output:
|
| 129 |
+
output = output.replace("a a","a~")
|
| 130 |
+
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
|
| 134 |
+
|
| 135 |
+
output = replace_repeating_a((output))
|
| 136 |
+
output = re.sub(r'\s+~', '~', output)
|
| 137 |
+
|
| 138 |
+
if "oː~o oː~ o" in output:
|
| 139 |
+
output = output.replace("oː~o oː~ o","oː~~~~~~")
|
| 140 |
+
if "aː~aː" in output:
|
| 141 |
+
output = output.replace("aː~aː","aː~~~")
|
| 142 |
+
if "oɴ naː" in output:
|
| 143 |
+
output = output.replace("oɴ naː","onnaː")
|
| 144 |
+
if "aː~~ aː" in output:
|
| 145 |
+
output = output.replace("aː~~ aː","aː~~~~")
|
| 146 |
+
if "oː~o" in output:
|
| 147 |
+
output = output.replace("oː~o","oː~~")
|
| 148 |
+
if "oː~~o o" in output:
|
| 149 |
+
output = output.replace("oː~~o o","oː~~~~") # yeah I'm too tired to learn regex how did you know
|
| 150 |
+
|
| 151 |
+
output = random_space_fix(output)
|
| 152 |
+
output = random_sym_fix(output) # fixing some symbols, if they have a specific white space such as miku& sakura -> miku ando sakura
|
| 153 |
+
output = random_sym_fix_no_space(output) # same as above but for those without white space such as miku&sakura -> miku ando sakura
|
| 154 |
+
|
| 155 |
+
return output.lstrip()
|
| 156 |
+
|
| 157 |
+
# def process_row(row):
|
| 158 |
+
# return {'phonemes': [phonemize(word) for word in row['phonemes']]}
|
cotlet/sanity_check.py
ADDED
|
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import csv
|
| 2 |
+
import wave
|
| 3 |
+
import os
|
| 4 |
+
from tqdm import tqdm
|
| 5 |
+
def verify_wav_file(file_path):
|
| 6 |
+
try:
|
| 7 |
+
with wave.open(file_path, 'rb') as wav_file:
|
| 8 |
+
# Try to read some basic properties
|
| 9 |
+
channels = wav_file.getnchannels()
|
| 10 |
+
sample_width = wav_file.getsampwidth()
|
| 11 |
+
framerate = wav_file.getframerate()
|
| 12 |
+
frames = wav_file.getnframes()
|
| 13 |
+
|
| 14 |
+
# If we got here, the file is likely valid
|
| 15 |
+
return True
|
| 16 |
+
except Exception as e:
|
| 17 |
+
print(f"Error processing {file_path}: {str(e)}")
|
| 18 |
+
return False
|
| 19 |
+
|
| 20 |
+
def main():
|
| 21 |
+
csv_path = "/home/austin/disk1/stts-zs_cleaning/data/filename.csv"
|
| 22 |
+
total_files = 0
|
| 23 |
+
valid_files = 0
|
| 24 |
+
|
| 25 |
+
with open(csv_path, 'r') as csv_file:
|
| 26 |
+
csv_reader = csv.reader(csv_file, delimiter='|')
|
| 27 |
+
for row in tqdm(csv_reader,desc="Verifying files", unit="file"):
|
| 28 |
+
if row: # Check if the row is not empty
|
| 29 |
+
wav_path = row[0]
|
| 30 |
+
total_files += 1
|
| 31 |
+
|
| 32 |
+
if os.path.exists(wav_path):
|
| 33 |
+
if verify_wav_file(wav_path):
|
| 34 |
+
valid_files += 1
|
| 35 |
+
else:
|
| 36 |
+
print(f"File is corrupted or invalid: {wav_path}")
|
| 37 |
+
else:
|
| 38 |
+
print(f"File does not exist: {wav_path}")
|
| 39 |
+
|
| 40 |
+
print(f"\nVerification completed.")
|
| 41 |
+
print(f"Total files checked: {total_files}")
|
| 42 |
+
print(f"Valid files: {valid_files}")
|
| 43 |
+
print(f"Invalid or missing files: {total_files - valid_files}")
|
| 44 |
+
|
| 45 |
+
if __name__ == "__main__":
|
| 46 |
+
main()
|
cotlet/utils.py
ADDED
|
@@ -0,0 +1,1003 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
import re
|
| 3 |
+
import cutlet
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
formal_to_informal = {
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
'ワタクシ': 'わたし',
|
| 11 |
+
'チカコ':'しゅうこ',
|
| 12 |
+
"タノヒト":"ほかのひと",
|
| 13 |
+
|
| 14 |
+
# Add more mappings as needed
|
| 15 |
+
}
|
| 16 |
+
|
| 17 |
+
formal_to_informal2 = {
|
| 18 |
+
|
| 19 |
+
"たのひと":"ほかのひと",
|
| 20 |
+
"すうは": "かずは",
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
# Add more mappings as needed
|
| 24 |
+
}
|
| 25 |
+
|
| 26 |
+
formal_to_informal3 = {
|
| 27 |
+
|
| 28 |
+
"%":"%",
|
| 29 |
+
"@": "あっとさいん",
|
| 30 |
+
"$":"どる",
|
| 31 |
+
"#":"はっしゅたぐ",
|
| 32 |
+
"$":"どる",
|
| 33 |
+
"#":"はっしゅたぐ",
|
| 34 |
+
"何が":"なにが",
|
| 35 |
+
|
| 36 |
+
"何も":"なにも",
|
| 37 |
+
"何か":"なにか",
|
| 38 |
+
# "奏":"かなで",
|
| 39 |
+
"何は":"なにが",
|
| 40 |
+
"お父様":"おとうさま",
|
| 41 |
+
"お兄様":"おにいさま",
|
| 42 |
+
"何を":"なにを",
|
| 43 |
+
"良い":"いい",
|
| 44 |
+
"李衣菜":"りいな",
|
| 45 |
+
"志希":"しき",
|
| 46 |
+
"種":"たね",
|
| 47 |
+
"方々":"かたがた",
|
| 48 |
+
"颯":"はやて",
|
| 49 |
+
"茄子さん":"かこさん",
|
| 50 |
+
"茄子ちゃん":"かこちゃん",
|
| 51 |
+
"涼ちゃん":"りょうちゃん",
|
| 52 |
+
"涼さん":"りょうさん",
|
| 53 |
+
"紗枝":"さえ",
|
| 54 |
+
"文香":"ふみか",
|
| 55 |
+
"私":"わたし",
|
| 56 |
+
"周子":"しゅうこ",
|
| 57 |
+
"イェ":"いえ",
|
| 58 |
+
"可憐":"かれん",
|
| 59 |
+
"加蓮":"かれん",
|
| 60 |
+
"・":".",
|
| 61 |
+
"方の":"かたの",
|
| 62 |
+
"気に":"きに",
|
| 63 |
+
"唯さん":"ゆいさん",
|
| 64 |
+
"唯ちゃん":"ゆいちゃん",
|
| 65 |
+
"聖ちゃん":"ひじりちゃん",
|
| 66 |
+
"他の":"ほかの",
|
| 67 |
+
"他に":"ほかに",
|
| 68 |
+
"一生懸命":"いっしょうけんめい",
|
| 69 |
+
"楓さん":"かえでさん",
|
| 70 |
+
"楓ちゃん":"かえでちゃん",
|
| 71 |
+
"内から":"ないから",
|
| 72 |
+
"の下で":"のしたで",
|
| 73 |
+
|
| 74 |
+
}
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
mapper = dict([
|
| 78 |
+
|
| 79 |
+
("仕方","しかた"),
|
| 80 |
+
("明日","あした"),
|
| 81 |
+
('私',"わたし"),
|
| 82 |
+
("従妹","いとこ"),
|
| 83 |
+
|
| 84 |
+
("1人","ひとり"),
|
| 85 |
+
("2人","ふたり"),
|
| 86 |
+
|
| 87 |
+
("一期","いちご"),
|
| 88 |
+
("一会","いちえ"),
|
| 89 |
+
|
| 90 |
+
("♪","!"),
|
| 91 |
+
("?","?"),
|
| 92 |
+
|
| 93 |
+
("どんな方","どんなかた"),
|
| 94 |
+
("ふたり暮らし","ふたりぐらし"),
|
| 95 |
+
|
| 96 |
+
("新年","しんねん"),
|
| 97 |
+
("来年","らいねん"),
|
| 98 |
+
("去年","きょねん"),
|
| 99 |
+
("壮年","そうねん"),
|
| 100 |
+
("今年","ことし"),
|
| 101 |
+
|
| 102 |
+
("昨年","さくねん"),
|
| 103 |
+
("本年","ほんねん"),
|
| 104 |
+
("平年","へいねん"),
|
| 105 |
+
("閏年","うるうどし"),
|
| 106 |
+
("初年","しょねん"),
|
| 107 |
+
("少年","しょうねん"),
|
| 108 |
+
("多年","たねん"),
|
| 109 |
+
("青年","せいねん"),
|
| 110 |
+
("中年","ちゅうねん"),
|
| 111 |
+
("老年","ろうねん"),
|
| 112 |
+
("成年","せいねん"),
|
| 113 |
+
("幼年","ようねん"),
|
| 114 |
+
("前年","ぜんねん"),
|
| 115 |
+
("元年","がんねん"),
|
| 116 |
+
("経年","けいねん"),
|
| 117 |
+
("当年","とうねん"),
|
| 118 |
+
|
| 119 |
+
("明年","みょうねん"),
|
| 120 |
+
("歳年","さいねん"),
|
| 121 |
+
("数年","すうねん"),
|
| 122 |
+
("半年","はんとし"),
|
| 123 |
+
("後年","こうねん"),
|
| 124 |
+
("実年","じつねん"),
|
| 125 |
+
("年年","ねんねん"),
|
| 126 |
+
("連年","れんねん"),
|
| 127 |
+
("暦年","れきねん"),
|
| 128 |
+
("各年","かくねん"),
|
| 129 |
+
("全年","ぜんねん"),
|
| 130 |
+
|
| 131 |
+
("年を","としを"),
|
| 132 |
+
("年が","としが"),
|
| 133 |
+
("年も","としも"),
|
| 134 |
+
("年は","としは"),
|
| 135 |
+
|
| 136 |
+
|
| 137 |
+
("奏ちゃん","かなでちゃん"),
|
| 138 |
+
("負けず嫌い","まけずぎらい"),
|
| 139 |
+
("貴方","あなた"),
|
| 140 |
+
("貴女","あなた"),
|
| 141 |
+
("貴男","あなた"),
|
| 142 |
+
|
| 143 |
+
("その節","そのせつ"),
|
| 144 |
+
|
| 145 |
+
("何し","なにし"),
|
| 146 |
+
("何する","なにする"),
|
| 147 |
+
|
| 148 |
+
("心さん","しんさん"),
|
| 149 |
+
("心ちゃん","しんちゃん"),
|
| 150 |
+
|
| 151 |
+
("乃々","のの"),
|
| 152 |
+
|
| 153 |
+
("身体の","からだの"),
|
| 154 |
+
("身体が","からだが"),
|
| 155 |
+
("身体を","からだを"),
|
| 156 |
+
("身体は","からだは"),
|
| 157 |
+
("身体に","からだに"),
|
| 158 |
+
("正念場","しょうねんば"),
|
| 159 |
+
("言う","いう"),
|
| 160 |
+
|
| 161 |
+
|
| 162 |
+
("一回","いっかい"),
|
| 163 |
+
("一曲","いっきょく"),
|
| 164 |
+
("一日","いちにち"),
|
| 165 |
+
("一言","ひとこと"),
|
| 166 |
+
("一杯","いっぱい"),
|
| 167 |
+
|
| 168 |
+
|
| 169 |
+
("方が","ほうが"),
|
| 170 |
+
("縦輪城","じゅうりんしろ"),
|
| 171 |
+
("深息","しんそく"),
|
| 172 |
+
("家人","かじん"),
|
| 173 |
+
("お返し","おかえし"),
|
| 174 |
+
("化物語","ばけものがたり"),
|
| 175 |
+
("阿良々木暦","あららぎこよみ"),
|
| 176 |
+
("何より","なにより")
|
| 177 |
+
|
| 178 |
+
|
| 179 |
+
])
|
| 180 |
+
|
| 181 |
+
|
| 182 |
+
# Merge all dictionaries into one
|
| 183 |
+
all_transformations = {**formal_to_informal, **formal_to_informal2, **formal_to_informal3, **mapper}
|
| 184 |
+
|
| 185 |
+
def apply_transformations(text, transformations = all_transformations):
|
| 186 |
+
for key, value in transformations.items():
|
| 187 |
+
text = text.replace(key, value)
|
| 188 |
+
return text
|
| 189 |
+
|
| 190 |
+
|
| 191 |
+
def number_to_japanese(num):
|
| 192 |
+
if not isinstance(num, int) or num < 0 or num > 9999:
|
| 193 |
+
return "Invalid input"
|
| 194 |
+
|
| 195 |
+
digits = ["", "いち", "に", "さん", "よん", "ご", "ろく", "なな", "はち", "きゅう"]
|
| 196 |
+
tens = ["", "じゅう", "にじゅう", "さんじゅう", "よんじゅう", "ごじゅう", "ろくじゅう", "ななじゅう", "はちじゅう", "きゅうじゅう"]
|
| 197 |
+
hundreds = ["", "ひゃく", "にひゃく", "さんびゃく", "よんひゃく", "ごひゃく", "ろっぴゃく", "ななひゃく", "はっぴゃく", "きゅうひゃく"]
|
| 198 |
+
thousands = ["", "せん", "にせん", "さんぜん", "よんせん", "ごせん", "ろくせん", "ななせん", "はっせん", "きゅうせん"]
|
| 199 |
+
|
| 200 |
+
if num == 0:
|
| 201 |
+
return "ゼロ"
|
| 202 |
+
|
| 203 |
+
result = ""
|
| 204 |
+
if num >= 1000:
|
| 205 |
+
result += thousands[num // 1000]
|
| 206 |
+
num %= 1000
|
| 207 |
+
if num >= 100:
|
| 208 |
+
result += hundreds[num // 100]
|
| 209 |
+
num %= 100
|
| 210 |
+
if num >= 10:
|
| 211 |
+
result += tens[num // 10]
|
| 212 |
+
num %= 10
|
| 213 |
+
if num > 0:
|
| 214 |
+
result += digits[num]
|
| 215 |
+
|
| 216 |
+
return result
|
| 217 |
+
|
| 218 |
+
def convert_numbers_in_string(input_string):
|
| 219 |
+
# Regular expression to find numbers in the string
|
| 220 |
+
number_pattern = re.compile(r'\d+')
|
| 221 |
+
|
| 222 |
+
# Function to replace numbers with their Japanese pronunciation
|
| 223 |
+
def replace_with_japanese(match):
|
| 224 |
+
num = int(match.group())
|
| 225 |
+
return number_to_japanese(num)
|
| 226 |
+
|
| 227 |
+
# Replace all occurrences of numbers in the string
|
| 228 |
+
converted_string = number_pattern.sub(replace_with_japanese, input_string)
|
| 229 |
+
return converted_string
|
| 230 |
+
|
| 231 |
+
|
| 232 |
+
|
| 233 |
+
roma_mapper = dict([
|
| 234 |
+
|
| 235 |
+
################################
|
| 236 |
+
|
| 237 |
+
("my","mʲ"),
|
| 238 |
+
("by","bʲ"),
|
| 239 |
+
("ny","nʲ"),
|
| 240 |
+
("ry","rʲ"),
|
| 241 |
+
("si","sʲ"),
|
| 242 |
+
("ky","kʲ"),
|
| 243 |
+
("gy","gʲ"),
|
| 244 |
+
("dy","dʲ"),
|
| 245 |
+
("di","dʲ"),
|
| 246 |
+
("fi","fʲ"),
|
| 247 |
+
("fy","fʲ"),
|
| 248 |
+
("ch","tɕ"),
|
| 249 |
+
("sh","ɕ"),
|
| 250 |
+
|
| 251 |
+
################################
|
| 252 |
+
|
| 253 |
+
("a","a"),
|
| 254 |
+
("i","i"),
|
| 255 |
+
("u","ɯ"),
|
| 256 |
+
("e","e"),
|
| 257 |
+
("o","o"),
|
| 258 |
+
("ka","ka"),
|
| 259 |
+
("ki","ki"),
|
| 260 |
+
("ku","kɯ"),
|
| 261 |
+
("ke","ke"),
|
| 262 |
+
("ko","ko"),
|
| 263 |
+
("sa","sa"),
|
| 264 |
+
("shi","ɕi"),
|
| 265 |
+
("su","sɯ"),
|
| 266 |
+
("se","se"),
|
| 267 |
+
("so","so"),
|
| 268 |
+
("ta","ta"),
|
| 269 |
+
("chi","tɕi"),
|
| 270 |
+
("tsu","tsɯ"),
|
| 271 |
+
("te","te"),
|
| 272 |
+
("to","to"),
|
| 273 |
+
("na","na"),
|
| 274 |
+
("ni","ni"),
|
| 275 |
+
("nu","nɯ"),
|
| 276 |
+
("ne","ne"),
|
| 277 |
+
("no","no"),
|
| 278 |
+
("ha","ha"),
|
| 279 |
+
("hi","çi"),
|
| 280 |
+
("fu","ɸɯ"),
|
| 281 |
+
("he","he"),
|
| 282 |
+
("ho","ho"),
|
| 283 |
+
("ma","ma"),
|
| 284 |
+
("mi","mi"),
|
| 285 |
+
("mu","mɯ"),
|
| 286 |
+
("me","me"),
|
| 287 |
+
("mo","mo"),
|
| 288 |
+
("ra","ɽa"),
|
| 289 |
+
("ri","ɽi"),
|
| 290 |
+
("ru","ɽɯ"),
|
| 291 |
+
("re","ɽe"),
|
| 292 |
+
("ro","ɽo"),
|
| 293 |
+
("ga","ga"),
|
| 294 |
+
("gi","gi"),
|
| 295 |
+
("gu","gɯ"),
|
| 296 |
+
("ge","ge"),
|
| 297 |
+
("go","go"),
|
| 298 |
+
("za","za"),
|
| 299 |
+
("ji","dʑi"),
|
| 300 |
+
("zu","zɯ"),
|
| 301 |
+
("ze","ze"),
|
| 302 |
+
("zo","zo"),
|
| 303 |
+
("da","da"),
|
| 304 |
+
|
| 305 |
+
|
| 306 |
+
("zu","zɯ"),
|
| 307 |
+
("de","de"),
|
| 308 |
+
("do","do"),
|
| 309 |
+
("ba","ba"),
|
| 310 |
+
("bi","bi"),
|
| 311 |
+
("bu","bɯ"),
|
| 312 |
+
("be","be"),
|
| 313 |
+
("bo","bo"),
|
| 314 |
+
("pa","pa"),
|
| 315 |
+
("pi","pi"),
|
| 316 |
+
("pu","pɯ"),
|
| 317 |
+
("pe","pe"),
|
| 318 |
+
("po","po"),
|
| 319 |
+
("ya","ja"),
|
| 320 |
+
("yu","jɯ"),
|
| 321 |
+
("yo","jo"),
|
| 322 |
+
("wa","wa"),
|
| 323 |
+
|
| 324 |
+
|
| 325 |
+
|
| 326 |
+
|
| 327 |
+
("a","a"),
|
| 328 |
+
("i","i"),
|
| 329 |
+
("u","ɯ"),
|
| 330 |
+
("e","e"),
|
| 331 |
+
("o","o"),
|
| 332 |
+
("wa","wa"),
|
| 333 |
+
("o","o"),
|
| 334 |
+
|
| 335 |
+
|
| 336 |
+
("wo","o")])
|
| 337 |
+
|
| 338 |
+
nasal_sound = dict([
|
| 339 |
+
# before m, p, b
|
| 340 |
+
("ɴm","mm"),
|
| 341 |
+
("ɴb", "mb"),
|
| 342 |
+
("ɴp", "mp"),
|
| 343 |
+
|
| 344 |
+
# before k, g
|
| 345 |
+
("ɴk","ŋk"),
|
| 346 |
+
("ɴg", "ŋg"),
|
| 347 |
+
|
| 348 |
+
# before t, d, n, s, z, ɽ
|
| 349 |
+
("ɴt","nt"),
|
| 350 |
+
("ɴd", "nd"),
|
| 351 |
+
("ɴn","nn"),
|
| 352 |
+
("ɴs", "ns"),
|
| 353 |
+
("ɴz","nz"),
|
| 354 |
+
("ɴɽ", "nɽ"),
|
| 355 |
+
|
| 356 |
+
("ɴɲ", "ɲɲ"),
|
| 357 |
+
|
| 358 |
+
])
|
| 359 |
+
|
| 360 |
+
def Roma2IPA(text):
|
| 361 |
+
orig = text
|
| 362 |
+
|
| 363 |
+
for k, v in roma_mapper.items():
|
| 364 |
+
text = text.replace(k, v)
|
| 365 |
+
|
| 366 |
+
return text
|
| 367 |
+
|
| 368 |
+
def nasal_mapper(text):
|
| 369 |
+
orig = text
|
| 370 |
+
|
| 371 |
+
|
| 372 |
+
for k, v in nasal_sound.items():
|
| 373 |
+
text = text.replace(k, v)
|
| 374 |
+
|
| 375 |
+
return text
|
| 376 |
+
|
| 377 |
+
def alphabetreading(text):
|
| 378 |
+
alphabet_dict = {"A": "エイ",
|
| 379 |
+
"B": "ビー",
|
| 380 |
+
"C": "シー",
|
| 381 |
+
"D": "ディー",
|
| 382 |
+
"E": "イー",
|
| 383 |
+
"F": "エフ",
|
| 384 |
+
"G": "ジー",
|
| 385 |
+
"H": "エイチ",
|
| 386 |
+
"I":"アイ",
|
| 387 |
+
"J":"ジェイ",
|
| 388 |
+
"K":"ケイ",
|
| 389 |
+
"L":"エル",
|
| 390 |
+
"M":"エム",
|
| 391 |
+
"N":"エヌ",
|
| 392 |
+
"O":"オー",
|
| 393 |
+
"P":"ピー",
|
| 394 |
+
"Q":"キュー",
|
| 395 |
+
"R":"アール",
|
| 396 |
+
"S":"エス",
|
| 397 |
+
"T":"ティー",
|
| 398 |
+
"U":"ユー",
|
| 399 |
+
"V":"ヴィー",
|
| 400 |
+
"W":"ダブリュー",
|
| 401 |
+
"X":"エックス",
|
| 402 |
+
"Y":"ワイ",
|
| 403 |
+
"Z":"ゼッド"}
|
| 404 |
+
text = text.upper()
|
| 405 |
+
text_ret = ""
|
| 406 |
+
for t in text:
|
| 407 |
+
if t in alphabet_dict:
|
| 408 |
+
text_ret += alphabet_dict[t]
|
| 409 |
+
else:
|
| 410 |
+
text_ret += t
|
| 411 |
+
return text_ret
|
| 412 |
+
|
| 413 |
+
|
| 414 |
+
roma_mapper_plus_2 = {
|
| 415 |
+
|
| 416 |
+
"bjo":'bʲo',
|
| 417 |
+
"rjo":"rʲo",
|
| 418 |
+
"kjo":"kʲo",
|
| 419 |
+
"kyu":"kʲu",
|
| 420 |
+
|
| 421 |
+
}
|
| 422 |
+
|
| 423 |
+
def replace_repeated_chars(input_string):
|
| 424 |
+
result = []
|
| 425 |
+
i = 0
|
| 426 |
+
while i < len(input_string):
|
| 427 |
+
if i + 1 < len(input_string) and input_string[i] == input_string[i + 1] and input_string[i] in 'aiueo':
|
| 428 |
+
result.append(input_string[i] + 'ː')
|
| 429 |
+
i += 2
|
| 430 |
+
else:
|
| 431 |
+
result.append(input_string[i])
|
| 432 |
+
i += 1
|
| 433 |
+
return ''.join(result)
|
| 434 |
+
|
| 435 |
+
|
| 436 |
+
def replace_chars_2(text, mapping=roma_mapper_plus_2):
|
| 437 |
+
|
| 438 |
+
|
| 439 |
+
sorted_keys = sorted(mapping.keys(), key=len, reverse=True)
|
| 440 |
+
|
| 441 |
+
pattern = '|'.join(re.escape(key) for key in sorted_keys)
|
| 442 |
+
|
| 443 |
+
|
| 444 |
+
def replace(match):
|
| 445 |
+
key = match.group(0)
|
| 446 |
+
return mapping.get(key, key)
|
| 447 |
+
|
| 448 |
+
return re.sub(pattern, replace, text)
|
| 449 |
+
|
| 450 |
+
|
| 451 |
+
def replace_tashdid_2(s):
|
| 452 |
+
vowels = 'aiueoɯ0123456789.?!_。؟?!...@@##$$%%^^&&**()()_+=[「」]></\`~~―ー∺"'
|
| 453 |
+
result = []
|
| 454 |
+
|
| 455 |
+
i = 0
|
| 456 |
+
while i < len(s):
|
| 457 |
+
if i < len(s) - 2 and s[i].lower() == s[i + 2].lower() and s[i].lower() not in vowels and s[i + 1] == ' ':
|
| 458 |
+
result.append('ʔ')
|
| 459 |
+
result.append(s[i + 2])
|
| 460 |
+
i += 3
|
| 461 |
+
elif i < len(s) - 1 and s[i].lower() == s[i + 1].lower() and s[i].lower() not in vowels:
|
| 462 |
+
result.append('ʔ')
|
| 463 |
+
result.append(s[i + 1])
|
| 464 |
+
i += 2
|
| 465 |
+
else:
|
| 466 |
+
result.append(s[i])
|
| 467 |
+
i += 1
|
| 468 |
+
|
| 469 |
+
return ''.join(result)
|
| 470 |
+
|
| 471 |
+
def replace_tashdid(input_string):
|
| 472 |
+
result = []
|
| 473 |
+
i = 0
|
| 474 |
+
while i < len(input_string):
|
| 475 |
+
if i + 1 < len(input_string) and input_string[i] == input_string[i + 1] and input_string[i] not in 'aiueo':
|
| 476 |
+
result.append('ʔ')
|
| 477 |
+
result.append(input_string[i])
|
| 478 |
+
i += 2 # Skip the next character as it is already processed
|
| 479 |
+
else:
|
| 480 |
+
result.append(input_string[i])
|
| 481 |
+
i += 1
|
| 482 |
+
return ''.join(result)
|
| 483 |
+
|
| 484 |
+
def hira2ipa(text, roma_mapper=roma_mapper):
|
| 485 |
+
keys_set = set(roma_mapper.keys())
|
| 486 |
+
special_rule = ("n", "ɴ")
|
| 487 |
+
|
| 488 |
+
transformed_text = []
|
| 489 |
+
i = 0
|
| 490 |
+
|
| 491 |
+
while i < len(text):
|
| 492 |
+
if text[i] == special_rule[0]:
|
| 493 |
+
if i + 1 == len(text) or text[i + 1] not in keys_set:
|
| 494 |
+
transformed_text.append(special_rule[1])
|
| 495 |
+
else:
|
| 496 |
+
transformed_text.append(text[i])
|
| 497 |
+
else:
|
| 498 |
+
transformed_text.append(text[i])
|
| 499 |
+
|
| 500 |
+
i += 1
|
| 501 |
+
|
| 502 |
+
return ''.join(transformed_text)
|
| 503 |
+
|
| 504 |
+
k_mapper = dict([
|
| 505 |
+
("ゔぁ","ba"),
|
| 506 |
+
("ゔぃ","bi"),
|
| 507 |
+
("ゔぇ","be"),
|
| 508 |
+
("ゔぉ","bo"),
|
| 509 |
+
("ゔゃ","bʲa"),
|
| 510 |
+
("ゔゅ","bʲɯ"),
|
| 511 |
+
("ゔゃ","bʲa"),
|
| 512 |
+
("ゔょ","bʲo"),
|
| 513 |
+
|
| 514 |
+
("ゔ","bɯ"),
|
| 515 |
+
|
| 516 |
+
("あぁ"," aː"),
|
| 517 |
+
("いぃ"," iː"),
|
| 518 |
+
("いぇ"," je"),
|
| 519 |
+
("いゃ"," ja"),
|
| 520 |
+
("うぅ"," ɯː"),
|
| 521 |
+
("えぇ"," eː"),
|
| 522 |
+
("おぉ"," oː"),
|
| 523 |
+
("かぁ"," kaː"),
|
| 524 |
+
("きぃ"," kiː"),
|
| 525 |
+
("くぅ","kɯː"),
|
| 526 |
+
("くゃ","ka"),
|
| 527 |
+
("くゅ","kʲɯ"),
|
| 528 |
+
("くょ","kʲo"),
|
| 529 |
+
("けぇ","keː"),
|
| 530 |
+
("こぉ","koː"),
|
| 531 |
+
("がぁ","gaː"),
|
| 532 |
+
("ぎぃ","giː"),
|
| 533 |
+
("ぐぅ","gɯː"),
|
| 534 |
+
("ぐゃ","gʲa"),
|
| 535 |
+
("ぐゅ","gʲɯ"),
|
| 536 |
+
("ぐょ","gʲo"),
|
| 537 |
+
("げぇ","geː"),
|
| 538 |
+
("ごぉ","goː"),
|
| 539 |
+
("さぁ","saː"),
|
| 540 |
+
("しぃ","ɕiː"),
|
| 541 |
+
("すぅ","sɯː"),
|
| 542 |
+
("すゃ","sʲa"),
|
| 543 |
+
("すゅ","sʲɯ"),
|
| 544 |
+
("すょ","sʲo"),
|
| 545 |
+
("せぇ","seː"),
|
| 546 |
+
("そぉ","soː"),
|
| 547 |
+
("ざぁ","zaː"),
|
| 548 |
+
("じぃ","dʑiː"),
|
| 549 |
+
("ずぅ","zɯː"),
|
| 550 |
+
("ずゃ","zʲa"),
|
| 551 |
+
("ずゅ","zʲɯ"),
|
| 552 |
+
("ずょ","zʲo"),
|
| 553 |
+
("ぜぇ","zeː"),
|
| 554 |
+
("ぞぉ","zeː"),
|
| 555 |
+
("たぁ","taː"),
|
| 556 |
+
("ちぃ","tɕiː"),
|
| 557 |
+
("つぁ","tsa"),
|
| 558 |
+
("つぃ","tsi"),
|
| 559 |
+
("つぅ","tsɯː"),
|
| 560 |
+
("つゃ","tɕa"),
|
| 561 |
+
("つゅ","tɕɯ"),
|
| 562 |
+
("つょ","tɕo"),
|
| 563 |
+
("つぇ","tse"),
|
| 564 |
+
("つぉ","tso"),
|
| 565 |
+
("てぇ","teː"),
|
| 566 |
+
("とぉ","toː"),
|
| 567 |
+
("だぁ","daː"),
|
| 568 |
+
("ぢぃ","dʑiː"),
|
| 569 |
+
("づぅ","dɯː"),
|
| 570 |
+
("づゃ","zʲa"),
|
| 571 |
+
("づゅ","zʲɯ"),
|
| 572 |
+
("づょ","zʲo"),
|
| 573 |
+
("でぇ","deː"),
|
| 574 |
+
("どぉ","doː"),
|
| 575 |
+
("なぁ","naː"),
|
| 576 |
+
("にぃ","niː"),
|
| 577 |
+
("ぬぅ","nɯː"),
|
| 578 |
+
("ぬゃ","nʲa"),
|
| 579 |
+
("ぬゅ","nʲɯ"),
|
| 580 |
+
("ぬょ","nʲo"),
|
| 581 |
+
("ねぇ","neː"),
|
| 582 |
+
("のぉ","noː"),
|
| 583 |
+
("はぁ","haː"),
|
| 584 |
+
("ひぃ","çiː"),
|
| 585 |
+
("ふぅ","ɸɯː"),
|
| 586 |
+
("ふゃ","ɸʲa"),
|
| 587 |
+
("ふゅ","ɸʲɯ"),
|
| 588 |
+
("ふょ","ɸʲo"),
|
| 589 |
+
("へぇ","heː"),
|
| 590 |
+
("ほぉ","hoː"),
|
| 591 |
+
("ばぁ","baː"),
|
| 592 |
+
("びぃ","biː"),
|
| 593 |
+
("ぶぅ","bɯː"),
|
| 594 |
+
("ふゃ","ɸʲa"),
|
| 595 |
+
("ぶゅ","bʲɯ"),
|
| 596 |
+
("ふょ","ɸʲo"),
|
| 597 |
+
("べぇ","beː"),
|
| 598 |
+
("ぼぉ","boː"),
|
| 599 |
+
("ぱぁ","paː"),
|
| 600 |
+
("ぴぃ","piː"),
|
| 601 |
+
("ぷぅ","pɯː"),
|
| 602 |
+
("ぷゃ","pʲa"),
|
| 603 |
+
("ぷゅ","pʲɯ"),
|
| 604 |
+
("ぷょ","pʲo"),
|
| 605 |
+
("ぺぇ","peː"),
|
| 606 |
+
("ぽぉ","poː"),
|
| 607 |
+
("まぁ","maː"),
|
| 608 |
+
("みぃ","miː"),
|
| 609 |
+
("むぅ","mɯː"),
|
| 610 |
+
("むゃ","mʲa"),
|
| 611 |
+
("むゅ","mʲɯ"),
|
| 612 |
+
("むょ","mʲo"),
|
| 613 |
+
("めぇ","meː"),
|
| 614 |
+
("もぉ","moː"),
|
| 615 |
+
("やぁ","jaː"),
|
| 616 |
+
("ゆぅ","jɯː"),
|
| 617 |
+
("ゆゃ","jaː"),
|
| 618 |
+
("ゆゅ","jɯː"),
|
| 619 |
+
("ゆょ","joː"),
|
| 620 |
+
("よぉ","joː"),
|
| 621 |
+
("らぁ","ɽaː"),
|
| 622 |
+
("りぃ","ɽiː"),
|
| 623 |
+
("るぅ","��ɯː"),
|
| 624 |
+
("るゃ","ɽʲa"),
|
| 625 |
+
("るゅ","ɽʲɯ"),
|
| 626 |
+
("るょ","ɽʲo"),
|
| 627 |
+
("れぇ","ɽeː"),
|
| 628 |
+
("ろぉ","ɽoː"),
|
| 629 |
+
("わぁ","ɯaː"),
|
| 630 |
+
("をぉ","oː"),
|
| 631 |
+
|
| 632 |
+
("う゛","bɯ"),
|
| 633 |
+
("でぃ","di"),
|
| 634 |
+
("でぇ","deː"),
|
| 635 |
+
("でゃ","dʲa"),
|
| 636 |
+
("でゅ","dʲɯ"),
|
| 637 |
+
("でょ","dʲo"),
|
| 638 |
+
("てぃ","ti"),
|
| 639 |
+
("てぇ","teː"),
|
| 640 |
+
("てゃ","tʲa"),
|
| 641 |
+
("てゅ","tʲɯ"),
|
| 642 |
+
("てょ","tʲo"),
|
| 643 |
+
("すぃ","si"),
|
| 644 |
+
("ずぁ","zɯa"),
|
| 645 |
+
("ずぃ","zi"),
|
| 646 |
+
("ずぅ","zɯ"),
|
| 647 |
+
("ずゃ","zʲa"),
|
| 648 |
+
("ずゅ","zʲɯ"),
|
| 649 |
+
("ずょ","zʲo"),
|
| 650 |
+
("ずぇ","ze"),
|
| 651 |
+
("ずぉ","zo"),
|
| 652 |
+
("きゃ","kʲa"),
|
| 653 |
+
("きゅ","kʲɯ"),
|
| 654 |
+
("きょ","kʲo"),
|
| 655 |
+
("しゃ","ɕʲa"),
|
| 656 |
+
("しゅ","ɕʲɯ"),
|
| 657 |
+
("しぇ","ɕʲe"),
|
| 658 |
+
("しょ","ɕʲo"),
|
| 659 |
+
("ちゃ","tɕa"),
|
| 660 |
+
("ちゅ","tɕɯ"),
|
| 661 |
+
("ちぇ","tɕe"),
|
| 662 |
+
("ちょ","tɕo"),
|
| 663 |
+
("とぅ","tɯ"),
|
| 664 |
+
("とゃ","tʲa"),
|
| 665 |
+
("とゅ","tʲɯ"),
|
| 666 |
+
("とょ","tʲo"),
|
| 667 |
+
("どぁ","doa"),
|
| 668 |
+
("どぅ","dɯ"),
|
| 669 |
+
("どゃ","dʲa"),
|
| 670 |
+
("どゅ","dʲɯ"),
|
| 671 |
+
("どょ","dʲo"),
|
| 672 |
+
("どぉ","doː"),
|
| 673 |
+
("にゃ","nʲa"),
|
| 674 |
+
("にゅ","nʲɯ"),
|
| 675 |
+
("にょ","nʲo"),
|
| 676 |
+
("ひゃ","çʲa"),
|
| 677 |
+
("ひゅ","çʲɯ"),
|
| 678 |
+
("ひょ","çʲo"),
|
| 679 |
+
("みゃ","mʲa"),
|
| 680 |
+
("みゅ","mʲɯ"),
|
| 681 |
+
("みょ","mʲo"),
|
| 682 |
+
("りゃ","ɽʲa"),
|
| 683 |
+
("りぇ","ɽʲe"),
|
| 684 |
+
("りゅ","ɽʲɯ"),
|
| 685 |
+
("りょ","ɽʲo"),
|
| 686 |
+
("ぎゃ","gʲa"),
|
| 687 |
+
("ぎゅ","gʲɯ"),
|
| 688 |
+
("ぎょ","gʲo"),
|
| 689 |
+
("ぢぇ","dʑe"),
|
| 690 |
+
("ぢゃ","dʑa"),
|
| 691 |
+
("ぢゅ","dʑɯ"),
|
| 692 |
+
("ぢょ","dʑo"),
|
| 693 |
+
("じぇ","dʑe"),
|
| 694 |
+
("じゃ","dʑa"),
|
| 695 |
+
("じゅ","dʑɯ"),
|
| 696 |
+
("じょ","dʑo"),
|
| 697 |
+
("びゃ","bʲa"),
|
| 698 |
+
("びゅ","bʲɯ"),
|
| 699 |
+
("びょ","bʲo"),
|
| 700 |
+
("ぴゃ","pʲa"),
|
| 701 |
+
("ぴゅ","pʲɯ"),
|
| 702 |
+
("ぴょ","pʲo"),
|
| 703 |
+
("うぁ","ɯa"),
|
| 704 |
+
("うぃ","ɯi"),
|
| 705 |
+
("うぇ","ɯe"),
|
| 706 |
+
("うぉ","ɯo"),
|
| 707 |
+
("うゃ","ɯʲa"),
|
| 708 |
+
("うゅ","ɯʲɯ"),
|
| 709 |
+
("うょ","ɯʲo"),
|
| 710 |
+
("ふぁ","ɸa"),
|
| 711 |
+
("ふぃ","ɸi"),
|
| 712 |
+
("ふぅ","ɸɯ"),
|
| 713 |
+
("ふゃ","ɸʲa"),
|
| 714 |
+
("ふゅ","ɸʲɯ"),
|
| 715 |
+
("ふょ","ɸʲo"),
|
| 716 |
+
("ふぇ","ɸe"),
|
| 717 |
+
("ふぉ","ɸo"),
|
| 718 |
+
|
| 719 |
+
("あ"," a"),
|
| 720 |
+
("い"," i"),
|
| 721 |
+
("う","ɯ"),
|
| 722 |
+
("え"," e"),
|
| 723 |
+
("お"," o"),
|
| 724 |
+
("か"," ka"),
|
| 725 |
+
("き"," ki"),
|
| 726 |
+
("く"," kɯ"),
|
| 727 |
+
("け"," ke"),
|
| 728 |
+
("こ"," ko"),
|
| 729 |
+
("さ"," sa"),
|
| 730 |
+
("し"," ɕi"),
|
| 731 |
+
("す"," sɯ"),
|
| 732 |
+
("せ"," se"),
|
| 733 |
+
("そ"," so"),
|
| 734 |
+
("た"," ta"),
|
| 735 |
+
("ち"," tɕi"),
|
| 736 |
+
("つ"," tsɯ"),
|
| 737 |
+
("て"," te"),
|
| 738 |
+
("と"," to"),
|
| 739 |
+
("な"," na"),
|
| 740 |
+
("に"," ni"),
|
| 741 |
+
("ぬ"," nɯ"),
|
| 742 |
+
("ね"," ne"),
|
| 743 |
+
("の"," no"),
|
| 744 |
+
("は"," ha"),
|
| 745 |
+
("ひ"," çi"),
|
| 746 |
+
("ふ"," ɸɯ"),
|
| 747 |
+
("へ"," he"),
|
| 748 |
+
("ほ"," ho"),
|
| 749 |
+
("ま"," ma"),
|
| 750 |
+
("み"," mi"),
|
| 751 |
+
("む"," mɯ"),
|
| 752 |
+
("め"," me"),
|
| 753 |
+
("も"," mo"),
|
| 754 |
+
("ら"," ɽa"),
|
| 755 |
+
("り"," ɽi"),
|
| 756 |
+
("る"," ɽɯ"),
|
| 757 |
+
("れ"," ɽe"),
|
| 758 |
+
("ろ"," ɽo"),
|
| 759 |
+
("が"," ga"),
|
| 760 |
+
("ぎ"," gi"),
|
| 761 |
+
("ぐ"," gɯ"),
|
| 762 |
+
("げ"," ge"),
|
| 763 |
+
("ご"," go"),
|
| 764 |
+
("ざ"," za"),
|
| 765 |
+
("じ"," dʑi"),
|
| 766 |
+
("ず"," zɯ"),
|
| 767 |
+
("ぜ"," ze"),
|
| 768 |
+
("ぞ"," zo"),
|
| 769 |
+
("だ"," da"),
|
| 770 |
+
("ぢ"," dʑi"),
|
| 771 |
+
("づ"," zɯ"),
|
| 772 |
+
("で"," de"),
|
| 773 |
+
("ど"," do"),
|
| 774 |
+
("ば"," ba"),
|
| 775 |
+
("び"," bi"),
|
| 776 |
+
("ぶ"," bɯ"),
|
| 777 |
+
("べ"," be"),
|
| 778 |
+
("ぼ"," bo"),
|
| 779 |
+
("ぱ"," pa"),
|
| 780 |
+
("ぴ"," pi"),
|
| 781 |
+
("ぷ"," pɯ"),
|
| 782 |
+
("ぺ"," pe"),
|
| 783 |
+
("ぽ"," po"),
|
| 784 |
+
("や"," ja"),
|
| 785 |
+
("ゆ"," jɯ"),
|
| 786 |
+
("よ"," jo"),
|
| 787 |
+
("わ"," wa"),
|
| 788 |
+
("ゐ"," i"),
|
| 789 |
+
("ゑ"," e"),
|
| 790 |
+
("ん"," ɴ"),
|
| 791 |
+
("っ"," ʔ"),
|
| 792 |
+
("ー"," ː"),
|
| 793 |
+
|
| 794 |
+
("ぁ"," a"),
|
| 795 |
+
("ぃ"," i"),
|
| 796 |
+
("ぅ"," ɯ"),
|
| 797 |
+
("ぇ"," e"),
|
| 798 |
+
("ぉ"," o"),
|
| 799 |
+
("ゎ"," ɯa"),
|
| 800 |
+
("ぉ"," o"),
|
| 801 |
+
("っ","?"),
|
| 802 |
+
|
| 803 |
+
("を","o")
|
| 804 |
+
|
| 805 |
+
])
|
| 806 |
+
|
| 807 |
+
|
| 808 |
+
def post_fix(text):
|
| 809 |
+
orig = text
|
| 810 |
+
|
| 811 |
+
for k, v in k_mapper.items():
|
| 812 |
+
text = text.replace(k, v)
|
| 813 |
+
|
| 814 |
+
return text
|
| 815 |
+
|
| 816 |
+
|
| 817 |
+
|
| 818 |
+
|
| 819 |
+
sym_ws = dict([
|
| 820 |
+
|
| 821 |
+
("$ ","dorɯ"),
|
| 822 |
+
("$ ","dorɯ"),
|
| 823 |
+
|
| 824 |
+
("〇 ","marɯ"),
|
| 825 |
+
("¥ ","eɴ"),
|
| 826 |
+
|
| 827 |
+
("# ","haʔɕɯ tagɯ"),
|
| 828 |
+
("# ","haʔɕɯ tagɯ"),
|
| 829 |
+
|
| 830 |
+
("& ","ando"),
|
| 831 |
+
("& ","ando"),
|
| 832 |
+
|
| 833 |
+
("% ","paːsento"),
|
| 834 |
+
("% ","paːsento"),
|
| 835 |
+
|
| 836 |
+
("@ ","aʔto saiɴ"),
|
| 837 |
+
("@ ","aʔto saiɴ")
|
| 838 |
+
|
| 839 |
+
|
| 840 |
+
|
| 841 |
+
])
|
| 842 |
+
|
| 843 |
+
def random_sym_fix(text): # with space
|
| 844 |
+
orig = text
|
| 845 |
+
|
| 846 |
+
for k, v in sym_ws.items():
|
| 847 |
+
text = text.replace(k, f" {v} ")
|
| 848 |
+
|
| 849 |
+
return text
|
| 850 |
+
|
| 851 |
+
|
| 852 |
+
sym_ns = dict([
|
| 853 |
+
|
| 854 |
+
("$","dorɯ"),
|
| 855 |
+
("$","dorɯ"),
|
| 856 |
+
|
| 857 |
+
("〇","marɯ"),
|
| 858 |
+
("¥","eɴ"),
|
| 859 |
+
|
| 860 |
+
("#","haʔɕɯ tagɯ"),
|
| 861 |
+
("#","haʔɕɯ tagɯ"),
|
| 862 |
+
|
| 863 |
+
("&","ando"),
|
| 864 |
+
("&","ando"),
|
| 865 |
+
|
| 866 |
+
("%","paːsento"),
|
| 867 |
+
("%","paːsento"),
|
| 868 |
+
|
| 869 |
+
("@","aʔto saiɴ"),
|
| 870 |
+
("@","aʔto saiɴ"),
|
| 871 |
+
|
| 872 |
+
("~","—"),
|
| 873 |
+
("kʲɯɯdʑɯɯkʲɯɯ.kʲɯɯdʑɯɯ","kʲɯɯdʑɯɯ kʲɯɯ teɴ kʲɯɯdʑɯɯ")
|
| 874 |
+
|
| 875 |
+
|
| 876 |
+
|
| 877 |
+
|
| 878 |
+
|
| 879 |
+
])
|
| 880 |
+
|
| 881 |
+
def random_sym_fix_no_space(text):
|
| 882 |
+
orig = text
|
| 883 |
+
|
| 884 |
+
for k, v in sym_ns.items():
|
| 885 |
+
text = text.replace(k, f" {v} ")
|
| 886 |
+
|
| 887 |
+
return text
|
| 888 |
+
|
| 889 |
+
|
| 890 |
+
spaces = dict([
|
| 891 |
+
|
| 892 |
+
("ɯ ɴ","ɯɴ"),
|
| 893 |
+
("na ɴ ","naɴ "),
|
| 894 |
+
(" mina ", " miɴna "),
|
| 895 |
+
("ko ɴ ni tɕi ha","konnitɕiwa"),
|
| 896 |
+
("ha i","hai"),
|
| 897 |
+
("boɯtɕama","boʔtɕama"),
|
| 898 |
+
("i eːi","ieːi"),
|
| 899 |
+
("taiɕɯtsɯdʑoɯ","taiɕitsɯdʑoɯ"),
|
| 900 |
+
("soɴna ka ze ni","soɴna fɯɯ ni"),
|
| 901 |
+
(" i e ","ke "),
|
| 902 |
+
("�",""),
|
| 903 |
+
("×"," batsɯ "),
|
| 904 |
+
("se ka ɯndo","sekaɯndo"),
|
| 905 |
+
("i i","iː"),
|
| 906 |
+
("i tɕi","itɕi"),
|
| 907 |
+
("ka i","kai"),
|
| 908 |
+
("naɴ ga","nani ga"),
|
| 909 |
+
("i eː i","ieːi"),
|
| 910 |
+
|
| 911 |
+
("naɴ koɽe","nani koɽe"),
|
| 912 |
+
("naɴ soɽe","nani soɽe"),
|
| 913 |
+
(" ɕeɴ "," seɴ "),
|
| 914 |
+
|
| 915 |
+
# ("konna","koɴna"),
|
| 916 |
+
# ("sonna"," soɴna "),
|
| 917 |
+
# ("anna","aɴna"),
|
| 918 |
+
# ("nn","ɴn"),
|
| 919 |
+
|
| 920 |
+
("en ","eɴ "),
|
| 921 |
+
("in ","iɴ "),
|
| 922 |
+
("an ","aɴ "),
|
| 923 |
+
("on ","oɴ "),
|
| 924 |
+
("ɯn ","ɯɴ "),
|
| 925 |
+
# ("nd","ɴd"),
|
| 926 |
+
|
| 927 |
+
("koɴd o","kondo"),
|
| 928 |
+
("ko ɴ d o","kondo"),
|
| 929 |
+
("ko ɴ do","kondo"),
|
| 930 |
+
|
| 931 |
+
("oanitɕaɴ","oniːtɕaɴ"),
|
| 932 |
+
("oanisaɴ","oniːsaɴ"),
|
| 933 |
+
("oanisama","oniːsama"),
|
| 934 |
+
("hoːmɯrɯɴɯ","hoːmɯrɯːmɯ"),
|
| 935 |
+
("so ɴ na ","sonna"),
|
| 936 |
+
(" sonna "," sonna "),
|
| 937 |
+
(" konna "," konna "),
|
| 938 |
+
("ko ɴ na ","konna"),
|
| 939 |
+
(" ko to "," koto "),
|
| 940 |
+
("edʑdʑi","eʔtɕi"),
|
| 941 |
+
(" edʑdʑ "," eʔtɕi "),
|
| 942 |
+
(" dʑdʑ "," dʑiːdʑiː "),
|
| 943 |
+
("secɯnd","sekaɯndo"),
|
| 944 |
+
|
| 945 |
+
("ɴɯ","nɯ"),
|
| 946 |
+
("ɴe","ne"),
|
| 947 |
+
("ɴo","no"),
|
| 948 |
+
("ɴa","na"),
|
| 949 |
+
("ɴi","ni"),
|
| 950 |
+
("ɴʲ","nʲ"),
|
| 951 |
+
|
| 952 |
+
("hotond o","hotondo"),
|
| 953 |
+
("hakoɴd e","hakoɴde"),
|
| 954 |
+
("gakɯtɕi ɽi","gaʔtɕiɽi "),
|
| 955 |
+
|
| 956 |
+
(" ʔ","ʔ"),
|
| 957 |
+
("ʔ ","ʔ"),
|
| 958 |
+
|
| 959 |
+
("-","ː"),
|
| 960 |
+
("- ","ː"),
|
| 961 |
+
("--","~ː"),
|
| 962 |
+
("~","—"),
|
| 963 |
+
("、",","),
|
| 964 |
+
|
| 965 |
+
(" ː","ː"),
|
| 966 |
+
('ka nade',"kanade"),
|
| 967 |
+
|
| 968 |
+
("ohahasaɴ","okaːsaɴ"),
|
| 969 |
+
(" "," "),
|
| 970 |
+
("viː","bɯiː"),
|
| 971 |
+
("ːː","ː—"),
|
| 972 |
+
|
| 973 |
+
("d ʑ","dʑ"),
|
| 974 |
+
("d a","da"),
|
| 975 |
+
("d e","de"),
|
| 976 |
+
("d o","do"),
|
| 977 |
+
("d ɯ","dɯ"),
|
| 978 |
+
|
| 979 |
+
("niːɕiki","ni iɕiki"),
|
| 980 |
+
("anitɕaɴ","niːtɕaɴ"),
|
| 981 |
+
("daiːtɕi","dai itɕi"),
|
| 982 |
+
|
| 983 |
+
("naɴ sono","nani sono"),
|
| 984 |
+
("naɴ kono","nani kono"),
|
| 985 |
+
("naɴ ano","nani ano"), # Cutlet please fix your shit
|
| 986 |
+
(" niːtaɽa"," ni itaɽa"),
|
| 987 |
+
("doɽamaɕiːd","doɽama ɕiːdʲi"),
|
| 988 |
+
("aɴ ta","anta"),
|
| 989 |
+
("aɴta","anta"),
|
| 990 |
+
("naniːʔteɴ","nani iʔteɴ"),
|
| 991 |
+
("niːkite","ni ikite")
|
| 992 |
+
|
| 993 |
+
])
|
| 994 |
+
|
| 995 |
+
|
| 996 |
+
|
| 997 |
+
def random_space_fix(text):
|
| 998 |
+
orig = text
|
| 999 |
+
|
| 1000 |
+
for k, v in spaces.items():
|
| 1001 |
+
text = text.replace(k, v)
|
| 1002 |
+
|
| 1003 |
+
return text
|