| --- | |
| language: | |
| - de | |
| tags: | |
| - portal | |
| - GLaDOS | |
| - turret | |
| base_model: rhasspy/piper-voices | |
| --- | |
| * GLaDOS voice model, trained on German Portal 1 and Portal 2 game files | |
| ** Model description | |
| This model uses a checkpoint from the [[https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main/de/de_DE/thorsten/high][Torsten High]] model as a base and fine tuned it | |
| via the voice lines, directly coming from the game files of Portal 1 and Portal 2 to | |
| replicate the German GLaDOS voice for piper. | |
| Training has been performed on an RTX 4000 for over more than 3000 epochs. | |
| You will find two voice models in here | |
| - =de_DE-glados-high.onnx= and =de_DE-glados-high.onnx.json=: GLaDOS herself | |
| - =de_DE-glados-turret-high.onnx= and =de_DE-glados-turret-high.onnx.json=: Fine tuned on the above model to sound like the turret voice. | |
| ** Dataset & Training | |
| I also added /hints on how to build the training dataset/ and the used toolchain for preparing and training the model in this repo. | |
| Reasons being: | |
| - The the training data is intellectual property and copyright by Valve (I cannot include it here for obvious reasons) | |
| - Training a model for piper (as of early 2025) relies on old/outdated tools from 2021 and getting everything up | |
| and running can be super frustrating | |
| *Requirements* | |
| - A PC with an nVidia GPU and the proprietary nVidia drivers, CUDA, Docker + Docker as well as the nvidia-container-toolkit installed | |
| - Ideally use a linux system (WSL untested but potentially might work) | |
| - Basic linux and python knowledge | |
| *** Build the training dataset | |
| **** Extract the files from the game | |
| The training dataset has been extracted from the Portal 1 and Portal 2 game files. | |
| For legal reason, they are not included in this repo. But you can easily extract them from the | |
| gamefiles via [[https://developer.valvesoftware.com/wiki/VPKEdit][VPKEdit]] | |
| *Portal 1*: | |
| - Switch the game to the desired language (Here: German) via Steam | |
| - Navigate to =<steam>/steamapps/common/Portal/portal= and open =portal_pak_dir.vpk= with VPKEdit | |
| - Inside =portal_pak_dir.vpk=, navigate to =sound/vo/aperture_ai= and extract all =*.wav= files into the folder =raw= inside this git repo | |
| *Portal 2:* | |
| - Switch the game to the desired language (Here: German) via Steam | |
| - Navigate to =<steam>/steamapps/common/portal 2=. Select the subfolder matching the language (here =portal2_german=) and open =pak01_dir.vpk= via VPKEdit | |
| - Inside =pak01_dir.vpk=, navigate to =sound/vo/glados= and extract all =*.wav= files (but no subfolders) to the folder =raw= inside this git repo | |
| *Portal 2 DLC 1*: | |
| - Repeat the steps 1 for *Portal 2* above but now select the =portal2_dlc1_<your language>= folder (if it exists). Here, =portal2_dlc1_german= does exist. Open =pak01_dir.vpk= with VPKEdit | |
| - Repeat step 3 of *Portal 2* above but copy the files to =raw= in this git repo | |
| **** Transcode the files | |
| We need to transcode the files. The portal 1 files have a samplerate of 44.1 kHz WAV while the portal2 files are MP3. | |
| For training, we need WAV, 16bit (LE), mono PCM with the sampleratres shown below, depending on the model quality we want to train. | |
| - x-low, low: 16000 kHz | |
| - medium, high: 22050 Hz | |
| NOTE: In principle, we can also train on 44100 Hz, however the piper-train then needs to be modified for *training* and *inference* as it only supports | |
| Run the following command (needs =ffmpeg= to be installed) | |
| #+begin_src sh | |
| # Before running the script, first edit the bitrate, that you want | |
| ./0_transcode.sh | |
| #+end_src | |
| **** Sort by good/bad samples | |
| _Now the annoying part_: Listen to all voice samples, one by one and sort them by good (same voice style, no degradation in quality, no additional none-voice parts or mumble etc) and bad (the opposite) | |
| I have written a helper script for this purpose: *1_sort_good_bad.py* (Read the comments in it). | |
| _But hold your horses_: Before you perform this annoying job, that can take several hours: I expect the quality of the voice lines to be similar across languges. So you can use my | |
| script =1_from_good.py= which uses the =good.txt= file to tag voice samples as *good* or *bad*, based on my decisions made during listening to GLaDOS myself. | |
| Run the following command | |
| #+begin_src sh | |
| ./1_from_good.py | |
| #+end_src | |
| **** Transcribe | |
| Now we need to transcribe the files. For this, we need =faster-whisper=. The easiest way to install and use it, is to do this via Docker. | |
| But before you do that, you should edit the file =2_transcribe.py= and select the language and model you want to use. | |
| Run this to build the docker container(s) | |
| #+begin_src sh | |
| docker compose up --build -d | |
| docker exec -it transcribe bash | |
| #+end_src | |
| You should now be in the =transcribe= docker container. Run | |
| #+begin_src sh | |
| ./2_transcribe.py | |
| #+end_src | |
| This will yield a new file =metadata.csv=. _Copy this file to =raw_good=, once transcription has finished_ | |
| *** Training | |
| For this, you should use the Docker container, which is provided by this repo. | |
| But before you do that, you need to configure the new files: | |
| - 3_gen_traindata.sh: Edit the samplerate (16000 for x-low and low, 22050 for medium, high models) and the language code (en, de, ru, fr, ...) | |
| - 4_train.sh: Edit the QUALITY, BATCHSIZE, PHONEME_MAX parameters, that suit your training hardware. | |
| Also select the CHKPOINT to start from: You ideally do not want to train from scratch but rather from an already exisiting checkpoint. | |
| Grab [[https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main][one from the piper people]], that fits the model (x-low, low, medium, high) and language, that you want to train. | |
| Copy it to =checkpoints= within this repo. | |
| Now run the following within this repo (if you haven't it already done for transcription) | |
| #+begin_src sh | |
| docker compose up --build -d | |
| docker exec -it training bash | |
| #+end_src | |
| This will build and enter the training container and also export training metrics via tensorboard at http://127.0.0.1:6006 | |
| From inside the container, you now need to generate your traindata for the training process | |
| #+begin_src sh | |
| ./3_gen_traindata.sh | |
| #+end_src | |
| And now, you are ready for training. Simply run | |
| #+begin_src sh | |
| ./4_train.sh | |
| #+end_src | |
| inside the container. | |
| In the case, you need to stop training, you just have to change the path to the checkpoint by setting the =CHCKPOINT= variable in =./4_train.sh=. | |
| *** Infere the final model | |
| After training has finished (either it flattened of or you hit the max epoch limit), you need to export the model to the onnx format. | |
| First, edit =5_export.sh= and set the name and also the checkpoint (generally the last trained checkpoint by =4_train.sh=, you want to export the model from | |
| From still inside the training docker container, run this command | |
| #+begin_src sh | |
| ./5_export.sh | |
| #+end_src | |
| This will generate a =<model_name>.onnx= and =<model_name>.onnx.json= file. The later one needs to be adjusted: Open it in a file editor and and navigate to the line where it reads | |
| #+begin_src json | |
| "dataset": "", | |
| #+end_src | |
| and place replace "" with this models name (here: "<model_name>") | |
| #+begin_src json | |
| "dataset": "de_DE-glados-high" | |
| #+end_src | |
| These two files can now be used by piper | |