piper-de-glados / README.org
systemofapwne's picture
Notes regarding the two models
a0a9917
---
language:
- de
tags:
- portal
- GLaDOS
- turret
base_model: rhasspy/piper-voices
---
* GLaDOS voice model, trained on German Portal 1 and Portal 2 game files
** Model description
This model uses a checkpoint from the [[https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main/de/de_DE/thorsten/high][Torsten High]] model as a base and fine tuned it
via the voice lines, directly coming from the game files of Portal 1 and Portal 2 to
replicate the German GLaDOS voice for piper.
Training has been performed on an RTX 4000 for over more than 3000 epochs.
You will find two voice models in here
- =de_DE-glados-high.onnx= and =de_DE-glados-high.onnx.json=: GLaDOS herself
- =de_DE-glados-turret-high.onnx= and =de_DE-glados-turret-high.onnx.json=: Fine tuned on the above model to sound like the turret voice.
** Dataset & Training
I also added /hints on how to build the training dataset/ and the used toolchain for preparing and training the model in this repo.
Reasons being:
- The the training data is intellectual property and copyright by Valve (I cannot include it here for obvious reasons)
- Training a model for piper (as of early 2025) relies on old/outdated tools from 2021 and getting everything up
and running can be super frustrating
*Requirements*
- A PC with an nVidia GPU and the proprietary nVidia drivers, CUDA, Docker + Docker as well as the nvidia-container-toolkit installed
- Ideally use a linux system (WSL untested but potentially might work)
- Basic linux and python knowledge
*** Build the training dataset
**** Extract the files from the game
The training dataset has been extracted from the Portal 1 and Portal 2 game files.
For legal reason, they are not included in this repo. But you can easily extract them from the
gamefiles via [[https://developer.valvesoftware.com/wiki/VPKEdit][VPKEdit]]
*Portal 1*:
- Switch the game to the desired language (Here: German) via Steam
- Navigate to =<steam>/steamapps/common/Portal/portal= and open =portal_pak_dir.vpk= with VPKEdit
- Inside =portal_pak_dir.vpk=, navigate to =sound/vo/aperture_ai= and extract all =*.wav= files into the folder =raw= inside this git repo
*Portal 2:*
- Switch the game to the desired language (Here: German) via Steam
- Navigate to =<steam>/steamapps/common/portal 2=. Select the subfolder matching the language (here =portal2_german=) and open =pak01_dir.vpk= via VPKEdit
- Inside =pak01_dir.vpk=, navigate to =sound/vo/glados= and extract all =*.wav= files (but no subfolders) to the folder =raw= inside this git repo
*Portal 2 DLC 1*:
- Repeat the steps 1 for *Portal 2* above but now select the =portal2_dlc1_<your language>= folder (if it exists). Here, =portal2_dlc1_german= does exist. Open =pak01_dir.vpk= with VPKEdit
- Repeat step 3 of *Portal 2* above but copy the files to =raw= in this git repo
**** Transcode the files
We need to transcode the files. The portal 1 files have a samplerate of 44.1 kHz WAV while the portal2 files are MP3.
For training, we need WAV, 16bit (LE), mono PCM with the sampleratres shown below, depending on the model quality we want to train.
- x-low, low: 16000 kHz
- medium, high: 22050 Hz
NOTE: In principle, we can also train on 44100 Hz, however the piper-train then needs to be modified for *training* and *inference* as it only supports
Run the following command (needs =ffmpeg= to be installed)
#+begin_src sh
# Before running the script, first edit the bitrate, that you want
./0_transcode.sh
#+end_src
**** Sort by good/bad samples
_Now the annoying part_: Listen to all voice samples, one by one and sort them by good (same voice style, no degradation in quality, no additional none-voice parts or mumble etc) and bad (the opposite)
I have written a helper script for this purpose: *1_sort_good_bad.py* (Read the comments in it).
_But hold your horses_: Before you perform this annoying job, that can take several hours: I expect the quality of the voice lines to be similar across languges. So you can use my
script =1_from_good.py= which uses the =good.txt= file to tag voice samples as *good* or *bad*, based on my decisions made during listening to GLaDOS myself.
Run the following command
#+begin_src sh
./1_from_good.py
#+end_src
**** Transcribe
Now we need to transcribe the files. For this, we need =faster-whisper=. The easiest way to install and use it, is to do this via Docker.
But before you do that, you should edit the file =2_transcribe.py= and select the language and model you want to use.
Run this to build the docker container(s)
#+begin_src sh
docker compose up --build -d
docker exec -it transcribe bash
#+end_src
You should now be in the =transcribe= docker container. Run
#+begin_src sh
./2_transcribe.py
#+end_src
This will yield a new file =metadata.csv=. _Copy this file to =raw_good=, once transcription has finished_
*** Training
For this, you should use the Docker container, which is provided by this repo.
But before you do that, you need to configure the new files:
- 3_gen_traindata.sh: Edit the samplerate (16000 for x-low and low, 22050 for medium, high models) and the language code (en, de, ru, fr, ...)
- 4_train.sh: Edit the QUALITY, BATCHSIZE, PHONEME_MAX parameters, that suit your training hardware.
Also select the CHKPOINT to start from: You ideally do not want to train from scratch but rather from an already exisiting checkpoint.
Grab [[https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main][one from the piper people]], that fits the model (x-low, low, medium, high) and language, that you want to train.
Copy it to =checkpoints= within this repo.
Now run the following within this repo (if you haven't it already done for transcription)
#+begin_src sh
docker compose up --build -d
docker exec -it training bash
#+end_src
This will build and enter the training container and also export training metrics via tensorboard at http://127.0.0.1:6006
From inside the container, you now need to generate your traindata for the training process
#+begin_src sh
./3_gen_traindata.sh
#+end_src
And now, you are ready for training. Simply run
#+begin_src sh
./4_train.sh
#+end_src
inside the container.
In the case, you need to stop training, you just have to change the path to the checkpoint by setting the =CHCKPOINT= variable in =./4_train.sh=.
*** Infere the final model
After training has finished (either it flattened of or you hit the max epoch limit), you need to export the model to the onnx format.
First, edit =5_export.sh= and set the name and also the checkpoint (generally the last trained checkpoint by =4_train.sh=, you want to export the model from
From still inside the training docker container, run this command
#+begin_src sh
./5_export.sh
#+end_src
This will generate a =<model_name>.onnx= and =<model_name>.onnx.json= file. The later one needs to be adjusted: Open it in a file editor and and navigate to the line where it reads
#+begin_src json
"dataset": "",
#+end_src
and place replace "" with this models name (here: "<model_name>")
#+begin_src json
"dataset": "de_DE-glados-high"
#+end_src
These two files can now be used by piper