Spaces:
Running
Running
| <h1 align="center">torchcrepe</h1> | |
| <div align="center"> | |
| [](https://pypi.python.org/pypi/torchcrepe) | |
| [](https://opensource.org/licenses/MIT) | |
| [](https://pepy.tech/project/torchcrepe) | |
| </div> | |
| Pytorch implementation of the CREPE [1] pitch tracker. The original Tensorflow | |
| implementation can be found [here](https://github.com/marl/crepe/). The | |
| provided model weights were obtained by converting the "tiny" and "full" models | |
| using [MMdnn](https://github.com/microsoft/MMdnn), an open-source model | |
| management framework. | |
| ## Installation | |
| Perform the system-dependent PyTorch install using the instructions found | |
| [here](https://pytorch.org/). | |
| `pip install torchcrepe` | |
| ## Usage | |
| ### Computing pitch and periodicity from audio | |
| ```python | |
| import torchcrepe | |
| # Load audio | |
| audio, sr = torchcrepe.load.audio( ... ) | |
| # Here we'll use a 5 millisecond hop length | |
| hop_length = int(sr / 200.) | |
| # Provide a sensible frequency range for your domain (upper limit is 2006 Hz) | |
| # This would be a reasonable range for speech | |
| fmin = 50 | |
| fmax = 550 | |
| # Select a model capacity--one of "tiny" or "full" | |
| model = 'tiny' | |
| # Choose a device to use for inference | |
| device = 'cuda:0' | |
| # Pick a batch size that doesn't cause memory errors on your gpu | |
| batch_size = 2048 | |
| # Compute pitch using first gpu | |
| pitch = torchcrepe.predict(audio, | |
| sr, | |
| hop_length, | |
| fmin, | |
| fmax, | |
| model, | |
| batch_size=batch_size, | |
| device=device) | |
| ``` | |
| A periodicity metric similar to the Crepe confidence score can also be | |
| extracted by passing `return_periodicity=True` to `torchcrepe.predict`. | |
| ### Decoding | |
| By default, `torchcrepe` uses Viterbi decoding on the softmax of the network | |
| output. This is different than the original implementation, which uses a | |
| weighted average near the argmax of binary cross-entropy probabilities. | |
| The argmax operation can cause double/half frequency errors. These can be | |
| removed by penalizing large pitch jumps via Viterbi decoding. The `decode` | |
| submodule provides some options for decoding. | |
| ```python | |
| # Decode using viterbi decoding (default) | |
| torchcrepe.predict(..., decoder=torchcrepe.decode.viterbi) | |
| # Decode using weighted argmax (as in the original implementation) | |
| torchcrepe.predict(..., decoder=torchcrepe.decode.weighted_argmax) | |
| # Decode using argmax | |
| torchcrepe.predict(..., decoder=torchcrepe.decode.argmax) | |
| ``` | |
| ### Filtering and thresholding | |
| When periodicity is low, the pitch is less reliable. For some problems, it | |
| makes sense to mask these less reliable pitch values. However, the periodicity | |
| can be noisy and the pitch has quantization artifacts. `torchcrepe` provides | |
| submodules `filter` and `threshold` for this purpose. The filter and threshold | |
| parameters should be tuned to your data. For clean speech, a 10-20 millisecond | |
| window with a threshold of 0.21 has worked. | |
| ```python | |
| # We'll use a 15 millisecond window assuming a hop length of 5 milliseconds | |
| win_length = 3 | |
| # Median filter noisy confidence value | |
| periodicity = torchcrepe.filter.median(periodicity, win_length) | |
| # Remove inharmonic regions | |
| pitch = torchcrepe.threshold.At(.21)(pitch, periodicity) | |
| # Optionally smooth pitch to remove quantization artifacts | |
| pitch = torchcrepe.filter.mean(pitch, win_length) | |
| ``` | |
| For more fine-grained control over pitch thresholding, see | |
| `torchcrepe.threshold.Hysteresis`. This is especially useful for removing | |
| spurious voiced regions caused by noise in the periodicity values, but | |
| has more parameters and may require more manual tuning to your data. | |
| CREPE was not trained on silent audio. Therefore, it sometimes assigns high | |
| confidence to pitch bins in silent regions. You can use | |
| `torchcrepe.threshold.Silence` to manually set the periodicity in silent | |
| regions to zero. | |
| ```python | |
| periodicity = torchcrepe.threshold.Silence(-60.)(periodicity, | |
| audio, | |
| sr, | |
| hop_length) | |
| ``` | |
| ### Computing the CREPE model output activations | |
| ```python | |
| batch = next(torchcrepe.preprocess(audio, sr, hop_length)) | |
| probabilities = torchcrepe.infer(batch) | |
| ``` | |
| ### Computing the CREPE embedding space | |
| As in Differentiable Digital Signal Processing [2], this uses the output of the | |
| fifth max-pooling layer as a pretrained pitch embedding | |
| ```python | |
| embeddings = torchcrepe.embed(audio, sr, hop_length) | |
| ``` | |
| ### Computing from files | |
| `torchcrepe` defines the following functions convenient for predicting | |
| directly from audio files on disk. Each of these functions also takes | |
| a `device` argument that can be used for device placement (e.g., | |
| `device='cuda:0'`). | |
| ```python | |
| torchcrepe.predict_from_file(audio_file, ...) | |
| torchcrepe.predict_from_file_to_file( | |
| audio_file, output_pitch_file, output_periodicity_file, ...) | |
| torchcrepe.predict_from_files_to_files( | |
| audio_files, output_pitch_files, output_periodicity_files, ...) | |
| torchcrepe.embed_from_file(audio_file, ...) | |
| torchcrepe.embed_from_file_to_file(audio_file, output_file, ...) | |
| torchcrepe.embed_from_files_to_files(audio_files, output_files, ...) | |
| ``` | |
| ### Command-line interface | |
| ```bash | |
| usage: python -m torchcrepe | |
| [-h] | |
| --audio_files AUDIO_FILES [AUDIO_FILES ...] | |
| --output_files OUTPUT_FILES [OUTPUT_FILES ...] | |
| [--hop_length HOP_LENGTH] | |
| [--output_periodicity_files OUTPUT_PERIODICITY_FILES [OUTPUT_PERIODICITY_FILES ...]] | |
| [--embed] | |
| [--fmin FMIN] | |
| [--fmax FMAX] | |
| [--model MODEL] | |
| [--decoder DECODER] | |
| [--gpu GPU] | |
| [--no_pad] | |
| optional arguments: | |
| -h, --help show this help message and exit | |
| --audio_files AUDIO_FILES [AUDIO_FILES ...] | |
| The audio file to process | |
| --output_files OUTPUT_FILES [OUTPUT_FILES ...] | |
| The file to save pitch or embedding | |
| --hop_length HOP_LENGTH | |
| The hop length of the analysis window | |
| --output_periodicity_files OUTPUT_PERIODICITY_FILES [OUTPUT_PERIODICITY_FILES ...] | |
| The file to save periodicity | |
| --embed Performs embedding instead of pitch prediction | |
| --fmin FMIN The minimum frequency allowed | |
| --fmax FMAX The maximum frequency allowed | |
| --model MODEL The model capacity. One of "tiny" or "full" | |
| --decoder DECODER The decoder to use. One of "argmax", "viterbi", or | |
| "weighted_argmax" | |
| --gpu GPU The gpu to perform inference on | |
| --no_pad Whether to pad the audio | |
| ``` | |
| ## Tests | |
| The module tests can be run as follows. | |
| ```bash | |
| pip install pytest | |
| pytest | |
| ``` | |
| ## References | |
| [1] J. W. Kim, J. Salamon, P. Li, and J. P. Bello, “Crepe: A | |
| Convolutional Representation for Pitch Estimation,” in 2018 IEEE | |
| International Conference on Acoustics, Speech and Signal | |
| Processing (ICASSP). | |
| [2] J. H. Engel, L. Hantrakul, C. Gu, and A. Roberts, | |
| “DDSP: Differentiable Digital Signal Processing,” in | |
| 2020 International Conference on Learning | |
| Representations (ICLR). | |