| <h1 align="center">TorchFCPE</h1> | |
| ## Overview | |
| TorchFCPE(Fast Context-based Pitch Estimation) is a PyTorch-based library designed for audio pitch extraction and MIDI conversion. This README provides a quick guide on how to use the library for audio pitch inference and MIDI extraction. | |
| Note: that the MIDI extractor of FCPE is quantized from f0 using non neural network methods | |
| Note: I won't be updating FCPE (or benchmark) so soon, but I will definitely release a version with cleaned-up code by no later than next year. | |
| ## Installation | |
| Before using the library, make sure you have the necessary dependencies installed: | |
| ```bash | |
| pip install torchfcpe | |
| ``` | |
| ## Usage | |
| ### 1. Audio Pitch Inference | |
| ```python | |
| from torchfcpe import spawn_bundled_infer_model | |
| import torch | |
| import librosa | |
| # Configure device and target hop size | |
| device = 'cpu' # or 'cuda' if using a GPU | |
| sr = 16000 # Sample rate | |
| hop_size = 160 # Hop size for processing | |
| # Load and preprocess audio | |
| audio, sr = librosa.load('test.wav', sr=sr) | |
| audio = librosa.to_mono(audio) | |
| audio_length = len(audio) | |
| f0_target_length = (audio_length // hop_size) + 1 | |
| audio = torch.from_numpy(audio).float().unsqueeze(0).unsqueeze(-1).to(device) | |
| # Load the model | |
| model = spawn_bundled_infer_model(device=device) | |
| # Perform pitch inference | |
| f0 = model.infer( | |
| audio, | |
| sr=sr, | |
| decoder_mode='local_argmax', # Recommended mode | |
| threshold=0.006, # Threshold for V/UV decision | |
| f0_min=80, # Minimum pitch | |
| f0_max=880, # Maximum pitch | |
| interp_uv=False, # Interpolate unvoiced frames | |
| output_interp_target_length=f0_target_length, # Interpolate to target length | |
| ) | |
| print(f0) | |
| ``` | |
| ### 2. MIDI Extraction | |
| ```python | |
| # Extract MIDI from audio | |
| midi = model.extact_midi( | |
| audio, | |
| sr=sr, | |
| decoder_mode='local_argmax', # Recommended mode | |
| threshold=0.006, # Threshold for V/UV decision | |
| f0_min=80, # Minimum pitch | |
| f0_max=880, # Maximum pitch | |
| output_path="test.mid", # Save MIDI to file | |
| ) | |
| print(midi) | |
| ``` | |
| ### Notes | |
| - **Inference Parameters:** | |
| - `audio`: Input audio as a `torch.Tensor`. | |
| - `sr`: Sample rate of the audio. | |
| - `decoder_mode` (Optional): Mode for decoding, 'local_argmax' is recommended. | |
| - `threshold` (Optional): Threshold for voice/unvoiced decision; default is 0.006. | |
| - `f0_min` (Optional): Minimum pitch value; default is 80 Hz. | |
| - `f0_max` (Optional): Maximum pitch value; default is 880 Hz. | |
| - `interp_uv` (Optional): Whether to interpolate unvoiced frames; default is False. | |
| - `output_interp_target_length` (Optional): Length to which the output pitch should be interpolated. | |
| - **MIDI Extraction Parameters:** | |
| - `audio`: Input audio as a `torch.Tensor`. | |
| - `sr`: Sample rate of the audio. | |
| - `decoder_mode` (Optional): Mode for decoding; 'local_argmax' is recommended. | |
| - `threshold` (Optional): Threshold for voice/unvoiced decision; default is 0.006. | |
| - `f0_min` (Optional): Minimum pitch value; default is 80 Hz. | |
| - `f0_max` (Optional): Maximum pitch value; default is 880 Hz. | |
| - `output_path` (Optional): File path to save the MIDI file. If not provided, only returns the MIDI structure. | |
| - `tempo` (Optional): BPM for the MIDI file. If None, BPM is automatically predicted. | |
| ## Additional Features | |
| - **Model as a PyTorch Module:** | |
| You can use the model as a standard PyTorch module. For example: | |
| ```python | |
| # Change device | |
| model = model.to(device) | |
| # Compile model | |
| model = torch.compile(model) | |
| ``` | |
| ## Paper | |
| If you find our work useful, please consider citing the paper: | |
| ``` | |
| @misc{luo2025fcpefastcontextbasedpitch, | |
| title={FCPE: A Fast Context-based Pitch Estimation Model}, | |
| author={Yuxin Luo and Ruoyi Zhang and Lu-Chuan Liu and Tianyu Li and Hangyu Liu}, | |
| year={2025}, | |
| eprint={2509.15140}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.SD}, | |
| url={https://arxiv.org/abs/2509.15140}, | |
| } | |
| ``` | |
| ### Important details | |
| The model we use in our paper is DDSP-200K, you can get the model from here: [DDSP-200K Model](https://huggingface.co/ChiTu/FCPE/tree/main). | |
| And there's another model which released earlier, you can get it from here [FCPE-Previous](/torchfcpe/assets/fcpe_c_v001.pt). | |
| More information about experiments will be released after the paper is accepted or rejected. |