Buckets:
| # utils/audio | |
| Helper module for audio processing. | |
| These functions and classes are only used internally, | |
| meaning an end-user shouldn't need to access anything here. | |
| * [utils/audio](#module_utils/audio) | |
| * _static_ | |
| * [.RawAudio](#module_utils/audio.RawAudio) | |
| * [`new RawAudio(audio, sampling_rate)`](#new_module_utils/audio.RawAudio_new) | |
| * [`.toWav()`](#module_utils/audio.RawAudio+toWav) ⇒ <code>ArrayBuffer</code> | |
| * [`.toBlob()`](#module_utils/audio.RawAudio+toBlob) ⇒ <code>Blob</code> | |
| * [`.save(path)`](#module_utils/audio.RawAudio+save) | |
| * [`.read_audio(url, sampling_rate)`](#module_utils/audio.read_audio) ⇒ <code>Promise.<Float32Array></code> | |
| * [`~audio`](#module_utils/audio.read_audio..audio) : <code>Float32Array</code> | |
| * [`.hanning(M)`](#module_utils/audio.hanning) ⇒ <code>Float64Array</code> | |
| * [`.hamming(M)`](#module_utils/audio.hamming) ⇒ <code>Float64Array</code> | |
| * [`.mel_filter_bank(num_frequency_bins, num_mel_filters, min_frequency, max_frequency, sampling_rate, [norm], [mel_scale], [triangularize_in_mel_space])`](#module_utils/audio.mel_filter_bank) ⇒ <code>Array.<Array<number>></code> | |
| * [`.spectrogram(waveform, window, frame_length, hop_length, options)`](#module_utils/audio.spectrogram) ⇒ [<code>Promise.<Tensor></code>](#Tensor) | |
| * [`.window_function(window_length, name, options)`](#module_utils/audio.window_function) ⇒ <code>Float64Array</code> | |
| * _inner_ | |
| * [`~generalized_cosine_window(M, a_0)`](#module_utils/audio..generalized_cosine_window) ⇒ <code>Float64Array</code> | |
| * [`~hertz_to_mel(freq, [mel_scale])`](#module_utils/audio..hertz_to_mel) ⇒ <code>T</code> | |
| * [`~mel_to_hertz(mels, [mel_scale])`](#module_utils/audio..mel_to_hertz) ⇒ <code>T</code> | |
| * [`~_create_triangular_filter_bank(fft_freqs, filter_freqs)`](#module_utils/audio.._create_triangular_filter_bank) ⇒ <code>Array.<Array<number>></code> | |
| * [`~linspace(start, end, num)`](#module_utils/audio..linspace) ⇒ | |
| * [`~padReflect(array, left, right)`](#module_utils/audio..padReflect) ⇒ <code>T</code> | |
| * [`~_db_conversion_helper(spectrogram, factor, reference, min_value, db_range)`](#module_utils/audio.._db_conversion_helper) ⇒ <code>T</code> | |
| * [`~amplitude_to_db(spectrogram, [reference], [min_value], [db_range])`](#module_utils/audio..amplitude_to_db) ⇒ <code>T</code> | |
| * [`~power_to_db(spectrogram, [reference], [min_value], [db_range])`](#module_utils/audio..power_to_db) ⇒ <code>T</code> | |
| * [`~encodeWAV(samples, rate)`](#module_utils/audio..encodeWAV) ⇒ <code>ArrayBuffer</code> | |
| * * * | |
| <a id="module_utils/audio.RawAudio" class="group"></a> | |
| ## utils/audio.RawAudio | |
| **Kind**: static class of [<code>utils/audio</code>](#module_utils/audio) | |
| * [.RawAudio](#module_utils/audio.RawAudio) | |
| * [`new RawAudio(audio, sampling_rate)`](#new_module_utils/audio.RawAudio_new) | |
| * [`.toWav()`](#module_utils/audio.RawAudio+toWav) ⇒ <code>ArrayBuffer</code> | |
| * [`.toBlob()`](#module_utils/audio.RawAudio+toBlob) ⇒ <code>Blob</code> | |
| * [`.save(path)`](#module_utils/audio.RawAudio+save) | |
| * * * | |
| <a id="new_module_utils/audio.RawAudio_new" class="group"></a> | |
| ### `new RawAudio(audio, sampling_rate)` | |
| Create a new `RawAudio` object. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>audio</td><td><code>Float32Array</code></td><td><p>Audio data</p> | |
| </td> | |
| </tr><tr> | |
| <td>sampling_rate</td><td><code>number</code></td><td><p>Sampling rate of the audio data</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_utils/audio.RawAudio+toWav" class="group"></a> | |
| ### `rawAudio.toWav()` ⇒ <code>ArrayBuffer</code> | |
| Convert the audio to a wav file buffer. | |
| **Kind**: instance method of [<code>RawAudio</code>](#module_utils/audio.RawAudio) | |
| **Returns**: <code>ArrayBuffer</code> - The WAV file. | |
| * * * | |
| <a id="module_utils/audio.RawAudio+toBlob" class="group"></a> | |
| ### `rawAudio.toBlob()` ⇒ <code>Blob</code> | |
| Convert the audio to a blob. | |
| **Kind**: instance method of [<code>RawAudio</code>](#module_utils/audio.RawAudio) | |
| * * * | |
| <a id="module_utils/audio.RawAudio+save" class="group"></a> | |
| ### `rawAudio.save(path)` | |
| Save the audio to a wav file. | |
| **Kind**: instance method of [<code>RawAudio</code>](#module_utils/audio.RawAudio) | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>path</td><td><code>string</code></td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_utils/audio.read_audio" class="group"></a> | |
| ## `utils/audio.read_audio(url, sampling_rate)` ⇒ <code>Promise.<Float32Array></code> | |
| Helper function to read audio from a path/URL. | |
| **Kind**: static method of [<code>utils/audio</code>](#module_utils/audio) | |
| **Returns**: <code>Promise.<Float32Array></code> - The decoded audio as a `Float32Array`. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>url</td><td><code>string</code> | <code>URL</code></td><td><p>The path/URL to load the audio from.</p> | |
| </td> | |
| </tr><tr> | |
| <td>sampling_rate</td><td><code>number</code></td><td><p>The sampling rate to use when decoding the audio.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_utils/audio.read_audio..audio" class="group"></a> | |
| ### `read_audio~audio` : <code>Float32Array</code> | |
| **Kind**: inner property of [<code>read_audio</code>](#module_utils/audio.read_audio) | |
| * * * | |
| <a id="module_utils/audio.hanning" class="group"></a> | |
| ## `utils/audio.hanning(M)` ⇒ <code>Float64Array</code> | |
| Generates a Hanning window of length M. | |
| See https://numpy.org/doc/stable/reference/generated/numpy.hanning.html for more information. | |
| **Kind**: static method of [<code>utils/audio</code>](#module_utils/audio) | |
| **Returns**: <code>Float64Array</code> - The generated Hanning window. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>M</td><td><code>number</code></td><td><p>The length of the Hanning window to generate.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_utils/audio.hamming" class="group"></a> | |
| ## `utils/audio.hamming(M)` ⇒ <code>Float64Array</code> | |
| Generates a Hamming window of length M. | |
| See https://numpy.org/doc/stable/reference/generated/numpy.hamming.html for more information. | |
| **Kind**: static method of [<code>utils/audio</code>](#module_utils/audio) | |
| **Returns**: <code>Float64Array</code> - The generated Hamming window. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>M</td><td><code>number</code></td><td><p>The length of the Hamming window to generate.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_utils/audio.mel_filter_bank" class="group"></a> | |
| ## `utils/audio.mel_filter_bank(num_frequency_bins, num_mel_filters, min_frequency, max_frequency, sampling_rate, [norm], [mel_scale], [triangularize_in_mel_space])` ⇒ <code>Array.<Array<number>></code> | |
| Creates a frequency bin conversion matrix used to obtain a mel spectrogram. This is called a *mel filter bank*, and | |
| various implementation exist, which differ in the number of filters, the shape of the filters, the way the filters | |
| are spaced, the bandwidth of the filters, and the manner in which the spectrum is warped. The goal of these | |
| features is to approximate the non-linear human perception of the variation in pitch with respect to the frequency. | |
| **Kind**: static method of [<code>utils/audio</code>](#module_utils/audio) | |
| **Returns**: <code>Array.<Array<number>></code> - Triangular filter bank matrix, which is a 2D array of shape (`num_frequency_bins`, `num_mel_filters`). | |
| This is a projection matrix to go from a spectrogram to a mel spectrogram. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>num_frequency_bins</td><td><code>number</code></td><td><p>Number of frequency bins (should be the same as <code>n_fft // 2 + 1</code> | |
| where <code>n_fft</code> is the size of the Fourier Transform used to compute the spectrogram).</p> | |
| </td> | |
| </tr><tr> | |
| <td>num_mel_filters</td><td><code>number</code></td><td><p>Number of mel filters to generate.</p> | |
| </td> | |
| </tr><tr> | |
| <td>min_frequency</td><td><code>number</code></td><td><p>Lowest frequency of interest in Hz.</p> | |
| </td> | |
| </tr><tr> | |
| <td>max_frequency</td><td><code>number</code></td><td><p>Highest frequency of interest in Hz. This should not exceed <code>sampling_rate / 2</code>.</p> | |
| </td> | |
| </tr><tr> | |
| <td>sampling_rate</td><td><code>number</code></td><td><p>Sample rate of the audio waveform.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[norm]</td><td><code>string</code></td><td><p>If <code>"slaney"</code>, divide the triangular mel weights by the width of the mel band (area normalization).</p> | |
| </td> | |
| </tr><tr> | |
| <td>[mel_scale]</td><td><code>string</code></td><td><p>The mel frequency scale to use, <code>"htk"</code> or <code>"slaney"</code>.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[triangularize_in_mel_space]</td><td><code>boolean</code></td><td><p>If this option is enabled, the triangular filter is applied in mel space rather than frequency space. | |
| This should be set to <code>true</code> in order to get the same results as <code>torchaudio</code> when computing mel filters.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_utils/audio.spectrogram" class="group"></a> | |
| ## `utils/audio.spectrogram(waveform, window, frame_length, hop_length, options)` ⇒ [<code>Promise.<Tensor></code>](#Tensor) | |
| Calculates a spectrogram over one waveform using the Short-Time Fourier Transform. | |
| This function can create the following kinds of spectrograms: | |
| - amplitude spectrogram (`power = 1.0`) | |
| - power spectrogram (`power = 2.0`) | |
| - complex-valued spectrogram (`power = None`) | |
| - log spectrogram (use `log_mel` argument) | |
| - mel spectrogram (provide `mel_filters`) | |
| - log-mel spectrogram (provide `mel_filters` and `log_mel`) | |
| In this implementation, the window is assumed to be zero-padded to have the same size as the analysis frame. | |
| A padded window can be obtained from `window_function()`. The FFT input buffer may be larger than the analysis frame, | |
| typically the next power of two. | |
| **Kind**: static method of [<code>utils/audio</code>](#module_utils/audio) | |
| **Returns**: [<code>Promise.<Tensor></code>](#Tensor) - Spectrogram of shape `(num_frequency_bins, length)` (regular spectrogram) or shape `(num_mel_filters, length)` (mel spectrogram). | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>waveform</td><td><code>Float32Array</code> | <code>Float64Array</code></td><td></td><td><p>The input waveform of shape <code>(length,)</code>. This must be a single real-valued, mono waveform.</p> | |
| </td> | |
| </tr><tr> | |
| <td>window</td><td><code>Float32Array</code> | <code>Float64Array</code></td><td></td><td><p>The windowing function to apply of shape <code>(frame_length,)</code>, including zero-padding if necessary. The actual window length may be | |
| shorter than <code>frame_length</code>, but we're assuming the array has already been zero-padded.</p> | |
| </td> | |
| </tr><tr> | |
| <td>frame_length</td><td><code>number</code></td><td></td><td><p>The length of the analysis frames in samples (a.k.a., <code>fft_length</code>).</p> | |
| </td> | |
| </tr><tr> | |
| <td>hop_length</td><td><code>number</code></td><td></td><td><p>The stride between successive analysis frames in samples.</p> | |
| </td> | |
| </tr><tr> | |
| <td>options</td><td><code>Object</code></td><td></td><td></td> | |
| </tr><tr> | |
| <td>[options.fft_length]</td><td><code>number</code></td><td><code></code></td><td><p>The size of the FFT buffer in samples. This determines how many frequency bins the spectrogram will have. | |
| For optimal speed, this should be a power of two. If <code>null</code>, uses <code>frame_length</code>.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.power]</td><td><code>number</code></td><td><code>1.0</code></td><td><p>If 1.0, returns the amplitude spectrogram. If 2.0, returns the power spectrogram. If <code>null</code>, returns complex numbers.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.center]</td><td><code>boolean</code></td><td><code>true</code></td><td><p>Whether to pad the waveform so that frame <code>t</code> is centered around time <code>t * hop_length</code>. If <code>false</code>, frame | |
| <code>t</code> will start at time <code>t * hop_length</code>.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.pad_mode]</td><td><code>string</code></td><td><code>""reflect""</code></td><td><p>Padding mode used when <code>center</code> is <code>true</code>. Possible values are: <code>"constant"</code> (pad with zeros), | |
| <code>"edge"</code> (pad with edge values), <code>"reflect"</code> (pads with mirrored values).</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.onesided]</td><td><code>boolean</code></td><td><code>true</code></td><td><p>If <code>true</code>, only computes the positive frequencies and returns a spectrogram containing <code>fft_length // 2 + 1</code> | |
| frequency bins. If <code>false</code>, also computes the negative frequencies and returns <code>fft_length</code> frequency bins.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.preemphasis]</td><td><code>number</code></td><td><code></code></td><td><p>Coefficient for a low-pass filter that applies pre-emphasis before the DFT.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.preemphasis_htk_flavor]</td><td><code>boolean</code></td><td><code>true</code></td><td><p>Whether to apply the pre-emphasis filter in the HTK flavor.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.mel_filters]</td><td><code>Array.<Array<number>></code></td><td><code></code></td><td><p>The mel filter bank of shape <code>(num_freq_bins, num_mel_filters)</code>. | |
| If supplied, applies this filter bank to create a mel spectrogram.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.mel_floor]</td><td><code>number</code></td><td><code>1e-10</code></td><td><p>Minimum value of mel frequency banks.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.log_mel]</td><td><code>string</code></td><td><code>null</code></td><td><p>How to convert the spectrogram to log scale. Possible options are: | |
| <code>null</code> (don't convert), <code>"log"</code> (take the natural logarithm) <code>"log10"</code> (take the base-10 logarithm), <code>"dB"</code> (convert to decibels). | |
| Can only be used when <code>power</code> is not <code>null</code>.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.reference]</td><td><code>number</code></td><td><code>1.0</code></td><td><p>Sets the input spectrogram value that corresponds to 0 dB. For example, use <code>max(spectrogram)[0]</code> to set | |
| the loudest part to 0 dB. Must be greater than zero.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.min_value]</td><td><code>number</code></td><td><code>1e-10</code></td><td><p>The spectrogram will be clipped to this minimum value before conversion to decibels, to avoid taking <code>log(0)</code>. | |
| For a power spectrogram, the default of <code>1e-10</code> corresponds to a minimum of -100 dB. For an amplitude spectrogram, the value <code>1e-5</code> corresponds to -100 dB. | |
| Must be greater than zero.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.db_range]</td><td><code>number</code></td><td><code></code></td><td><p>Sets the maximum dynamic range in decibels. For example, if <code>db_range = 80</code>, the difference between the | |
| peak value and the smallest value will never be more than 80 dB. Must be greater than zero.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.remove_dc_offset]</td><td><code>boolean</code></td><td><code></code></td><td><p>Subtract mean from waveform on each frame, applied before pre-emphasis. This should be set to <code>true</code> in | |
| order to get the same results as <code>torchaudio.compliance.kaldi.fbank</code> when computing mel filters.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.max_num_frames]</td><td><code>number</code></td><td><code></code></td><td><p>If provided, limits the number of frames to compute to this value.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.min_num_frames]</td><td><code>number</code></td><td><code></code></td><td><p>If provided, ensures the number of frames to compute is at least this value.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.do_pad]</td><td><code>boolean</code></td><td><code>true</code></td><td><p>If <code>true</code>, pads the output spectrogram to have <code>max_num_frames</code> frames.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.transpose]</td><td><code>boolean</code></td><td><code>false</code></td><td><p>If <code>true</code>, the returned spectrogram will have shape <code>(num_frames, num_frequency_bins/num_mel_filters)</code>. If <code>false</code>, the returned spectrogram will have shape <code>(num_frequency_bins/num_mel_filters, num_frames)</code>.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.mel_offset]</td><td><code>number</code></td><td><code>0</code></td><td><p>Offset to add to the mel spectrogram to avoid taking the log of zero.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_utils/audio.window_function" class="group"></a> | |
| ## `utils/audio.window_function(window_length, name, options)` ⇒ <code>Float64Array</code> | |
| Returns an array containing the specified window. | |
| **Kind**: static method of [<code>utils/audio</code>](#module_utils/audio) | |
| **Returns**: <code>Float64Array</code> - The window of shape `(window_length,)` or `(frame_length,)`. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>window_length</td><td><code>number</code></td><td></td><td><p>The length of the window in samples.</p> | |
| </td> | |
| </tr><tr> | |
| <td>name</td><td><code>string</code></td><td></td><td><p>The name of the window function.</p> | |
| </td> | |
| </tr><tr> | |
| <td>options</td><td><code>Object</code></td><td></td><td><p>Additional options.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.periodic]</td><td><code>boolean</code></td><td><code>true</code></td><td><p>Whether the window is periodic or symmetric.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.frame_length]</td><td><code>number</code></td><td><code></code></td><td><p>The length of the analysis frames in samples. | |
| Provide a value for <code>frame_length</code> if the window is smaller than the frame length, so that it will be zero-padded.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[options.center]</td><td><code>boolean</code></td><td><code>true</code></td><td><p>Whether to center the window inside the FFT buffer. Only used when <code>frame_length</code> is provided.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_utils/audio..generalized_cosine_window" class="group"></a> | |
| ## `utils/audio~generalized_cosine_window(M, a_0)` ⇒ <code>Float64Array</code> | |
| Helper function to generate windows that are special cases of the generalized cosine window. | |
| See https://www.mathworks.com/help/signal/ug/generalized-cosine-windows.html for more information. | |
| **Kind**: inner method of [<code>utils/audio</code>](#module_utils/audio) | |
| **Returns**: <code>Float64Array</code> - The generated window. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>M</td><td><code>number</code></td><td><p>Number of points in the output window. If zero or less, an empty array is returned.</p> | |
| </td> | |
| </tr><tr> | |
| <td>a_0</td><td><code>number</code></td><td><p>Offset for the generalized cosine window.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_utils/audio..hertz_to_mel" class="group"></a> | |
| ## `utils/audio~hertz_to_mel(freq, [mel_scale])` ⇒ <code>T</code> | |
| **Kind**: inner method of [<code>utils/audio</code>](#module_utils/audio) | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>freq</td><td><code>T</code></td><td></td> | |
| </tr><tr> | |
| <td>[mel_scale]</td><td><code>string</code></td><td><code>"htk"</code></td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_utils/audio..mel_to_hertz" class="group"></a> | |
| ## `utils/audio~mel_to_hertz(mels, [mel_scale])` ⇒ <code>T</code> | |
| **Kind**: inner method of [<code>utils/audio</code>](#module_utils/audio) | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>mels</td><td><code>T</code></td><td></td> | |
| </tr><tr> | |
| <td>[mel_scale]</td><td><code>string</code></td><td><code>"htk"</code></td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_utils/audio.._create_triangular_filter_bank" class="group"></a> | |
| ## `utils/audio~_create_triangular_filter_bank(fft_freqs, filter_freqs)` ⇒ <code>Array.<Array<number>></code> | |
| Creates a triangular filter bank. | |
| Adapted from torchaudio and librosa. | |
| **Kind**: inner method of [<code>utils/audio</code>](#module_utils/audio) | |
| **Returns**: <code>Array.<Array<number>></code> - of shape `(num_frequency_bins, num_mel_filters)`. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>fft_freqs</td><td><code>Float64Array</code></td><td><p>Discrete frequencies of the FFT bins in Hz, of shape <code>(num_frequency_bins,)</code>.</p> | |
| </td> | |
| </tr><tr> | |
| <td>filter_freqs</td><td><code>Float64Array</code></td><td><p>Center frequencies of the triangular filters to create, in Hz, of shape <code>(num_mel_filters,)</code>.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_utils/audio..linspace" class="group"></a> | |
| ## `utils/audio~linspace(start, end, num)` ⇒ | |
| Return evenly spaced numbers over a specified interval. | |
| **Kind**: inner method of [<code>utils/audio</code>](#module_utils/audio) | |
| **Returns**: `num` evenly spaced samples, calculated over the interval `[start, stop]`. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>start</td><td><code>number</code></td><td><p>The starting value of the sequence.</p> | |
| </td> | |
| </tr><tr> | |
| <td>end</td><td><code>number</code></td><td><p>The end value of the sequence.</p> | |
| </td> | |
| </tr><tr> | |
| <td>num</td><td><code>number</code></td><td><p>Number of samples to generate.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_utils/audio..padReflect" class="group"></a> | |
| ## `utils/audio~padReflect(array, left, right)` ⇒ <code>T</code> | |
| **Kind**: inner method of [<code>utils/audio</code>](#module_utils/audio) | |
| **Returns**: <code>T</code> - The padded array. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>array</td><td><code>T</code></td><td><p>The array to pad.</p> | |
| </td> | |
| </tr><tr> | |
| <td>left</td><td><code>number</code></td><td><p>The amount of padding to add to the left.</p> | |
| </td> | |
| </tr><tr> | |
| <td>right</td><td><code>number</code></td><td><p>The amount of padding to add to the right.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_utils/audio.._db_conversion_helper" class="group"></a> | |
| ## `utils/audio~_db_conversion_helper(spectrogram, factor, reference, min_value, db_range)` ⇒ <code>T</code> | |
| Helper function to compute `amplitude_to_db` and `power_to_db`. | |
| **Kind**: inner method of [<code>utils/audio</code>](#module_utils/audio) | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>spectrogram</td><td><code>T</code></td> | |
| </tr><tr> | |
| <td>factor</td><td><code>number</code></td> | |
| </tr><tr> | |
| <td>reference</td><td><code>number</code></td> | |
| </tr><tr> | |
| <td>min_value</td><td><code>number</code></td> | |
| </tr><tr> | |
| <td>db_range</td><td><code>number</code></td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_utils/audio..amplitude_to_db" class="group"></a> | |
| ## `utils/audio~amplitude_to_db(spectrogram, [reference], [min_value], [db_range])` ⇒ <code>T</code> | |
| Converts an amplitude spectrogram to the decibel scale. This computes `20 * log10(spectrogram / reference)`, | |
| using basic logarithm properties for numerical stability. NOTE: Operates in-place. | |
| The motivation behind applying the log function on the (mel) spectrogram is that humans do not hear loudness on a | |
| linear scale. Generally to double the perceived volume of a sound we need to put 8 times as much energy into it. | |
| This means that large variations in energy may not sound all that different if the sound is loud to begin with. | |
| This compression operation makes the (mel) spectrogram features match more closely what humans actually hear. | |
| **Kind**: inner method of [<code>utils/audio</code>](#module_utils/audio) | |
| **Returns**: <code>T</code> - The modified spectrogram in decibels. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>spectrogram</td><td><code>T</code></td><td></td><td><p>The input amplitude (mel) spectrogram.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[reference]</td><td><code>number</code></td><td><code>1.0</code></td><td><p>Sets the input spectrogram value that corresponds to 0 dB. | |
| For example, use <code>np.max(spectrogram)</code> to set the loudest part to 0 dB. Must be greater than zero.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[min_value]</td><td><code>number</code></td><td><code>1e-5</code></td><td><p>The spectrogram will be clipped to this minimum value before conversion to decibels, | |
| to avoid taking <code>log(0)</code>. The default of <code>1e-5</code> corresponds to a minimum of -100 dB. Must be greater than zero.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[db_range]</td><td><code>number</code></td><td><code></code></td><td><p>Sets the maximum dynamic range in decibels. For example, if <code>db_range = 80</code>, the | |
| difference between the peak value and the smallest value will never be more than 80 dB. Must be greater than zero.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_utils/audio..power_to_db" class="group"></a> | |
| ## `utils/audio~power_to_db(spectrogram, [reference], [min_value], [db_range])` ⇒ <code>T</code> | |
| Converts a power spectrogram to the decibel scale. This computes `10 * log10(spectrogram / reference)`, | |
| using basic logarithm properties for numerical stability. NOTE: Operates in-place. | |
| The motivation behind applying the log function on the (mel) spectrogram is that humans do not hear loudness on a | |
| linear scale. Generally to double the perceived volume of a sound we need to put 8 times as much energy into it. | |
| This means that large variations in energy may not sound all that different if the sound is loud to begin with. | |
| This compression operation makes the (mel) spectrogram features match more closely what humans actually hear. | |
| Based on the implementation of `librosa.power_to_db`. | |
| **Kind**: inner method of [<code>utils/audio</code>](#module_utils/audio) | |
| **Returns**: <code>T</code> - The modified spectrogram in decibels. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Default</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>spectrogram</td><td><code>T</code></td><td></td><td><p>The input power (mel) spectrogram. Note that a power spectrogram has the amplitudes squared!</p> | |
| </td> | |
| </tr><tr> | |
| <td>[reference]</td><td><code>number</code></td><td><code>1.0</code></td><td><p>Sets the input spectrogram value that corresponds to 0 dB. | |
| For example, use <code>np.max(spectrogram)</code> to set the loudest part to 0 dB. Must be greater than zero.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[min_value]</td><td><code>number</code></td><td><code>1e-10</code></td><td><p>The spectrogram will be clipped to this minimum value before conversion to decibels, | |
| to avoid taking <code>log(0)</code>. The default of <code>1e-10</code> corresponds to a minimum of -100 dB. Must be greater than zero.</p> | |
| </td> | |
| </tr><tr> | |
| <td>[db_range]</td><td><code>number</code></td><td><code></code></td><td><p>Sets the maximum dynamic range in decibels. For example, if <code>db_range = 80</code>, the | |
| difference between the peak value and the smallest value will never be more than 80 dB. Must be greater than zero.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <a id="module_utils/audio..encodeWAV" class="group"></a> | |
| ## `utils/audio~encodeWAV(samples, rate)` ⇒ <code>ArrayBuffer</code> | |
| Encode audio data to a WAV file. | |
| WAV file specs : https://en.wikipedia.org/wiki/WAV#WAV_File_header | |
| Adapted from https://www.npmjs.com/package/audiobuffer-to-wav | |
| **Kind**: inner method of [<code>utils/audio</code>](#module_utils/audio) | |
| **Returns**: <code>ArrayBuffer</code> - The WAV audio buffer. | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Param</th><th>Type</th><th>Description</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>samples</td><td><code>Float32Array</code></td><td><p>The audio samples.</p> | |
| </td> | |
| </tr><tr> | |
| <td>rate</td><td><code>number</code></td><td><p>The sample rate.</p> | |
| </td> | |
| </tr> </tbody> | |
| </table> | |
| * * * | |
| <EditOnGithub source="https://github.com/huggingface/transformers.js/blob/main/docs/source/api/utils/audio.md" /> |
Xet Storage Details
- Size:
- 29.5 kB
- Xet hash:
- 920b5a8476c9a1505012fb7ae740a2003c191605360d72bc104c2249eb56627f
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.