Upload feature extractor

Browse files

Files changed (3) hide show

README.md +199 -0
feature_extraction_gramt_ambisonics.py +237 -0
preprocessor_config.json +12 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

feature_extraction_gramt_ambisonics.py ADDED Viewed

	@@ -0,0 +1,237 @@

+from typing import Optional, Union
+import numpy as np
+from transformers import SequenceFeatureExtractor
+from transformers import BatchFeature
+from transformers.utils import TensorType
+import torch
+from torch import Tensor
+from typing import Callable
+from torchaudio.transforms import Spectrogram, MelScale
+class FeatureExtractor(torch.nn.Module):
+    def __init__(
+        self,
+        sample_rate: int = 32000,
+        n_fft: int = 400,
+        win_length: Optional[int] = None,
+        hop_length: Optional[int] = None,
+        f_min: float = 0.0,
+        f_max: Optional[float] = None,
+        pad: int = 0,
+        n_mels: int = 128,
+        window_fn: Callable[..., Tensor] = torch.hann_window,
+        power: float = None,
+        normalized: bool = False,
+        wkwargs: Optional[dict] = None,
+        center: bool = True,
+        pad_mode: str = "reflect",
+        onesided: Optional[bool] = None,
+        norm: Optional[str] = None,
+        mel_scale: str = "htk",
+    ) -> None:
+        super().__init__()
+        self.sample_rate = sample_rate
+        self.power = power
+        self.n_fft = n_fft
+        self.win_length = win_length if win_length is not None else n_fft
+        self.hop_length = hop_length if hop_length is not None else self.win_length // 2
+        self.pad = pad
+        self.power = power
+        self.normalized = normalized
+        self.n_mels = n_mels  # number of mel frequency bins
+        self.f_max = f_max
+        self.f_min = f_min
+        self.eps = 1e-6
+        self.spectrogram = Spectrogram(
+            n_fft=self.n_fft,
+            win_length=self.win_length,
+            hop_length=self.hop_length,
+            pad=self.pad,
+            window_fn=window_fn,
+            power=None,
+            normalized=self.normalized,
+            wkwargs=wkwargs,
+            center=center,
+            pad_mode=pad_mode,
+            onesided=True,
+        )
+        self.mel_scale = MelScale(
+            self.n_mels, self.sample_rate, self.f_min, self.f_max, self.n_fft // 2 + 1, norm, mel_scale
+        )
+        self.processed_spec = None
+    def _get_foa_intensity_vectors(self, linear_spectra):
+        """
+        Convert FOA (First Order Ambisonic) linear spectra to intensity vectors.
+        Args:
+            linear_spectra: Complex tensor of shape (batch, freq_bins, 4)
+                        where the 4 channels are [W, X, Y, Z]
+        Returns:
+            foa_iv: Tensor of shape (batch, nb_mel_bins * 3)
+        """
+        # Extract W channel (omnidirectional component)
+        W = linear_spectra[: , [0], ...]
+        XYZ = linear_spectra[:, 1:, ...]
+        # Compute intensity vectors using complex conjugate
+        # I = 2 * Re(conj(W) * [X, Y, Z])
+        I = 2 * torch.real(torch.conj(W) * XYZ)
+        # Compute energy with epsilon for numerical stability
+        # E = eps + |W|^2 + (|X|^2 + |Y|^2 + |Z|^2)/3
+        W_power = torch.squeeze(torch.abs(W) ** 2)
+        xyz_power = torch.sum(torch.abs(XYZ) ** 2, dim=1)
+        E = self.eps + W_power + xyz_power
+        # Normalize intensity vectors
+        I_norm = I / E.unsqueeze(dim = 1)
+        foa_iv = self.mel_scale(I_norm)
+        return foa_iv
+    def forward(self, audio):
+        spec = self.spectrogram(audio)
+        power_spec = torch.abs(spec)**self.power
+        mel_spec = torch.log(self.mel_scale(power_spec) + self.eps)
+        foa_aiv = self._get_foa_intensity_vectors(spec)
+        return torch.cat([mel_spec, foa_aiv], dim = 1)
+class AmbisonicsFeatureExtractor(SequenceFeatureExtractor):
+    r"""
+    Constructs a Audio Spectrogram Transformer (AST) feature extractor.
+    This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
+    most of the main methods. Users should refer to this superclass for more information regarding those methods.
+    This class extracts mel-filter bank features from raw speech using TorchAudio if installed or using numpy
+    otherwise, pads/truncates them to a fixed length and normalizes them using a mean and standard deviation.
+    Args:
+        feature_size (`int`, *optional*, defaults to 1):
+            The feature dimension of the extracted features.
+        sampling_rate (`int`, *optional*, defaults to 16000):
+            The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).
+        num_mel_bins (`int`, *optional*, defaults to 128):
+            Number of Mel-frequency bins.
+        max_length (`int`, *optional*, defaults to 1024):
+            Maximum length to which to pad/truncate the extracted features
+    """
+    in_channels = 4
+    feature_extractor_type = "gram-ambisonics"
+    def __init__(
+        self,
+        feature_size=1,
+        sampling_rate=32000,
+        num_mel_bins=128,
+        padding_value=0.0,
+        **kwargs,
+    ):
+        super().__init__(feature_size=feature_size, sampling_rate=sampling_rate, padding_value=padding_value, **kwargs)
+        self.num_mel_bins = num_mel_bins
+    def _extract_fbank_features(
+        self,
+        waveform: np.ndarray,
+    ) -> np.ndarray:
+        """
+        Get mel-filter bank features using TorchAudio. Note that TorchAudio requires 16-bit signed integers as inputs
+        and hence the waveform should not be normalized before feature extraction.
+        """
+        melspec = FeatureExtractor(
+                sample_rate=self.sampling_rate,
+                n_fft=1024,
+                win_length=1024,
+                hop_length=self.sampling_rate // 100,
+                f_min=50,
+                f_max=self.sampling_rate // 2,
+                n_mels=self.num_mel_bins,
+                power=2.0,
+        )
+        waveform = torch.tensor(waveform.clone().detach())
+        # If waveform has two channels, but the channel information is not the first dimension, transpose.
+        if (waveform.ndim == 2) and (waveform.shape[0] > 100):
+            waveform = waveform.transpose(1, 0)
+        if waveform.ndim == 1:
+            waveform = waveform.unsqueeze(0)
+        # Handle stereo/mono channels consistently
+        if waveform.shape[0] == 1:
+            waveform = torch.cat([waveform, waveform, waveform, waveform], dim = 0).unsqueeze(0)
+            log_mel = melspec(waveform).transpose(3, 2)[0]
+        elif waveform.shape[0] == 2:
+            waveform = waveform[0]
+            waveform = torch.cat([waveform, waveform, waveform, waveform], dim = 0).unsqueeze(0)
+            log_mel = melspec(waveform).transpose(3, 2)[0]
+            return log_mel
+        elif waveform.shape[0] == 4:
+            log_mel = melspec(waveform.unsqueeze(0)).transpose(3, 2)[0]
+            return log_mel
+        else:
+            raise Exception("Unknowm channel count")
+    def _normalize_audio(self, audio_data, target_dBFS=-14.0):
+        rms = torch.sqrt(torch.mean(audio_data**2))  # Calculate the RMS of the audio
+        if rms == 0:  # Avoid division by zero in case of a completely silent audio
+            return audio_data
+        current_dBFS = 20 * torch.log10(rms)  # Convert RMS to dBFS
+        gain_dB = target_dBFS - current_dBFS  # Calculate the required gain in dB
+        gain_linear = 10 ** (gain_dB / 20)  # Convert gain from dB to linear scale
+        normalized_audio = audio_data * gain_linear  # Apply the gain to the audio data
+        return normalized_audio
+    def __call__(
+        self,
+        raw_speech: Union[np.ndarray, list[float], list[np.ndarray], list[list[float]]],
+        sampling_rate: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        **kwargs,
+    ) -> BatchFeature:
+        """
+        Main method to featurize and prepare for the model one or several sequence(s).
+        Args:
+            raw_speech (`np.ndarray`, `list[float]`, `list[np.ndarray]`, `list[list[float]]`):
+                The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float
+                values, a list of numpy arrays or a list of list of float values.
+            sampling_rate (`int`, *optional*):
+                The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
+                `sampling_rate` at the forward call to prevent silent errors.
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors instead of list of python integers. Acceptable values are:
+                - `'tf'`: Return TensorFlow `tf.constant` objects.
+                - `'pt'`: Return PyTorch `torch.Tensor` objects.
+                - `'np'`: Return Numpy `np.ndarray` objects.
+        """
+        if sampling_rate is not None:
+            if sampling_rate != self.sampling_rate:
+                raise ValueError(
+                    f"The model corresponding to this feature extractor: {self} was trained using a sampling rate of"
+                    f" {self.sampling_rate}. Please make sure that the provided `raw_speech` input was sampled with"
+                    f" {self.sampling_rate} and not {sampling_rate}."
+                )
+        # extract fbank features and pad/truncate to max_length
+        features = [self._extract_fbank_features(waveform) for waveform in raw_speech]
+        features = torch.nn.utils.rnn.pad_sequence(features, batch_first=True)
+        inputs = BatchFeature({"input_values": features})
+        return inputs

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "auto_map": {
+    "AutoFeatureExtractor": "feature_extraction_gramt_ambisonics.AmbisonicsFeatureExtractor"
+  },
+  "feature_extractor_type": "AmbisonicsFeatureExtractor",
+  "feature_size": 1,
+  "num_mel_bins": 128,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "return_attention_mask": true,
+  "sampling_rate": 32000
+}