Spaces:

sohamc05
/

NSynth-5K-pretrained-sed

Sleeping

App Files Files Community

sohamc10 commited on Aug 7

Commit

9b0d6c2

1 Parent(s): 3457c43

gradio app

Browse files

Files changed (49) hide show

.gitignore +5 -0
LICENSE +21 -0
README.md +320 -13
app.py +71 -0
config.py +28 -0
data_util/audioset_classes.py +1393 -0
data_util/audioset_strong.py +329 -0
data_util/dcase2016task2.py +280 -0
data_util/transforms.py +195 -0
ex_audioset_strong.py +504 -0
ex_dcase2016task2.py +517 -0
helpers/augment.py +225 -0
helpers/decode.py +72 -0
helpers/encode.py +230 -0
helpers/score.py +384 -0
helpers/utils.py +12 -0
images/downstream_task_results.png +0 -0
inference.py +126 -0
models/asit/ASIT_wrapper.py +60 -0
models/asit/data_transformations.py +29 -0
models/asit/utils.py +540 -0
models/asit/vision_transformer.py +316 -0
models/atstframe/ATSTF_wrapper.py +105 -0
models/atstframe/audio_transformer.py +253 -0
models/atstframe/transformer.py +112 -0
models/beats/BEATs.py +183 -0
models/beats/BEATs_wrapper.py +56 -0
models/beats/Tokenizers.py +172 -0
models/beats/backbone.py +783 -0
models/beats/modules.py +218 -0
models/beats/quantizer.py +215 -0
models/frame_mn/Frame_MN_wrapper.py +75 -0
models/frame_mn/block_types.py +189 -0
models/frame_mn/model.py +356 -0
models/frame_mn/utils.py +93 -0
models/frame_passt/fpasst.py +963 -0
models/frame_passt/fpasst_wrapper.py +86 -0
models/frame_passt/preprocess.py +147 -0
models/frame_passt/vit_helpers.py +399 -0
models/m2d/M2D_wrapper.py +52 -0
models/m2d/portable_m2d.py +410 -0
models/prediction_wrapper.py +213 -0
models/seq_models.py +40 -0
models/transformer_wrapper.py +19 -0
requirements.txt +17 -0
resources/README.md +1 -0
resources/best_model_BEATs.pth +3 -0
resources/eval_durations.csv +0 -0
resources/labelvocabulary.csv +89 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,5 @@

+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+__init__.pyc

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2024 Florian Schmid
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,13 +1,320 @@
----
-title: NSynth 5K Pretrained Sed
-emoji: 🐢
-colorFrom: indigo
-colorTo: red
-sdk: gradio
-sdk_version: 5.41.1
-app_file: app.py
-pinned: false
-license: cc-by-nc-4.0
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Effective Pre-Training of Audio Transformers for Sound Event Detection
+In this repository, we publish pre-trained models and code for the ICASSP'25 paper: [**Effective Pre-Training of Audio Transformers for Sound Event Detection**](https://arxiv.org/abs/2409.09546).
+In this paper, we propose a pre-training pipeline for audio spectrogram transformers for frame-level sound event detection tasks. On top of common pre-training steps, we add a meticulously designed training routine on AudioSet frame-level annotations. For five transformers, we show that this additional pre-training step leads to substantial performance improvements on frame-level downstream tasks. We release all model checkpoints and hope that they will help researchers improve tasks that require high-quality frame-level representations.
+This repository includes:
+* All pre-trained checkpoints and model files (see [here](https://github.com/fschmid56/PretrainedSED/releases/tag/v0.0.1))
+* A script that demonstrates how the pre-trained checkpoints can be loaded and used for inference (see [here](https://github.com/fschmid56/PretrainedSED/blob/main/inference.py))
+* Add a table outlining the external checkpoints used in this work (see [here](https://github.com/fschmid56/PretrainedSED?tab=readme-ov-file#model-checkpoints))
+* Evaluation routine on the AudioSet frame-level annotations (see [here](https://github.com/fschmid56/PretrainedSED?tab=readme-ov-file#run-audioset-strong-evaluation))
+* The AudioSet Strong training routine (see [here](https://github.com/fschmid56/PretrainedSED?tab=readme-ov-file#audioset-strong-pre-training))
+* The ensemble logits for the AudioSet Strong dataset (see [here](https://github.com/fschmid56/PretrainedSED?tab=readme-ov-file#download-ensemble-pseudo-labels))
+* A file demonstrating how the pre-trained transformers can be fine-tuned on a downstream task (see [here](ex_dcase2016task2.py))
+* **New:** added two low-complexity SED models ('frame_mn10' with 3.83M parameters and 'frame_mn06' with 1.62M parameters)
+## Setting up Environment
+1. If needed, create a new environment with python 3.9 and activate it:
+```bash
+conda create -n ptsed python=3.9 cython
+conda activate ptsed
+ ```
+2. Install pytorch build that suits your system. For example:
+```bash
+pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
+# or for cuda >= 12.1
+pip3 install torch torchvision torchaudio
+```
+3. Install the requirements:
+ ```bash
+pip3 install -r requirements.txt
+ ```
+4. Install package for mp3 decoding:
+``` bash
+CFLAGS='-O3 -march=native' pip install https://github.com/f0k/minimp3py/archive/master.zip
+```
+## Inference
+The script [inference.py](inference.py) demonstrates how to load a pre-trained model and run sound event detection on an audio file
+of arbitrary length.
+ ```python
+python inference.py --cuda --model_name="BEATs" --audio_file="test_files/752547__iscence__milan_metro_coming_in_station.wav"
+ ```
+The argument ```model_name``` specifies the transformer used for inference, and the corresponding pre-trained model checkpoint
+is automatically downloaded and placed in the folder [resources](resources).
+The argument ```audio_file``` specifies the path to a single audio file. There is one [example file](test_files/752547__iscence__milan_metro_coming_in_station.wav) included.
+More example files can be downloaded from the [GitHub release](https://github.com/fschmid56/PretrainedSED/releases/tag/v0.0.1).
+**Low-complexity** inference with customized MobileNet:
+ ```python
+python inference.py --cuda --model_name="frame_mn06" --audio_file="test_files/752547__iscence__milan_metro_coming_in_station.wav"
+ ```
+## Model Checkpoints
+The following is a list of checkpoints that we have created and worked with in our paper. For external checkpoints, we provide the download link. "Checkpoint Name" refers to the respective names in our [GitHub release](https://github.com/fschmid56/PretrainedSED/releases/tag/v0.0.1). **All model checkpoints** are automatically downloaded by running the code, or can be manually downloaded from the [GitHub release](https://github.com/fschmid56/PretrainedSED/releases/tag/v0.0.1).
+| Model                | Pre-Training | Checkpoint Name    | External Download Link                                                                                                    | Reference                                                                        |
+|----------------------|--------------|--------------------|---------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------|
+| BEATs                | SSL          | BEATs_ssl.pt       | [here](https://1drv.ms/u/s!AqeByhGUtINrgcpxJUNDxg4eU0r-vA?e=qezPJ5)                                                       | [[1]](https://arxiv.org/pdf/2212.09058)                                          |
+| BEATs                | Weak         | BEATs_weak.pt      | [here](https://1drv.ms/u/s!AqeByhGUtINrgcpke6_lRSZEKD5j2Q?e=A3FpOf)                                                       | [[1]](https://arxiv.org/pdf/2212.09058)                                          |
+| BEATs                | Strong       | BEATs_strong_1.pt  | ours                                                                                                                      | [[1]](https://arxiv.org/pdf/2212.09058)                                          |
+| ATST-Frame           | SSL          | ATST-F_ssl.pt      | [here](https://drive.google.com/file/d/1bGJSZWlAIIJ6GL5Id5dW0PTB72DL-QDQ/view?usp=sharing)                                | [[2]](https://arxiv.org/pdf/2306.04186)                                          |
+| ATST-Frame           | Weak         | ATST-F_weak.pt     | [here](https://drive.google.com/file/d/1_xb0_n3UNbUG_pH1vLHTviLfsaSfCzxz/view?usp=drive_link)                             | [[2]](https://arxiv.org/pdf/2306.04186)                                          |
+| ATST-Frame           | Strong       | ATST-F_strong_1.pt | ours                                                                                                                      | [[2]](https://arxiv.org/pdf/2306.04186)                                          |
+| fPaSST               | SSL          | fpasst_im.pt       | [here](https://dl.fbaipublicfiles.com/deit/deit_base_distilled_patch16_384-d0272ac0.pth)                                  | [[3]](https://arxiv.org/pdf/2110.05069), [[4]](https://arxiv.org/pdf/2407.12997) |
+| fPaSST               | Weak         | fpasst_weak.pt     | ours                                                                                                                      | [[3]](https://arxiv.org/pdf/2110.05069), [[4]](https://arxiv.org/pdf/2407.12997) |
+| fPaSST               | Strong       | fpasst_strong_1.pt | ours                                                                                                                      | [[3]](https://arxiv.org/pdf/2110.05069), [[4]](https://arxiv.org/pdf/2407.12997) |
+| ASiT                 | SSL          | ASIT_ssl.pt        | [here](https://drive.google.com/file/d/11eaOU40jonpYZ3u_XI-XUSSWclv8qeR7/view?usp=drive_link)                             | [[5]](https://arxiv.org/pdf/2211.13189)                                          |
+| ASiT                 | Weak         | ASIT_weak.pt       | ours                                                                                                                      | [[5]](https://arxiv.org/pdf/2211.13189)                                          |
+| ASiT                 | Strong       | ASIT_strong_1.pt   | ours                                                                                                                      | [[5]](https://arxiv.org/pdf/2211.13189)                                          |
+| M2D                  | SSL          | M2D_ssl.pt         | [here](https://github.com/nttcslab/m2d/releases/download/v0.3.0/m2d_clap_vit_base-80x1001p16x16-240128_AS-FT_enconly.zip) | [[6]](https://arxiv.org/pdf/2406.02032)                                          |
+| M2D                  | Weak         | M2D_weak.pt        | [here](https://github.com/nttcslab/m2d/releases/download/v0.3.0/m2d_clap_vit_base-80x1001p16x16-240128_AS-FT_enconly.zip) | [[6]](https://arxiv.org/pdf/2406.02032)                                          |
+| M2D                  | Strong       | M2D_strong_1.pt    | ours                                                                                                                      | [[6]](https://arxiv.org/pdf/2406.02032)                                          |
+| Customized MobileNet | Strong       | frame_mn06.pt      | ours                                                                                                                      | **NEW**                                                                          |
+| Customized MobileNet | Strong       | frame_mn10.pt      | ours                                                                                                                      | **NEW**                                                                          |
+## AudioSet Strong pre-training
+### Prepare Dataset
+1. Follow the steps described [here](https://github.com/kkoutini/PaSST/tree/main/audioset#experiments-on-audioset) to obtain AudioSet, encoded as mp3 files and packed into HDF5 format.
+You will end up with a directory containing three HDF5 files:
+* balanced_train_segments_mp3.hdf
+* unbalanced_train_segments_mp3.hdf
+* eval_segments_mp3.hdf
+2. We use the [Huggingface datasets](https://huggingface.co/docs/datasets/index) API for fast and memory-efficient loading of the dataset. The [hf_dataset_gen/audioset_strong.py](hf_dataset_gen/audioset_strong.py) file takes the dataset from Step 1 and converts it into a Huggingface dataset.
+Adapt the paths in [hf_dataset_gen/audioset_strong.py](hf_dataset_gen/audioset_strong.py) marked as TODOs (2x: hdf5 path and target path for HF dataset).
+3. Create the Hunggingface dataset:
+```
+cd hf_dataset_gen
+python audioset_strong.py
+```
+4. The path to the dataset is specified via an environment variable. When you access the dataset for training or evaluation,
+set the environment variable. For example, in our case, the Huggingface dataset path is set to:
+```/share/hel/datasets/HF_datasets/local/audioset_strong```
+And therefore we set the following environment variable:
+```
+export HF_DATASETS_CACHE=/share/hel/datasets/HF_datasets/cache/
+```
+### Download ensemble pseudo labels
+If you want to train on AudioSet Strong using Knowledge Distillation as described in the paper, you will have to download the
+ensemble logits from [Zenodo](https://zenodo.org/records/14626113). The HDF5 file contains filenames (Youtube IDs) matched with the corresponding ensembled logits. The corresponding keys are "filenames" and "strong_logits". Ensemble Logits for one file are of shape 447 x 250 (number of classes x timeframes at 40 ms resolution). Ensemble Logits are stored in float16 format to save space.
+Check out [this code piece](https://github.com/fschmid56/PretrainedSED/blob/f62e9fb1566254766396cce0343a2de4156d3015/data_util/transforms.py#L37) if you want to learn how pseudo labels are loaded.
+For training, the pseudo-label file can simply be set via command line: ```--pseudo_labels_file=<location>```
+### Run AudioSet Strong training
+Example: Train ATST-F, pretrained on AudioSet weak, with an RNN on top, use the balanced sampler and set wavmix augmentation to probability of 1.0.
+```
+python ex_audioset_strong.py --model_name=ATST-F --seq_model_type=rnn --use_balanced_sampler --pretrained=weak --wavmix_p=1.0
+```
+Check out the results: https://api.wandb.ai/links/cp_tobi/tphswm5k
+Example: Train ATST-F using Knowledge Distillation.
+```
+python ex_audioset_strong.py --model_name=ATST-F --pretrained=weak --n_epochs=120 --wavmix_p=0.5 --freq_warp_p=0 --filter_augment_p=0 --mixstyle_p=0 --max_lr=1e-4 --distillation_loss_weight=0.9 --pseudo_labels_file=<path_to_pseudo_label_file_from_Zenodo>
+```
+Check out the results: https://api.wandb.ai/links/cp_tobi/2eh4cz80
+### Run AudioSet Strong evaluation
+Evaluate the AudioSet Strong pre-trained checkpoint of ATST-F:
+```
+python ex_audioset_strong.py --model_name=ATST-F --pretrained=strong --evaluate
+```
+If everything is set up correctly, this should give a `val/psds1_macro_averaged` of around 46.
+## Fine-Tuning on Downstream Task
+We demonstrate how pre-trained transformers can be fine-tuned for the downstream Sound Event Detection task by using our transformers on [DCASE 2016 Task 2](https://dcase.community/challenge2016/task-sound-event-detection-in-synthetic-audio-results). This task focuses on detecting office sounds and is part of the [HEAR benchmark](https://hearbenchmark.com/hear-tasks.html).
+### Obtain DCASE 2016 Task 2 Dataset in HEAR format
+Follow the instructions on the [HEAR website](https://hearbenchmark.com/hear-tasks.html) to download the dataset in 16 kHz sampling rate. After completing the setup, your file tree should look similar to this:
+```
+hear_datasets/tasks/dcase2016_task2-hear2021-full/
+├── 16000
+├── 48000
+├── labelvocabulary.csv
+├── task_metadata.json
+├── test.json
+├── train.json
+└── valid.json
+```
+The ```16000``` folder contains audio files sampled at 16 kHz.
+### Run Fine-Tuning
+The main script for fine-tuning is [ex_dcase2016task2.py](ex_dcase2016task2.py).
+To fine-tune the full ATST-F model, pre-trained on AudioSet Strong, with a layer-wise learning rate decay of 0.95, use the following command:
+```
+python ex_dcase2016task2.py --task_path=hear_datasets/tasks/dcase2016_task2-hear2021-full --model_name=ATST-F --pretrained=strong --lr_decay=0.95
+```
+To train only the linear prediction head on top of the frozen BEATs transformer, also pre-trained on AudioSet Strong, use this command:
+```
+python ex_dcase2016task2.py --task_path=hear_datasets/tasks/dcase2016_task2-hear2021-full --model_name=BEATs --pretrained=strong --transformer_frozen --max_lr=2e-1 --mixup_p=0 --wavmix_p=0 --no_adamw --weight_decay=0 --n_epochs=500
+```
+## Results & Ablation Studies
+This section presents the main results reported [in the paper](https://arxiv.org/pdf/2409.09546), along with additional ablation studies, including teacher model performances, comparisons of different sequence models, and evaluations using the DESED baseline system setup. The additional ablation studies have been requested by ICASSP`25 reviewers.
+* All results represent averages over three independent runs.
+* For AudioSet Strong, we employ the threshold-independent PSDS1 [7] metric to ensure fine-grained temporal evaluation.
+### Student Model Performances on AudioSet Strong (*from paper*)
+* For the *Li et al. [2]* row, we reproduced their AudioSet Strong [training pipeline](https://github.com/Audio-WestlakeU/audiossl).
+* Alongside the **Proposed Pipeline**, we include ablation studies for three settings: no KD, no RNN in teacher models, and no pre-training on AudioSet Weak (no Step 2).
+|                       | **ATST-F** | **BEATs** | **fPaSST** | **M2D**  | **ASiT** |
+|-----------------------|------------|-----------|------------|----------|----------|
+| **Li et al. [2]**     | 40.9       | 36.5      | 38.7       | 36.9     | 37.0     |
+| **Proposed Pipeline** | **45.8**   | **46.5**  | **45.4**   | **46.3** | **46.2** |
+| **-- without KD**     | 41.8       | 44.1      | 40.7       | 41.1     | 40.9     |
+| **-- without RNN**    | 45.7       | 45.8      | 45.3       | 46.0     | 46.1     |
+| **-- without Step 2** | 45.7       | 46.3      | 45.2       | 44.9     | **46.2** |
+**Conclusions:**
+* The significant performance gap to [2] stems mainly from our three design choices (KD, RNNs, Step 2), but also improvements in training on AudioSet Strong, including balanced sampling and aggressive data augmentation.
+* Knowledge Distillation (KD) has the most substantial impact, underlining the effectiveness of the ensemble-KD approach.
+* RNNs in teacher models and pre-training on AudioSet Weak offer modest improvements but are justified due to their low additional cost. Notably, they do not increase student model complexity, and AudioSet Weak checkpoints are publicly available for most transformers.
+###  Teacher Model Performances on AudioSet Strong (*additional results*)
+* The table below shows teacher model results for each transformer.
+* Column **Avg. Ind.** represents the average performance across all single models in the row.
+* Column **Ensemble** represents the performance of the ensemble consisting of all models in the respective row.
+|                               | **ATST-F** | **BEATs** | **fPaSST** | **M2D**  | **ASiT** | **Avg. Ind.** | **Ensemble** |
+|-------------------------------|------------|-----------|------------|----------|----------|---------------|--------------|
+| **Proposed Teacher Pipeline** | 43.3       | **45.8**  | **43.3**   | **44.1** | **43.3** | **44.9**      | **47.1**     |
+| **-- without RNN**            | 41.8       | 44.1      | 40.7       | 41.1     | 40.9     | 41.7          | 46.2         |
+| **-- without Step 2**         | **43.5**   | 34.4      | 40.9       | 43.8     | 43.2     | 41.2          | 46.5         |
+**Conclusions:**
+* *Ensemble Performance*: The *Ensemble* column reflects the teacher ensemble performances utilized for Knowledge Distillation (KD) in table above.
+* *Impact of RNNs and Step 2*: Incorporating RNNs and Step 2 (AudioSet Weak pre-training) notably enhances single-model teacher performance, with the exception of ATST-F without Step 2.
+* *Benefits of Ensembling*: While individual model performances show considerable variability (Avg. Ind.), ensembling stabilizes and elevates overall performance, as evidenced by the smaller differences in the *Ensemble* column.
+* *BEATs-Specific Insights*: BEATs excels in the *Proposed Teacher Pipeline* and *without RNN* settings but underperforms in the *without Step 2* configuration. This discrepancy may be attributed to its unique SSL pre-training routine and longer sequence length (resulting from more tokens being extracted from the input).
+### Teacher Model Performances with different Sequence Models (*additional results*)
+* The use of an additional sequence model on top of the AudioSet Weak pre-trained transformers stems from our hypothesis that adding capacity specifically for temporally-strong predictions can enhance performance.
+* The table below shows teacher model performances for various sequence models added on top of the transformers before training on AudioSet Strong. The paper uses BiGRUs (RNN) as they deliver the best performance.
+* We investigated 4 different sequence models:
+  * RNN: BiGRUs
+  * Attention: Multi-Head Self-Attention with rotary position embeddings
+  * Transformer (TF): Transformer Encoder blocks with rotary position embeddings
+  * [MAMBA](https://arxiv.org/abs/2312.00752): Implementation from [mambapy](https://github.com/alxndrTL/mamba.py)
+* We varied the inner dimension (*dim*) and the number of layers (\<Model Type\>:\<#layers\>; e.g., TF:2 means two Transformer layers were added on top of the pre-trained transformer).
+* **Setup Notes**:
+  * Ablations were performed using **ATST-F** due to its computational efficiency.
+  * Performance without a sequence model was **41.8 PSDS1**.
+  * Removing the top Transformer layers, which may overfit to AudioSet Weak labels, decreased performance.
+  * For MAMBA, only a single layer was feasible due to memory constraints.
+|              | RNN:1 |   RNN:2   | RNN:3 |   TF:1    | TF:2  |   TF:3    | ATT:1 |   ATT:2   | ATT:3 |  MAMBA:1  |
+|:-------------|:-----:|:---------:|:-----:|:---------:|:-----:|:---------:|:-----:|:---------:|:-----:|:---------:|
+| **dim=256**  | 8.72  |   3.76    | 3.10  |   34.25   | 34.62 |   34.05   | 40.08 |   39.70   | 39.55 |   40.27   |
+| **dim=512**  | 40.62 |   7.26    | 0.12  |   40.41   | 41.11 |   40.30   | 41.78 |   41.91   | 41.95 |   41.25   |
+| **dim=1024** | 42.74 |   42.75   | 43.00 |   42.69   | 42.22 |   42.20   | 42.44 | **42.45** | 42.08 | **41.97** |
+| **dim=2048** | 43.41 | **43.43** | 42.66 | **42.90** | 38.94 | **42.90** | 41.58 |   41.59   | 41.42 |   41.72   |
+**Conclusions:**
+* *Best model type*: The highest performance was achieved with 2 BiGRU layers, followed by Transformer, Self-Attention, and MAMBA. All sequence models improved performance compared to using no additional sequence model, though MAMBA's gains were marginal.
+* *Inner Dimension*: Larger inner dimensions consistently led to better performance across all sequence models. Significant improvements required dimensions ≥1024, while smaller dimensions (e.g., 256) often degraded performance, with severe failures for BiGRU. We believe that large inner dimensions are essential due to the high number of classes (447) in AudioSet Strong.
+* *Number of layers*: Performance was relatively insensitive to the number of layers for most sequence models, with optimal results often achieved with just 1–2 layers.
+### Downstream Task Performances (*from paper*)
+* Three frame-level downstream tasks:
+  * DCASE 2023 Task 4: Domestic Environment Sound Event Detection (*DESED*), metric: PSDS 1
+  * DCASE 2016 Task 2 (*DC16-T2*), metric: onset F-measure
+  * MAESTRO 5hr (*MAESTRO*), metric: onset F-measure
+* For DESED, we followed a simplified setup in line with [2], excluding unsupervised data (no mean teacher approach) and an additional CRNN component from the [DCASE 2023 Task 4 baseline system](https://github.com/DCASE-REPO/DESED_task/tree/master/recipes/dcase2023_task4_baseline). While state-of-the-art approaches such as [4] and [8] leverage advanced techniques (e.g., multi-stage/multi-iteration training, sophisticated data augmentation, and interpolation consistency training), we deliberately avoided these complexities, as the focus is on a precise evaluation of pre-training quality.
+![Downstream Task Results](/images/downstream_task_results.png)
+**Conclusions:**
+* *In-Domain Tasks*: The pipeline demonstrates strong, consistent improvements for all transformers on *DESED* and *DC16-T2*, showcasing its effectiveness for in-domain tasks.
+* *Out-of-Domain Task*: Results on *MAESTRO* (piano pitch prediction) are inconclusive. This limitation suggests that the proposed pre-training strategy yields substantial gains only when audio and labels are similar to the AudioSet ontology.
+* *Simplified DESED Setup*: Despite the simplified setup (no CRNN, no unsupervised data), performance remains comparable to the [DCASE 2023 Task 4 baseline system](https://github.com/DCASE-REPO/DESED_task/tree/master/recipes/dcase2023_task4_baseline).
+#### DESED Baseline Setup (*additional results*)
+To complement the simplified DESED setup presented earlier, we provide results for the [DCASE 2023 Task 4 baseline system](https://github.com/DCASE-REPO/DESED_task/tree/master/recipes/dcase2023_task4_baseline) setup for ATST-F and BEATs in the table below. Note that hyperparameters were not extensively tuned, and the data setup may differ slightly from the original baseline.
+| **Model** | **Checkpoint**   | **Notes**           | **Performance** |
+|-----------|------------------|---------------------|-----------------|
+| ATST-F    | Step 1 (SSL)     |                     | 42.7            |
+| ATST-F    | Step 2 (AS weak) |                     | 47.1            |
+| ATST-F    | Full Pipeline    |                     | 50.4            |
+| ATST-F    | Full Pipeline    | dropped 2 TF layers | **51.1**        |
+| BEATs     | Step 1 (SSL)     |                     | 39.7            |
+| BEATs     | Step 2 (AS weak) |                     | 48.1            |
+| BEATs     | Full Pipeline    |                     | 48.6            |
+| BEATs     | Full Pipeline    | dropped 2 TF layers | **51.1**        |
+**Conclusions**:
+* The *Full Pipeline* substantially improves performance over *Step 1 (SSL)* and *Step 2 (AS Weak)* for both ATST-F and BEATs.
+* Dropping the last two Transformer layers notably enhances results, suggesting that the final layers may focus on AudioSet Strong label-specific features, while earlier layers provide more general, transferable embeddings that benefit the DESED task. We will conduct further experiments to find out whether dropping Transformer layers is generalizable to other tasks or specific to the DESED task.
+# References
+[1] S. Chen, Y. Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei, “BEATs: Audio pre-training with acoustic tokenizers,” in Proceedings of the International Conference on Machine Learning (ICML), 2023.
+[2] X. Li, N. Shao, and X. Li, “Self-supervised audio teacher-student transformer for both clip-level and frame-level tasks,” Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1336–1351, 2024.
+[3] K. Koutini, J. Schl¨uter, H. Eghbal-zadeh, and G. Widmer, “Efficient training of audio transformers with patchout,” in Proceedings of the Interspeech Conference, 2022.
+[4] F. Schmid, P. Primus, T. Morocutti, J. Greif, and G. Widmer, “Multi-iteration multi-stage fine-tuning of transformers for sound event detection with heterogeneous datasets,” in Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2024.
+[5] S. Atito, M. Awais, W. Wang, M. D. Plumbley, and J. Kittler, “ASiT: Local-global audio spectrogram vision transformer for event classification,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 32, pp. 3684–3693, 2024.
+[6] D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, M. Yasuda, S. Tsubaki, and K. Imoto, “M2D-CLAP: masked modeling duo meets CLAP for learning general-purpose audio-language representation,” in Proceedings of the Interspeech Conference, 2024.
+[7] J. Ebbers, R. Haeb-Umbach, and R. Serizel, “Threshold independent evaluation of sound event detection scores,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
+[8] N. Shao, X. Li, and X. Li, “Fine-tune the pretrained ATST model for sound event detection,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024

app.py ADDED Viewed

	@@ -0,0 +1,71 @@

+import numpy as np
+import gradio as gr
+from models.atstframe.ATSTF_wrapper import ATSTWrapper
+from models.beats.BEATs_wrapper import BEATsWrapper
+from models.frame_passt.fpasst_wrapper import FPaSSTWrapper
+from models.m2d.M2D_wrapper import M2DWrapper
+from models.asit.ASIT_wrapper import ASiTWrapper
+from models.frame_mn.Frame_MN_wrapper import FrameMNWrapper
+from models.prediction_wrapper import PredictionsWrapper
+from models.frame_mn.utils import NAME_TO_WIDTH
+import torch
+from torch import nn
+import pandas as pd
+class TransformerClassifier(nn.Module):
+    def __init__(self, model, n_classes):
+        super(TransformerClassifier, self).__init__()
+        self.model = model
+        self.linear = nn.Linear(model.embed_dim, n_classes)
+    def forward(self, x):
+        mel = self.model.mel_forward(x)
+        features = self.model(mel).squeeze(1)
+        return self.linear(features)
+def get_model(model_name):
+    if model_name == "BEATs":
+        beats = BEATsWrapper()
+        model = PredictionsWrapper(beats, checkpoint=None, head_type=None, seq_len=1)
+    elif model_name == "ATST-F":
+        atst = ATSTWrapper()
+        model = PredictionsWrapper(atst, checkpoint=None, head_type=None, seq_len=1)
+    elif model_name == "fpasst":
+        fpasst = FPaSSTWrapper()
+        model = PredictionsWrapper(fpasst, checkpoint=None, head_type=None, seq_len=1)
+    elif model_name == "M2D":
+        m2d = M2DWrapper()
+        model = PredictionsWrapper(m2d, checkpoint=None, head_type=None, seq_len=1,
+                                    embed_dim=m2d.m2d.cfg.feature_d)
+    elif model_name == "ASIT":
+        asit = ASiTWrapper()
+        model = PredictionsWrapper(asit, checkpoint=None, head_type=None, seq_len=1)
+    elif model_name.startswith("frame_mn"):
+        width = NAME_TO_WIDTH(model_name)
+        frame_mn = FrameMNWrapper(width)
+        embed_dim = frame_mn.state_dict()['frame_mn.features.16.1.bias'].shape[0]
+        model = PredictionsWrapper(frame_mn, checkpoint=None, head_type=None, seq_len=1, embed_dim=embed_dim)
+    else:
+        raise NotImplementedError(f"Model {model_name} not (yet) implemented")
+    main_model = TransformerClassifier(model, n_classes=88)
+    # main_model.compile()
+    main_model.load_state_dict(torch.load(f"resources/best_model_{model_name}.pth", map_location='cpu'))
+    print(main_model)
+    main_model.eval()
+    return main_model
+model = get_model("BEATs")
+label_mapping = pd.read_csv("resources/labelvocabulary.csv", header=None, index_col=0).to_dict()[1]
+def apply_sepia(input_audio):
+    # Apply sepia effect to the audio
+    waveform = torch.from_numpy(input_audio[1]).float()  # Convert to tensor
+    output = model(waveform.unsqueeze(0))
+    output = output.detach().cpu().numpy()
+    output = np.argmax(output, axis=1)
+    return int(label_mapping[str(output.item())])
+demo = gr.Interface(apply_sepia, gr.Audio(max_length=4,), "number",title="NSynth Pitch Classification",)
+demo.launch()

config.py ADDED Viewed

	@@ -0,0 +1,28 @@

+RESOURCES_FOLDER = "resources"
+GITHUB_RELEASE_URL = "https://github.com/fschmid56/PretrainedSED/releases/download/v0.0.1/"
+# checkpoints
+CHECKPOINT_URLS = {}
+# strong
+CHECKPOINT_URLS['BEATs_strong_1'] = GITHUB_RELEASE_URL + "BEATs_strong_1.pt"
+CHECKPOINT_URLS['ATST-F_strong_1'] = GITHUB_RELEASE_URL + "ATST-F_strong_1.pt"
+CHECKPOINT_URLS['ASIT_strong_1'] = GITHUB_RELEASE_URL + "ASIT_strong_1.pt"
+CHECKPOINT_URLS['fpasst_strong_1'] = GITHUB_RELEASE_URL + "fpasst_strong_1.pt"
+CHECKPOINT_URLS['M2D_strong_1'] = GITHUB_RELEASE_URL + "M2D_strong_1.pt"
+for width in ['06', '10']:
+    CHECKPOINT_URLS[f'frame_mn{width}_strong_1'] = GITHUB_RELEASE_URL + f'frame_mn{width}_strong_1.pt'
+# weak
+CHECKPOINT_URLS['BEATs_weak'] = GITHUB_RELEASE_URL + "BEATs_weak.pt"
+CHECKPOINT_URLS['ATST-F_weak'] = GITHUB_RELEASE_URL + "ATST-F_weak.pt"
+CHECKPOINT_URLS['ASIT_weak'] = GITHUB_RELEASE_URL + "ASIT_weak.pt"
+CHECKPOINT_URLS['fpasst_weak'] = GITHUB_RELEASE_URL + "fpasst_weak.pt"
+CHECKPOINT_URLS['M2D_weak'] = GITHUB_RELEASE_URL + "M2D_weak.pt"
+# ssl
+CHECKPOINT_URLS['BEATs_ssl'] = GITHUB_RELEASE_URL + "BEATs_ssl.pt"
+CHECKPOINT_URLS['ATST-F_ssl'] = GITHUB_RELEASE_URL + "ATST-F_ssl.pt"
+CHECKPOINT_URLS['ASIT_ssl'] = GITHUB_RELEASE_URL + "ASIT_ssl.pt"
+CHECKPOINT_URLS['fpasst_ssl'] = GITHUB_RELEASE_URL + "fpasst_ssl.pt"
+CHECKPOINT_URLS['M2D_ssl'] = GITHUB_RELEASE_URL + "M2D_ssl.pt"

data_util/audioset_classes.py ADDED Viewed

	@@ -0,0 +1,1393 @@

+as_strong_train_classes = ['Accelerating, revving, vroom',
+                           'Air brake',
+                           'Air conditioning',
+                           'Air horn, truck horn',
+                           'Aircraft',
+                           'Aircraft engine',
+                           'Alarm',
+                           'Alarm clock',
+                           'Alert',
+                           'Ambulance (siren)',
+                           'Animal',
+                           'Applause',
+                           'Arrow',
+                           'Artillery fire',
+                           'Audio logo',
+                           'Babbling',
+                           'Baby cry, infant cry',
+                           'Baby laughter',
+                           'Background noise',
+                           'Bang',
+                           'Bark',
+                           'Basketball bounce',
+                           'Bathroom sounds',
+                           'Bathtub (filling or washing)',
+                           'Battle cry',
+                           'Bee, wasp, etc.',
+                           'Beep, bleep',
+                           'Bell',
+                           'Bellow',
+                           'Belly laugh',
+                           'Bicycle bell',
+                           'Bicycle, tricycle',
+                           'Bird',
+                           'Bird flight, flapping wings',
+                           'Bird vocalization, bird call, bird song',
+                           'Biting',
+                           'Bleat',
+                           'Blender, food processor',
+                           'Boat, Water vehicle',
+                           'Boiling',
+                           'Boing',
+                           'Booing',
+                           'Boom',
+                           'Bouncing',
+                           'Bow-wow',
+                           'Breaking',
+                           'Breathing',
+                           'Brief tone',
+                           'Burping, eructation',
+                           'Burst, pop',
+                           'Bus',
+                           'Busy signal',
+                           'Buzz',
+                           'Buzzer',
+                           'Cacophony',
+                           'Camera',
+                           'Canidae, wild dogs, wolves',
+                           'Cap gun',
+                           'Car',
+                           'Car alarm',
+                           'Car passing by',
+                           'Carbon monoxide detector, CO detector',
+                           'Cart',
+                           'Cash register',
+                           'Cat',
+                           'Caterwaul',
+                           'Cattle, bovinae',
+                           'Caw',
+                           'Cellphone buzz, vibrating alert',
+                           'Chain',
+                           'Chainsaw',
+                           'Change ringing (campanology)',
+                           'Channel, environment and background',
+                           'Chant',
+                           'Cheering',
+                           'Chewing, mastication',
+                           'Chicken, rooster',
+                           'Child singing',
+                           'Child speech, kid speaking',
+                           'Children playing',
+                           'Children shouting',
+                           'Chime',
+                           'Chipmunk',
+                           'Chirp tone',
+                           'Chirp, tweet',
+                           'Choir',
+                           'Chop',
+                           'Chopping (food)',
+                           'Chorus effect',
+                           'Chuckle, chortle',
+                           'Church bell',
+                           'Civil defense siren',
+                           'Clang',
+                           'Clapping',
+                           'Clatter',
+                           'Clickety-clack',
+                           'Clicking',
+                           'Clip-clop',
+                           'Clock',
+                           'Cluck',
+                           'Clunk',
+                           'Coin (dropping)',
+                           'Computer keyboard',
+                           'Conversation',
+                           'Coo',
+                           'Cough',
+                           'Cowbell',
+                           'Crack',
+                           'Crackle',
+                           'Creak',
+                           'Cricket',
+                           'Croak',
+                           'Crockery breaking and smashing',
+                           'Crow',
+                           'Crowd',
+                           'Crowing, cock-a-doodle-doo',
+                           'Crumpling, crinkling',
+                           'Crunch',
+                           'Crushing',
+                           'Crying, sobbing',
+                           'Cupboard open or close',
+                           'Cutlery, silverware',
+                           'Deformable shell',
+                           "Dental drill, dentist's drill",
+                           'Dial tone',
+                           'Digestive',
+                           'Ding',
+                           'Ding-dong',
+                           'Dishes, pots, and pans',
+                           'Distortion',
+                           'Dog',
+                           'Domestic animals, pets',
+                           'Dong, bong',
+                           'Donkey, ass',
+                           'Door',
+                           'Doorbell',
+                           'Drawer open or close',
+                           'Drill',
+                           'Drip',
+                           'Duck call (hunting tool)',
+                           'Ducks, geese, waterfowl',
+                           'Echo',
+                           'Effects unit',
+                           'Electric rotor drone, quadcopter',
+                           'Electric shaver, electric razor',
+                           'Electric toothbrush',
+                           'Electronic tuner',
+                           'Emergency vehicle',
+                           'Engine',
+                           'Engine knocking',
+                           'Engine starting',
+                           'Environmental noise',
+                           'Error signal',
+                           'Eruption',
+                           'Explosion',
+                           'Fart',
+                           'Female singing',
+                           'Female speech, woman speaking',
+                           'Filing (rasp)',
+                           'Fill (with liquid)',
+                           'Finger snapping',
+                           'Fire',
+                           'Fire alarm',
+                           'Fire engine, fire truck (siren)',
+                           'Firecracker',
+                           'Fireworks',
+                           'Fixed-wing aircraft, airplane',
+                           'Fizz',
+                           'Flap',
+                           'Fly, housefly',
+                           'Foghorn',
+                           'Fowl',
+                           'Frog',
+                           'Frying (food)',
+                           'Fusillade',
+                           'Gargling',
+                           'Gasp',
+                           'Gears',
+                           'Generic impact sounds',
+                           'Giggle',
+                           'Glass',
+                           'Glass chink, clink',
+                           'Glass shatter',
+                           'Goat',
+                           'Gobble',
+                           'Grind',
+                           'Groan',
+                           'Growling',
+                           'Grunt',
+                           'Gull, seagull',
+                           'Gunshot, gunfire',
+                           'Gurgling, bubbling',
+                           'Gush',
+                           'Hair dryer',
+                           'Hammer',
+                           'Hands',
+                           'Heart sounds, heartbeat',
+                           'Heavy engine (low frequency)',
+                           'Helicopter',
+                           'Hiccup',
+                           'Hiss',
+                           'Honk',
+                           'Hoot',
+                           'Horse',
+                           'Howl',
+                           'Howl (wind)',
+                           'Hubbub, speech noise, speech babble',
+                           'Hum',
+                           'Human group actions',
+                           'Human locomotion',
+                           'Human sounds',
+                           'Human voice',
+                           'Humming',
+                           'Ice cream truck, ice cream van',
+                           'Idling',
+                           'Insect',
+                           'Inside, large room or hall',
+                           'Inside, public space',
+                           'Inside, small room',
+                           'Jackhammer',
+                           'Jet engine',
+                           'Jingle bell',
+                           'Jingle, tinkle',
+                           'Kettle whistle',
+                           'Keypress tone',
+                           'Keys jangling',
+                           'Kitchen and dining room sounds',
+                           'Knife',
+                           'Knock',
+                           'Laughter',
+                           'Lawn mower',
+                           'Light engine (high frequency)',
+                           'Liquid',
+                           'Livestock, farm animals, working animals',
+                           'Lock',
+                           'Machine gun',
+                           'Mains hum',
+                           'Male singing',
+                           'Male speech, man speaking',
+                           'Mantra',
+                           'Mechanical bell',
+                           'Mechanical fan',
+                           'Mechanisms',
+                           'Medium engine (mid frequency)',
+                           'Meow',
+                           'Microphone',
+                           'Microwave oven',
+                           'Moo',
+                           'Mosquito',
+                           'Motor vehicle (road)',
+                           'Motorboat, speedboat',
+                           'Motorcycle',
+                           'Mouse',
+                           'Music',
+                           'Narration, monologue',
+                           'Neigh, whinny',
+                           'Noise',
+                           'Non-motorized land vehicle',
+                           'Ocean',
+                           'Oink',
+                           'Other sourceless',
+                           'Outside, urban or manmade',
+                           'Owl',
+                           'Packing tape, duct tape',
+                           'Pant',
+                           'Pant (dog)',
+                           'Paper rustling',
+                           'Patter',
+                           'Pig',
+                           'Pigeon, dove',
+                           'Ping',
+                           'Plop',
+                           'Police car (siren)',
+                           'Pour',
+                           'Power saw, circular saw, table saw',
+                           'Power tool',
+                           'Power windows, electric windows',
+                           'Printer',
+                           'Propeller, airscrew',
+                           'Puff',
+                           'Pulleys',
+                           'Pulse',
+                           'Pump (liquid)',
+                           'Purr',
+                           'Quack',
+                           'Race car, auto racing',
+                           'Radio',
+                           'Rail transport',
+                           'Railroad car, train wagon',
+                           'Rain',
+                           'Rain on surface',
+                           'Raindrop',
+                           'Rapping',
+                           'Ratchet, pawl',
+                           'Rattle',
+                           'Refrigerator',
+                           'Respiratory sounds',
+                           'Reverberation',
+                           'Reversing beeps',
+                           'Ringing tone, ringback tone',
+                           'Ringtone',
+                           'Roar',
+                           'Roaring cats (lions, tigers)',
+                           'Rodents, rats, mice',
+                           'Roll',
+                           'Rowboat, canoe, kayak',
+                           'Rub',
+                           'Rumble',
+                           'Run',
+                           'Rustle',
+                           'Sailboat, sailing ship',
+                           'Sanding',
+                           'Sawing',
+                           'Scissors',
+                           'Scrape',
+                           'Scratch',
+                           'Screaming',
+                           'Screech',
+                           'Sewing machine',
+                           'Sheep',
+                           'Ship',
+                           'Shout',
+                           'Shower',
+                           'Shuffle',
+                           'Shuffling cards',
+                           'Sigh',
+                           'Sine wave',
+                           'Singing',
+                           'Single-lens reflex camera',
+                           'Sink (filling or washing)',
+                           'Siren',
+                           'Sizzle',
+                           'Skateboard',
+                           'Slam',
+                           'Slap, smack',
+                           'Sliding door',
+                           'Slosh',
+                           'Slurp, drinking straw',
+                           'Smash, crash',
+                           'Smoke detector, smoke alarm',
+                           'Snake',
+                           'Snap',
+                           'Sneeze',
+                           'Snicker',
+                           'Sniff',
+                           'Snoring',
+                           'Snort',
+                           'Snort (horse)',
+                           'Sonar',
+                           'Sonic boom',
+                           'Sound effect',
+                           'Sound equipment',
+                           'Sound reproduction',
+                           'Speech',
+                           'Speech synthesizer',
+                           'Splash, splatter',
+                           'Splinter',
+                           'Spray',
+                           'Squawk',
+                           'Squeak',
+                           'Squeal',
+                           'Squish',
+                           'Stairs',
+                           'Static',
+                           'Steam',
+                           'Steam whistle',
+                           'Stir',
+                           'Stomach rumble',
+                           'Stomp, stamp',
+                           'Stream, river',
+                           'Subway, metro, underground',
+                           'Surface contact',
+                           'Sweeping',
+                           'Synthetic singing',
+                           'Tap',
+                           'Tap dance',
+                           'Tape hiss',
+                           'Tearing',
+                           'Telephone',
+                           'Telephone bell ringing',
+                           'Telephone dialing, DTMF',
+                           'Television',
+                           'Throat clearing',
+                           'Thump, thud',
+                           'Thunder',
+                           'Thunderstorm',
+                           'Thunk',
+                           'Tick',
+                           'Tick-tock',
+                           'Tire squeal, skidding',
+                           'Toilet flush',
+                           'Tools',
+                           'Toothbrush',
+                           'Traffic noise, roadway noise',
+                           'Train',
+                           'Train horn',
+                           'Train wheels squealing',
+                           'Train whistle',
+                           'Trickle, dribble',
+                           'Truck',
+                           'Tuning fork',
+                           'Turkey',
+                           'Typewriter',
+                           'Typing',
+                           'Unknown sound',
+                           'Vacuum cleaner',
+                           'Vehicle',
+                           'Vehicle horn, car horn, honking, toot',
+                           'Velcro, hook and loop fastener',
+                           'Video game sound',
+                           'Wail, moan',
+                           'Walk, footsteps',
+                           'Washing machine',
+                           'Water',
+                           'Water tap, faucet',
+                           'Waterfall',
+                           'Waves, surf',
+                           'Whack, thwack',
+                           'Whale vocalization',
+                           'Wheeze',
+                           'Whimper',
+                           'Whimper (dog)',
+                           'Whip',
+                           'Whir',
+                           'Whispering',
+                           'Whistle',
+                           'Whistling',
+                           'White noise, pink noise',
+                           'Whoop',
+                           'Whoosh, swoosh, swish',
+                           'Wild animals',
+                           'Wildfire',
+                           'Wind',
+                           'Wind chime',
+                           'Wind noise (microphone)',
+                           'Windscreen wiper, windshield wiper',
+                           'Wobble',
+                           'Wolf-whistling',
+                           'Wood',
+                           'Writing',
+                           'Yak',
+                           'Yawn',
+                           'Yell',
+                           'Yip',
+                           'Yodeling',
+                           'Zing',
+                           'Zipper (clothing)']
+as_strong_eval_classes = ['Accelerating, revving, vroom',
+                          'Air brake',
+                          'Air conditioning',
+                          'Air horn, truck horn',
+                          'Aircraft',
+                          'Aircraft engine',
+                          'Alarm',
+                          'Alarm clock',
+                          'Ambulance (siren)',
+                          'Animal',
+                          'Applause',
+                          'Arrow',
+                          'Artillery fire',
+                          'Audio logo',
+                          'Babbling',
+                          'Baby cry, infant cry',
+                          'Baby laughter',
+                          'Background noise',
+                          'Bang',
+                          'Bark',
+                          'Basketball bounce',
+                          'Bathtub (filling or washing)',
+                          'Battle cry',
+                          'Bee, wasp, etc.',
+                          'Beep, bleep',
+                          'Bell',
+                          'Bellow',
+                          'Belly laugh',
+                          'Bicycle bell',
+                          'Bicycle, tricycle',
+                          'Bird',
+                          'Bird flight, flapping wings',
+                          'Bird vocalization, bird call, bird song',
+                          'Biting',
+                          'Bleat',
+                          'Blender, food processor',
+                          'Boat, Water vehicle',
+                          'Boiling',
+                          'Boing',
+                          'Boom',
+                          'Bouncing',
+                          'Bow-wow',
+                          'Breaking',
+                          'Breathing',
+                          'Brief tone',
+                          'Burping, eructation',
+                          'Burst, pop',
+                          'Bus',
+                          'Busy signal',
+                          'Buzz',
+                          'Buzzer',
+                          'Cacophony',
+                          'Camera',
+                          'Canidae, wild dogs, wolves',
+                          'Cap gun',
+                          'Car',
+                          'Car alarm',
+                          'Car passing by',
+                          'Cart',
+                          'Cash register',
+                          'Cat',
+                          'Caterwaul',
+                          'Cattle, bovinae',
+                          'Caw',
+                          'Cellphone buzz, vibrating alert',
+                          'Chainsaw',
+                          'Change ringing (campanology)',
+                          'Chant',
+                          'Cheering',
+                          'Chewing, mastication',
+                          'Chicken, rooster',
+                          'Child singing',
+                          'Child speech, kid speaking',
+                          'Children playing',
+                          'Children shouting',
+                          'Chime',
+                          'Chipmunk',
+                          'Chirp tone',
+                          'Chirp, tweet',
+                          'Choir',
+                          'Chop',
+                          'Chopping (food)',
+                          'Chorus effect',
+                          'Chuckle, chortle',
+                          'Church bell',
+                          'Civil defense siren',
+                          'Clang',
+                          'Clapping',
+                          'Clatter',
+                          'Clickety-clack',
+                          'Clicking',
+                          'Clip-clop',
+                          'Clock',
+                          'Cluck',
+                          'Coin (dropping)',
+                          'Computer keyboard',
+                          'Conversation',
+                          'Coo',
+                          'Cough',
+                          'Cowbell',
+                          'Crack',
+                          'Crackle',
+                          'Creak',
+                          'Cricket',
+                          'Croak',
+                          'Crockery breaking and smashing',
+                          'Crow',
+                          'Crowd',
+                          'Crowing, cock-a-doodle-doo',
+                          'Crumpling, crinkling',
+                          'Crunch',
+                          'Crushing',
+                          'Crying, sobbing',
+                          'Cupboard open or close',
+                          'Cutlery, silverware',
+                          "Dental drill, dentist's drill",
+                          'Dial tone',
+                          'Ding',
+                          'Ding-dong',
+                          'Dishes, pots, and pans',
+                          'Distortion',
+                          'Dog',
+                          'Domestic animals, pets',
+                          'Door',
+                          'Doorbell',
+                          'Drawer open or close',
+                          'Drill',
+                          'Drip',
+                          'Ducks, geese, waterfowl',
+                          'Echo',
+                          'Effects unit',
+                          'Electric rotor drone, quadcopter',
+                          'Electric shaver, electric razor',
+                          'Electric toothbrush',
+                          'Electronic tuner',
+                          'Emergency vehicle',
+                          'Engine',
+                          'Engine knocking',
+                          'Engine starting',
+                          'Environmental noise',
+                          'Eruption',
+                          'Explosion',
+                          'Fart',
+                          'Female singing',
+                          'Female speech, woman speaking',
+                          'Filing (rasp)',
+                          'Fill (with liquid)',
+                          'Finger snapping',
+                          'Fire',
+                          'Fire alarm',
+                          'Fire engine, fire truck (siren)',
+                          'Firecracker',
+                          'Fireworks',
+                          'Fixed-wing aircraft, airplane',
+                          'Flap',
+                          'Fly, housefly',
+                          'Foghorn',
+                          'Fowl',
+                          'Frog',
+                          'Frying (food)',
+                          'Fusillade',
+                          'Gargling',
+                          'Gasp',
+                          'Gears',
+                          'Generic impact sounds',
+                          'Giggle',
+                          'Glass',
+                          'Glass chink, clink',
+                          'Glass shatter',
+                          'Goat',
+                          'Gobble',
+                          'Groan',
+                          'Growling',
+                          'Grunt',
+                          'Gunshot, gunfire',
+                          'Gurgling, bubbling',
+                          'Gush',
+                          'Hair dryer',
+                          'Hammer',
+                          'Hands',
+                          'Heart murmur',
+                          'Heart sounds, heartbeat',
+                          'Heavy engine (low frequency)',
+                          'Helicopter',
+                          'Hiccup',
+                          'Hiss',
+                          'Honk',
+                          'Hoot',
+                          'Horse',
+                          'Howl',
+                          'Howl (wind)',
+                          'Hubbub, speech noise, speech babble',
+                          'Hum',
+                          'Human sounds',
+                          'Human voice',
+                          'Humming',
+                          'Ice cream truck, ice cream van',
+                          'Idling',
+                          'Insect',
+                          'Inside, large room or hall',
+                          'Inside, public space',
+                          'Inside, small room',
+                          'Jackhammer',
+                          'Jet engine',
+                          'Jingle bell',
+                          'Jingle, tinkle',
+                          'Keys jangling',
+                          'Kitchen and dining room sounds',
+                          'Knock',
+                          'Laughter',
+                          'Lawn mower',
+                          'Light engine (high frequency)',
+                          'Liquid',
+                          'Livestock, farm animals, working animals',
+                          'Machine gun',
+                          'Mains hum',
+                          'Male singing',
+                          'Male speech, man speaking',
+                          'Mantra',
+                          'Mechanical fan',
+                          'Mechanisms',
+                          'Medium engine (mid frequency)',
+                          'Meow',
+                          'Microwave oven',
+                          'Moo',
+                          'Mosquito',
+                          'Motor vehicle (road)',
+                          'Motorboat, speedboat',
+                          'Motorcycle',
+                          'Mouse',
+                          'Music',
+                          'Narration, monologue',
+                          'Neigh, whinny',
+                          'Noise',
+                          'Non-motorized land vehicle',
+                          'Ocean',
+                          'Oink',
+                          'Outside, rural or natural',
+                          'Outside, urban or manmade',
+                          'Owl',
+                          'Packing tape, duct tape',
+                          'Pant',
+                          'Pant (dog)',
+                          'Paper rustling',
+                          'Patter',
+                          'Pig',
+                          'Pigeon, dove',
+                          'Ping',
+                          'Plop',
+                          'Police car (siren)',
+                          'Pour',
+                          'Power saw, circular saw, table saw',
+                          'Power tool',
+                          'Power windows, electric windows',
+                          'Printer',
+                          'Propeller, airscrew',
+                          'Pulleys',
+                          'Pulse',
+                          'Pump (liquid)',
+                          'Purr',
+                          'Quack',
+                          'Race car, auto racing',
+                          'Radio',
+                          'Rail transport',
+                          'Railroad car, train wagon',
+                          'Rain',
+                          'Rain on surface',
+                          'Raindrop',
+                          'Rapping',
+                          'Ratchet, pawl',
+                          'Rattle',
+                          'Respiratory sounds',
+                          'Reverberation',
+                          'Reversing beeps',
+                          'Ringing tone, ringback tone',
+                          'Ringtone',
+                          'Roar',
+                          'Roaring cats (lions, tigers)',
+                          'Rodents, rats, mice',
+                          'Roll',
+                          'Rowboat, canoe, kayak',
+                          'Rub',
+                          'Rumble',
+                          'Run',
+                          'Rustle',
+                          'Sailboat, sailing ship',
+                          'Sanding',
+                          'Sawing',
+                          'Scissors',
+                          'Scrape',
+                          'Scratch',
+                          'Screaming',
+                          'Sewing machine',
+                          'Sheep',
+                          'Ship',
+                          'Shout',
+                          'Shower',
+                          'Shuffle',
+                          'Shuffling cards',
+                          'Sigh',
+                          'Silence',
+                          'Sine wave',
+                          'Singing',
+                          'Single-lens reflex camera',
+                          'Sink (filling or washing)',
+                          'Siren',
+                          'Sizzle',
+                          'Skateboard',
+                          'Slam',
+                          'Slap, smack',
+                          'Sliding door',
+                          'Slosh',
+                          'Smash, crash',
+                          'Smoke detector, smoke alarm',
+                          'Snake',
+                          'Sneeze',
+                          'Snicker',
+                          'Sniff',
+                          'Snoring',
+                          'Snort',
+                          'Snort (horse)',
+                          'Sonar',
+                          'Sound effect',
+                          'Sound equipment',
+                          'Source-ambiguous sounds',
+                          'Specific impact sounds',
+                          'Speech',
+                          'Speech synthesizer',
+                          'Splash, splatter',
+                          'Splinter',
+                          'Spray',
+                          'Squawk',
+                          'Squeak',
+                          'Squeal',
+                          'Squish',
+                          'Stairs',
+                          'Static',
+                          'Steam',
+                          'Steam whistle',
+                          'Stir',
+                          'Stomach rumble',
+                          'Stomp, stamp',
+                          'Stream, river',
+                          'Studio recording',
+                          'Subway, metro, underground',
+                          'Surface contact',
+                          'Synthetic singing',
+                          'Tap',
+                          'Tap dance',
+                          'Tearing',
+                          'Telephone',
+                          'Telephone bell ringing',
+                          'Telephone dialing, DTMF',
+                          'Television',
+                          'Throat clearing',
+                          'Throbbing',
+                          'Thump, thud',
+                          'Thunder',
+                          'Thunderstorm',
+                          'Thunk',
+                          'Tick',
+                          'Tick-tock',
+                          'Tire squeal, skidding',
+                          'Toilet flush',
+                          'Tools',
+                          'Toothbrush',
+                          'Traffic noise, roadway noise',
+                          'Train',
+                          'Train horn',
+                          'Train wheels squealing',
+                          'Train whistle',
+                          'Trickle, dribble',
+                          'Truck',
+                          'Tuning fork',
+                          'Turkey',
+                          'Typewriter',
+                          'Typing',
+                          'Unknown sound',
+                          'Unmodified field recording',
+                          'Vacuum cleaner',
+                          'Vehicle',
+                          'Vehicle horn, car horn, honking, toot',
+                          'Velcro, hook and loop fastener',
+                          'Vibration',
+                          'Video game sound',
+                          'Wail, moan',
+                          'Walk, footsteps',
+                          'Washing machine',
+                          'Water',
+                          'Water tap, faucet',
+                          'Waterfall',
+                          'Waves, surf',
+                          'Whack, thwack',
+                          'Whale vocalization',
+                          'Wheeze',
+                          'Whimper',
+                          'Whimper (dog)',
+                          'Whip',
+                          'Whir',
+                          'Whispering',
+                          'Whistle',
+                          'Whistling',
+                          'White noise, pink noise',
+                          'Whoop',
+                          'Whoosh, swoosh, swish',
+                          'Wild animals',
+                          'Wind',
+                          'Wind chime',
+                          'Wind noise (microphone)',
+                          'Wood',
+                          'Writing',
+                          'Yawn',
+                          'Yell',
+                          'Yip',
+                          'Yodeling',
+                          'Zipper (clothing)']
+as_weak_classes = ['A capella',
+                   'Accelerating, revving, vroom',
+                   'Accordion',
+                   'Acoustic guitar',
+                   'Afrobeat',
+                   'Air brake',
+                   'Air conditioning',
+                   'Air horn, truck horn',
+                   'Aircraft',
+                   'Aircraft engine',
+                   'Alarm',
+                   'Alarm clock',
+                   'Ambient music',
+                   'Ambulance (siren)',
+                   'Angry music',
+                   'Animal',
+                   'Applause',
+                   'Arrow',
+                   'Artillery fire',
+                   'Babbling',
+                   'Baby cry, infant cry',
+                   'Baby laughter',
+                   'Background music',
+                   'Bagpipes',
+                   'Bang',
+                   'Banjo',
+                   'Bark',
+                   'Basketball bounce',
+                   'Bass drum',
+                   'Bass guitar',
+                   'Bathtub (filling or washing)',
+                   'Battle cry',
+                   'Beatboxing',
+                   'Bee, wasp, etc.',
+                   'Beep, bleep',
+                   'Bell',
+                   'Bellow',
+                   'Belly laugh',
+                   'Bicycle',
+                   'Bicycle bell',
+                   'Bird',
+                   'Bird flight, flapping wings',
+                   'Bird vocalization, bird call, bird song',
+                   'Biting',
+                   'Bleat',
+                   'Blender',
+                   'Bluegrass',
+                   'Blues',
+                   'Boat, Water vehicle',
+                   'Boiling',
+                   'Boing',
+                   'Boom',
+                   'Bouncing',
+                   'Bow-wow',
+                   'Bowed string instrument',
+                   'Brass instrument',
+                   'Breaking',
+                   'Breathing',
+                   'Burping, eructation',
+                   'Burst, pop',
+                   'Bus',
+                   'Busy signal',
+                   'Buzz',
+                   'Buzzer',
+                   'Cacophony',
+                   'Camera',
+                   'Canidae, dogs, wolves',
+                   'Cap gun',
+                   'Car',
+                   'Car alarm',
+                   'Car passing by',
+                   'Carnatic music',
+                   'Cash register',
+                   'Cat',
+                   'Caterwaul',
+                   'Cattle, bovinae',
+                   'Caw',
+                   'Cello',
+                   'Chainsaw',
+                   'Change ringing (campanology)',
+                   'Chant',
+                   'Chatter',
+                   'Cheering',
+                   'Chewing, mastication',
+                   'Chicken, rooster',
+                   'Child singing',
+                   'Child speech, kid speaking',
+                   'Children playing',
+                   'Children shouting',
+                   'Chime',
+                   'Chink, clink',
+                   'Chirp tone',
+                   'Chirp, tweet',
+                   'Choir',
+                   'Chop',
+                   'Chopping (food)',
+                   'Chorus effect',
+                   'Christian music',
+                   'Christmas music',
+                   'Chuckle, chortle',
+                   'Church bell',
+                   'Civil defense siren',
+                   'Clang',
+                   'Clapping',
+                   'Clarinet',
+                   'Classical music',
+                   'Clatter',
+                   'Clickety-clack',
+                   'Clicking',
+                   'Clip-clop',
+                   'Clock',
+                   'Cluck',
+                   'Coin (dropping)',
+                   'Computer keyboard',
+                   'Conversation',
+                   'Coo',
+                   'Cough',
+                   'Country',
+                   'Cowbell',
+                   'Crack',
+                   'Crackle',
+                   'Creak',
+                   'Cricket',
+                   'Croak',
+                   'Crow',
+                   'Crowd',
+                   'Crowing, cock-a-doodle-doo',
+                   'Crumpling, crinkling',
+                   'Crunch',
+                   'Crushing',
+                   'Crying, sobbing',
+                   'Cupboard open or close',
+                   'Cutlery, silverware',
+                   'Cymbal',
+                   'Dance music',
+                   "Dental drill, dentist's drill",
+                   'Dial tone',
+                   'Didgeridoo',
+                   'Ding',
+                   'Ding-dong',
+                   'Disco',
+                   'Dishes, pots, and pans',
+                   'Distortion',
+                   'Dog',
+                   'Domestic animals, pets',
+                   'Door',
+                   'Doorbell',
+                   'Double bass',
+                   'Drawer open or close',
+                   'Drill',
+                   'Drip',
+                   'Drum',
+                   'Drum and bass',
+                   'Drum kit',
+                   'Drum machine',
+                   'Drum roll',
+                   'Dubstep',
+                   'Duck',
+                   'Echo',
+                   'Effects unit',
+                   'Electric guitar',
+                   'Electric piano',
+                   'Electric shaver, electric razor',
+                   'Electric toothbrush',
+                   'Electronic dance music',
+                   'Electronic music',
+                   'Electronic organ',
+                   'Electronic tuner',
+                   'Electronica',
+                   'Emergency vehicle',
+                   'Engine',
+                   'Engine knocking',
+                   'Engine starting',
+                   'Environmental noise',
+                   'Eruption',
+                   'Exciting music',
+                   'Explosion',
+                   'Fart',
+                   'Female singing',
+                   'Female speech, woman speaking',
+                   'Field recording',
+                   'Filing (rasp)',
+                   'Fill (with liquid)',
+                   'Finger snapping',
+                   'Fire',
+                   'Fire alarm',
+                   'Fire engine, fire truck (siren)',
+                   'Firecracker',
+                   'Fireworks',
+                   'Fixed-wing aircraft, airplane',
+                   'Flamenco',
+                   'Flap',
+                   'Flute',
+                   'Fly, housefly',
+                   'Foghorn',
+                   'Folk music',
+                   'Fowl',
+                   'French horn',
+                   'Frog',
+                   'Frying (food)',
+                   'Funk',
+                   'Funny music',
+                   'Fusillade',
+                   'Gargling',
+                   'Gasp',
+                   'Gears',
+                   'Giggle',
+                   'Glass',
+                   'Glockenspiel',
+                   'Goat',
+                   'Gobble',
+                   'Gong',
+                   'Goose',
+                   'Gospel music',
+                   'Groan',
+                   'Growling',
+                   'Grunge',
+                   'Grunt',
+                   'Guitar',
+                   'Gunshot, gunfire',
+                   'Gurgling',
+                   'Gush',
+                   'Hair dryer',
+                   'Hammer',
+                   'Hammond organ',
+                   'Hands',
+                   'Happy music',
+                   'Harmonic',
+                   'Harmonica',
+                   'Harp',
+                   'Harpsichord',
+                   'Heart murmur',
+                   'Heart sounds, heartbeat',
+                   'Heavy engine (low frequency)',
+                   'Heavy metal',
+                   'Helicopter',
+                   'Hi-hat',
+                   'Hiccup',
+                   'Hip hop music',
+                   'Hiss',
+                   'Honk',
+                   'Hoot',
+                   'Horse',
+                   'House music',
+                   'Howl',
+                   'Hubbub, speech noise, speech babble',
+                   'Hum',
+                   'Humming',
+                   'Ice cream truck, ice cream van',
+                   'Idling',
+                   'Independent music',
+                   'Insect',
+                   'Inside, large room or hall',
+                   'Inside, public space',
+                   'Inside, small room',
+                   'Jackhammer',
+                   'Jazz',
+                   'Jet engine',
+                   'Jingle (music)',
+                   'Jingle bell',
+                   'Jingle, tinkle',
+                   'Keyboard (musical)',
+                   'Keys jangling',
+                   'Knock',
+                   'Laughter',
+                   'Lawn mower',
+                   'Light engine (high frequency)',
+                   'Liquid',
+                   'Livestock, farm animals, working animals',
+                   'Lullaby',
+                   'Machine gun',
+                   'Mains hum',
+                   'Male singing',
+                   'Male speech, man speaking',
+                   'Mallet percussion',
+                   'Mandolin',
+                   'Mantra',
+                   'Maraca',
+                   'Marimba, xylophone',
+                   'Mechanical fan',
+                   'Mechanisms',
+                   'Medium engine (mid frequency)',
+                   'Meow',
+                   'Microwave oven',
+                   'Middle Eastern music',
+                   'Moo',
+                   'Mosquito',
+                   'Motor vehicle (road)',
+                   'Motorboat, speedboat',
+                   'Motorcycle',
+                   'Mouse',
+                   'Music',
+                   'Music for children',
+                   'Music of Africa',
+                   'Music of Asia',
+                   'Music of Bollywood',
+                   'Music of Latin America',
+                   'Musical instrument',
+                   'Narration, monologue',
+                   'Neigh, whinny',
+                   'New-age music',
+                   'Noise',
+                   'Ocean',
+                   'Oink',
+                   'Opera',
+                   'Orchestra',
+                   'Organ',
+                   'Outside, rural or natural',
+                   'Outside, urban or manmade',
+                   'Owl',
+                   'Pant',
+                   'Patter',
+                   'Percussion',
+                   'Piano',
+                   'Pig',
+                   'Pigeon, dove',
+                   'Ping',
+                   'Pink noise',
+                   'Pizzicato',
+                   'Plop',
+                   'Plucked string instrument',
+                   'Police car (siren)',
+                   'Pop music',
+                   'Pour',
+                   'Power tool',
+                   'Power windows, electric windows',
+                   'Printer',
+                   'Progressive rock',
+                   'Propeller, airscrew',
+                   'Psychedelic rock',
+                   'Pulleys',
+                   'Pulse',
+                   'Pump (liquid)',
+                   'Punk rock',
+                   'Purr',
+                   'Quack',
+                   'Race car, auto racing',
+                   'Radio',
+                   'Rail transport',
+                   'Railroad car, train wagon',
+                   'Rain',
+                   'Rain on surface',
+                   'Raindrop',
+                   'Rapping',
+                   'Ratchet, pawl',
+                   'Rattle',
+                   'Rattle (instrument)',
+                   'Reggae',
+                   'Reverberation',
+                   'Reversing beeps',
+                   'Rhythm and blues',
+                   'Rimshot',
+                   'Ringtone',
+                   'Roar',
+                   'Roaring cats (lions, tigers)',
+                   'Rock and roll',
+                   'Rock music',
+                   'Rodents, rats, mice',
+                   'Roll',
+                   'Rowboat, canoe, kayak',
+                   'Rub',
+                   'Rumble',
+                   'Run',
+                   'Rustle',
+                   'Rustling leaves',
+                   'Sad music',
+                   'Sailboat, sailing ship',
+                   'Salsa music',
+                   'Sampler',
+                   'Sanding',
+                   'Sawing',
+                   'Saxophone',
+                   'Scary music',
+                   'Scissors',
+                   'Scrape',
+                   'Scratch',
+                   'Scratching (performance technique)',
+                   'Screaming',
+                   'Sewing machine',
+                   'Shatter',
+                   'Sheep',
+                   'Ship',
+                   'Shofar',
+                   'Shout',
+                   'Shuffle',
+                   'Shuffling cards',
+                   'Sidetone',
+                   'Sigh',
+                   'Silence',
+                   'Sine wave',
+                   'Singing',
+                   'Singing bowl',
+                   'Single-lens reflex camera',
+                   'Sink (filling or washing)',
+                   'Siren',
+                   'Sitar',
+                   'Sizzle',
+                   'Ska',
+                   'Skateboard',
+                   'Skidding',
+                   'Slam',
+                   'Slap, smack',
+                   'Sliding door',
+                   'Slosh',
+                   'Smash, crash',
+                   'Smoke detector, smoke alarm',
+                   'Snake',
+                   'Snare drum',
+                   'Sneeze',
+                   'Snicker',
+                   'Sniff',
+                   'Snoring',
+                   'Snort',
+                   'Sonar',
+                   'Song',
+                   'Soul music',
+                   'Sound effect',
+                   'Soundtrack music',
+                   'Speech',
+                   'Speech synthesizer',
+                   'Splash, splatter',
+                   'Splinter',
+                   'Spray',
+                   'Squawk',
+                   'Squeak',
+                   'Squeal',
+                   'Squish',
+                   'Static',
+                   'Steam',
+                   'Steam whistle',
+                   'Steel guitar, slide guitar',
+                   'Steelpan',
+                   'Stir',
+                   'Stomach rumble',
+                   'Stream',
+                   'String section',
+                   'Strum',
+                   'Subway, metro, underground',
+                   'Swing music',
+                   'Synthesizer',
+                   'Synthetic singing',
+                   'Tabla',
+                   'Tambourine',
+                   'Tap',
+                   'Tapping (guitar technique)',
+                   'Tearing',
+                   'Techno',
+                   'Telephone',
+                   'Telephone bell ringing',
+                   'Telephone dialing, DTMF',
+                   'Television',
+                   'Tender music',
+                   'Theme music',
+                   'Theremin',
+                   'Throat clearing',
+                   'Throbbing',
+                   'Thump, thud',
+                   'Thunder',
+                   'Thunderstorm',
+                   'Thunk',
+                   'Tick',
+                   'Tick-tock',
+                   'Timpani',
+                   'Tire squeal',
+                   'Toilet flush',
+                   'Tools',
+                   'Toot',
+                   'Toothbrush',
+                   'Traditional music',
+                   'Traffic noise, roadway noise',
+                   'Train',
+                   'Train horn',
+                   'Train wheels squealing',
+                   'Train whistle',
+                   'Trance music',
+                   'Trickle, dribble',
+                   'Trombone',
+                   'Truck',
+                   'Trumpet',
+                   'Tubular bells',
+                   'Tuning fork',
+                   'Turkey',
+                   'Typewriter',
+                   'Typing',
+                   'Ukulele',
+                   'Vacuum cleaner',
+                   'Vehicle',
+                   'Vehicle horn, car horn, honking',
+                   'Vibraphone',
+                   'Vibration',
+                   'Video game music',
+                   'Violin, fiddle',
+                   'Vocal music',
+                   'Wail, moan',
+                   'Walk, footsteps',
+                   'Water',
+                   'Water tap, faucet',
+                   'Waterfall',
+                   'Waves, surf',
+                   'Wedding music',
+                   'Whack, thwack',
+                   'Whale vocalization',
+                   'Wheeze',
+                   'Whimper',
+                   'Whimper (dog)',
+                   'Whip',
+                   'Whir',
+                   'Whispering',
+                   'Whistle',
+                   'Whistling',
+                   'White noise',
+                   'Whoop',
+                   'Whoosh, swoosh, swish',
+                   'Wild animals',
+                   'Wind',
+                   'Wind chime',
+                   'Wind instrument, woodwind instrument',
+                   'Wind noise (microphone)',
+                   'Wood',
+                   'Wood block',
+                   'Writing',
+                   'Yell',
+                   'Yip',
+                   'Yodeling',
+                   'Zing',
+                   'Zipper (clothing)',
+                   'Zither'
+                   ]

data_util/audioset_strong.py ADDED Viewed

	@@ -0,0 +1,329 @@

+import os
+from time import perf_counter
+import datasets
+import numpy as np
+import pandas as pd
+import torch
+from torch.utils.data import (
+    Dataset as TorchDataset,
+    DistributedSampler,
+    WeightedRandomSampler,
+)
+from data_util.audioset_classes import as_strong_train_classes
+from data_util.transforms import (
+    Mp3DecodeTransform,
+    SequentialTransform,
+    AddPseudoLabelsTransform,
+    strong_label_transform,
+    target_transform
+)
+logger = datasets.logging.get_logger(__name__)
+def init_hf_config(max_shard_size="2GB", verbose=True, in_mem_max=None):
+    datasets.config.MAX_SHARD_SIZE = max_shard_size
+    if verbose:
+        datasets.logging.set_verbosity_info()
+    if in_mem_max is not None:
+        datasets.config.IN_MEMORY_MAX_SIZE = in_mem_max
+def get_hf_local_path(path, local_datasets_path=None):
+    if local_datasets_path is None:
+        local_datasets_path = os.environ.get(
+            "HF_DATASETS_LOCAL",
+            os.path.join(os.environ.get("HF_DATASETS_CACHE"), "../local"),
+        )
+    path = os.path.join(local_datasets_path, path)
+    return path
+class catchtime:
+    # context to measure loading time: https://stackoverflow.com/questions/33987060/python-context-manager-that-measures-time
+    def __init__(self, debug_print="Time", logger=logger):
+        self.debug_print = debug_print
+        self.logger = logger
+    def __enter__(self):
+        self.start = perf_counter()
+        return self
+    def __exit__(self, type, value, traceback):
+        self.time = perf_counter() - self.start
+        readout = f"{self.debug_print}: {self.time:.3f} seconds"
+        self.logger.info(readout)
+def merge_overlapping_events(sample):
+    events = pd.DataFrame(sample['events'][0])
+    events = events.sort_values(by='onset')
+    sample['events'] = [None]
+    for l in events['event_label'].unique():
+        rows = []
+        for i, r in events.loc[events['event_label'] == l].iterrows():
+            if len(rows) == 0 or rows[-1]['offset'] < r['onset']:
+                rows.append(r)
+            else:
+                onset = min(rows[-1]['onset'], r['onset'])
+                offset = max(rows[-1]['offset'], r['offset'])
+                rows[-1]['onset'] = onset
+                rows[-1]['offset'] = offset
+        if sample["events"][0] is None:
+            sample['events'][0] = pd.DataFrame(rows)
+        else:
+            sample["events"][0] = pd.concat([sample['events'][0], pd.DataFrame(rows)])
+    return sample
+def get_training_dataset(
+        label_encoder,
+        audio_length=10.0,
+        sample_rate=16000,
+        wavmix_p=0.0,
+        pseudo_labels_file=None,
+):
+    init_hf_config()
+    decode_transform = Mp3DecodeTransform(
+        sample_rate=sample_rate, max_length=audio_length, debug_info_key="filename"
+    )
+    ds_list = []
+    with catchtime("Loading audioset_strong"):
+        as_ds = datasets.load_from_disk(get_hf_local_path("audioset_strong"))
+    # label encode transformation
+    if label_encoder is not None:
+        # set list of label names to be encoded
+        label_encoder.labels = as_strong_train_classes
+        encode_label_fun = lambda x: strong_label_transform(x, strong_label_encoder=label_encoder)
+    else:
+        encode_label_fun = lambda x: x
+    as_transforms = [
+        decode_transform,
+        merge_overlapping_events,
+        encode_label_fun,
+        target_transform,
+    ]
+    if pseudo_labels_file:
+        as_transforms.append(AddPseudoLabelsTransform(pseudo_labels_file=pseudo_labels_file).add_pseudo_label_transform)
+    as_ds.set_transform(SequentialTransform(as_transforms))
+    ds_list.append(as_ds["balanced_train"])
+    ds_list.append(as_ds["unbalanced_train"])
+    dataset = torch.utils.data.ConcatDataset(ds_list)
+    if wavmix_p > 0:
+        print("Using Wavmix!")
+        dataset = MixupDataset(dataset, rate=wavmix_p)
+    return dataset
+def get_eval_dataset(
+        label_encoder,
+        audio_length=10.0,
+        sample_rate=16000
+):
+    init_hf_config()
+    ds_list = []
+    decode_transform = Mp3DecodeTransform(
+        sample_rate=sample_rate, max_length=audio_length, debug_info_key="filename"
+    )
+    with catchtime(f"Loading audioset:"):
+        as_ds = datasets.load_from_disk(get_hf_local_path("audioset_strong"))
+    # label encode transformation
+    if label_encoder is not None:
+        label_encoder.labels = as_strong_train_classes
+        encode_label_fun = lambda x: strong_label_transform(x, strong_label_encoder=label_encoder)
+    else:
+        encode_label_fun = lambda x: x
+    as_transforms = [
+        decode_transform,
+        merge_overlapping_events,
+        encode_label_fun,
+        target_transform
+    ]
+    as_ds.set_transform(SequentialTransform(as_transforms))
+    as_ds_eval = (
+        as_ds["eval"]
+    )
+    ds_list.append(as_ds_eval)
+    dataset = torch.utils.data.ConcatDataset(ds_list)
+    return dataset
+def get_full_dataset(label_encoder, audio_length=10.0, sample_rate=16000):
+    init_hf_config()
+    decode_transform = Mp3DecodeTransform(
+        sample_rate=sample_rate, max_length=audio_length, debug_info_key="filename"
+    )
+    with catchtime(f"Loading audioset:"):
+        as_ds = datasets.load_from_disk(get_hf_local_path("audioset_strong"))
+    # label encode transformation
+    if label_encoder is not None:
+        label_encoder.labels = as_strong_train_classes
+        encode_label_fun = lambda x: strong_label_transform(x, strong_label_encoder=label_encoder)
+    else:
+        encode_label_fun = lambda x: x
+    as_transforms = [
+        decode_transform,
+        merge_overlapping_events,
+        encode_label_fun,
+    ]
+    as_ds.set_transform(SequentialTransform(as_transforms))
+    ds_list = []
+    ds_list.append(as_ds["balanced_train"])
+    ds_list.append(as_ds["unbalanced_train"])
+    ds_list.append(as_ds["eval"])
+    dataset = torch.utils.data.ConcatDataset(ds_list)
+    return dataset
+def get_uniform_sample_weights(dataset):
+    """
+    :return: float tensor of shape len(full_training_set) representing the weights of each sample.
+    """
+    return torch.ones(len(dataset)).float()
+def get_temporal_count_balanced_sample_weights(dataset, sample_weight_offset=30,
+                                               save_folder="/share/rk8/shared/as_strong"):
+    """
+    :return: float tensor of shape len(full_training_set) representing the weights of each sample.
+    """
+    # the order of balanced_train_hdf5, unbalanced_train_hdf5 is important.
+    # should match get_full_training_set
+    os.makedirs(save_folder, exist_ok=True)
+    save_file = os.path.join(save_folder, f"weights_temporal_count_offset_{sample_weight_offset}.pt")
+    if os.path.exists(save_file):
+        return torch.load(save_file)
+    from tqdm import tqdm
+    all_y = []
+    for sample in tqdm(dataset, desc="Calculating sample weights."):
+        all_y.append(sample["event_count"])
+    all_y = torch.from_numpy(np.stack(all_y, axis=0))
+    per_class = all_y.long().sum(0).float().reshape(1, -1)  # frequencies per class
+    per_class = sample_weight_offset + per_class  # offset low freq classes
+    if sample_weight_offset > 0:
+        print(f"Warning: sample_weight_offset={sample_weight_offset} minnow={per_class.min()}")
+    per_class_weights = 1000. / per_class
+    all_weight = all_y * per_class_weights
+    all_weight = all_weight.sum(dim=1)
+    torch.save(all_weight, save_file)
+    return all_weight
+class MixupDataset(TorchDataset):
+    """ Mixing Up wave forms
+    """
+    def __init__(self, dataset, beta=2, rate=0.5):
+        self.beta = beta
+        self.rate = rate
+        self.dataset = dataset
+        print(f"Mixing up waveforms from dataset of len {len(dataset)}")
+    def __getitem__(self, index):
+        if torch.rand(1) < self.rate:
+            batch1 = self.dataset[index]
+            idx2 = torch.randint(len(self.dataset), (1,)).item()
+            batch2 = self.dataset[idx2]
+            x1, x2 = batch1['audio'], batch2['audio']
+            y1, y2 = batch1['strong'], batch2['strong']
+            if 'pseudo_strong' in batch1:
+                p1, p2 = batch1['pseudo_strong'], batch2['pseudo_strong']
+            l = np.random.beta(self.beta, self.beta)
+            l = max(l, 1. - l)
+            x1 = x1 - x1.mean()
+            x2 = x2 - x2.mean()
+            x = (x1 * l + x2 * (1. - l))
+            x = x - x.mean()
+            batch1['audio'] = x
+            batch1['strong'] = (y1 * l + y2 * (1. - l))
+            if 'pseudo_strong' in batch1:
+                batch1['pseudo_strong'] = (p1 * l + p2 * (1. - l))
+            return batch1
+        return self.dataset[index]
+    def __len__(self):
+        return len(self.dataset)
+class DistributedSamplerWrapper(DistributedSampler):
+    def __init__(
+        self, sampler, dataset, num_replicas=None, rank=None, shuffle: bool = True
+    ):
+        super(DistributedSamplerWrapper, self).__init__(
+            dataset, num_replicas, rank, shuffle
+        )
+        # source: @awaelchli https://github.com/PyTorchLightning/pytorch-lightning/issues/3238
+        self.sampler = sampler
+    def __iter__(self):
+        if self.sampler.generator is None:
+            self.sampler.generator = torch.Generator()
+        self.sampler.generator.manual_seed(self.seed + self.epoch)
+        indices = list(self.sampler)
+        if self.epoch < 2:
+            logger.info(
+                f"\n DistributedSamplerWrapper (rank {self.rank}) :  {indices[:3]} \n\n"
+            )
+        indices = indices[self.rank : self.total_size : self.num_replicas]
+        return iter(indices)
+def get_weighted_sampler(
+        samples_weights,
+        epoch_len=100_000,
+        sampler_replace=False,
+):
+    num_nodes = int(os.environ.get("WORLD_SIZE", 1))
+    ddp = int(os.environ.get("DDP", 1))
+    num_nodes = max(ddp, num_nodes)
+    rank = int(os.environ.get("NODE_RANK", 0))
+    return DistributedSamplerWrapper(
+        sampler=WeightedRandomSampler(
+            samples_weights, num_samples=epoch_len, replacement=sampler_replace
+        ),
+        dataset=range(epoch_len),
+        num_replicas=num_nodes,
+        rank=rank,
+    )
+if __name__ == "__main__":
+    from helpers.encode import ManyHotEncoder
+    encoder = ManyHotEncoder([], 10., 160, net_pooling=4, fs=16_000)
+    train_ds = get_training_dataset(
+        encoder, audio_length=10.0, sample_rate=16_000
+    )
+    valid_ds = get_eval_dataset(
+        encoder, audio_length=10.0, sample_rate=16_000
+    )
+    print("Len train dataset: ", len(train_ds))
+    print("Len valid dataset: ", len(valid_ds))

data_util/dcase2016task2.py ADDED Viewed

	@@ -0,0 +1,280 @@

+import json
+from pathlib import Path
+from typing import Dict, List, Tuple
+import numpy as np
+import pandas as pd
+import soundfile as sf
+import torch
+from intervaltree import IntervalTree
+from torch.utils.data import Dataset
+class FixCropDataset(Dataset):
+    """
+    Read in a JSON file and return audio and audio filenames
+    """
+    def __init__(self, data: Dict,
+                 audio_dir: Path,
+                 sample_rate: int,
+                 label_fps: int,
+                 label_to_idx: Dict,
+                 nlabels: int):
+        self.clip_len = 120
+        self.target_len = 10
+        self.pieces_per_clip = self.clip_len // self.target_len
+        self.filenames = list(data.keys())
+        self.audio_dir = audio_dir
+        assert self.audio_dir.is_dir(), f"{audio_dir} is not a directory"
+        self.sample_rate = sample_rate
+        # all files are 120 seconds long, split them into 12 x 10 second pieces
+        self.pieces = []
+        self.labels = []
+        self.timestamps = []
+        for filename in self.filenames:
+            self.pieces += [(filename, i) for i in range(self.pieces_per_clip)]
+            labels = data[filename]
+            frame_len = 1000 / label_fps
+            timestamps = np.arange(label_fps * self.clip_len) * frame_len + 0.5 * frame_len
+            timestamp_labels = get_labels_for_timestamps(labels, timestamps)
+            ys = []
+            for timestamp_label in timestamp_labels:
+                timestamp_label_idxs = [label_to_idx[str(event)] for event in timestamp_label]
+                y_timestamp = label_to_binary_vector(timestamp_label_idxs, nlabels)
+                ys.append(y_timestamp)
+            ys = torch.stack(ys)
+            frames_per_clip = ys.size(0) // self.pieces_per_clip
+            self.labels += [ys[frames_per_clip * i: frames_per_clip * (i + 1)] for i in range(self.pieces_per_clip)]
+            self.timestamps += [timestamps[frames_per_clip * i: frames_per_clip * (i + 1)] for i in
+                                range(self.pieces_per_clip)]
+        assert len(self.labels) == len(self.pieces) == len(self.filenames) * self.pieces_per_clip
+    def __len__(self):
+        return len(self.pieces)
+    def __getitem__(self, idx):
+        filename = self.pieces[idx][0]
+        piece = self.pieces[idx][1]
+        audio_path = self.audio_dir.joinpath(filename)
+        audio, sr = sf.read(str(audio_path), dtype=np.float32)
+        assert sr == self.sample_rate
+        start = self.sample_rate * piece * self.target_len
+        end = start + self.sample_rate * self.target_len
+        audio = audio[start:end]
+        return audio, self.labels[idx].transpose(0, 1), filename, self.timestamps[idx]
+class RandomCropDataset(Dataset):
+    """
+    Read in a JSON file and return audio and audio filenames
+    """
+    def __init__(self, data: Dict,
+                 audio_dir: Path,
+                 sample_rate: int,
+                 label_fps: int,
+                 label_to_idx: Dict,
+                 nlabels: int):
+        self.clip_len = 120
+        self.target_len = 10
+        self.pieces_per_clip = self.clip_len // self.target_len
+        self.filenames = list(data.keys())
+        self.audio_dir = audio_dir
+        assert self.audio_dir.is_dir(), f"{audio_dir} is not a directory"
+        self.sample_rate = sample_rate
+        self.label_fps = label_fps
+        # all files are 120 seconds long, randomly crop 10 seconds snippets
+        self.labels = []
+        self.timestamps = []
+        for filename in self.filenames:
+            labels = data[filename]
+            frame_len = 1000 / label_fps
+            timestamps = np.arange(label_fps * self.clip_len) * frame_len + 0.5 * frame_len
+            timestamp_labels = get_labels_for_timestamps(labels, timestamps)
+            ys = []
+            for timestamp_label in timestamp_labels:
+                timestamp_label_idxs = [label_to_idx[str(event)] for event in timestamp_label]
+                y_timestamp = label_to_binary_vector(timestamp_label_idxs, nlabels)
+                ys.append(y_timestamp)
+            ys = torch.stack(ys)
+            self.labels.append(ys)
+            self.timestamps.append(timestamps)
+        assert len(self.labels) == len(self.filenames)
+    def __len__(self):
+        return len(self.filenames) * self.clip_len // self.target_len
+    def __getitem__(self, idx):
+        idx = idx % len(self.filenames)
+        filename = self.filenames[idx]
+        audio_path = self.audio_dir.joinpath(filename)
+        audio, sr = sf.read(str(audio_path), dtype=np.float32)
+        assert sr == self.sample_rate
+        # crop random 10 seconds piece
+        labels_to_pick = self.target_len * self.label_fps
+        max_offset = len(self.labels[idx]) - labels_to_pick + 1
+        offset = torch.randint(max_offset, (1,)).item()
+        labels = self.labels[idx][offset:offset + labels_to_pick]
+        scale = self.sample_rate // self.label_fps
+        audio = audio[offset * scale:offset * scale + labels_to_pick * scale]
+        timestamps = self.timestamps[idx][offset:offset + labels_to_pick]
+        return audio, labels.transpose(0, 1), filename, timestamps
+def get_training_dataset(
+        task_path,
+        sample_rate=16000,
+        label_fps=25,
+        wavmix_p=0.0,
+        random_crop=True
+):
+    task_path = Path(task_path)
+    label_vocab, nlabels = label_vocab_nlabels(task_path)
+    label_to_idx = label_vocab_as_dict(label_vocab, key="label", value="idx")
+    train_fold = task_path.joinpath("train.json")
+    audio_dir = task_path.joinpath(str(sample_rate), "train")
+    train_fold_data = json.load(train_fold.open())
+    if random_crop:
+        dataset = RandomCropDataset(train_fold_data, audio_dir, sample_rate, label_fps, label_to_idx, nlabels)
+    else:
+        dataset = FixCropDataset(train_fold_data, audio_dir, sample_rate, label_fps, label_to_idx, nlabels)
+    if wavmix_p > 0:
+        dataset = MixupDataset(dataset, rate=wavmix_p)
+    return dataset
+def get_validation_dataset(
+        task_path,
+        sample_rate=16000,
+        label_fps=25,
+):
+    task_path = Path(task_path)
+    label_vocab, nlabels = label_vocab_nlabels(task_path)
+    label_to_idx = label_vocab_as_dict(label_vocab, key="label", value="idx")
+    valid_fold = task_path.joinpath("valid.json")
+    audio_dir = task_path.joinpath(str(sample_rate), "valid")
+    valid_fold_data = json.load(valid_fold.open())
+    dataset = FixCropDataset(valid_fold_data, audio_dir, sample_rate, label_fps, label_to_idx, nlabels)
+    return dataset
+def get_test_dataset(
+        task_path,
+        sample_rate=16000,
+        label_fps=25,
+):
+    task_path = Path(task_path)
+    label_vocab, nlabels = label_vocab_nlabels(task_path)
+    label_to_idx = label_vocab_as_dict(label_vocab, key="label", value="idx")
+    test_fold = task_path.joinpath("test.json")
+    audio_dir = task_path.joinpath(str(sample_rate), "test")
+    test_fold_data = json.load(test_fold.open())
+    dataset = FixCropDataset(test_fold_data, audio_dir, sample_rate, label_fps, label_to_idx, nlabels)
+    return dataset
+def get_labels_for_timestamps(labels: List, timestamps: np.ndarray) -> List:
+    # A list of labels present at each timestamp
+    tree = IntervalTree()
+    # Add all events to the label tree
+    for event in labels:
+        # We add 0.0001 so that the end also includes the event
+        tree.addi(event["start"], event["end"] + 0.0001, event["label"])
+    timestamp_labels = []
+    # Update the binary vector of labels with intervals for each timestamp
+    for j, t in enumerate(timestamps):
+        interval_labels: List[str] = [interval.data for interval in tree[t]]
+        timestamp_labels.append(interval_labels)
+        # If we want to store the timestamp too
+        # labels_for_sound.append([float(t), interval_labels])
+    assert len(timestamp_labels) == len(timestamps)
+    return timestamp_labels
+def label_vocab_nlabels(task_path: Path) -> Tuple[pd.DataFrame, int]:
+    label_vocab = pd.read_csv(task_path.joinpath("labelvocabulary.csv"))
+    nlabels = len(label_vocab)
+    assert nlabels == label_vocab["idx"].max() + 1
+    return (label_vocab, nlabels)
+def label_vocab_as_dict(df: pd.DataFrame, key: str, value: str) -> Dict:
+    """
+    Returns a dictionary of the label vocabulary mapping the label column to
+    the idx column. key sets whether the label or idx is the key in the dict. The
+    other column will be the value.
+    """
+    if key == "label":
+        # Make sure the key is a string
+        df["label"] = df["label"].astype(str)
+        value = "idx"
+    else:
+        assert key == "idx", "key argument must be either 'label' or 'idx'"
+        value = "label"
+    return df.set_index(key).to_dict()[value]
+def label_to_binary_vector(label: List, num_labels: int) -> torch.Tensor:
+    """
+    Converts a list of labels into a binary vector
+    Args:
+        label: list of integer labels
+        num_labels: total number of labels
+    Returns:
+        A float Tensor that is multi-hot binary vector
+    """
+    # Lame special case for multilabel with no labels
+    if len(label) == 0:
+        # BCEWithLogitsLoss wants float not long targets
+        binary_labels = torch.zeros((num_labels,), dtype=torch.float)
+    else:
+        binary_labels = torch.zeros((num_labels,)).scatter(0, torch.tensor(label), 1.0)
+    # Validate the binary vector we just created
+    assert set(torch.where(binary_labels == 1.0)[0].numpy()) == set(label)
+    return binary_labels
+class MixupDataset(Dataset):
+    """ Mixing Up wave forms
+    """
+    def __init__(self, dataset, beta=0.2, rate=0.5):
+        self.beta = beta
+        self.rate = rate
+        self.dataset = dataset
+        print(f"Mixing up waveforms from dataset of len {len(dataset)}")
+    def __getitem__(self, index):
+        if torch.rand(1) < self.rate:
+            batch1 = self.dataset[index]
+            idx2 = torch.randint(len(self.dataset), (1,)).item()
+            batch2 = self.dataset[idx2]
+            x1, x2 = batch1[0], batch2[0]
+            y1, y2 = batch1[1], batch2[1]
+            l = np.random.beta(self.beta, self.beta)
+            l = max(l, 1. - l)
+            x1 = x1 - x1.mean()
+            x2 = x2 - x2.mean()
+            x = (x1 * l + x2 * (1. - l))
+            x = x - x.mean()
+            y = (y1 * l + y2 * (1. - l))
+            return x, y, batch1[2], batch1[3]
+        return self.dataset[index]
+    def __len__(self):
+        return len(self.dataset)

data_util/transforms.py ADDED Viewed

	@@ -0,0 +1,195 @@

+import os
+import datasets
+import h5py
+import numpy as np
+import pandas as pd
+import torch
+import torchaudio
+from data_util.audioset_classes import as_strong_train_classes
+## Transforms with a similar style to https://github.com/descriptinc/audiotools/blob/master/audiotools/data/transforms.py
+logger = datasets.logging.get_logger(__name__)
+def target_transform(sample):
+    del sample["labels"]
+    del sample["label_ids"]
+    return sample
+def strong_label_transform(sample, strong_label_encoder=None):
+    assert strong_label_encoder is not None
+    events = pd.DataFrame(sample['events'][0])
+    events = events[events['event_label'].isin(set(as_strong_train_classes))]
+    strong = strong_label_encoder.encode_strong_df(events).T
+    sample["strong"] = [strong]
+    sample["event_count"] = [strong.sum(1)]
+    # encode ground truth events as string - we will use this for evaluation
+    sample["gt_string"] = ["++".join([";;".join([str(e[0]), str(e[1]), e[2]]) for e in
+                                      zip(sample['events'][0]['onset'], sample['events'][0]['offset'],
+                                          sample['events'][0]['event_label'])])]
+    del sample['events']
+    return sample
+class AddPseudoLabelsTransform:
+    def __init__(self, pseudo_labels_file):
+        self.pseudo_labels_file = pseudo_labels_file
+        if self.pseudo_labels_file is not None:
+            # fetch dict of positions for each example
+            self.ex2pseudo_idx = {}
+            f = h5py.File(self.pseudo_labels_file, "r")
+            for i, fname in enumerate(f["filenames"]):
+                self.ex2pseudo_idx[fname.decode("UTF-8")] = i
+        self._opened_pseudo_hdf5 = None
+    @property
+    def pseudo_hdf5_file(self):
+        if self._opened_pseudo_hdf5 is None:
+            self._opened_pseudo_hdf5 = h5py.File(self.pseudo_labels_file, "r")
+        return self._opened_pseudo_hdf5
+    def add_pseudo_label_transform(self, sample):
+        indices = [self.ex2pseudo_idx[fn.rstrip(".mp3")] for fn in sample['filename']]
+        pseudo_strong = [torch.from_numpy(np.stack(self.pseudo_hdf5_file["strong_logits"][index])).float()
+                         for index in indices]
+        pseudo_strong = [torch.sigmoid(pseudo_strong[i]) for i in range(len(pseudo_strong))]
+        sample['pseudo_strong'] = pseudo_strong
+        return sample
+class SequentialTransform:
+    """Apply a sequence of transforms to a batch."""
+    def __init__(self, transforms):
+        """
+        Args:
+            transforms: list of transforms to apply
+        """
+        self.transforms = transforms
+    def append(self, transform):
+        self.transforms.append(transform)
+    def __call__(self, batch):
+        for t in self.transforms:
+            batch = t(batch)
+        return batch
+class Mp3DecodeTransform:
+    def __init__(
+            self,
+            mp3_bytes_key="mp3_bytes",
+            audio_key="audio",
+            sample_rate=32000,
+            max_length=10.0,
+            min_length=None,
+            random_sample_crop=True,
+            allow_resample=True,
+            resampling_method="sinc_interp_kaiser",
+            keep_mp3_bytes=False,
+            debug_info_key=None,
+    ):
+        """Decode mp3 bytes to audio waveform
+        Args:
+            mp3_bytes_key (str, optional): The key to mp3 bytes in the input batch. Defaults to "mp3_bytes".
+            audio_key (str, optional): The key to save the decoded audio in the output batch. Defaults to "audio".
+            sample_rate (int, optional): The expected output audio_key. Defaults to 32000.
+            max_length (int, float, optional): the maximum output audio length in seconds if float, otherwise in samples. Defaults to 10.
+            min_length (int, optional): the minimum output audio length in seconds. Defaults to max_length.
+            random_sample_crop (bool, optional): Randomly crop the audio to max_length if its longer otherwise return the first crop. Defaults to True.
+            allow_resample (bool, optional): Resample the singal if the sampling rate don't match. Defaults to True.
+            resampling_method (str, optional): reampling method from torchaudio.transforms.Resample  . Defaults to "sinc_interp_kaiser".
+            keep_mp3_bytes (bool, optional): keep the original bytes in the output dict. Defaults to False.
+        Raises:
+            Exception: if minimp3py is not installed
+        """
+        self.mp3_bytes_key = mp3_bytes_key
+        self.audio_key = audio_key
+        self.sample_rate = sample_rate
+        self.max_length = max_length
+        if min_length is None:
+            min_length = max_length
+        self.min_length = min_length
+        self.random_sample_crop = random_sample_crop
+        self.allow_resample = allow_resample
+        self.resampling_method = resampling_method
+        self.keep_mp3_bytes = keep_mp3_bytes
+        self.debug_info_key = debug_info_key
+        self.resamplers_cache = {}
+        try:
+            import minimp3py  # noqa: F401
+        except:
+            raise Exception(
+                "minimp3py is not installed, please install it using: `CFLAGS='-O3 -march=native' pip install https://github.com/f0k/minimp3py/archive/master.zip`"
+            )
+    def __call__(self, batch):
+        import minimp3py
+        data_list = batch[self.mp3_bytes_key]
+        if self.debug_info_key is not None:
+            file_name_list = batch[self.debug_info_key]
+        else:
+            file_name_list = range(len(data_list))
+        audio_list = []
+        for data, file_name in zip(data_list, file_name_list):
+            try:
+                duration, ch, sr = minimp3py.probe(data)
+                if isinstance(self.max_length, float):
+                    max_length = int(self.max_length * sr)
+                else:
+                    max_length = int(self.max_length * sr // self.sample_rate)
+                offset = 0
+                if self.random_sample_crop and duration > max_length:
+                    max_offset = max(int(duration - max_length), 0) + 1
+                    offset = torch.randint(max_offset, (1,)).item()
+                waveform, _ = minimp3py.read(data, start=offset, length=max_length)
+                waveform = waveform[:, 0]  # 0 for the first channel only
+                if waveform.dtype != "float32":
+                    raise RuntimeError("Unexpected wave type")
+                waveform = torch.from_numpy(waveform)
+                if len(waveform) == 0:
+                    logger.warning(
+                        f"Empty waveform for {file_name}, duration {duration}, offset {offset}, max_length {max_length}, sr {sr}, ch {ch}"
+                    )
+                elif sr != self.sample_rate:
+                    assert self.allow_resample, f"Unexpected sample rate {sr} instead of {self.sample_rate} at {file_name}"
+                    if self.resamplers_cache.get(sr) is None:
+                        self.resamplers_cache[sr] = torchaudio.transforms.Resample(
+                            sr,
+                            self.sample_rate,
+                            resampling_method=self.resampling_method,
+                        )
+                    waveform = self.resamplers_cache[sr](waveform)
+                min_length = self.min_length
+                if isinstance(self.min_length, float):
+                    min_length = int(self.min_length * self.sample_rate)
+                if min_length is not None and len(waveform) < min_length:
+                    waveform = torch.concatenate(
+                        (
+                            waveform,
+                            torch.zeros(
+                                min_length - len(waveform),
+                                dtype=waveform.dtype,
+                                device=waveform.device,
+                            ),
+                        ),
+                        dim=0,
+                    )
+                audio_list.append(waveform)
+            except Exception as e:
+                print(f"Error decoding {file_name}: {e}")
+                raise e
+        batch[self.audio_key] = audio_list
+        batch["sampling_rate"] = [self.sample_rate] * len(audio_list)
+        if not self.keep_mp3_bytes:
+            del batch[self.mp3_bytes_key]
+        return batch

ex_audioset_strong.py ADDED Viewed

	@@ -0,0 +1,504 @@

+import numpy as np
+import pandas as pd
+import torch
+from torch.utils.data import DataLoader
+import argparse
+import torch.nn as nn
+import wandb
+import transformers
+import random
+import pytorch_lightning as pl
+from pytorch_lightning.loggers import WandbLogger
+import sed_scores_eval
+from helpers.decode import batched_decode_preds
+from helpers.encode import ManyHotEncoder
+from models.atstframe.ATSTF_wrapper import ATSTWrapper
+from models.beats.BEATs_wrapper import BEATsWrapper
+from models.frame_passt.fpasst_wrapper import FPaSSTWrapper
+from models.m2d.M2D_wrapper import M2DWrapper
+from models.asit.ASIT_wrapper import ASiTWrapper
+from models.prediction_wrapper import PredictionsWrapper
+from helpers.augment import frame_shift, time_mask, mixup, filter_augmentation, mixstyle, RandomResizeCrop
+from helpers.utils import worker_init_fn
+from data_util.audioset_strong import get_training_dataset, get_eval_dataset
+from data_util.audioset_strong import get_temporal_count_balanced_sample_weights, get_uniform_sample_weights, \
+    get_weighted_sampler
+from data_util.audioset_classes import as_strong_train_classes, as_strong_eval_classes
+from models.frame_mn.Frame_MN_wrapper import FrameMNWrapper
+from models.frame_mn.utils import NAME_TO_WIDTH
+class PLModule(pl.LightningModule):
+    def __init__(self, config, encoder):
+        super().__init__()
+        self.config = config
+        self.encoder = encoder
+        if config.pretrained == "scratch":
+            checkpoint = None
+        elif config.pretrained == "ssl":
+            checkpoint = "ssl"
+        elif config.pretrained == "weak":
+            checkpoint = "weak"
+        elif config.pretrained == "strong":
+            checkpoint = "strong_1"
+        else:
+            raise ValueError(f"Unknown pretrained checkpoint: {config.pretrained}")
+        # load transformer model
+        if config.model_name == "BEATs":
+            beats = BEATsWrapper()
+            model = PredictionsWrapper(beats, checkpoint=f"BEATs_{checkpoint}" if checkpoint else None,
+                                       seq_model_type=config.seq_model_type)
+        elif config.model_name == "ATST-F":
+            atst = ATSTWrapper()
+            model = PredictionsWrapper(atst, checkpoint=f"ATST-F_{checkpoint}" if checkpoint else None,
+                                       seq_model_type=config.seq_model_type)
+        elif config.model_name == "fpasst":
+            fpasst = FPaSSTWrapper()
+            model = PredictionsWrapper(fpasst, checkpoint=f"fpasst_{checkpoint}" if checkpoint else None,
+                                       seq_model_type=config.seq_model_type)
+        elif config.model_name == "M2D":
+            m2d = M2DWrapper()
+            model = PredictionsWrapper(m2d, checkpoint=f"M2D_{checkpoint}" if checkpoint else None,
+                                       seq_model_type=config.seq_model_type,
+                                       embed_dim=m2d.m2d.cfg.feature_d)
+        elif config.model_name == "ASIT":
+            asit = ASiTWrapper()
+            model = PredictionsWrapper(asit, checkpoint=f"ASIT_{checkpoint}" if checkpoint else None,
+                                       seq_model_type=config.seq_model_type)
+        elif config.model_name.startswith("frame_mn"):
+            width = NAME_TO_WIDTH(config.model_name)
+            frame_mn = FrameMNWrapper(width)
+            embed_dim = frame_mn.state_dict()['frame_mn.features.16.1.bias'].shape[0]
+            model = PredictionsWrapper(frame_mn, checkpoint=f"{config.model_name}_strong_1", embed_dim=embed_dim)
+        else:
+            raise NotImplementedError(f"Model {config.model_name} not (yet) implemented")
+        self.model = model
+        # prepare ingredients for knowledge distillation
+        assert 0 <= config.distillation_loss_weight <= 1, "Lambda for Knowledge Distillation must be between 0 and 1."
+        self.strong_loss = nn.BCEWithLogitsLoss()
+        self.freq_warp = RandomResizeCrop((1, 1.0), time_scale=(1.0, 1.0))
+        self.val_durations_df = pd.read_csv(f"resources/eval_durations.csv",
+                                            sep=",", header=None, names=["filename", "duration"])
+        self.val_predictions_strong = {}
+        self.val_ground_truth = {}
+        self.val_duration = {}
+        self.val_loss = []
+    def forward(self, batch):
+        x = batch["audio"]
+        mel = self.model.mel_forward(x)
+        y_strong, _ = self.model(mel)
+        return y_strong
+    def get_optimizer(
+            self, lr, adamw=False, weight_decay=0.01, betas=(0.9, 0.999)
+    ):
+        # we split the parameters into two groups, one for the pretrained model and one for the downstream model
+        # we also split each of them into <=1 dimensional and >=2 dimensional parameters, so we can only
+        # apply weight decay to the >=2 dimensional parameters, thus excluding biases and batch norms, an idea from NanoGPT
+        params_leq1D = []
+        params_geq2D = []
+        for name, param in self.model.named_parameters():
+            if param.requires_grad:
+                if param.ndimension() >= 2:
+                    params_geq2D.append(param)
+                else:
+                    params_leq1D.append(param)
+        param_groups = [
+            {'params': params_leq1D, 'lr': lr},
+            {'params': params_geq2D, 'lr': lr, 'weight_decay': weight_decay},
+        ]
+        if weight_decay > 0:
+            assert adamw
+        assert len(param_groups) > 0
+        if adamw:
+            print(f"\nUsing adamw weight_decay={weight_decay}!\n")
+            return torch.optim.AdamW(param_groups, lr=lr, betas=betas)
+        return torch.optim.Adam(param_groups, lr=lr, betas=betas)
+    def get_lr_scheduler(
+            self,
+            optimizer,
+            num_training_steps,
+            schedule_mode="cos",
+            gamma: float = 0.999996,
+            num_warmup_steps=20000,
+            lr_end=2e-7,
+    ):
+        if schedule_mode in {"exp"}:
+            return torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma)
+        if schedule_mode in {"cosine", "cos"}:
+            return transformers.get_cosine_schedule_with_warmup(
+                optimizer,
+                num_warmup_steps=num_warmup_steps,
+                num_training_steps=num_training_steps,
+            )
+        if schedule_mode in {"linear"}:
+            print("Linear schedule!")
+            return transformers.get_polynomial_decay_schedule_with_warmup(
+                optimizer,
+                num_warmup_steps=num_warmup_steps,
+                num_training_steps=num_training_steps,
+                power=1.0,
+                lr_end=lr_end,
+            )
+        raise RuntimeError(f"schedule_mode={schedule_mode} Unknown.")
+    def configure_optimizers(self):
+        """
+        This is the way pytorch lightening requires optimizers and learning rate schedulers to be defined.
+        The specified items are used automatically in the optimization loop (no need to call optimizer.step() yourself).
+        :return: dict containing optimizer and learning rate scheduler
+        """
+        optimizer = self.get_optimizer(self.config.max_lr, adamw=self.config.adamw,
+                                       weight_decay=self.config.weight_decay)
+        num_training_steps = self.trainer.estimated_stepping_batches
+        scheduler = self.get_lr_scheduler(optimizer, num_training_steps,
+                                          schedule_mode=self.config.schedule_mode,
+                                          lr_end=self.config.lr_end)
+        lr_scheduler_config = {
+            "scheduler": scheduler,
+            "interval": "step",
+            "frequency": 1
+        }
+        return [optimizer], [lr_scheduler_config]
+    def training_step(self, train_batch, batch_idx):
+        """
+        :param train_batch: contains one batch from train dataloader
+        :param batch_idx
+        :return: a dict containing at least loss that is used to update model parameters, can also contain
+                    other items that can be processed in 'training_epoch_end' to log other metrics than loss
+        """
+        x = train_batch["audio"]
+        labels = train_batch['strong']
+        if 'pseudo_strong' in train_batch:
+            pseudo_labels = train_batch['pseudo_strong']
+        else:
+            # create dummy pseudo labels
+            pseudo_labels = torch.zeros_like(labels)
+            assert self.config.distillation_loss_weight == 0
+        mel = self.model.mel_forward(x)
+        # time rolling
+        if self.config.frame_shift_range > 0:
+            mel, labels, pseudo_labels = frame_shift(
+                mel,
+                labels,
+                pseudo_labels=pseudo_labels,
+                net_pooling=self.encoder.net_pooling,
+                shift_range=self.config.frame_shift_range
+            )
+        # mixup
+        if self.config.mixup_p > random.random():
+            mel, labels, pseudo_labels = mixup(
+                mel,
+                targets=labels,
+                pseudo_strong=pseudo_labels
+            )
+        # mixstyle
+        if self.config.mixstyle_p > random.random():
+            mel = mixstyle(
+                mel
+            )
+        # time masking
+        if self.config.max_time_mask_size > 0:
+            mel, labels, pseudo_labels = time_mask(
+                mel,
+                labels,
+                pseudo_labels=pseudo_labels,
+                net_pooling=self.encoder.net_pooling,
+                max_mask_ratio=self.config.max_time_mask_size
+            )
+        # frequency masking
+        if self.config.filter_augment_p > random.random():
+            mel, _ = filter_augmentation(
+                mel
+            )
+        # frequency warping
+        if self.config.freq_warp_p > random.random():
+            mel = mel.squeeze(1)
+            mel = self.freq_warp(mel)
+            mel = mel.unsqueeze(1)
+        # forward through network; use strong head
+        y_hat_strong, _ = self.model(mel)
+        strong_supervised_loss = self.strong_loss(y_hat_strong, labels)
+        if self.config.distillation_loss_weight > 0:
+            strong_distillation_loss = self.strong_loss(y_hat_strong, pseudo_labels)
+        else:
+            strong_distillation_loss = torch.tensor(0., device=y_hat_strong.device, dtype=y_hat_strong.dtype)
+        loss = self.config.distillation_loss_weight * strong_distillation_loss \
+               + (1 - self.config.distillation_loss_weight) * strong_supervised_loss
+        # logging
+        self.log('epoch', self.current_epoch)
+        for i, param_group in enumerate(self.trainer.optimizers[0].param_groups):
+            self.log(f'trainer/lr_optimizer_{i}', param_group['lr'])
+        self.log("train/loss", loss.detach().cpu(), prog_bar=True)
+        self.log("train/strong_supervised_loss", strong_supervised_loss.detach().cpu())
+        self.log("train/strong_distillation_loss", strong_distillation_loss.detach().cpu())
+        return loss
+    def validation_step(self, val_batch, batch_idx):
+        # bring ground truth into shape needed for evaluation
+        for f, gt_string in zip(val_batch["filename"], val_batch["gt_string"]):
+            f = f[:-len(".mp3")]
+            events = [e.split(";;") for e in gt_string.split("++")]
+            self.val_ground_truth[f] = [(float(e[0]), float(e[1]), e[2]) for e in events]
+            self.val_duration[f] = self.val_durations_df[self.val_durations_df["filename"] == f]["duration"].values[0]
+        y_hat_strong = self(val_batch)
+        y_strong = val_batch["strong"]
+        loss = self.strong_loss(y_hat_strong, y_strong)
+        self.val_loss.append(loss.cpu())
+        scores_raw, scores_postprocessed, prediction_dfs = batched_decode_preds(
+            y_hat_strong.float(),
+            val_batch['filename'],
+            self.encoder,
+            median_filter=self.config.median_window
+        )
+        self.val_predictions_strong.update(
+            scores_postprocessed
+        )
+    def on_validation_epoch_end(self):
+        gt_unique_events = set([e[2] for f, events in self.val_ground_truth.items() for e in events])
+        train_unique_events = set(self.encoder.labels)
+        # evaluate on all classes that are in both train and test sets (407 classes)
+        class_intersection = gt_unique_events.intersection(train_unique_events)
+        assert len(class_intersection) == len(set(as_strong_train_classes).intersection(as_strong_eval_classes)) == 407, \
+            f"Intersection unique events. Expected: {len(set(as_strong_train_classes).intersection(as_strong_eval_classes))}," \
+            f" Actual: {len(class_intersection)}"
+        # filter ground truth according to class_intersection
+        val_ground_truth = {fid: [event for event in self.val_ground_truth[fid] if event[2] in class_intersection]
+                            for fid in self.val_ground_truth}
+        # drop audios without events - aligned with DESED evaluation procedure
+        val_ground_truth = {fid: events for fid, events in val_ground_truth.items() if len(events) > 0}
+        # keep only corresponding audio durations
+        audio_durations = {
+            fid: self.val_duration[fid] for fid in val_ground_truth.keys()
+        }
+        # filter files in predictions
+        as_strong_preds = {
+            fid: self.val_predictions_strong[fid] for fid in val_ground_truth.keys()
+        }
+        # filter classes in predictions
+        unused_classes = list(set(self.encoder.labels).difference(class_intersection))
+        for f, df in as_strong_preds.items():
+            df.drop(columns=list(unused_classes), axis=1, inplace=True)
+        segment_based_pauroc = sed_scores_eval.segment_based.auroc(
+            as_strong_preds,
+            val_ground_truth,
+            audio_durations,
+            max_fpr=0.1,
+            segment_length=1.0,
+            num_jobs=1
+        )
+        psds1 = sed_scores_eval.intersection_based.psds(
+            as_strong_preds,
+            val_ground_truth,
+            audio_durations,
+            dtc_threshold=0.7,
+            gtc_threshold=0.7,
+            cttc_threshold=None,
+            alpha_ct=0,
+            alpha_st=1,
+            num_jobs=1
+        )
+        # "val/psds1_macro_averaged" is psds1 without penalization for performance
+        #  variations across classes
+        logs = {"val/loss": torch.as_tensor(self.val_loss).mean().cuda(),
+                "val/psds1": psds1[0],
+                "val/psds1_macro_averaged": np.array([v for k, v in psds1[1].items()]).mean(),
+                "val/pauroc": segment_based_pauroc[0]['mean'],
+                }
+        self.log_dict(logs, sync_dist=False)
+        self.val_predictions_strong = {}
+        self.val_ground_truth = {}
+        self.val_duration = {}
+        self.val_loss = []
+def train(config):
+    # Train Models on temporally-strong portion of AudioSet.
+    # logging is done using wandb
+    wandb_logger = WandbLogger(
+        project="PTSED",
+        notes="Pre-Training Transformers for Sound Event Detection on AudioSet Strong.",
+        tags=["AudioSet Strong", "Sound Event Detection", "Pseudo Labels", "Knowledge Disitillation"],
+        config=config,
+        name=config.experiment_name
+    )
+    # encoder manages encoding and decoding of model predictions
+    encoder = ManyHotEncoder(as_strong_train_classes)
+    train_set = get_training_dataset(encoder, wavmix_p=config.wavmix_p,
+                                     pseudo_labels_file=config.pseudo_labels_file)
+    eval_set = get_eval_dataset(encoder)
+    if config.use_balanced_sampler:
+        sample_weights = get_temporal_count_balanced_sample_weights(train_set, save_folder="resources")
+    else:
+        sample_weights = get_uniform_sample_weights(train_set)
+    train_sampler = get_weighted_sampler(sample_weights, epoch_len=config.epoch_len)
+    # train dataloader
+    train_dl = DataLoader(dataset=train_set,
+                          sampler=train_sampler,
+                          worker_init_fn=worker_init_fn,
+                          num_workers=config.num_workers,
+                          batch_size=config.batch_size,
+                          shuffle=False)
+    # eval dataloader
+    eval_dl = DataLoader(dataset=eval_set,
+                         worker_init_fn=worker_init_fn,
+                         num_workers=config.num_workers,
+                         batch_size=config.batch_size)
+    # create pytorch lightening module
+    pl_module = PLModule(config, encoder)
+    # create the pytorch lightening trainer by specifying the number of epochs to train, the logger,
+    # on which kind of device(s) to train and possible callbacks
+    trainer = pl.Trainer(max_epochs=config.n_epochs,
+                         logger=wandb_logger,
+                         accelerator='auto',
+                         devices=config.num_devices,
+                         precision=config.precision,
+                         num_sanity_val_steps=0,
+                         check_val_every_n_epoch=config.check_val_every_n_epoch
+                         )
+    # start training and validation for the specified number of epochs
+    trainer.fit(pl_module, train_dl, eval_dl)
+    wandb.finish()
+def evaluate(config):
+    # only evaluation of pre-trained models
+    # encoder manages encoding and decoding of model predictions
+    encoder = ManyHotEncoder(as_strong_train_classes)
+    eval_set = get_eval_dataset(encoder)
+    # eval dataloader
+    eval_dl = DataLoader(dataset=eval_set,
+                         worker_init_fn=worker_init_fn,
+                         num_workers=config.num_workers,
+                         batch_size=config.batch_size)
+    # create pytorch lightening module
+    pl_module = PLModule(config, encoder)
+    # create the pytorch lightening trainer by specifying the number of epochs to train, the logger,
+    # on which kind of device(s) to train and possible callbacks
+    trainer = pl.Trainer(max_epochs=config.n_epochs,
+                         accelerator='auto',
+                         devices=config.num_devices,
+                         precision=config.precision,
+                         num_sanity_val_steps=0,
+                         check_val_every_n_epoch=config.check_val_every_n_epoch)
+    # start evaluation
+    trainer.validate(pl_module, eval_dl)
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Configuration Parser. ')
+    # general
+    parser.add_argument('--experiment_name', type=str, default="AudioSet_Strong")
+    parser.add_argument('--batch_size', type=int, default=256)
+    parser.add_argument('--num_workers', type=int, default=16)
+    parser.add_argument('--num_devices', type=int, default=1)
+    parser.add_argument('--precision', type=int, default=16)
+    parser.add_argument('--evaluate', action='store_true', default=False)
+    parser.add_argument('--check_val_every_n_epoch', type=int, default=5)
+    # model
+    parser.add_argument('--model_name', type=str,
+                        choices=["ATST-F", "BEATs", "fpasst", "M2D", "ASIT"] + \
+                                [f"frame_mn{width}" for width in ["06", "10"]],
+                        default="ATST-F")  # used also for training
+    # "scratch" = no pretraining
+    # "ssl" = SSL pre-trained
+    # "weak" = AudioSet Weak pre-trained
+    # "strong" = AudioSet Strong pre-trained
+    parser.add_argument('--pretrained', type=str, choices=["scratch", "ssl", "weak", "strong"],
+                        default="weak")
+    parser.add_argument('--seq_model_type', type=str, choices=["rnn"],
+                        default=None)
+    # training
+    parser.add_argument('--n_epochs', type=int, default=30)
+    parser.add_argument('--use_balanced_sampler', action='store_true', default=False)
+    parser.add_argument('--distillation_loss_weight', type=float, default=0.0)
+    parser.add_argument('--epoch_len', type=int, default=100000)
+    parser.add_argument('--median_window', type=int, default=9)
+    # augmentation
+    parser.add_argument('--wavmix_p', type=float, default=0.8)
+    parser.add_argument('--freq_warp_p', type=float, default=0.8)
+    parser.add_argument('--filter_augment_p', type=float, default=0.8)
+    parser.add_argument('--frame_shift_range', type=float, default=0.125)  # in seconds
+    parser.add_argument('--mixup_p', type=float, default=0.3)
+    parser.add_argument('--mixstyle_p', type=float, default=0.3)
+    parser.add_argument('--max_time_mask_size', type=float, default=0.0)
+    # optimizer
+    parser.add_argument('--adamw', action='store_true', default=False)
+    parser.add_argument('--weight_decay', type=float, default=0.0)
+    # lr schedule
+    parser.add_argument('--schedule_mode', type=str, default="cos")
+    parser.add_argument('--max_lr', type=float, default=7e-5)
+    parser.add_argument('--lr_end', type=float, default=2e-7)
+    parser.add_argument('--warmup_steps', type=int, default=5000)
+    # knowledge distillation
+    parser.add_argument('--pseudo_labels_file', type=str,
+                        default=None)
+    args = parser.parse_args()
+    if args.evaluate:
+        evaluate(args)
+    else:
+        train(args)

ex_dcase2016task2.py ADDED Viewed

	@@ -0,0 +1,517 @@

+import argparse
+import random
+from pathlib import Path
+from typing import Dict
+import pytorch_lightning as pl
+import torch
+import torch.nn as nn
+import transformers
+from einops import rearrange
+from pytorch_lightning.loggers import WandbLogger
+from torch.utils.data import DataLoader
+import wandb
+from data_util.dcase2016task2 import (get_training_dataset, get_validation_dataset, get_test_dataset,
+                                      label_vocab_nlabels, label_vocab_as_dict)
+from helpers.augment import frame_shift, time_mask, mixup, filter_augmentation, mixstyle, RandomResizeCrop
+from helpers.score import get_events_for_all_files, combine_target_events, EventBasedScore, SegmentBasedScore
+from helpers.utils import worker_init_fn
+from models.asit.ASIT_wrapper import ASiTWrapper
+from models.atstframe.ATSTF_wrapper import ATSTWrapper
+from models.beats.BEATs_wrapper import BEATsWrapper
+from models.frame_passt.fpasst_wrapper import FPaSSTWrapper
+from models.m2d.M2D_wrapper import M2DWrapper
+from models.prediction_wrapper import PredictionsWrapper
+class PLModule(pl.LightningModule):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        if config.pretrained == "scratch":
+            checkpoint = None
+        elif config.pretrained == "ssl":
+            checkpoint = "ssl"
+        elif config.pretrained == "weak":
+            checkpoint = "weak"
+        elif config.pretrained == "strong":
+            checkpoint = "strong_1"
+        else:
+            raise ValueError(f"Unknown pretrained checkpoint: {config.pretrained}")
+        # load transformer model
+        if config.model_name == "BEATs":
+            beats = BEATsWrapper()
+            model = PredictionsWrapper(beats, checkpoint=f"BEATs_{checkpoint}" if checkpoint else None,
+                                       seq_model_type=config.seq_model_type,
+                                       n_classes_strong=self.config.n_classes)
+        elif config.model_name == "ATST-F":
+            atst = ATSTWrapper()
+            model = PredictionsWrapper(atst, checkpoint=f"ATST-F_{checkpoint}" if checkpoint else None,
+                                       seq_model_type=config.seq_model_type,
+                                       n_classes_strong=self.config.n_classes)
+        elif config.model_name == "fpasst":
+            fpasst = FPaSSTWrapper()
+            model = PredictionsWrapper(fpasst, checkpoint=f"fpasst_{checkpoint}" if checkpoint else None,
+                                       seq_model_type=config.seq_model_type,
+                                       n_classes_strong=self.config.n_classes)
+        elif config.model_name == "M2D":
+            m2d = M2DWrapper()
+            model = PredictionsWrapper(m2d, checkpoint=f"M2D_{checkpoint}" if checkpoint else None,
+                                       seq_model_type=config.seq_model_type,
+                                       n_classes_strong=self.config.n_classes,
+                                       embed_dim=m2d.m2d.cfg.feature_d)
+        elif config.model_name == "ASIT":
+            asit = ASiTWrapper()
+            model = PredictionsWrapper(asit, checkpoint=f"ASIT_{checkpoint}" if checkpoint else None,
+                                       seq_model_type=config.seq_model_type,
+                                       n_classes_strong=self.config.n_classes)
+        else:
+            raise NotImplementedError(f"Model {config.model_name} not (yet) implemented")
+        self.model = model
+        self.strong_loss = nn.BCEWithLogitsLoss()
+        self.freq_warp = RandomResizeCrop((1, 1.0), time_scale=(1.0, 1.0))
+        task_path = Path(self.config.task_path)
+        label_vocab, nlabels = label_vocab_nlabels(task_path)
+        self.label_to_idx = label_vocab_as_dict(label_vocab, key="label", value="idx")
+        self.idx_to_label: Dict[int, str] = {
+            idx: label for (label, idx) in self.label_to_idx.items()
+        }
+        self.event_onset_200ms_fms = EventBasedScore(
+            label_to_idx=self.label_to_idx,
+            name="event_onset_200ms_fms",
+            scores=("f_measure", "precision", "recall"),
+            params={"evaluate_onset": True, "evaluate_offset": False, "t_collar": 0.2}
+        )
+        self.event_onset_50ms_fms = EventBasedScore(
+            label_to_idx=self.label_to_idx,
+            name="event_onset_50ms_fms",
+            scores=("f_measure", "precision", "recall"),
+            params={"evaluate_onset": True, "evaluate_offset": False, "t_collar": 0.05}
+        )
+        self.segment_1s_er = SegmentBasedScore(
+            label_to_idx=self.label_to_idx,
+            name="segment_1s_er",
+            scores=("error_rate",),
+            params={"time_resolution": 1.0},
+            maximize=False,
+        )
+        self.postprocessing_grid = {
+            "median_filter_ms": [
+                250
+            ],
+            "min_duration": [
+                125
+            ]
+        }
+        self.preds, self.tgts, self.fnames, self.timestamps = [], [], [], []
+    def forward(self, audio):
+        mel = self.model.mel_forward(audio)
+        y_strong, _ = self.model(mel)
+        return y_strong
+    def separate_params(self):
+        pt_params = []
+        seq_params = []
+        head_params = []
+        for name, p in self.named_parameters():
+            name = name[len("model."):]
+            if name.startswith('model'):
+                # the transformer
+                pt_params.append(p)
+            elif name.startswith('seq_model'):
+                # the optional sequence model
+                seq_params.append(p)
+            elif name.startswith('strong_head') or name.startswith('weak_head'):
+                # the prediction head
+                head_params.append(p)
+            else:
+                raise ValueError(f"Unexpected key in model: {name}")
+        if self.model.has_separate_params():
+            # split parameters into groups according to their depth in the network
+            # based on this, we can apply layer-wise learning rate decay
+            pt_params = self.model.separate_params()
+        else:
+            if self.config.lr_decay != 1.0:
+                raise ValueError(f"Model has no separate_params function. Can't apply layer-wise lr decay, but "
+                                 f"learning rate decay is set to {self.config.lr_decay}.")
+        return pt_params, seq_params, head_params
+    def get_optimizer(
+            self,
+            lr,
+            lr_decay=1.0,
+            transformer_lr=None,
+            transformer_frozen=False,
+            adamw=False,
+            weight_decay=0.01,
+            betas=(0.9, 0.999)
+    ):
+        pt_params, seq_params, head_params = self.separate_params()
+        param_groups = [
+            {'params': head_params, 'lr': lr},  # model head (besides base model and seq model)
+        ]
+        if transformer_frozen:
+            for p in pt_params + seq_params:
+                if isinstance(p, list):
+                    for p_i in p:
+                        p_i.detach_()
+                else:
+                    p.detach_()
+        else:
+            if transformer_lr is None:
+                transformer_lr = lr
+            if isinstance(pt_params, list) and isinstance(pt_params[0], list):
+                # apply lr decay
+                scale_lrs = [transformer_lr * (lr_decay ** i) for i in range(1, len(pt_params) + 1)]
+                param_groups = param_groups + [{"params": pt_params[i], "lr": scale_lrs[i]} for i in
+                                               range(len(pt_params))]
+            else:
+                param_groups.append(
+                    {'params': pt_params, 'lr': transformer_lr},  # pretrained model
+                )
+            param_groups.append(
+                {'params': seq_params, 'lr': lr},  # pretrained model
+            )
+        # do not apply weight decay to biases and batch norms
+        param_groups_split = []
+        for param_group in param_groups:
+            params_1D, params_2D = [], []
+            lr = param_group['lr']
+            for param in param_group['params']:
+                if param.ndimension() >= 2:
+                    params_2D.append(param)
+                elif param.ndimension() <= 1:
+                    params_1D.append(param)
+            param_groups_split += [{'params': params_2D, 'lr': lr, 'weight_decay': weight_decay},
+                                   {'params': params_1D, 'lr': lr}]
+        if weight_decay > 0:
+            assert adamw
+        if adamw:
+            print(f"\nUsing adamw weight_decay={weight_decay}!\n")
+            return torch.optim.AdamW(param_groups_split, lr=lr, weight_decay=weight_decay, betas=betas)
+        return torch.optim.Adam(param_groups_split, lr=lr, betas=betas)
+    def get_lr_scheduler(
+            self,
+            optimizer,
+            num_training_steps,
+            schedule_mode="cos",
+            gamma: float = 0.999996,
+            num_warmup_steps=4000,
+            lr_end=1e-7,
+    ):
+        if schedule_mode in {"exp"}:
+            return torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma)
+        if schedule_mode in {"cosine", "cos"}:
+            return transformers.get_cosine_schedule_with_warmup(
+                optimizer,
+                num_warmup_steps=num_warmup_steps,
+                num_training_steps=num_training_steps,
+            )
+        if schedule_mode in {"linear"}:
+            print("Linear schedule!")
+            return transformers.get_polynomial_decay_schedule_with_warmup(
+                optimizer,
+                num_warmup_steps=num_warmup_steps,
+                num_training_steps=num_training_steps,
+                power=1.0,
+                lr_end=lr_end,
+            )
+        raise RuntimeError(f"schedule_mode={schedule_mode} Unknown.")
+    def configure_optimizers(self):
+        """
+        This is the way pytorch lightening requires optimizers and learning rate schedulers to be defined.
+        The specified items are used automatically in the optimization loop (no need to call optimizer.step() yourself).
+        :return: dict containing optimizer and learning rate scheduler
+        """
+        optimizer = self.get_optimizer(self.config.max_lr,
+                                       lr_decay=self.config.lr_decay,
+                                       transformer_lr=self.config.transformer_lr,
+                                       transformer_frozen=self.config.transformer_frozen,
+                                       adamw=False if self.config.no_adamw else True,
+                                       weight_decay=self.config.weight_decay)
+        num_training_steps = self.trainer.estimated_stepping_batches
+        scheduler = self.get_lr_scheduler(optimizer, num_training_steps,
+                                          schedule_mode=self.config.schedule_mode,
+                                          lr_end=self.config.lr_end)
+        lr_scheduler_config = {
+            "scheduler": scheduler,
+            "interval": "step",
+            "frequency": 1
+        }
+        return [optimizer], [lr_scheduler_config]
+    def training_step(self, train_batch, batch_idx):
+        """
+        :param train_batch: contains one batch from train dataloader
+        :param batch_idx
+        :return: a dict containing at least loss that is used to update model parameters, can also contain
+                    other items that can be processed in 'training_epoch_end' to log other metrics than loss
+        """
+        audios, labels, fnames, timestamps = train_batch
+        if self.config.transformer_frozen:
+            self.model.model.eval()
+            self.model.seq_model.eval()
+        mel = self.model.mel_forward(audios)
+        # time rolling
+        if self.config.frame_shift_range > 0:
+            mel, labels = frame_shift(
+                mel,
+                labels,
+                shift_range=self.config.frame_shift_range
+            )
+        # mixup
+        if self.config.mixup_p > random.random():
+            mel, labels = mixup(
+                mel,
+                targets=labels
+            )
+        # mixstyle
+        if self.config.mixstyle_p > random.random():
+            mel = mixstyle(
+                mel
+            )
+        # time masking
+        if self.config.max_time_mask_size > 0:
+            mel, labels, pseudo_labels = time_mask(
+                mel,
+                labels,
+                max_mask_ratio=self.config.max_time_mask_size
+            )
+        # frequency masking
+        if self.config.filter_augment_p > random.random():
+            mel, _ = filter_augmentation(
+                mel
+            )
+        # frequency warping
+        if self.config.freq_warp_p > random.random():
+            mel = mel.squeeze(1)
+            mel = self.freq_warp(mel)
+            mel = mel.unsqueeze(1)
+        # forward through network; use strong head
+        y_hat_strong, _ = self.model(mel)
+        loss = self.strong_loss(y_hat_strong, labels)
+        # logging
+        self.log('epoch', self.current_epoch)
+        for i, param_group in enumerate(self.trainer.optimizers[0].param_groups):
+            self.log(f'trainer/lr_optimizer_{i}', param_group['lr'])
+        self.log("train/loss", loss.detach().cpu(), prog_bar=True)
+        return loss
+    def _score_step(self, batch):
+        audios, labels, fnames, timestamps = batch
+        strong_preds = self.forward(audios)
+        self.preds.append(strong_preds)
+        self.tgts.append(labels)
+        self.fnames.append(fnames)
+        self.timestamps.append(timestamps)
+    def _score_epoch_end(self, name="val"):
+        preds = torch.cat(self.preds)
+        tgts = torch.cat(self.tgts)
+        fnames = [item for sublist in self.fnames for item in sublist]
+        timestamps = torch.cat(self.timestamps)
+        val_loss = self.strong_loss(preds, tgts)
+        self.log(f"{name}/loss", val_loss, prog_bar=True)
+        # the following function expects one prediction per timestamp (sequence dimension must be flattened)
+        seq_len = preds.size(-1)
+        preds = rearrange(preds, 'bs c t -> (bs t) c').float()
+        timestamps = rearrange(timestamps, 'bs t -> (bs t)').float()
+        fnames = [fname for fname in fnames for _ in range(seq_len)]
+        predicted_events_by_postprocessing = get_events_for_all_files(
+            preds,
+            fnames,
+            timestamps,
+            self.idx_to_label,
+            self.postprocessing_grid
+        )
+        # we only have one postprocessing configurations (aligned with HEAR challenge)
+        key = list(predicted_events_by_postprocessing.keys())[0]
+        predicted_events = predicted_events_by_postprocessing[key]
+        # load ground truth for test fold
+        task_path = Path(self.config.task_path)
+        test_target_events = combine_target_events(["valid" if name == "val" else "test"], task_path)
+        onset_fms = self.event_onset_200ms_fms(predicted_events, test_target_events)
+        onset_fms_50 = self.event_onset_50ms_fms(predicted_events, test_target_events)
+        segment_1s_er = self.segment_1s_er(predicted_events, test_target_events)
+        self.log(f"{name}/onset_fms", onset_fms[0][1])
+        self.log(f"{name}/onset_fms_50", onset_fms_50[0][1])
+        self.log(f"{name}/segment_1s_er", segment_1s_er[0][1])
+        # free buffers
+        self.preds, self.tgts, self.fnames, self.timestamps = [], [], [], []
+    def validation_step(self, batch, batch_idx):
+        self._score_step(batch)
+    def on_validation_epoch_end(self):
+        self._score_epoch_end(name="val")
+    def test_step(self, batch, batch_idx):
+        self._score_step(batch)
+    def on_test_epoch_end(self):
+        self._score_epoch_end(name="test")
+def train(config):
+    # Example for fine-tuning pre-trained transformers on a downstream task.
+    # logging is done using wandb
+    wandb_logger = WandbLogger(
+        project="PTSED",
+        notes="Downstream Training on office sound event detection.",
+        tags=["DCASE 2016 Task 2", "Sound Event Detection"],
+        config=config,
+        name=config.experiment_name
+    )
+    train_set = get_training_dataset(config.task_path, wavmix_p=config.wavmix_p)
+    val_ds = get_validation_dataset(config.task_path)
+    test_ds = get_test_dataset(config.task_path)
+    # train dataloader
+    train_dl = DataLoader(dataset=train_set,
+                          worker_init_fn=worker_init_fn,
+                          num_workers=config.num_workers,
+                          batch_size=config.batch_size,
+                          shuffle=True)
+    # validation dataloader
+    valid_dl = DataLoader(dataset=val_ds,
+                          worker_init_fn=worker_init_fn,
+                          num_workers=config.num_workers,
+                          batch_size=config.batch_size,
+                          shuffle=False,
+                          drop_last=False)
+    # test dataloader
+    test_dl = DataLoader(dataset=test_ds,
+                         worker_init_fn=worker_init_fn,
+                         num_workers=config.num_workers,
+                         batch_size=config.batch_size,
+                         shuffle=False,
+                         drop_last=False)
+    # create pytorch lightening module
+    pl_module = PLModule(config)
+    # create the pytorch lightening trainer by specifying the number of epochs to train, the logger,
+    # on which kind of device(s) to train and possible callbacks
+    trainer = pl.Trainer(max_epochs=config.n_epochs,
+                         logger=wandb_logger,
+                         accelerator='auto',
+                         devices=config.num_devices,
+                         precision=config.precision,
+                         num_sanity_val_steps=0,
+                         check_val_every_n_epoch=config.check_val_every_n_epoch
+                         )
+    # start training and validation for the specified number of epochs
+    trainer.fit(
+        pl_module,
+        train_dataloaders=train_dl,
+        val_dataloaders=valid_dl,
+    )
+    test_results = trainer.test(pl_module, dataloaders=test_dl)
+    print(test_results)
+    wandb.finish()
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Configuration Parser. ')
+    # general
+    parser.add_argument('--task_path', type=str, required=True)
+    parser.add_argument('--experiment_name', type=str, default="DCASE2016Task2")
+    parser.add_argument('--batch_size', type=int, default=256)
+    parser.add_argument('--num_workers', type=int, default=16)
+    parser.add_argument('--num_devices', type=int, default=1)
+    parser.add_argument('--precision', type=int, default=16)
+    parser.add_argument('--check_val_every_n_epoch', type=int, default=10)
+    # model
+    parser.add_argument('--model_name', type=str,
+                        choices=["ATST-F", "BEATs", "fpasst", "M2D", "ASIT"],
+                        default="ATST-F")  # used also for training
+    # "scratch" = no pretraining
+    # "ssl" = SSL pre-trained
+    # "weak" = AudioSet Weak pre-trained
+    # "strong" = AudioSet Strong pre-trained
+    parser.add_argument('--pretrained', type=str, choices=["scratch", "ssl", "weak", "strong"],
+                        default="strong")
+    parser.add_argument('--seq_model_type', type=str, choices=["rnn"],
+                        default=None)
+    parser.add_argument('--n_classes', type=int, default=11)
+    # training
+    parser.add_argument('--n_epochs', type=int, default=300)
+    # augmentation
+    parser.add_argument('--wavmix_p', type=float, default=0.5)
+    parser.add_argument('--freq_warp_p', type=float, default=0.0)
+    parser.add_argument('--filter_augment_p', type=float, default=0.0)
+    parser.add_argument('--frame_shift_range', type=float, default=0.0)  # in seconds
+    parser.add_argument('--mixup_p', type=float, default=0.5)
+    parser.add_argument('--mixstyle_p', type=float, default=0.0)
+    parser.add_argument('--max_time_mask_size', type=float, default=0.0)
+    # optimizer
+    parser.add_argument('--no_adamw', action='store_true', default=False)
+    parser.add_argument('--weight_decay', type=float, default=0.001)
+    parser.add_argument('--transformer_frozen', action='store_true', dest='transformer_frozen',
+                        default=False,
+                        help='Disable training for the transformer.')
+    # lr schedule
+    parser.add_argument('--schedule_mode', type=str, default="cos")
+    parser.add_argument('--max_lr', type=float, default=1.06e-4)
+    parser.add_argument('--transformer_lr', type=float, default=None)
+    parser.add_argument('--lr_decay', type=float, default=1.0)
+    parser.add_argument('--lr_end', type=float, default=1e-7)
+    parser.add_argument('--warmup_steps', type=int, default=100)
+    args = parser.parse_args()
+    train(args)

helpers/augment.py ADDED Viewed

	@@ -0,0 +1,225 @@

+import torch
+import random
+import numpy as np
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.distributions.beta import Beta
+def frame_shift(mels, labels, embeddings=None, pseudo_labels=None,
+                net_pooling=4, shift_range=0.125):
+    bsz, channels, n_bands, frames = mels.shape
+    abs_shift_mel = int(frames * shift_range)
+    if embeddings is not None:
+        embed_frames = embeddings.shape[-1]
+        embed_pool_fact = frames / embed_frames
+    for bindx in range(bsz):
+        shift = int(random.gauss(0, abs_shift_mel))
+        mels[bindx] = torch.roll(mels[bindx], shift, dims=-1)
+        label_shift = -abs(shift) / net_pooling if shift < 0 else shift / net_pooling
+        label_shift = round(label_shift)
+        labels[bindx] = torch.roll(labels[bindx], label_shift, dims=-1)
+        if pseudo_labels is not None:
+            pseudo_labels[bindx] = torch.roll(pseudo_labels[bindx], label_shift, dims=-1)
+        if embeddings is not None:
+            embed_shift = -abs(shift) / embed_pool_fact if shift < 0 else shift / embed_pool_fact
+            embed_shift = round(embed_shift)
+            embeddings[bindx] = torch.roll(embeddings[bindx], embed_shift, dims=-1)
+    out_args = [mels]
+    if embeddings is not None:
+        out_args.append(embeddings)
+    out_args.append(labels)
+    if pseudo_labels is not None:
+        out_args.append(pseudo_labels)
+    return tuple(out_args)
+def time_mask(features, labels, embeddings=None, pseudo_labels=None, net_pooling=4,
+              min_mask_ratio=0.05, max_mask_ratio=0.2):
+    _, _, n_frame = labels.shape
+    if embeddings is not None:
+        embed_frames = embeddings.shape[-1]
+        embed_pool_fact = embed_frames / n_frame
+    t_width = torch.randint(low=int(n_frame * min_mask_ratio), high=int(n_frame * max_mask_ratio), size=(1,))
+    t_low = torch.randint(low=0, high=n_frame-t_width[0], size=(1,))
+    features[:, :, :, t_low * net_pooling:(t_low+t_width)*net_pooling] = 0
+    labels[:, :, t_low:t_low+t_width] = 0
+    if pseudo_labels is not None:
+        labels[:, :, t_low:t_low + t_width] = 0
+    if embeddings is not None:
+        low = round((t_low * embed_pool_fact).item())
+        high = round(((t_low + t_width) * embed_pool_fact).item())
+        embeddings[..., low:high] = 0
+    out_args = [features]
+    if embeddings is not None:
+        out_args.append(embeddings)
+    out_args.append(labels)
+    if pseudo_labels is not None:
+        out_args.append(pseudo_labels)
+    return tuple(out_args)
+def mixup(data, embeddings=None, targets=None, pseudo_strong=None, alpha=0.2, beta=0.2, return_mix_coef=False):
+    with torch.no_grad():
+        batch_size = data.size(0)
+        c = np.random.beta(alpha, beta, size=batch_size)
+        c = np.maximum(c, 1 - c)
+        perm = torch.randperm(batch_size)
+        cd = torch.tensor(c, dtype=data.dtype, device=data.device).view(batch_size, *([1] * (data.ndim - 1)))
+        mixed_data = cd * data + (1 - cd) * data[perm, :]
+        if embeddings is not None:
+            ce = torch.tensor(c, dtype=embeddings.dtype, device=embeddings.device).view(batch_size, *([1] * (embeddings.ndim - 1)))
+            mixed_embeddings = ce * embeddings + (1 - ce) * embeddings[perm, :]
+        if targets is not None:
+            ct = torch.tensor(c, dtype=data.dtype, device=data.device).view(batch_size, *([1] * (targets.ndim - 1)))
+            mixed_target = torch.clamp(
+                ct * targets + (1 - ct) * targets[perm, :], min=0, max=1
+            )
+        if pseudo_strong is not None:
+            cp = torch.tensor(c, dtype=pseudo_strong.dtype, device=pseudo_strong.device).view(batch_size,
+                                                                                              *([1] * (pseudo_strong.ndim - 1)))
+            mixed_pseudo_strong = cp * pseudo_strong + (1 - cp) * pseudo_strong[perm, :]
+    out_args = [mixed_data]
+    if embeddings is not None:
+        out_args.append(mixed_embeddings)
+    if targets is not None:
+        out_args.append(mixed_target)
+    if pseudo_strong is not None:
+        out_args.append(mixed_pseudo_strong)
+    if return_mix_coef:
+        out_args.append(perm)
+        out_args.append(c)
+    return tuple(out_args)
+def filt_aug_(features, db_range=(-6, 6), n_band=(3, 6), min_bw=6):
+    batch_size, channels, n_freq_bin, _ = features.shape
+    n_freq_band = torch.randint(low=n_band[0], high=n_band[1], size=(1,)).item()   # [low, high)
+    if n_freq_band > 1:
+        while n_freq_bin - n_freq_band * min_bw + 1 < 0:
+            min_bw -= 1
+        band_bndry_freqs = torch.sort(torch.randint(0, n_freq_bin - n_freq_band * min_bw + 1,
+                                                    (n_freq_band - 1,)))[0] + \
+                           torch.arange(1, n_freq_band) * min_bw
+        band_bndry_freqs = torch.cat((torch.tensor([0]), band_bndry_freqs, torch.tensor([n_freq_bin])))
+        band_factors = torch.rand((batch_size, n_freq_band + 1)).to(features) * (db_range[1] - db_range[0]) + db_range[0]
+        freq_filt = torch.ones((batch_size, n_freq_bin, 1)).to(features)
+        for i in range(n_freq_band):
+            for j in range(batch_size):
+                freq_filt[j, band_bndry_freqs[i]:band_bndry_freqs[i+1], :] = \
+                    torch.linspace(band_factors[j, i], band_factors[j, i+1],
+                                   band_bndry_freqs[i+1] - band_bndry_freqs[i]).unsqueeze(-1)
+        freq_filt = 10 ** (freq_filt / 20)
+        return features * freq_filt.unsqueeze(1)
+    else:
+        return features
+def filter_augmentation(features, n_transform=1, filter_db_range=(-6, 6),
+                        filter_bands=(3, 6), filter_minimum_bandwidth=6):
+    if n_transform == 2:
+        feature_list = []
+        for _ in range(n_transform):
+            features_temp = features
+            features_temp = filt_aug_(features_temp, db_range=filter_db_range, n_band=filter_bands,
+                                      min_bw=filter_minimum_bandwidth)
+            feature_list.append(features_temp)
+        return feature_list
+    elif n_transform == 1:
+        features = filt_aug_(features, db_range=filter_db_range, n_band=filter_bands,
+                             min_bw=filter_minimum_bandwidth)
+        return [features, features]
+    else:
+        return [features, features]
+def mixstyle(x, alpha=0.4, eps=1e-6):
+    batch_size = x.size(0)
+    # frequency-wise statistics
+    f_mu = x.mean(dim=3, keepdim=True)
+    f_var = x.var(dim=3, keepdim=True)
+    f_sig = (f_var + eps).sqrt()  # compute instance standard deviation
+    f_mu, f_sig = f_mu.detach(), f_sig.detach()  # block gradients
+    x_normed = (x - f_mu) / f_sig  # normalize input
+    lmda = Beta(alpha, alpha).sample((batch_size, 1, 1, 1)).to(x.device, dtype=x.dtype)  # sample instance-wise convex weights
+    lmda = torch.max(lmda, 1-lmda)
+    perm = torch.randperm(batch_size).to(x.device)  # generate shuffling indices
+    f_mu_perm, f_sig_perm = f_mu[perm], f_sig[perm]  # shuffling
+    mu_mix = f_mu * lmda + f_mu_perm * (1 - lmda)  # generate mixed mean
+    sig_mix = f_sig * lmda + f_sig_perm * (1 - lmda)  # generate mixed standard deviation
+    x = x_normed * sig_mix + mu_mix  # denormalize input using the mixed statistics
+    return x
+class RandomResizeCrop(nn.Module):
+    """Random Resize Crop block.
+    Args:
+        virtual_crop_scale: Virtual crop area `(F ratio, T ratio)` in ratio to input size.
+        freq_scale: Random frequency range `(min, max)`.
+        time_scale: Random time frame range `(min, max)`.
+    """
+    def __init__(self, virtual_crop_scale=(1.0, 1.5), freq_scale=(0.6, 1.0), time_scale=(0.6, 1.5)):
+        super().__init__()
+        self.virtual_crop_scale = virtual_crop_scale
+        self.freq_scale = freq_scale
+        self.time_scale = time_scale
+        self.interpolation = 'bicubic'
+        assert time_scale[1] >= 1.0 and freq_scale[1] >= 1.0
+    @staticmethod
+    def get_params(virtual_crop_size, in_size, time_scale, freq_scale):
+        canvas_h, canvas_w = virtual_crop_size
+        src_h, src_w = in_size
+        h = np.clip(int(np.random.uniform(*freq_scale) * src_h), 1, canvas_h)
+        w = np.clip(int(np.random.uniform(*time_scale) * src_w), 1, canvas_w)
+        i = random.randint(0, canvas_h - h) if canvas_h > h else 0
+        j = random.randint(0, canvas_w - w) if canvas_w > w else 0
+        return i, j, h, w
+    def forward(self, lms):
+        # spec_output = []
+        # for lms in specs:
+        # lms = lms.unsqueeze(0)
+        # make virtual_crop_arear empty space (virtual crop area) and copy the input log mel spectrogram to th the center
+        virtual_crop_size = [int(s * c) for s, c in zip(lms.shape[-2:], self.virtual_crop_scale)]
+        virtual_crop_area = (torch.zeros((lms.shape[0], virtual_crop_size[0], virtual_crop_size[1]))
+                            .to(torch.float).to(lms.device))
+        _, lh, lw = virtual_crop_area.shape
+        c, h, w = lms.shape
+        x, y = (lw - w) // 2, (lh - h) // 2
+        virtual_crop_area[:, y:y+h, x:x+w] = lms
+        # get random area
+        i, j, h, w = self.get_params(virtual_crop_area.shape[-2:], lms.shape[-2:], self.time_scale, self.freq_scale)
+        crop = virtual_crop_area[:, i:i+h, j:j+w]
+        # print(f'shapes {virtual_crop_area.shape} {crop.shape} -> {lms.shape}')
+        lms = F.interpolate(crop.unsqueeze(1), size=lms.shape[-2:],
+            mode=self.interpolation, align_corners=True).squeeze(1)
+            # spec_output.append(lms.float())
+        return lms.float() # torch.concat(lms, dim=0)
+    def __repr__(self):
+        format_string = self.__class__.__name__ + f'(virtual_crop_size={self.virtual_crop_scale}'
+        format_string += ', time_scale={0}'.format(tuple(round(s, 4) for s in self.time_scale))
+        format_string += ', freq_scale={0})'.format(tuple(round(r, 4) for r in self.freq_scale))
+        return format_string

helpers/decode.py ADDED Viewed

	@@ -0,0 +1,72 @@

+"""
+Code from:
+https://github.com/DCASE-REPO/DESED_task
+"""
+from pathlib import Path
+import numpy as np
+import pandas as pd
+import scipy
+from sed_scores_eval.base_modules.scores import create_score_dataframe
+def batched_decode_preds(
+        strong_preds,
+        filenames,
+        encoder,
+        thresholds=[0.5],
+        median_filter=None,
+        pad_indx=None,
+):
+    """Decode a batch of predictions to dataframes. Each threshold gives a different dataframe and stored in a
+    dictionary
+    Args:
+        strong_preds: torch.Tensor, batch of strong predictions.
+        filenames: list, the list of filenames of the current batch.
+        encoder: ManyHotEncoder object, object used to decode predictions.
+        thresholds: list, the list of thresholds to be used for predictions.
+        median_filter: int, the number of frames for which to apply median window (smoothing).
+        pad_indx: list, the list of indexes which have been used for padding.
+    Returns:
+        dict of predictions, each keys is a threshold and the value is the DataFrame of predictions.
+    """
+    # Init a dataframe per threshold
+    scores_raw = {}
+    scores_postprocessed = {}
+    prediction_dfs = {}
+    for threshold in thresholds:
+        prediction_dfs[threshold] = pd.DataFrame()
+    for j in range(strong_preds.shape[0]):  # over batches
+        audio_id = Path(filenames[j]).stem
+        filename = audio_id + ".wav"
+        c_scores = strong_preds[j]
+        if pad_indx is not None:
+            true_len = int(c_scores.shape[-1] * pad_indx[j].item())
+            c_scores = c_scores[:true_len]
+        c_scores = c_scores.transpose(0, 1).detach().cpu().numpy()
+        scores_raw[audio_id] = create_score_dataframe(
+            scores=c_scores,
+            timestamps=encoder._frame_to_time(np.arange(len(c_scores) + 1)),
+            event_classes=encoder.labels,
+        )
+        if median_filter is not None:
+            c_scores = scipy.ndimage.filters.median_filter(c_scores, (median_filter, 1))
+        scores_postprocessed[audio_id] = create_score_dataframe(
+            scores=c_scores,
+            timestamps=encoder._frame_to_time(np.arange(len(c_scores) + 1)),
+            event_classes=encoder.labels,
+        )
+        for c_th in thresholds:
+            pred = c_scores > c_th
+            pred = encoder.decode_strong(pred)
+            pred = pd.DataFrame(pred, columns=["event_label", "onset", "offset"])
+            pred["filename"] = filename
+            prediction_dfs[c_th] = pd.concat(
+                [prediction_dfs[c_th], pred], ignore_index=True
+            )
+    return scores_raw, scores_postprocessed, prediction_dfs

helpers/encode.py ADDED Viewed

	@@ -0,0 +1,230 @@

+"""
+Code from:
+https://github.com/DCASE-REPO/DESED_task
+"""
+from collections import OrderedDict
+import numpy as np
+import pandas as pd
+from dcase_util.data import DecisionEncoder
+class ManyHotEncoder:
+    """"
+        Adapted after DecisionEncoder.find_contiguous_regions method in
+        https://github.com/DCASE-REPO/dcase_util/blob/master/dcase_util/data/decisions.py
+        Encode labels into numpy arrays where 1 correspond to presence of the class and 0 absence.
+        Multiple 1 can appear on the same line, it is for multi label problem.
+    Args:
+        labels: list, the classes which will be encoded
+        n_frames: int, (Default value = None) only useful for strong labels. The number of frames of a segment.
+    Attributes:
+        labels: list, the classes which will be encoded
+        n_frames: int, only useful for strong labels. The number of frames of a segment.
+    """
+    def __init__(
+            self, labels, audio_len=10, frame_hop=160, net_pooling=4, fs=16000
+    ):
+        if type(labels) in [np.ndarray, np.array]:
+            labels = labels.tolist()
+        elif isinstance(labels, (dict, OrderedDict)):
+            labels = list(labels.keys())
+        self.labels = labels
+        self.audio_len = audio_len
+        self.frame_hop = frame_hop
+        self.fs = fs
+        self.net_pooling = net_pooling
+        n_frames = self.audio_len * self.fs
+        self.n_frames = int(int((n_frames / self.frame_hop)) / self.net_pooling)
+    def encode_weak(self, labels):
+        """ Encode a list of weak labels into a numpy array
+        Args:
+            labels: list, list of labels to encode (to a vector of 0 and 1)
+        Returns:
+            numpy.array
+            A vector containing 1 for each label, and 0 everywhere else
+        """
+        # useful for tensor empty labels
+        if type(labels) is str:
+            if labels == "empty":
+                y = np.zeros(len(self.labels)) - 1
+                return y
+            else:
+                labels = labels.split(",")
+        if type(labels) is pd.DataFrame:
+            if labels.empty:
+                labels = []
+            elif "event_label" in labels.columns:
+                labels = labels["event_label"]
+        y = np.zeros(len(self.labels))
+        for label in labels:
+            if not pd.isna(label):
+                i = self.labels.index(label)
+                y[i] = 1
+        return y
+    def _time_to_frame(self, time):
+        samples = time * self.fs
+        frame = (samples) / self.frame_hop
+        return np.clip(frame / self.net_pooling, a_min=0, a_max=self.n_frames)
+    def _frame_to_time(self, frame):
+        frame = frame * self.net_pooling / (self.fs / self.frame_hop)
+        return np.clip(frame, a_min=0, a_max=self.audio_len)
+    def encode_strong_df(self, label_df):
+        """Encode a list (or pandas Dataframe or Serie) of strong labels, they correspond to a given filename
+        Args:
+            label_df: pandas DataFrame or Series, contains filename, onset (in frames) and offset (in frames)
+                If only filename (no onset offset) is specified, it will return the event on all the frames
+                onset and offset should be in frames
+        Returns:
+            numpy.array
+            Encoded labels, 1 where the label is present, 0 otherwise
+        """
+        assert any(
+            [x is not None for x in [self.audio_len, self.frame_hop]]
+        )
+        samples_len = self.n_frames
+        if type(label_df) is str:
+            if label_df == "empty":
+                y = np.zeros((samples_len, len(self.labels))) - 1
+                return y
+        y = np.zeros((samples_len, len(self.labels)))
+        if type(label_df) is pd.DataFrame:
+            if {"onset", "offset", "event_label"}.issubset(label_df.columns):
+                for _, row in label_df.iterrows():
+                    if not pd.isna(row["event_label"]):
+                        i = self.labels.index(row["event_label"])
+                        onset = int(self._time_to_frame(row["onset"]))
+                        offset = int(np.ceil(self._time_to_frame(row["offset"])))
+                        if "confidence" in label_df.columns:
+                            y[onset:offset, i] = row["confidence"]  # support confidence
+                        else:
+                            y[
+                            onset:offset, i
+                            ] = 1  # means offset not included (hypothesis of overlapping frames, so ok)
+        elif type(label_df) in [
+            pd.Series,
+            list,
+            np.ndarray,
+        ]:  # list of list or list of strings
+            if type(label_df) is pd.Series:
+                if {"onset", "offset", "event_label"}.issubset(
+                        label_df.index
+                ):  # means only one value
+                    if not pd.isna(label_df["event_label"]):
+                        i = self.labels.index(label_df["event_label"])
+                        onset = int(self._time_to_frame(label_df["onset"]))
+                        offset = int(np.ceil(self._time_to_frame(label_df["offset"])))
+                        if "confidence" in label_df.columns:
+                            y[onset:offset, i] = label_df["confidence"]
+                        else:
+                            y[onset:offset, i] = 1
+                    return y
+            for event_label in label_df:
+                # List of string, so weak labels to be encoded in strong
+                if type(event_label) is str:
+                    if event_label != "":
+                        i = self.labels.index(event_label)
+                        y[:, i] = 1
+                # List of list, with [label, onset, offset]
+                elif len(event_label) == 3:
+                    if event_label[0] != "":
+                        i = self.labels.index(event_label[0])
+                        onset = int(self._time_to_frame(event_label[1]))
+                        offset = int(np.ceil(self._time_to_frame(event_label[2])))
+                        y[onset:offset, i] = 1
+                # List of list, with [label, onset, offset, confidence]
+                elif len(event_label) == 4:
+                    if event_label[0] != "":
+                        i = self.labels.index(event_label[0])
+                        onset = int(self._time_to_frame(event_label[1]))
+                        offset = int(np.ceil(self._time_to_frame(event_label[2])))
+                        y[onset:offset, i] = event_label[3]
+                else:
+                    raise NotImplementedError(
+                        "cannot encode strong, type mismatch: {}".format(
+                            type(event_label)
+                        )
+                    )
+        else:
+            raise NotImplementedError(
+                "To encode_strong, type is pandas.Dataframe with onset, offset and event_label"
+                "columns, or it is a list or pandas Series of event labels, "
+                "type given: {}".format(type(label_df))
+            )
+        return y
+    def decode_weak(self, labels):
+        """ Decode the encoded weak labels
+        Args:
+            labels: numpy.array, the encoded labels to be decoded
+        Returns:
+            list
+            Decoded labels, list of string
+        """
+        result_labels = []
+        for i, value in enumerate(labels):
+            if value == 1:
+                result_labels.append(self.labels[i])
+        return result_labels
+    def decode_strong(self, labels):
+        """ Decode the encoded strong labels
+        Args:
+            labels: numpy.array, the encoded labels to be decoded
+        Returns:
+            list
+            Decoded labels, list of list: [[label, onset offset], ...]
+        """
+        result_labels = []
+        for i, label_column in enumerate(labels.T):
+            change_indices = DecisionEncoder().find_contiguous_regions(label_column)
+            # append [label, onset, offset] in the result list
+            for row in change_indices:
+                result_labels.append(
+                    [
+                        self.labels[i],
+                        self._frame_to_time(row[0]),
+                        self._frame_to_time(row[1]),
+                    ]
+                )
+        return result_labels
+    def state_dict(self):
+        return {
+            "labels": self.labels,
+            "audio_len": self.audio_len,
+            "frame_hop": self.frame_hop,
+            "net_pooling": self.net_pooling,
+            "fs": self.fs,
+        }
+    @classmethod
+    def load_state_dict(cls, state_dict):
+        labels = state_dict["labels"]
+        audio_len = state_dict["audio_len"]
+        frame_hop = state_dict["frame_hop"]
+        net_pooling = state_dict["net_pooling"]
+        fs = state_dict["fs"]
+        return cls(labels, audio_len, frame_hop, net_pooling, fs)

helpers/score.py ADDED Viewed

	@@ -0,0 +1,384 @@

+"""
+score functions from: https://hearbenchmark.com/hear-tasks.html
+"""
+import json
+from collections import ChainMap
+from pathlib import Path
+from typing import Dict, Optional, Tuple, Union, List, Any
+import more_itertools
+import numpy as np
+import sed_eval
+import torch
+from dcase_util.containers import MetaDataContainer
+from scipy.ndimage import median_filter
+from sklearn.model_selection import ParameterGrid
+from tqdm import tqdm
+def validate_score_return_type(ret: Union[Tuple[Tuple[str, float], ...], float]):
+    """
+    Valid return types for the metric are
+        - tuple(tuple(string: name of the subtype, float: the value)): This is the
+            case with sed eval metrics. They can return (("f_measure", value),
+            ("precision", value), ...), depending on the scores
+            the metric should is supposed to return. This is set as `scores`
+            attribute in the metric.
+        - float: Standard metric behaviour
+    The downstream prediction pipeline is able to handle these two types.
+    In case of the tuple return type, the value of the first entry in the
+    tuple will be used as an optimisation criterion wherever required.
+    For instance, if the return is (("f_measure", value), ("precision", value)),
+    the value corresponding to the f_measure will be used ( for instance in
+    early stopping if this metric is the primary score for the task )
+    """
+    if isinstance(ret, tuple):
+        assert all(
+            type(s) == tuple and type(s[0]) == str and type(s[1]) == float for s in ret
+        ), (
+            "If the return type of the score is a tuple, all the elements "
+            "in the tuple should be tuple of type (string, float)"
+        )
+    elif isinstance(ret, float):
+        pass
+    else:
+        raise ValueError(
+            f"Return type {type(ret)} is unexpected. Return type of "
+            "the score function should either be a "
+            "tuple(tuple) or float. "
+        )
+class ScoreFunction:
+    """
+    A simple abstract base class for score functions
+    """
+    # TODO: Remove label_to_idx?
+    def __init__(
+            self,
+            label_to_idx: Dict[str, int],
+            name: Optional[str] = None,
+            maximize: bool = True,
+    ):
+        """
+        :param label_to_idx: Map from label string to integer index.
+        :param name: Override the name of this scoring function.
+        :param maximize: Maximize this score? (Otherwise, it's a loss or energy
+            we want to minimize, and I guess technically isn't a score.)
+        """
+        self.label_to_idx = label_to_idx
+        if name:
+            self.name = name
+        self.maximize = maximize
+    def __call__(self, *args, **kwargs) -> Union[Tuple[Tuple[str, float], ...], float]:
+        """
+        Calls the compute function of the metric, and after validating the output,
+        returns the metric score
+        """
+        ret = self._compute(*args, **kwargs)
+        validate_score_return_type(ret)
+        return ret
+    def _compute(
+            self, predictions: Any, targets: Any, **kwargs
+    ) -> Union[Tuple[Tuple[str, float], ...], float]:
+        """
+        Compute the score based on the predictions and targets.
+        This is a private function and the metric should be used as a functor
+        by calling the `__call__` method which calls this and also validates
+        the return type
+        """
+        raise NotImplementedError("Inheriting classes must implement this function")
+    def __str__(self):
+        return self.name
+class SoundEventScore(ScoreFunction):
+    """
+    Scores for sound event detection tasks using sed_eval
+    """
+    # Score class must be defined in inheriting classes
+    score_class: sed_eval.sound_event.SoundEventMetrics = None
+    def __init__(
+            self,
+            label_to_idx: Dict[str, int],
+            scores: Tuple[str],
+            params: Dict = None,
+            name: Optional[str] = None,
+            maximize: bool = True,
+    ):
+        """
+        :param scores: Scores to use, from the list of overall SED eval scores.
+            The first score in the tuple will be the primary score for this metric
+        :param params: Parameters to pass to the scoring function,
+                       see inheriting children for details.
+        """
+        if params is None:
+            params = {}
+        super().__init__(label_to_idx=label_to_idx, name=name, maximize=maximize)
+        self.scores = scores
+        self.params = params
+        assert self.score_class is not None
+    def _compute(
+            self, predictions: Dict, targets: Dict, **kwargs
+    ) -> Tuple[Tuple[str, float], ...]:
+        # Containers of events for sed_eval
+        reference_event_list = self.sed_eval_event_container(targets)
+        estimated_event_list = self.sed_eval_event_container(predictions)
+        # This will break in Python < 3.6 if the dict order is not
+        # the insertion order I think. I'm a little worried about this line
+        scores = self.score_class(
+            event_label_list=list(self.label_to_idx.keys()), **self.params
+        )
+        for filename in predictions:
+            scores.evaluate(
+                reference_event_list=reference_event_list.filter(filename=filename),
+                estimated_event_list=estimated_event_list.filter(filename=filename),
+            )
+        # results_overall_metrics return a pretty large nested selection of scores,
+        # with dicts of scores keyed on the type of scores, like f_measure, error_rate,
+        # accuracy
+        nested_overall_scores: Dict[
+            str, Dict[str, float]
+        ] = scores.results_overall_metrics()
+        # Open up nested overall scores
+        overall_scores: Dict[str, float] = dict(
+            ChainMap(*nested_overall_scores.values())
+        )
+        # Return the required scores as tuples. The scores are returned in the
+        # order they are passed in the `scores` argument
+        return tuple([(score, overall_scores[score]) for score in self.scores])
+    @staticmethod
+    def sed_eval_event_container(
+            x: Dict[str, List[Dict[str, Any]]]
+    ) -> MetaDataContainer:
+        # Reformat event list for sed_eval
+        reference_events = []
+        for filename, event_list in x.items():
+            for event in event_list:
+                reference_events.append(
+                    {
+                        # Convert from ms to seconds for sed_eval
+                        "event_label": str(event["label"]),
+                        "event_onset": event["start"] / 1000.0,
+                        "event_offset": event["end"] / 1000.0,
+                        "file": filename,
+                    }
+                )
+        return MetaDataContainer(reference_events)
+class EventBasedScore(SoundEventScore):
+    """
+    event-based scores - the ground truth and system output are compared at
+    event instance level;
+    See https://tut-arg.github.io/sed_eval/generated/sed_eval.sound_event.EventBasedMetrics.html # noqa: E501
+    for params.
+    """
+    score_class = sed_eval.sound_event.EventBasedMetrics
+class SegmentBasedScore(SoundEventScore):
+    """
+    segment-based scores - the ground truth and system output are compared in a
+    fixed time grid; sound events are marked as active or inactive in each segment;
+    See https://tut-arg.github.io/sed_eval/sound_event.html#sed_eval.sound_event.SegmentBasedMetrics # noqa: E501
+    for params.
+    """
+    score_class = sed_eval.sound_event.SegmentBasedMetrics
+def get_events_for_all_files(
+        predictions: torch.Tensor,
+        filenames: List[str],
+        timestamps: torch.Tensor,
+        idx_to_label: Dict[int, str],
+        postprocessing_grid: Dict[str, List[float]],
+        postprocessing: Optional[Tuple[Tuple[str, Any], ...]] = None,
+) -> Dict[Tuple[Tuple[str, Any], ...], Dict[str, List[Dict[str, Union[str, float]]]]]:
+    """
+    Produces lists of events from a set of frame based label probabilities.
+    The input prediction tensor may contain frame predictions from a set of different
+    files concatenated together. file_timestamps has a list of filenames and
+    timestamps for each frame in the predictions tensor.
+    We split the predictions into separate tensors based on the filename and compute
+    events based on those individually.
+    If no postprocessing is specified (during training), we try a
+    variety of ways of postprocessing the predictions into events,
+    from the postprocessing_grid including median filtering and
+    minimum event length.
+    If postprocessing is specified (during test, chosen at the best
+    validation epoch), we use this postprocessing.
+    Args:
+        predictions: a tensor of frame based multi-label predictions.
+        filenames: a list of filenames where each entry corresponds
+            to a frame in the predictions tensor.
+        timestamps: a list of timestamps where each entry corresponds
+            to a frame in the predictions tensor.
+        idx_to_label: Index to label mapping.
+        postprocessing: See above.
+    Returns:
+        A dictionary from filtering params to the following values:
+        A dictionary of lists of events keyed on the filename slug.
+        The event list is of dicts of the following format:
+            {"label": str, "start": float ms, "end": float ms}
+    """
+    # This probably could be more efficient if we make the assumption that
+    # timestamps are in sorted order. But this makes sure of it.
+    assert predictions.shape[0] == len(filenames)
+    assert predictions.shape[0] == len(timestamps)
+    event_files: Dict[str, Dict[float, torch.Tensor]] = {}
+    for i, (filename, timestamp) in enumerate(zip(filenames, timestamps)):
+        slug = Path(filename).name
+        # Key on the slug to be consistent with the ground truth
+        if slug not in event_files:
+            event_files[slug] = {}
+        # Save the predictions for the file keyed on the timestamp
+        event_files[slug][float(timestamp)] = predictions[i]
+    # Create events for all the different files. Store all the events as a dictionary
+    # with the same format as the ground truth from the luigi pipeline.
+    # Ex) { slug -> [{"label" : "woof", "start": 0.0, "end": 2.32}, ...], ...}
+    event_dict: Dict[
+        Tuple[Tuple[str, Any], ...], Dict[str, List[Dict[str, Union[float, str]]]]
+    ] = {}
+    if postprocessing:
+        postprocess = postprocessing
+        event_dict[postprocess] = {}
+        for slug, timestamp_predictions in event_files.items():
+            event_dict[postprocess][slug] = create_events_from_prediction(
+                timestamp_predictions, idx_to_label, **dict(postprocess)
+            )
+    else:
+        postprocessing_confs = list(ParameterGrid(postprocessing_grid))
+        for postprocess_dict in tqdm(postprocessing_confs):
+            postprocess = tuple(postprocess_dict.items())
+            event_dict[postprocess] = {}
+            for slug, timestamp_predictions in event_files.items():
+                event_dict[postprocess][slug] = create_events_from_prediction(
+                    timestamp_predictions, idx_to_label, **postprocess_dict
+                )
+    return event_dict
+def create_events_from_prediction(
+        prediction_dict: Dict[float, torch.Tensor],
+        idx_to_label: Dict[int, str],
+        threshold: float = 0.5,
+        median_filter_ms: float = 150,
+        min_duration: float = 60.0,
+) -> List[Dict[str, Union[float, str]]]:
+    """
+    Takes a set of prediction tensors keyed on timestamps and generates events.
+    (This is for one particular audio scene.)
+    We convert the prediction tensor to a binary label based on the threshold value. Any
+    events occurring at adjacent timestamps are considered to be part of the same event.
+    This loops through and creates events for each label class.
+    We optionally apply median filtering to predictions.
+    We disregard events that are less than the min_duration milliseconds.
+    Args:
+        prediction_dict: A dictionary of predictions keyed on timestamp
+            {timestamp -> prediction}. The prediction is a tensor of label
+            probabilities.
+        idx_to_label: Index to label mapping.
+        threshold: Threshold for determining whether to apply a label
+        min_duration: the minimum duration in milliseconds for an
+                event to be included.
+    Returns:
+        A list of dicts withs keys "label", "start", and "end"
+    """
+    # Make sure the timestamps are in the correct order
+    timestamps = np.array(sorted(prediction_dict.keys()))
+    # Create a sorted numpy matrix of frame level predictions for this file. We convert
+    # to a numpy array here before applying a median filter.
+    predictions = np.stack(
+        [prediction_dict[t].detach().cpu().numpy() for t in timestamps]
+    )
+    # Optionally apply a median filter here to smooth out events.
+    ts_diff = np.mean(np.diff(timestamps))
+    if median_filter_ms:
+        filter_width = int(round(median_filter_ms / ts_diff))
+        if filter_width:
+            predictions = median_filter(predictions, size=(filter_width, 1))
+    # Convert probabilities to binary vectors based on threshold
+    predictions = (predictions > threshold).astype(np.int8)
+    events = []
+    for label in range(predictions.shape[1]):
+        for group in more_itertools.consecutive_groups(
+                np.where(predictions[:, label])[0]
+        ):
+            grouptuple = tuple(group)
+            assert (
+                    tuple(sorted(grouptuple)) == grouptuple
+            ), f"{sorted(grouptuple)} != {grouptuple}"
+            startidx, endidx = (grouptuple[0], grouptuple[-1])
+            start = timestamps[startidx]
+            end = timestamps[endidx]
+            # Add event if greater than the minimum duration threshold
+            if end - start >= min_duration:
+                events.append(
+                    {"label": idx_to_label[label], "start": start, "end": end}
+                )
+    # This is just for pretty output, not really necessary
+    events.sort(key=lambda k: k["start"])
+    return events
+def combine_target_events(split_names: List[str], task_path):
+    """
+    This combines the target events from the list of splits and
+    returns the combined target events. This is useful when combining
+    multiple folds of data to create the training or validation
+    dataloader. For example, in k-fold, the training data-loader
+    might be made from the first 4/5 folds, and calling this function
+    with [fold00, fold01, fold02, fold03] will return the
+    aggregated target events across all the folds
+    """
+    combined_target_events: Dict = {}
+    for split_name in split_names:
+        target_events = json.load(
+            task_path.joinpath(f"{split_name}.json").open()
+        )
+        common_keys = set(combined_target_events.keys()).intersection(
+            target_events.keys()
+        )
+        assert len(common_keys) == 0, (
+            "Target events from one split should not override "
+            "target events from another. This is very unlikely as the "
+            "target_event is keyed on the files which are distinct for "
+            "each split"
+        )
+        combined_target_events.update(target_events)
+    return combined_target_events

helpers/utils.py ADDED Viewed

	@@ -0,0 +1,12 @@

+import torch
+import numpy as np
+import random
+def worker_init_fn(x):
+    seed = (torch.initial_seed() + x * 1000) % 2 ** 31  # problem with nearly seeded randoms
+    np.random.seed(seed)
+    random.seed(seed)
+    torch.manual_seed(seed)
+    return

images/downstream_task_results.png ADDED Viewed

inference.py ADDED Viewed

	@@ -0,0 +1,126 @@

+import argparse
+import librosa
+import torch
+from data_util import audioset_classes
+from helpers.decode import batched_decode_preds
+from helpers.encode import ManyHotEncoder
+from models.atstframe.ATSTF_wrapper import ATSTWrapper
+from models.beats.BEATs_wrapper import BEATsWrapper
+from models.frame_passt.fpasst_wrapper import FPaSSTWrapper
+from models.m2d.M2D_wrapper import M2DWrapper
+from models.asit.ASIT_wrapper import ASiTWrapper
+from models.frame_mn.Frame_MN_wrapper import FrameMNWrapper
+from models.prediction_wrapper import PredictionsWrapper
+from models.frame_mn.utils import NAME_TO_WIDTH
+def sound_event_detection(args):
+    """
+    Running Sound Event Detection on an audio clip.
+    """
+    device = torch.device('cuda') if args.cuda and torch.cuda.is_available() else torch.device('cpu')
+    model_name = args.model_name
+    if model_name == "BEATs":
+        beats = BEATsWrapper()
+        model = PredictionsWrapper(beats, checkpoint="BEATs_strong_1")
+    elif model_name == "ATST-F":
+        atst = ATSTWrapper()
+        model = PredictionsWrapper(atst, checkpoint="ATST-F_strong_1")
+    elif model_name == "fpasst":
+        fpasst = FPaSSTWrapper()
+        model = PredictionsWrapper(fpasst, checkpoint="fpasst_strong_1")
+    elif model_name == "M2D":
+        m2d = M2DWrapper()
+        model = PredictionsWrapper(m2d, checkpoint="M2D_strong_1", embed_dim=m2d.m2d.cfg.feature_d)
+    elif model_name == "ASIT":
+        asit = ASiTWrapper()
+        model = PredictionsWrapper(asit, checkpoint="ASIT_strong_1")
+    elif model_name.startswith("frame_mn"):
+        width = NAME_TO_WIDTH(model_name)
+        frame_mn = FrameMNWrapper(width)
+        embed_dim = frame_mn.state_dict()['frame_mn.features.16.1.bias'].shape[0]
+        model = PredictionsWrapper(frame_mn, checkpoint=f"{model_name}_strong_1", embed_dim=embed_dim)
+    else:
+        raise NotImplementedError(f"Model {model_name} not (yet) implemented")
+    model.eval()
+    model.to(device)
+    sample_rate = 16_000  # all our models are trained on 16 kHz audio
+    segment_duration = 10  # all models are trained on 10-second pieces
+    segment_samples = segment_duration * sample_rate
+    # load audio
+    (waveform, _) = librosa.core.load(args.audio_file, sr=sample_rate, mono=True)
+    waveform = torch.from_numpy(waveform[None, :]).to(device)
+    waveform_len = waveform.shape[1]
+    audio_len = waveform_len / sample_rate  # in seconds
+    print("Audio length (seconds): ", audio_len)
+    # encoder manages decoding of model predictions into dataframes
+    # containing event labels, onsets and offsets
+    encoder = ManyHotEncoder(audioset_classes.as_strong_train_classes, audio_len=audio_len)
+    # split audio file into 10-second chunks
+    num_chunks = waveform_len // segment_samples + (waveform_len % segment_samples != 0)
+    all_predictions = []
+    # Process each 10-second chunk
+    for i in range(num_chunks):
+        start_idx = i * segment_samples
+        end_idx = min((i + 1) * segment_samples, waveform_len)
+        waveform_chunk = waveform[:, start_idx:end_idx]
+        # Pad the last chunk if it's shorter than 10 seconds
+        if waveform_chunk.shape[1] < segment_samples:
+            pad_size = segment_samples - waveform_chunk.shape[1]
+            waveform_chunk = torch.nn.functional.pad(waveform_chunk, (0, pad_size))
+        # Run inference for each chunk
+        with torch.no_grad():
+            mel = model.mel_forward(waveform_chunk)
+            y_strong, _ = model(mel)
+        # Collect predictions
+        all_predictions.append(y_strong)
+    # Concatenate all predictions along the time axis
+    y_strong = torch.cat(all_predictions, dim=2)
+    # convert into probabilities
+    y_strong = torch.sigmoid(y_strong)
+    (
+        scores_unprocessed,
+        scores_postprocessed,
+        decoded_predictions
+    ) = batched_decode_preds(
+        y_strong.float(),
+        [args.audio_file],
+        encoder,
+        median_filter=args.median_window,
+        thresholds=args.detection_thresholds,
+    )
+    for th in decoded_predictions:
+        print("***************************************")
+        print(f"Detected events using threshold {th}:")
+        print(decoded_predictions[th].sort_values(by="onset"))
+        print("***************************************")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='Example of parser. ')
+    # model names: [BEATs, ASIT, ATST-F, fpasst, M2D]
+    parser.add_argument('--model_name', type=str, default='BEATs')
+    parser.add_argument('--audio_file', type=str,
+                        default='test_files/752547__iscence__milan_metro_coming_in_station.wav')
+    parser.add_argument('--detection_thresholds', type=float, default=(0.1, 0.2, 0.5))
+    parser.add_argument('--median_window', type=float, default=9)
+    parser.add_argument('--cuda', action='store_true', default=False)
+    args = parser.parse_args()
+    assert args.model_name in ["BEATs", "ASIT", "ATST-F", "fpasst", "M2D"] or args.model_name.startswith("frame_mn")
+    sound_event_detection(args)

models/asit/ASIT_wrapper.py ADDED Viewed

	@@ -0,0 +1,60 @@

+from models.asit.data_transformations import DataAugmentation
+from models.asit.vision_transformer import vit_base
+from models.transformer_wrapper import BaseModelWrapper
+class ASiTWrapper(BaseModelWrapper):
+    def __init__(self) -> None:
+        super().__init__()
+        self.asit_mel = DataAugmentation()
+        self.asit = vit_base(
+            patch_size=[16, 16],
+            audio_size=[128, 592],
+            stride=[16, 16],
+            in_chans=1,
+            num_classes=0
+        )
+    def mel_forward(self, x):
+        return self.asit_mel(x)
+    def forward(self, spec):
+        return self.asit(spec)
+    def separate_params(self):
+        pt_params = [[], [], [], [], [], [], [], [], [], [], [], []]
+        for k, p in self.named_parameters():
+            if any(['cls_token' in k,
+                    'pos_embed' in k,
+                    'norm_stats' in k,
+                    'patch_embed' in k]):
+                pt_params[0].append(p)
+            elif 'blocks.0.' in k:
+                pt_params[0].append(p)
+            elif 'blocks.1.' in k:
+                pt_params[1].append(p)
+            elif 'blocks.2.' in k:
+                pt_params[2].append(p)
+            elif 'blocks.3.' in k:
+                pt_params[3].append(p)
+            elif 'blocks.4.' in k:
+                pt_params[4].append(p)
+            elif 'blocks.5.' in k:
+                pt_params[5].append(p)
+            elif 'blocks.6.' in k:
+                pt_params[6].append(p)
+            elif 'blocks.7.' in k:
+                pt_params[7].append(p)
+            elif 'blocks.8.' in k:
+                pt_params[8].append(p)
+            elif 'blocks.9.' in k:
+                pt_params[9].append(p)
+            elif 'blocks.10.' in k:
+                pt_params[10].append(p)
+            elif 'blocks.11.' in k:
+                pt_params[11].append(p)
+            elif 'asit.norm.weight' in k or 'asit.norm.bias' in k:
+                pt_params[11].append(p)
+            else:
+                raise ValueError(f"Check separate params for ASiT! Unknown key: {k}")
+        return list(reversed(pt_params))

models/asit/data_transformations.py ADDED Viewed

	@@ -0,0 +1,29 @@

+import torch
+import torch.nn.functional
+import torchaudio
+class DataAugmentation(object):
+    def __init__(self, data_mean=-4.2677393, data_std=4.5689974, num_mel_bins=128, sample_rate=16000):
+        self.data_mean = data_mean
+        self.data_std = data_std
+        self.num_mel_bins = num_mel_bins
+        self.sample_rate = sample_rate
+    def _wav2fbank(self, waveform):
+        waveform = (waveform - waveform.mean())
+        fbank = torchaudio.compliance.kaldi.fbank(waveform, htk_compat=True, sample_frequency=self.sample_rate,
+                                                  use_energy=False,
+                                                  window_type='hanning', num_mel_bins=self.num_mel_bins, dither=0.0,
+                                                  frame_shift=10)
+        return fbank
+    def convert_waveform(self, waveform):
+        w = self._wav2fbank(waveform)
+        fbank = (w - self.data_mean) / (self.data_std * 2)
+        fbank = fbank.unsqueeze(0)
+        return fbank
+    def __call__(self, batch):
+        # apply convert_waveform to each sample of the batch and return the result
+        return torch.stack([self.convert_waveform(sample.reshape(1, -1)) for sample in batch]).permute(0, 1, 3, 2)

models/asit/utils.py ADDED Viewed

	@@ -0,0 +1,540 @@

+import warnings
+warnings.filterwarnings("ignore")
+import os
+import sys
+import time
+import math
+import random
+import datetime
+import subprocess
+from collections import defaultdict, deque
+import numpy as np
+import torch
+import torch.distributed as dist
+import argparse
+from numpy.random import randint
+def GMML_replace_list(samples, corrup_prev, masks_prev, drop_type='noise', max_replace=0.35, align=16):
+    rep_drop = 1 if drop_type == '' else (1 / (len(drop_type.split('-')) + 1))
+    n_imgs = samples.size()[0]  # this is batch size, but in case bad inistance happened while loading
+    samples_aug = samples.detach().clone()
+    masks = torch.zeros_like(samples_aug)
+    for i in range(n_imgs):
+        idx_rnd = randint(0, n_imgs)
+        if random.random() < rep_drop:
+            samples_aug[i], masks[i] = GMML_drop_rand_patches(samples_aug[i], samples[idx_rnd], max_replace=max_replace,
+                                                              align=align)
+        else:
+            samples_aug[i], masks[i] = corrup_prev[i], masks_prev[i]
+    return samples_aug, masks
+def GMML_drop_rand_patches(X, X_rep=None, drop_type='noise', max_replace=0.7, align=16, max_block_sz=0.3):
+    #######################
+    # max_replace: percentage of image to be replaced
+    # align: align corruption with the patch sizes
+    # max_block_sz: percentage of the maximum block to be dropped
+    #######################
+    np.random.seed()
+    C, H, W = X.size()
+    n_drop_pix = np.random.uniform(min(0.5, max_replace), max_replace) * H * W
+    mx_blk_height = int(H * max_block_sz)
+    mx_blk_width = int(W * max_block_sz)
+    align = max(1, align)
+    mask = torch.zeros_like(X)
+    drop_t = np.random.choice(drop_type.split('-'))
+    while mask[0].sum() < n_drop_pix:
+        ####### get a random block to replace
+        rnd_r = (randint(0, H - align) // align) * align
+        rnd_c = (randint(0, W - align) // align) * align
+        rnd_h = min(randint(align, mx_blk_height), H - rnd_r)
+        rnd_h = round(rnd_h / align) * align
+        rnd_w = min(randint(align, mx_blk_width), W - rnd_c)
+        rnd_w = round(rnd_w / align) * align
+        if X_rep is not None:
+            X[:, rnd_r:rnd_r + rnd_h, rnd_c:rnd_c + rnd_w] = X_rep[:, rnd_r:rnd_r + rnd_h,
+                                                             rnd_c:rnd_c + rnd_w].detach().clone()
+        else:
+            if drop_t == 'noise':
+                X[:, rnd_r:rnd_r + rnd_h, rnd_c:rnd_c + rnd_w] = torch.empty((C, rnd_h, rnd_w), dtype=X.dtype,
+                                                                             device=X.device).normal_()
+            elif drop_t == 'zeros':
+                X[:, rnd_r:rnd_r + rnd_h, rnd_c:rnd_c + rnd_w] = torch.zeros((C, rnd_h, rnd_w), dtype=X.dtype,
+                                                                             device=X.device)
+            else:
+                ####### get a random block to replace from
+                rnd_r2 = (randint(0, H - rnd_h) // align) * align
+                rnd_c2 = (randint(0, W - rnd_w) // align) * align
+                X[:, rnd_r:rnd_r + rnd_h, rnd_c:rnd_c + rnd_w] = X[:, rnd_r2:rnd_r2 + rnd_h,
+                                                                 rnd_c2:rnd_c2 + rnd_w].detach().clone()
+        mask[:, rnd_r:rnd_r + rnd_h, rnd_c:rnd_c + rnd_w] = 1
+    return X, mask
+class collate_batch(object):  # replace from other images
+    def __init__(self, drop_replace=0., drop_align=1):
+        self.drop_replace = drop_replace
+        self.drop_align = drop_align
+    def __call__(self, batch):
+        batch = torch.utils.data.dataloader.default_collate(batch)
+        if self.drop_replace > 0:
+            batch[0][1][0], batch[0][2][0] = GMML_replace_list(batch[0][0][0], batch[0][1][0], batch[0][2][0],
+                                                               max_replace=self.drop_replace, align=self.drop_align)
+            batch[0][1][1], batch[0][2][1] = GMML_replace_list(batch[0][0][1], batch[0][1][1], batch[0][2][1],
+                                                               max_replace=self.drop_replace, align=self.drop_align)
+        return batch
+def clip_gradients(model, clip):
+    norms = []
+    for name, p in model.named_parameters():
+        if p.grad is not None:
+            param_norm = p.grad.data.norm(2)
+            norms.append(param_norm.item())
+            clip_coef = clip / (param_norm + 1e-6)
+            if clip_coef < 1:
+                p.grad.data.mul_(clip_coef)
+    return norms
+def cancel_gradients_last_layer(epoch, model, freeze_last_layer):
+    if epoch >= freeze_last_layer:
+        return
+    for n, p in model.named_parameters():
+        if "last_layer" in n:
+            p.grad = None
+def restart_from_checkpoint(ckp_path, run_variables=None, **kwargs):
+    """
+    Re-start from checkpoint
+    """
+    if not os.path.isfile(ckp_path):
+        return
+    print("Found checkpoint at {}".format(ckp_path))
+    # open checkpoint file
+    checkpoint = torch.load(ckp_path, map_location="cpu")
+    # key is what to look for in the checkpoint file
+    # value is the object to load
+    # example: {'state_dict': model}
+    for key, value in kwargs.items():
+        if key in checkpoint and value is not None:
+            try:
+                msg = value.load_state_dict(checkpoint[key], strict=False)
+                print("=> loaded '{}' from checkpoint '{}' with msg {}".format(key, ckp_path, msg))
+            except TypeError:
+                try:
+                    msg = value.load_state_dict(checkpoint[key])
+                    print("=> loaded '{}' from checkpoint: '{}'".format(key, ckp_path))
+                except ValueError:
+                    print("=> failed to load '{}' from checkpoint: '{}'".format(key, ckp_path))
+        else:
+            print("=> key '{}' not found in checkpoint: '{}'".format(key, ckp_path))
+    # re load variable important for the run
+    if run_variables is not None:
+        for var_name in run_variables:
+            if var_name in checkpoint:
+                run_variables[var_name] = checkpoint[var_name]
+def cosine_scheduler(base_value, final_value, epochs, niter_per_ep, warmup_epochs=0, start_warmup_value=0):
+    warmup_schedule = np.array([])
+    warmup_iters = warmup_epochs * niter_per_ep
+    if warmup_epochs > 0:
+        warmup_schedule = np.linspace(start_warmup_value, base_value, warmup_iters)
+    iters = np.arange(epochs * niter_per_ep - warmup_iters)
+    schedule = final_value + 0.5 * (base_value - final_value) * (1 + np.cos(np.pi * iters / len(iters)))
+    schedule = np.concatenate((warmup_schedule, schedule))
+    assert len(schedule) == epochs * niter_per_ep
+    return schedule
+def bool_flag(s):
+    """
+    Parse boolean arguments from the command line.
+    """
+    FALSY_STRINGS = {"off", "false", "0"}
+    TRUTHY_STRINGS = {"on", "true", "1"}
+    if s.lower() in FALSY_STRINGS:
+        return False
+    elif s.lower() in TRUTHY_STRINGS:
+        return True
+    else:
+        raise argparse.ArgumentTypeError("invalid value for a boolean flag")
+def fix_random_seeds(seed=31):
+    """
+    Fix random seeds.
+    """
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+    np.random.seed(seed)
+class SmoothedValue(object):
+    """Track a series of values and provide access to smoothed values over a
+    window or the global series average.
+    """
+    def __init__(self, window_size=20, fmt=None):
+        if fmt is None:
+            fmt = "{median:.6f} ({global_avg:.6f})"
+        self.deque = deque(maxlen=window_size)
+        self.total = 0.0
+        self.count = 0
+        self.fmt = fmt
+    def update(self, value, n=1):
+        self.deque.append(value)
+        self.count += n
+        self.total += value * n
+    def synchronize_between_processes(self):
+        """
+        Warning: does not synchronize the deque!
+        """
+        if not is_dist_avail_and_initialized():
+            return
+        t = torch.tensor([self.count, self.total], dtype=torch.float64, device='cuda')
+        dist.barrier()
+        dist.all_reduce(t)
+        t = t.tolist()
+        self.count = int(t[0])
+        self.total = t[1]
+    @property
+    def median(self):
+        d = torch.tensor(list(self.deque))
+        return d.median().item()
+    @property
+    def avg(self):
+        d = torch.tensor(list(self.deque), dtype=torch.float32)
+        return d.mean().item()
+    @property
+    def global_avg(self):
+        return self.total / self.count
+    @property
+    def max(self):
+        return max(self.deque)
+    @property
+    def value(self):
+        return self.deque[-1]
+    def __str__(self):
+        return self.fmt.format(
+            median=self.median,
+            avg=self.avg,
+            global_avg=self.global_avg,
+            max=self.max,
+            value=self.value)
+def reduce_dict(input_dict, average=True):
+    """
+    Args:
+        input_dict (dict): all the values will be reduced
+        average (bool): whether to do average or sum
+    Reduce the values in the dictionary from all processes so that all processes
+    have the averaged results. Returns a dict with the same fields as
+    input_dict, after reduction.
+    """
+    world_size = get_world_size()
+    if world_size < 2:
+        return input_dict
+    with torch.no_grad():
+        names = []
+        values = []
+        # sort the keys so that they are consistent across processes
+        for k in sorted(input_dict.keys()):
+            names.append(k)
+            values.append(input_dict[k])
+        values = torch.stack(values, dim=0)
+        dist.all_reduce(values)
+        if average:
+            values /= world_size
+        reduced_dict = {k: v for k, v in zip(names, values)}
+    return reduced_dict
+class MetricLogger(object):
+    def __init__(self, delimiter="\t"):
+        self.meters = defaultdict(SmoothedValue)
+        self.delimiter = delimiter
+    def update(self, **kwargs):
+        for k, v in kwargs.items():
+            if isinstance(v, torch.Tensor):
+                v = v.item()
+            assert isinstance(v, (float, int))
+            self.meters[k].update(v)
+    def __getattr__(self, attr):
+        if attr in self.meters:
+            return self.meters[attr]
+        if attr in self.__dict__:
+            return self.__dict__[attr]
+        raise AttributeError("'{}' object has no attribute '{}'".format(
+            type(self).__name__, attr))
+    def __str__(self):
+        loss_str = []
+        for name, meter in self.meters.items():
+            loss_str.append(
+                "{}: {}".format(name, str(meter))
+            )
+        return self.delimiter.join(loss_str)
+    def synchronize_between_processes(self):
+        for meter in self.meters.values():
+            meter.synchronize_between_processes()
+    def add_meter(self, name, meter):
+        self.meters[name] = meter
+    def log_every(self, iterable, print_freq, header=None):
+        i = 0
+        if not header:
+            header = ''
+        start_time = time.time()
+        end = time.time()
+        iter_time = SmoothedValue(fmt='{avg:.6f}')
+        data_time = SmoothedValue(fmt='{avg:.6f}')
+        space_fmt = ':' + str(len(str(len(iterable)))) + 'd'
+        if torch.cuda.is_available():
+            log_msg = self.delimiter.join([
+                header,
+                '[{0' + space_fmt + '}/{1}]',
+                'eta: {eta}',
+                '{meters}',
+                'time: {time}',
+                'data: {data}',
+                'max mem: {memory:.0f}'
+            ])
+        else:
+            log_msg = self.delimiter.join([
+                header,
+                '[{0' + space_fmt + '}/{1}]',
+                'eta: {eta}',
+                '{meters}',
+                'time: {time}',
+                'data: {data}'
+            ])
+        MB = 1024.0 * 1024.0
+        for obj in iterable:
+            data_time.update(time.time() - end)
+            yield obj
+            iter_time.update(time.time() - end)
+            if i % print_freq == 0 or i == len(iterable) - 1:
+                eta_seconds = iter_time.global_avg * (len(iterable) - i)
+                eta_string = str(datetime.timedelta(seconds=int(eta_seconds)))
+                if torch.cuda.is_available():
+                    print(log_msg.format(
+                        i, len(iterable), eta=eta_string,
+                        meters=str(self),
+                        time=str(iter_time), data=str(data_time),
+                        memory=torch.cuda.max_memory_allocated() / MB))
+                else:
+                    print(log_msg.format(
+                        i, len(iterable), eta=eta_string,
+                        meters=str(self),
+                        time=str(iter_time), data=str(data_time)))
+            i += 1
+            end = time.time()
+        total_time = time.time() - start_time
+        total_time_str = str(datetime.timedelta(seconds=int(total_time)))
+        print('{} Total time: {} ({:.6f} s / it)'.format(
+            header, total_time_str, total_time / len(iterable)))
+def get_sha():
+    cwd = os.path.dirname(os.path.abspath(__file__))
+    def _run(command):
+        return subprocess.check_output(command, cwd=cwd).decode('ascii').strip()
+    sha = 'N/A'
+    diff = "clean"
+    branch = 'N/A'
+    try:
+        sha = _run(['git', 'rev-parse', 'HEAD'])
+        subprocess.check_output(['git', 'diff'], cwd=cwd)
+        diff = _run(['git', 'diff-index', 'HEAD'])
+        diff = "has uncommited changes" if diff else "clean"
+        branch = _run(['git', 'rev-parse', '--abbrev-ref', 'HEAD'])
+    except Exception:
+        pass
+    message = f"sha: {sha}, status: {diff}, branch: {branch}"
+    return message
+def is_dist_avail_and_initialized():
+    if not dist.is_available():
+        return False
+    if not dist.is_initialized():
+        return False
+    return True
+def get_world_size():
+    if not is_dist_avail_and_initialized():
+        return 1
+    return dist.get_world_size()
+def get_rank():
+    if not is_dist_avail_and_initialized():
+        return 0
+    return dist.get_rank()
+def is_main_process():
+    return get_rank() == 0
+def save_on_master(*args, **kwargs):
+    if is_main_process():
+        torch.save(*args, **kwargs)
+def setup_for_distributed(is_master):
+    """
+    This function disables printing when not in master process
+    """
+    import builtins as __builtin__
+    builtin_print = __builtin__.print
+    def print(*args, **kwargs):
+        force = kwargs.pop('force', False)
+        if is_master or force:
+            builtin_print(*args, **kwargs)
+    __builtin__.print = print
+def init_distributed_mode(args):
+    # launched with torch.distributed.launch
+    if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
+        args.rank = int(os.environ["RANK"])
+        args.world_size = int(os.environ['WORLD_SIZE'])
+        args.gpu = int(os.environ['LOCAL_RANK'])
+    # launched with submitit on a slurm cluster
+    elif 'SLURM_PROCID' in os.environ:
+        args.rank = int(os.environ['SLURM_PROCID'])
+        args.gpu = args.rank % torch.cuda.device_count()
+    elif torch.cuda.is_available():
+        print('Will run the code on one GPU.')
+        args.rank, args.gpu, args.world_size = 0, 0, 1
+        os.environ['MASTER_ADDR'] = '127.0.0.1'
+        os.environ['MASTER_PORT'] = '29500'
+    else:
+        print('Does not support training without GPU.')
+        sys.exit(1)
+    args.distributed = True
+    dist.init_process_group(
+        backend="nccl",
+        init_method=args.dist_url,
+        world_size=args.world_size,
+        rank=args.rank,
+    )
+    torch.cuda.set_device(args.gpu)
+    print('| distributed init (rank {}): {}'.format(
+        args.rank, args.dist_url), flush=True)
+    dist.barrier()
+    setup_for_distributed(args.rank == 0)
+def accuracy(output, target, topk=(1,)):
+    """Computes the accuracy over the k top predictions for the specified values of k"""
+    maxk = max(topk)
+    batch_size = target.size(0)
+    _, pred = output.topk(maxk, 1, True, True)
+    pred = pred.t()
+    correct = pred.eq(target.reshape(1, -1).expand_as(pred))
+    return [correct[:k].reshape(-1).float().sum(0) * 100. / batch_size for k in topk]
+def _no_grad_trunc_normal_(tensor, mean, std, a, b):
+    # Cut & paste from PyTorch official master until it's in a few official releases - RW
+    # Method based on https://people.sc.fsu.edu/~jburkardt/presentations/truncated_normal.pdf
+    def norm_cdf(x):
+        # Computes standard normal cumulative distribution function
+        return (1. + math.erf(x / math.sqrt(2.))) / 2.
+    if (mean < a - 2 * std) or (mean > b + 2 * std):
+        warnings.warn("mean is more than 2 std from [a, b] in nn.init.trunc_normal_. "
+                      "The distribution of values may be incorrect.",
+                      stacklevel=2)
+    with torch.no_grad():
+        # Values are generated by using a truncated uniform distribution and
+        # then using the inverse CDF for the normal distribution.
+        # Get upper and lower cdf values
+        l = norm_cdf((a - mean) / std)
+        u = norm_cdf((b - mean) / std)
+        # Uniformly fill tensor with values from [l, u], then translate to
+        # [2l-1, 2u-1].
+        tensor.uniform_(2 * l - 1, 2 * u - 1)
+        # Use inverse cdf transform for normal distribution to get truncated
+        # standard normal
+        tensor.erfinv_()
+        # Transform to proper mean, std
+        tensor.mul_(std * math.sqrt(2.))
+        tensor.add_(mean)
+        # Clamp to ensure it's in the proper range
+        tensor.clamp_(min=a, max=b)
+        return tensor
+def trunc_normal_(tensor, mean=0., std=1., a=-2., b=2.):
+    return _no_grad_trunc_normal_(tensor, mean, std, a, b)
+def get_params_groups(model):
+    regularized = []
+    not_regularized = []
+    for name, param in model.named_parameters():
+        if not param.requires_grad:
+            continue
+        # we do not regularize biases nor Norm parameters
+        if name.endswith(".bias") or len(param.shape) == 1:
+            not_regularized.append(param)
+        else:
+            regularized.append(param)
+    return [{'params': regularized}, {'params': not_regularized, 'weight_decay': 0.}]

models/asit/vision_transformer.py ADDED Viewed

	@@ -0,0 +1,316 @@

+from functools import partial
+import torch
+import torch.nn as nn
+from models.asit.utils import trunc_normal_
+def drop_path(x, drop_prob: float = 0., training: bool = False):
+    if drop_prob == 0. or not training:
+        return x
+    keep_prob = 1 - drop_prob
+    shape = (x.shape[0],) + (1,) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
+    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
+    random_tensor.floor_()  # binarize
+    output = x.div(keep_prob) * random_tensor
+    return output
+class DropPath(nn.Module):
+    """Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
+    """
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+    def forward(self, x):
+        return drop_path(x, self.drop_prob, self.training)
+class Mlp(nn.Module):
+    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop = nn.Dropout(drop)
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+class Attention(nn.Module):
+    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):
+        super().__init__()
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = qk_scale or head_dim ** -0.5
+        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+    def forward(self, x):
+        B, N, C = x.shape
+        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[2]
+        attn = (q @ k.transpose(-2, -1)) * self.scale
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x, attn
+class Block(nn.Module):
+    def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop=0., attn_drop=0.,
+                 drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm):
+        super().__init__()
+        self.norm1 = norm_layer(dim)
+        self.attn = Attention(
+            dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop)
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
+    def forward(self, x, return_attention=False):
+        y, attn = self.attn(self.norm1(x))
+        if return_attention:
+            return attn
+        x = x + self.drop_path(y)
+        x = x + self.drop_path(self.mlp(self.norm2(x)))
+        return x
+class PatchEmbed(nn.Module):
+    """ Image to Patch Embedding
+    """
+    def __init__(self, img_size=[1024, 128], patch_size=[16, 16], in_chans=3, embed_dim=768):
+        super().__init__()
+        num_patches = (img_size[0] // patch_size[0]) * (img_size[1] // patch_size[1])
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.num_patches = num_patches
+        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
+    def forward(self, x):
+        B, C, H, W = x.shape
+        x = self.proj(x).flatten(2).transpose(1, 2)
+        return x
+class VisionTransformer(nn.Module):
+    """ Vision Transformer """
+    def __init__(self, audio_size=[1024, 128], patch_size=[16, 16], in_chans=3, num_classes=0, embed_dim=768, depth=12,
+                 num_heads=12, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop_rate=0., attn_drop_rate=0.,
+                 drop_path_rate=0., norm_layer=nn.LayerNorm, **kwargs):
+        super().__init__()
+        self.num_features = self.embed_dim = embed_dim
+        self.audio_size = audio_size
+        self.patch_size = patch_size
+        self.patch_embed = PatchEmbed(
+            img_size=audio_size, patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim)
+        num_patches = self.patch_embed.num_patches
+        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
+        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
+        self.pos_drop = nn.Dropout(p=drop_rate)
+        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]  # stochastic depth decay rule
+        self.blocks = nn.ModuleList([
+            Block(
+                dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale,
+                drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=norm_layer)
+            for i in range(depth)])
+        self.norm = norm_layer(embed_dim)
+        # Classifier head
+        self.head = nn.Linear(embed_dim, num_classes) if num_classes > 0 else nn.Identity()
+        trunc_normal_(self.pos_embed, std=.02)
+        trunc_normal_(self.cls_token, std=.02)
+        self.apply(self._init_weights)
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+    def interpolate_pos_encoding(self, x, w, h):
+        npatch = (w / 16) * (h / 16)
+        N = self.pos_embed.shape[1] - 1
+        if npatch == N:
+            return self.pos_embed
+        class_pos_embed = self.pos_embed[:, 0]
+        patch_pos_embed = self.pos_embed[:, 1:]
+        sz1 = w // self.patch_size[0]
+        sz2 = h // self.patch_size[0]
+        prev_sz1 = self.audio_size[0] // self.patch_size[0]
+        prev_sz2 = self.audio_size[1] // self.patch_size[1]
+        patch_pos_embed = torch.nn.functional.interpolate(
+            patch_pos_embed.transpose(1, 2).reshape(1, self.embed_dim, prev_sz1, prev_sz2), size=(sz1, sz2),
+            mode='bicubic', align_corners=False)
+        patch_pos_embed = patch_pos_embed.reshape(1, self.embed_dim, sz1 * sz2).transpose(1, 2)
+        return torch.cat((class_pos_embed.unsqueeze(0), patch_pos_embed), dim=1)
+    def prepare_tokens(self, x):
+        B, nc, w, h = x.shape
+        x = self.patch_embed(x)  # patch linear embedding
+        # add the [CLS] token to the embed patch tokens
+        cls_tokens = self.cls_token.expand(B, -1, -1)
+        x = torch.cat((cls_tokens, x), dim=1)
+        # add positional encoding to each token
+        x = x + self.interpolate_pos_encoding(x, w, h)
+        # x = x + self.pos_embed
+        return self.pos_drop(x)
+    def forward(self, x, classify=False):
+        x = x.permute(0, 1, 3, 2)
+        x = self.prepare_tokens(x)
+        for blk in self.blocks:
+            x = blk(x)
+        x = self.norm(x)
+        if classify == True:
+            return self.head(x[:, 0])
+        return x
+    def get_last_selfattention(self, x):
+        x = self.prepare_tokens(x)
+        for i, blk in enumerate(self.blocks):
+            if i < len(self.blocks) - 1:
+                x = blk(x)
+            else:
+                # return attention of the last block
+                return blk(x, return_attention=True)
+    def get_intermediate_layers(self, x, n=1):
+        x = self.prepare_tokens(x)
+        # we return the output tokens from the `n` last blocks
+        output = []
+        for i, blk in enumerate(self.blocks):
+            x = blk(x)
+            if len(self.blocks) - i <= n:
+                output.append(self.norm(x))
+        return output
+def vit_tiny(patch_size=16, **kwargs):
+    model = VisionTransformer(
+        patch_size=patch_size, embed_dim=192, depth=12, num_heads=3, mlp_ratio=4,
+        qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    return model
+def vit_small(patch_size=[16, 16], audio_size=[1024, 128], stride=[16, 16], **kwargs):
+    model = VisionTransformer(
+        patch_size=patch_size, audio_size=audio_size, stride=stride, embed_dim=384, depth=12, num_heads=6, mlp_ratio=4,
+        qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    return model
+def vit_base(patch_size=[16, 16], audio_size=[1024, 128], stride=[16, 16], **kwargs):
+    model = VisionTransformer(
+        patch_size=patch_size, audio_size=audio_size, stride=stride, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4,
+        qkv_bias=True, norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    return model
+class CLSHead(nn.Module):
+    def __init__(self, in_dim, out_dim, use_bn=False, norm_last_layer=True, nlayers=3, hidden_dim=2048,
+                 bottleneck_dim=256):
+        super().__init__()
+        nlayers = max(nlayers, 1)
+        if nlayers == 1:
+            self.mlp = nn.Linear(in_dim, bottleneck_dim)
+        else:
+            layers = [nn.Linear(in_dim, hidden_dim)]
+            if use_bn:
+                layers.append(nn.BatchNorm1d(hidden_dim))
+            layers.append(nn.GELU())
+            for _ in range(nlayers - 2):
+                layers.append(nn.Linear(hidden_dim, hidden_dim))
+                if use_bn:
+                    layers.append(nn.BatchNorm1d(hidden_dim))
+                layers.append(nn.GELU())
+            layers.append(nn.Linear(hidden_dim, bottleneck_dim))
+            self.mlp = nn.Sequential(*layers)
+        self.apply(self._init_weights)
+        self.last_layer = nn.Linear(bottleneck_dim, out_dim, bias=False)
+        self.last_layer = nn.utils.weight_norm(nn.Linear(bottleneck_dim, out_dim, bias=False))
+        self.last_layer.weight_g.data.fill_(1)
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+    def forward(self, x):
+        x = self.mlp(x)
+        x = nn.functional.normalize(x, dim=-1, p=2)
+        return self.last_layer(x)
+class RECHead(nn.Module):
+    def __init__(self, in_dim, audio_size, in_chans=3, patch_size=16):
+        super().__init__()
+        self.audio_size = audio_size
+        self.patch_size = patch_size
+        layers = [nn.Linear(in_dim, in_dim)]
+        layers.append(nn.GELU())
+        layers.append(nn.Linear(in_dim, in_dim))
+        layers.append(nn.GELU())
+        layers.append(nn.Linear(in_dim, in_dim))
+        layers.append(nn.GELU())
+        self.mlp = nn.Sequential(*layers)
+        self.apply(self._init_weights)
+        self.convTrans = nn.ConvTranspose2d(in_dim, in_chans, kernel_size=(patch_size, patch_size),
+                                            stride=(patch_size, patch_size))
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+    def forward(self, x):
+        x = self.mlp(x)
+        x_rec = x.transpose(1, 2)
+        out_sz = (self.audio_size[0] // self.patch_size, self.audio_size[
+            1] // self.patch_size)  # tuple( (  int(math.sqrt(x_rec.size()[2]))  ,   int(math.sqrt(x_rec.size()[2])) ) )
+        x_rec = self.convTrans(x_rec.unflatten(2, out_sz))
+        return x_rec

models/atstframe/ATSTF_wrapper.py ADDED Viewed

	@@ -0,0 +1,105 @@

+import torch
+from torchaudio.transforms import AmplitudeToDB, MelSpectrogram
+from models.atstframe.audio_transformer import FrameASTModel
+from models.transformer_wrapper import BaseModelWrapper
+class ATSTWrapper(BaseModelWrapper):
+    def __init__(self, atst_dropout=0.0) -> None:
+        super().__init__()
+        self.atst_mel = ATSTMel()
+        self.atst = FrameASTModel(atst_dropout=atst_dropout)
+        self.fake_length = torch.tensor([1001])
+        self.cls_embed = None
+    def mel_forward(self, x):
+        return self.atst_mel(x)
+    def forward(self, spec):
+        atst_x = self.atst.get_intermediate_layers(
+            spec,
+            self.fake_length.to(spec).repeat(len(spec)),
+            1,
+            scene=False
+        )
+        return atst_x
+    def separate_params(self):
+        pt_params = [[], [], [], [], [], [], [], [], [], [], [], []]
+        for k, p in self.named_parameters():
+            if k in ['atst.mask_embed', 'atst.pos_embed', 'atst.patch_embed.patch_embed.weight',
+                     'atst.patch_embed.patch_embed.bias'] or "blocks.0." in k:
+                pt_params[0].append(p)
+            elif "blocks.1." in k:
+                pt_params[1].append(p)
+            elif "blocks.2." in k:
+                pt_params[2].append(p)
+            elif "blocks.3." in k:
+                pt_params[3].append(p)
+            elif "blocks.4." in k:
+                pt_params[4].append(p)
+            elif "blocks.5." in k:
+                pt_params[5].append(p)
+            elif "blocks.6." in k:
+                pt_params[6].append(p)
+            elif "blocks.7." in k:
+                pt_params[7].append(p)
+            elif "blocks.8" in k:
+                pt_params[8].append(p)
+            elif "blocks.9." in k:
+                pt_params[9].append(p)
+            elif "blocks.10." in k:
+                pt_params[10].append(p)
+            elif "blocks.11." in k or ".norm_frame." in k:
+                pt_params[11].append(p)
+            else:
+                raise ValueError(f"Check separate params for ATST! Unknown key: {k}")
+        return list(reversed(pt_params))
+class ATSTMel(torch.nn.Module):
+    def __init__(self) -> None:
+        super().__init__()
+        self.mel_transform = MelSpectrogram(
+            16000,
+            f_min=60,
+            f_max=7800,
+            hop_length=160,
+            win_length=1024,
+            n_fft=1024,
+            n_mels=64
+        )
+        self.amp_to_db = AmplitudeToDB(stype="power", top_db=80)
+        self.scaler = MinMax(min=-79.6482, max=50.6842)
+    def amp2db(self, spec):
+        return self.amp_to_db(spec).clamp(min=-50, max=80)
+    def forward(self, audio):
+        with torch.autocast(device_type="cuda", enabled=False):
+            spec = self.mel_transform(audio)
+        spec = self.scaler(self.amp2db(spec))
+        spec = spec.unsqueeze(1)
+        return spec
+class CustomAudioTransform:
+    def __repr__(self):
+        return self.__class__.__name__ + '()'
+class MinMax(CustomAudioTransform):
+    def __init__(self, min, max):
+        self.min = min
+        self.max = max
+    def __call__(self, input):
+        if self.min is None:
+            min_ = torch.min(input)
+            max_ = torch.max(input)
+        else:
+            min_ = self.min
+            max_ = self.max
+        input = (input - min_) / (max_ - min_) * 2. - 1.
+        return input

models/atstframe/audio_transformer.py ADDED Viewed

	@@ -0,0 +1,253 @@

+import math
+import warnings
+from functools import partial
+import torch
+from torch import nn
+from .transformer import Block
+def _no_grad_trunc_normal_(tensor, mean, std, a, b):
+    # Cut & paste from PyTorch official master until it's in a few official releases - RW
+    # Method based on https://people.sc.fsu.edu/~jburkardt/presentations/truncated_normal.pdf
+    def norm_cdf(x):
+        # Computes standard normal cumulative distribution function
+        return (1. + math.erf(x / math.sqrt(2.))) / 2.
+    if (mean < a - 2 * std) or (mean > b + 2 * std):
+        warnings.warn("mean is more than 2 std from [a, b] in nn.init.trunc_normal_. "
+                      "The distribution of values may be incorrect.",
+                      stacklevel=2)
+    with torch.no_grad():
+        # Values are generated by using a truncated uniform distribution and
+        # then using the inverse CDF for the normal distribution.
+        # Get upper and lower cdf values
+        l = norm_cdf((a - mean) / std)
+        u = norm_cdf((b - mean) / std)
+        # Uniformly fill tensor with values from [l, u], then translate to
+        # [2l-1, 2u-1].
+        tensor.uniform_(2 * l - 1, 2 * u - 1)
+        # Use inverse cdf transform for normal distribution to get truncated
+        # standard normal
+        tensor.erfinv_()
+        # Transform to proper mean, std
+        tensor.mul_(std * math.sqrt(2.))
+        tensor.add_(mean)
+        # Clamp to ensure it's in the proper range
+        tensor.clamp_(min=a, max=b)
+        return tensor
+def trunc_normal_(tensor, mean=0., std=1., a=-2., b=2.):
+    return _no_grad_trunc_normal_(tensor, mean, std, a, b)
+def get_num_patches(height=64, width=1001, patch_height=16, patch_width=16):
+    return (height // patch_height) * (width // patch_width)
+from einops.layers.torch import Rearrange
+class PatchEmbed_v2(nn.Module):
+    def __init__(self, patch_height=64, patch_width=4, embed_dim=768, input_dim=1):
+        super().__init__()
+        self.patch_height = patch_height
+        self.patch_width = patch_width
+        self.patch_maker = Rearrange('b c (h p1) (w p2) -> b (w h) (p1 p2 c)', p1=patch_height, p2=patch_width)
+        self.patch_embed = nn.Linear(patch_height * patch_width * input_dim, embed_dim)
+    def forward(self, melspec, length=None):
+        height = melspec.shape[2] - melspec.shape[2] % self.patch_height
+        width = melspec.shape[3] - melspec.shape[3] % self.patch_width
+        patch = self.patch_maker(melspec[:, :, :height, :width])
+        patch_embed = self.patch_embed(patch)
+        if length is not None:
+            patch_length = (torch.div(height, self.patch_height, rounding_mode='trunc')) * torch.div(
+                (length - length % self.patch_width), self.patch_width, rounding_mode='trunc')
+        else:
+            patch_length = None
+        return patch, patch_embed, patch_length
+class FrameAST(nn.Module):
+    """ Vision Transformer """
+    def __init__(self, nprompt=0, spec_h=64, spec_w=1001, patch_w=16, patch_h=16, pos_type="cut", in_chans=1,
+                 num_classes=0, embed_dim=768, depth=12,
+                 num_heads=12, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop_rate=0.0, attn_drop_rate=0.,
+                 drop_path_rate=0.0, norm_layer=nn.LayerNorm, **kwargs):
+        super().__init__()
+        self.num_features = self.embed_dim = embed_dim
+        self.spec_w = spec_w
+        self.spec_h = spec_h
+        self.embed_dim = embed_dim
+        self.patch_w = patch_w
+        self.patch_h = patch_h
+        self.pos_type = pos_type
+        self.patch_embed = PatchEmbed_v2(patch_h, patch_w, embed_dim)
+        self.mask_embed = nn.Parameter(torch.zeros(1, 1, self.embed_dim))
+        # hack
+        self.nprompt = nprompt
+        if self.nprompt > 0:
+            self.prompt_embed = nn.Parameter(torch.zeros(1, self.nprompt, self.embed_dim))
+            trunc_normal_(self.prompt_embed, std=.02)
+        num_patches = get_num_patches(spec_h, spec_w, patch_h, patch_w)
+        self.num_patches = num_patches
+        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
+        self.pos_drop = nn.Dropout(p=drop_rate)
+        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]  # stochastic depth decay rule
+        self.blocks = nn.ModuleList([
+            Block(
+                dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale,
+                drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=norm_layer)
+            for i in range(depth)])
+        self.norm_frame = norm_layer(embed_dim)
+        trunc_normal_(self.pos_embed, std=.02)
+        trunc_normal_(self.mask_embed, std=.02)
+        self.apply(self._init_weights)
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+    def prepare_tokens(self, x, mask_index, length, mask=True):
+        B, nc, h, w = x.shape
+        mel_patches, x, patch_length = self.patch_embed(x, length)  # patch linear embedding
+        B, T, C = x.shape
+        if (mask_index is not None) and mask:
+            mask_index_expand = mask_index.unsqueeze(2).expand(B, T, self.embed_dim).float()
+            x = (1 - mask_index_expand) * x + mask_index_expand * self.mask_embed.expand(B, T, C)
+        # add positional encoding to each token
+        if self.pos_type == "cut":
+            pos = self.pos_embed[:, 1:T + 1, :].expand(B, -1, -1)
+            x = x + pos
+        else:
+            pos = self.interpolate_pos_encoding(x, h, w)
+            x = x + pos[:, 1:]
+        # pos = self.pos_embed[:,1:T+1,:].expand(B,-1,-1)
+        # x = x + pos
+        return self.pos_drop(x), pos, mel_patches, h, w, patch_length
+    def forward(self, x, mask_index=None, mask_input=True, length=None):
+        x, pos, mel_patches, h, w, patch_length = self.prepare_tokens(x, mask_index, length, mask_input)
+        length_mask = torch.arange(mel_patches.shape[1]).to(x.device) < patch_length.unsqueeze(1)
+        length_mask = length_mask.to(x.device)
+        mask_index = mask_index & length_mask
+        if self.nprompt > 0:
+            x = torch.cat([self.prompt_embed.expand(x.shape[0], -1, -1), x], dim=1)
+        for i, blk in enumerate(self.blocks):
+            x = blk(x, patch_length + self.nprompt)
+        frame_repr = self.norm_frame(x)
+        return frame_repr[:, self.nprompt:][mask_index]
+    def interpolate_pos_encoding(self, x, h, w):
+        npatch = x.shape[1] - 1
+        N = self.pos_embed.shape[1] - 1
+        if npatch == N and w == self.spec_w and h == self.spec_h:
+            return self.pos_embed
+        class_pos_embed = self.pos_embed[:, 0]
+        patch_pos_embed = self.pos_embed[:, 1:]
+        dim = x.shape[-1]
+        w0 = w // self.patch_embed.patch_width
+        h0 = h // self.patch_embed.patch_height
+        # we add a small number to avoid floating point error in the interpolation
+        # see discussion at https://github.com/facebookresearch/dino/issues/8
+        w0, h0 = w0 + 0.1, h0 + 0.1
+        patch_pos_embed = nn.functional.interpolate(
+            patch_pos_embed.reshape(1, self.spec_h // self.patch_h, self.spec_w // self.patch_w, dim).permute(0, 3, 1,
+                                                                                                              2),
+            scale_factor=(h0 / (self.spec_h // self.patch_h), w0 / (self.spec_w // self.patch_w)),
+            mode='bicubic',
+        )
+        assert int(h0) == patch_pos_embed.shape[-2] and int(w0) == patch_pos_embed.shape[-1]
+        patch_pos_embed = patch_pos_embed.permute(0, 2, 3, 1).view(1, -1, dim)
+        return torch.cat((class_pos_embed.unsqueeze(0), patch_pos_embed), dim=1)
+    def get_last_selfattention(self, x):
+        x, _, _, _, _, _ = self.prepare_tokens(x, mask_index=None, length=None, mask=False)
+        atts = []
+        for i, blk in enumerate(self.blocks):
+            if i < len(self.blocks) - 1:
+                x, att = blk(x, return_attention=True)
+                atts.append(att)
+            else:
+                x, att = blk(x, return_attention=True)
+                atts.append(att)
+                return atts
+                # return attention of the last block
+    def get_intermediate_layers(self, x, length, n=1, scene=True, other_emb=None):
+        x, _, _, _, _, patch_length = self.prepare_tokens(x, mask_index=None, length=length, mask=False)
+        # we return the output tokens from the `n` last blocks
+        if other_emb is not None:
+            x = torch.cat([other_emb, x], dim=1)
+        output = []
+        if self.nprompt > 0:
+            x = torch.cat([self.prompt_embed.expand(x.shape[0], -1, -1), x], dim=1)
+        for i, blk in enumerate(self.blocks):
+            x = blk(x, patch_length + self.nprompt)
+            if len(self.blocks) - i <= n:
+                norm_x = self.norm_frame(x)
+                if scene:
+                    length_mask = torch.arange(x.shape[1] - self.nprompt).to(x.device) < patch_length.unsqueeze(1)
+                    avg = torch.sum(norm_x[:, self.nprompt:] * length_mask.unsqueeze(-1), dim=1) / (
+                            patch_length.unsqueeze(-1) + 1e-6)
+                    negative = (~length_mask) * -1e10
+                    # max = torch.max(norm_x[:,self.nprompt:]+negative.unsqueeze(-1),1).values
+                    output.append(avg)
+                    if self.nprompt > 0:
+                        output.append(torch.mean(norm_x[:, :self.nprompt], dim=1))
+                else:
+                    output.append(norm_x[:, self.nprompt:])
+        return torch.cat(output, dim=-1)
+def get_cls_avg(output_i, cur_len, use_cls):
+    length_mask = torch.arange(output_i[0].shape[1]).to(output_i[0].device) < cur_len.unsqueeze(1)
+    cls = [torch.zeros_like(x[:, 0]) for x in output_i]
+    avg = [torch.sum(x * length_mask.unsqueeze(-1), dim=1) / (cur_len.unsqueeze(1) + 1e-6) for x in output_i]
+    return cls, avg
+def FrameASTModel(patch_h=64, patch_w=4, atst_dropout=0.1, **kwargs):
+    return FrameAST(
+        patch_h=patch_h,
+        patch_w=patch_w,
+        embed_dim=768,
+        depth=12,
+        num_heads=12,
+        qkv_bias=False,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6),
+        drop_path_rate=atst_dropout,
+        drop_rate=atst_dropout,
+        **kwargs)

models/atstframe/transformer.py ADDED Viewed

	@@ -0,0 +1,112 @@

+import torch
+import torch.nn as nn
+def drop_path(x, drop_prob: float = 0., training: bool = False):
+    if drop_prob == 0. or not training:
+        return x
+    keep_prob = 1 - drop_prob
+    shape = (x.shape[0],) + (1,) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
+    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
+    random_tensor.floor_()  # binarize
+    output = x.div(keep_prob) * random_tensor
+    return output
+class DropPath(nn.Module):
+    """Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
+    """
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+    def forward(self, x):
+        return drop_path(x, self.drop_prob, self.training)
+class Mlp(nn.Module):
+    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop = nn.Dropout(drop)
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+class Attention(nn.Module):
+    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):
+        super().__init__()
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = qk_scale or head_dim ** -0.5
+        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+    def forward(self, x, mask):
+        B, N, C = x.shape
+        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[2]
+        attn = (q @ k.transpose(-2, -1)) * self.scale
+        if mask is not None:
+            attn += mask
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x, attn
+class Block(nn.Module):
+    def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop=0., attn_drop=0.,
+                 drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm):
+        super().__init__()
+        self.norm1 = norm_layer(dim)
+        self.attn = Attention(
+            dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop)
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
+    def forward(self, x, length=None, return_attention=False):
+        # if length is not None:
+        #     print(length)
+        #     mask_att = get_attention_mask(x,length)
+        # else:
+        mask_att = None
+        y, attn = self.attn(self.norm1(x), mask_att)
+        x = x + self.drop_path(y)
+        x = x + self.drop_path(self.mlp(self.norm2(x)))
+        if return_attention:
+            return x, attn
+        else:
+            return x
+def get_attention_mask(x, length):
+    batch_size, max_len, _ = x.shape
+    # create mask for padded elements and zero-out them
+    mask = torch.arange(max_len, device=length.device).expand(batch_size, max_len) >= length[:, None]
+    # extend the mask to attention shape and set weight
+    mask = -10000.0 * mask[:, None, None, :]
+    mask = mask.expand(batch_size, 1, max_len, max_len).to(x.device)
+    return mask

models/beats/BEATs.py ADDED Viewed

	@@ -0,0 +1,183 @@

+# --------------------------------------------------------
+# BEATs: Audio Pre-Training with Acoustic Tokenizers (https://arxiv.org/abs/2212.09058)
+# Github source: https://github.com/microsoft/unilm/tree/master/beats
+# Copyright (c) 2022 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Based on fairseq code bases
+# https://github.com/pytorch/fairseq
+# --------------------------------------------------------
+import torch
+import torch.nn as nn
+from torch.nn import LayerNorm
+import torchaudio.compliance.kaldi as ta_kaldi
+from models.beats.backbone import (
+    TransformerEncoder,
+)
+import logging
+from typing import Optional
+logger = logging.getLogger(__name__)
+class BEATsConfig:
+    def __init__(self, cfg=None):
+        self.input_patch_size: int = 16  # path size of patch embedding
+        self.embed_dim: int = 512  # patch embedding dimension
+        self.conv_bias: bool = False  # include bias in conv encoder
+        self.encoder_layers: int = 12  # num encoder layers in the transformer
+        self.encoder_embed_dim: int = 768  # encoder embedding dimension
+        self.encoder_ffn_embed_dim: int = 3072  # encoder embedding dimension for FFN
+        self.encoder_attention_heads: int = 12  # num encoder attention heads
+        self.activation_fn: str = "gelu"  # activation function to use
+        self.layer_wise_gradient_decay_ratio: float = 1.0  # ratio for layer-wise gradient decay
+        self.layer_norm_first: bool = False  # apply layernorm first in the transformer
+        self.deep_norm: bool = True  # apply deep_norm first in the transformer
+        # dropouts
+        self.dropout: float = 0.1  # dropout probability for the transformer
+        self.attention_dropout: float = 0.1  # dropout probability for attention weights
+        self.activation_dropout: float = 0.0  # dropout probability after activation in FFN
+        self.encoder_layerdrop: float = 0.05  # probability of dropping a tarnsformer layer
+        self.dropout_input: float = 0.1  # dropout to apply to the input (after feat extr)
+        # positional embeddings
+        self.conv_pos: int = 128  # number of filters for convolutional positional embeddings
+        self.conv_pos_groups: int = 16  # number of groups for convolutional positional embedding
+        # relative position embedding
+        self.relative_position_embedding: bool = True  # apply relative position embedding
+        self.num_buckets: int = 320  # number of buckets for relative position embedding
+        self.max_distance: int = 800  # maximum distance for relative position embedding
+        self.gru_rel_pos: bool = True  # apply gated relative position embedding
+        # label predictor
+        self.finetuned_model: bool = False  # whether the model is a fine-tuned model.
+        self.predictor_dropout: float = 0.1  # dropout probability for the predictor
+        self.predictor_class: int = 527  # target class number for the predictor
+        if cfg is not None:
+            self.update(cfg)
+    def update(self, cfg: dict):
+        self.__dict__.update(cfg)
+class BEATs(nn.Module):
+    def __init__(
+            self,
+            cfg: BEATsConfig,
+    ) -> None:
+        super().__init__()
+        logger.info(f"BEATs Config: {cfg.__dict__}")
+        self.cfg = cfg
+        self.embed = cfg.embed_dim
+        self.post_extract_proj = (
+            nn.Linear(self.embed, cfg.encoder_embed_dim)
+            if self.embed != cfg.encoder_embed_dim
+            else None
+        )
+        self.input_patch_size = cfg.input_patch_size
+        self.patch_embedding = nn.Conv2d(1, self.embed, kernel_size=self.input_patch_size, stride=self.input_patch_size,
+                                         bias=cfg.conv_bias)
+        self.dropout_input = nn.Dropout(cfg.dropout_input)
+        assert not cfg.deep_norm or not cfg.layer_norm_first
+        self.encoder = TransformerEncoder(cfg)
+        self.layer_norm = LayerNorm(self.embed)
+        if cfg.finetuned_model:
+            self.predictor_dropout = nn.Dropout(cfg.predictor_dropout)
+            self.predictor = nn.Linear(cfg.encoder_embed_dim, cfg.predictor_class)
+        else:
+            self.predictor = None
+    def forward_padding_mask(
+            self,
+            features: torch.Tensor,
+            padding_mask: torch.Tensor,
+    ) -> torch.Tensor:
+        extra = padding_mask.size(1) % features.size(1)
+        if extra > 0:
+            padding_mask = padding_mask[:, :-extra]
+        padding_mask = padding_mask.view(
+            padding_mask.size(0), features.size(1), -1
+        )
+        padding_mask = padding_mask.all(-1)
+        return padding_mask
+    def preprocess(
+            self,
+            source: torch.Tensor,
+            fbank_mean: float = 15.41663,
+            fbank_std: float = 6.55582,
+    ) -> torch.Tensor:
+        fbanks = []
+        for waveform in source:
+            waveform = waveform.unsqueeze(0) * 2 ** 15
+            fbank = ta_kaldi.fbank(waveform, num_mel_bins=128, sample_frequency=16000, frame_length=25, frame_shift=10)
+            fbanks.append(fbank)
+        fbank = torch.stack(fbanks, dim=0)
+        fbank = (fbank - fbank_mean) / (2 * fbank_std)
+        return fbank
+    def extract_features(
+            self,
+            source: torch.Tensor,
+            padding_mask: Optional[torch.Tensor] = None,
+            fbank_mean: float = 15.41663,
+            fbank_std: float = 6.55582,
+            do_preprocess: bool = True,
+    ):
+        if do_preprocess:
+            fbank = self.preprocess(source, fbank_mean=fbank_mean, fbank_std=fbank_std)
+            if padding_mask is not None:
+                padding_mask = self.forward_padding_mask(fbank, padding_mask)
+            fbank = fbank.unsqueeze(1)
+        else:
+            fbank = source
+        features = self.patch_embedding(fbank)
+        features = features.reshape(features.shape[0], features.shape[1], -1)
+        features = features.transpose(1, 2)
+        features = self.layer_norm(features)
+        if padding_mask is not None:
+            padding_mask = self.forward_padding_mask(features, padding_mask)
+        if self.post_extract_proj is not None:
+            features = self.post_extract_proj(features)
+        x = self.dropout_input(features)
+        x, layer_results = self.encoder(
+            x,
+            padding_mask=padding_mask,
+        )
+        if self.predictor is not None:
+            x = self.predictor_dropout(x)
+            logits = self.predictor(x)
+            if padding_mask is not None and padding_mask.any():
+                logits[padding_mask] = 0
+                logits = logits.sum(dim=1)
+                logits = logits / (~padding_mask).sum(dim=1).unsqueeze(-1).expand_as(logits)
+            else:
+                logits = logits.mean(dim=1)
+            lprobs = torch.sigmoid(logits)
+            return lprobs, padding_mask
+        else:
+            return x, padding_mask

models/beats/BEATs_wrapper.py ADDED Viewed

	@@ -0,0 +1,56 @@

+import torch
+from models.beats.BEATs import BEATsConfig, BEATs
+from models.transformer_wrapper import BaseModelWrapper
+class BEATsWrapper(BaseModelWrapper):
+    def __init__(self):
+        super().__init__()
+        cfg = BEATsConfig()
+        self.beats = BEATs(cfg)
+    def mel_forward(self, x):
+        with torch.autocast(device_type="cuda", enabled=False):
+            mel = self.beats.preprocess(x)
+        mel = mel.unsqueeze(1).transpose(2, 3)
+        return mel
+    def forward(self, x):
+        x = x.transpose(2, 3)
+        features = self.beats.extract_features(x, do_preprocess=False)[0]
+        return features
+    def separate_params(self):
+        pt_params = [[], [], [], [], [], [], [], [], [], [], [], []]
+        for k, p in self.named_parameters():
+            if ".layers.0." in k:
+                pt_params[0].append(p)
+            elif ".layers.1." in k:
+                pt_params[1].append(p)
+            elif ".layers.2." in k:
+                pt_params[2].append(p)
+            elif ".layers.3." in k:
+                pt_params[3].append(p)
+            elif ".layers.4." in k:
+                pt_params[4].append(p)
+            elif ".layers.5." in k:
+                pt_params[5].append(p)
+            elif ".layers.6." in k:
+                pt_params[6].append(p)
+            elif ".layers.7." in k:
+                pt_params[7].append(p)
+            elif ".layers.8." in k:
+                pt_params[8].append(p)
+            elif ".layers.9." in k:
+                pt_params[9].append(p)
+            elif ".layers.10." in k:
+                pt_params[10].append(p)
+            elif ".layers.11." in k:
+                pt_params[11].append(p)
+            elif (".post_extract_proj." in k or ".patch_embedding." in k or '.pos_conv.' in k
+                  or 'beats.layer_norm.' in k or "beats.encoder.layer_norm." in k):
+                pt_params[0].append(p)
+            else:
+                raise ValueError(f"Check separate params for BEATs! Unknown key: {k}")
+        return list(reversed(pt_params))

models/beats/Tokenizers.py ADDED Viewed

	@@ -0,0 +1,172 @@

+# --------------------------------------------------------
+# BEATs: Audio Pre-Training with Acoustic Tokenizers (https://arxiv.org/abs/2212.09058)
+# Github source: https://github.com/microsoft/unilm/tree/master/beats
+# Copyright (c) 2022 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Based on fairseq code bases
+# https://github.com/pytorch/fairseq
+# --------------------------------------------------------
+import torch
+import torch.nn as nn
+from torch.nn import LayerNorm
+import torchaudio.compliance.kaldi as ta_kaldi
+from backbone import (
+    TransformerEncoder,
+)
+from quantizer import (
+    NormEMAVectorQuantizer,
+)
+import logging
+from typing import Optional
+logger = logging.getLogger(__name__)
+class TokenizersConfig:
+    def __init__(self, cfg=None):
+        self.input_patch_size: int = -1  # path size of patch embedding
+        self.embed_dim: int = 512  # patch embedding dimension
+        self.conv_bias: bool = False  # include bias in conv encoder
+        self.encoder_layers: int = 12  # num encoder layers in the transformer
+        self.encoder_embed_dim: int = 768  # encoder embedding dimension
+        self.encoder_ffn_embed_dim: int = 3072  # encoder embedding dimension for FFN
+        self.encoder_attention_heads: int = 12  # num encoder attention heads
+        self.activation_fn: str = "gelu"  # activation function to use
+        self.layer_norm_first: bool = False  # apply layernorm first in the transformer
+        self.deep_norm: bool = False  # apply deep_norm first in the transformer
+        # dropouts
+        self.dropout: float = 0.1  # dropout probability for the transformer
+        self.attention_dropout: float = 0.1  # dropout probability for attention weights
+        self.activation_dropout: float = 0.0  # dropout probability after activation in FFN
+        self.encoder_layerdrop: float = 0.0  # probability of dropping a tarnsformer layer
+        self.dropout_input: float = 0.0  # dropout to apply to the input (after feat extr)
+        # positional embeddings
+        self.conv_pos: int = 128  # number of filters for convolutional positional embeddings
+        self.conv_pos_groups: int = 16  # number of groups for convolutional positional embedding
+        # relative position embedding
+        self.relative_position_embedding: bool = False  # apply relative position embedding
+        self.num_buckets: int = 320  # number of buckets for relative position embedding
+        self.max_distance: int = 1280  # maximum distance for relative position embedding
+        self.gru_rel_pos: bool = False  # apply gated relative position embedding
+        # quantizer
+        self.quant_n: int = 1024 # codebook number in quantizer
+        self.quant_dim: int = 256    # codebook dimension in quantizer
+        if cfg is not None:
+            self.update(cfg)
+    def update(self, cfg: dict):
+        self.__dict__.update(cfg)
+class Tokenizers(nn.Module):
+    def __init__(
+            self,
+            cfg: TokenizersConfig,
+    ) -> None:
+        super().__init__()
+        logger.info(f"Tokenizers Config: {cfg.__dict__}")
+        self.cfg = cfg
+        self.embed = cfg.embed_dim
+        self.post_extract_proj = (
+            nn.Linear(self.embed, cfg.encoder_embed_dim)
+            if self.embed != cfg.encoder_embed_dim
+            else None
+        )
+        self.input_patch_size = cfg.input_patch_size
+        self.patch_embedding = nn.Conv2d(1, self.embed, kernel_size=self.input_patch_size, stride=self.input_patch_size,
+                                         bias=cfg.conv_bias)
+        self.dropout_input = nn.Dropout(cfg.dropout_input)
+        assert not cfg.deep_norm or not cfg.layer_norm_first
+        self.encoder = TransformerEncoder(cfg)
+        self.layer_norm = LayerNorm(self.embed)
+        self.quantize = NormEMAVectorQuantizer(
+            n_embed=cfg.quant_n, embedding_dim=cfg.quant_dim, beta=1.0, kmeans_init=True, decay=0.99,
+        )
+        self.quant_n = cfg.quant_n
+        self.quantize_layer = nn.Sequential(
+            nn.Linear(cfg.encoder_embed_dim, cfg.encoder_embed_dim),
+            nn.Tanh(),
+            nn.Linear(cfg.encoder_embed_dim, cfg.quant_dim)  # for quantize
+        )
+    def forward_padding_mask(
+            self,
+            features: torch.Tensor,
+            padding_mask: torch.Tensor,
+    ) -> torch.Tensor:
+        extra = padding_mask.size(1) % features.size(1)
+        if extra > 0:
+            padding_mask = padding_mask[:, :-extra]
+        padding_mask = padding_mask.view(
+            padding_mask.size(0), features.size(1), -1
+        )
+        padding_mask = padding_mask.all(-1)
+        return padding_mask
+    def preprocess(
+            self,
+            source: torch.Tensor,
+            fbank_mean: float = 15.41663,
+            fbank_std: float = 6.55582,
+    ) -> torch.Tensor:
+        fbanks = []
+        for waveform in source:
+            waveform = waveform.unsqueeze(0) * 2 ** 15
+            fbank = ta_kaldi.fbank(waveform, num_mel_bins=128, sample_frequency=16000, frame_length=25, frame_shift=10)
+            fbanks.append(fbank)
+        fbank = torch.stack(fbanks, dim=0)
+        fbank = (fbank - fbank_mean) / (2 * fbank_std)
+        return fbank
+    def extract_labels(
+            self,
+            source: torch.Tensor,
+            padding_mask: Optional[torch.Tensor] = None,
+            fbank_mean: float = 15.41663,
+            fbank_std: float = 6.55582,
+    ):
+        fbank = self.preprocess(source, fbank_mean=fbank_mean, fbank_std=fbank_std)
+        if padding_mask is not None:
+            padding_mask = self.forward_padding_mask(fbank, padding_mask)
+        fbank = fbank.unsqueeze(1)
+        features = self.patch_embedding(fbank)
+        features = features.reshape(features.shape[0], features.shape[1], -1)
+        features = features.transpose(1, 2)
+        features = self.layer_norm(features)
+        if padding_mask is not None:
+            padding_mask = self.forward_padding_mask(features, padding_mask)
+        if self.post_extract_proj is not None:
+            features = self.post_extract_proj(features)
+        x = self.dropout_input(features)
+        x, layer_results = self.encoder(
+            x,
+            padding_mask=padding_mask,
+        )
+        quantize_input = self.quantize_layer(x)
+        quantize_feature, embed_loss, embed_ind = self.quantize(quantize_input)
+        return embed_ind

models/beats/backbone.py ADDED Viewed

	@@ -0,0 +1,783 @@

+# --------------------------------------------------------
+# BEATs: Audio Pre-Training with Acoustic Tokenizers (https://arxiv.org/abs/2212.09058)
+# Github source: https://github.com/microsoft/unilm/tree/master/beats
+# Copyright (c) 2022 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Based on fairseq code bases
+# https://github.com/pytorch/fairseq
+# --------------------------------------------------------
+import math
+import numpy as np
+from typing import Dict, Optional, Tuple
+import torch
+from torch import Tensor, nn
+import torch.nn.functional as F
+from torch.nn import LayerNorm, Parameter
+from models.beats.modules import (
+    GradMultiply,
+    SamePad,
+    get_activation_fn,
+    GLU_Linear,
+    quant_noise,
+)
+class TransformerEncoder(nn.Module):
+    def __init__(self, args):
+        super().__init__()
+        self.dropout = args.dropout
+        self.embedding_dim = args.encoder_embed_dim
+        self.pos_conv = nn.Conv1d(
+            self.embedding_dim,
+            self.embedding_dim,
+            kernel_size=args.conv_pos,
+            padding=args.conv_pos // 2,
+            groups=args.conv_pos_groups,
+        )
+        dropout = 0
+        std = math.sqrt((4 * (1.0 - dropout)) / (args.conv_pos * self.embedding_dim))
+        nn.init.normal_(self.pos_conv.weight, mean=0, std=std)
+        nn.init.constant_(self.pos_conv.bias, 0)
+        self.pos_conv = torch.nn.utils.parametrizations.weight_norm(self.pos_conv, name="weight", dim=2)
+        self.pos_conv = nn.Sequential(self.pos_conv, SamePad(args.conv_pos), nn.GELU())
+        if hasattr(args, "relative_position_embedding"):
+            self.relative_position_embedding = args.relative_position_embedding
+            self.num_buckets = args.num_buckets
+            self.max_distance = args.max_distance
+        else:
+            self.relative_position_embedding = False
+            self.num_buckets = 0
+            self.max_distance = 0
+        self.layers = nn.ModuleList(
+            [
+                TransformerSentenceEncoderLayer(
+                    embedding_dim=self.embedding_dim,
+                    ffn_embedding_dim=args.encoder_ffn_embed_dim,
+                    num_attention_heads=args.encoder_attention_heads,
+                    dropout=self.dropout,
+                    attention_dropout=args.attention_dropout,
+                    activation_dropout=args.activation_dropout,
+                    activation_fn=args.activation_fn,
+                    layer_norm_first=args.layer_norm_first,
+                    deep_norm=args.deep_norm,
+                    has_relative_attention_bias=self.relative_position_embedding,
+                    num_buckets=self.num_buckets,
+                    max_distance=self.max_distance,
+                    gru_rel_pos=args.gru_rel_pos,
+                    encoder_layers=args.encoder_layers,
+                )
+                for i in range(args.encoder_layers)
+            ]
+        )
+        if self.relative_position_embedding:
+            for i in range(1, args.encoder_layers):
+                del self.layers[i].self_attn.relative_attention_bias
+                self.layers[i].self_attn.relative_attention_bias = self.layers[0].self_attn.relative_attention_bias
+        self.layer_norm_first = args.layer_norm_first
+        self.layer_norm = LayerNorm(self.embedding_dim)
+        self.layerdrop = args.encoder_layerdrop
+        self.apply(init_bert_params)
+        if args.deep_norm:
+            deep_norm_beta = math.pow(8 * args.encoder_layers, -1 / 4)
+            for i in range(args.encoder_layers):
+                nn.init.xavier_normal_(self.layers[i].self_attn.k_proj.weight, gain=1)
+                nn.init.xavier_normal_(self.layers[i].self_attn.v_proj.weight, gain=deep_norm_beta)
+                nn.init.xavier_normal_(self.layers[i].self_attn.q_proj.weight, gain=1)
+                nn.init.xavier_normal_(self.layers[i].self_attn.out_proj.weight, gain=deep_norm_beta)
+                nn.init.xavier_normal_(self.layers[i].fc1.weight, gain=deep_norm_beta)
+                nn.init.xavier_normal_(self.layers[i].fc2.weight, gain=deep_norm_beta)
+        self.layer_wise_gradient_decay_ratio = getattr(args, "layer_wise_gradient_decay_ratio", 1)
+    def forward(self, x, padding_mask=None, layer=None):
+        x, layer_results = self.extract_features(x, padding_mask, layer)
+        if self.layer_norm_first and layer is None:
+            x = self.layer_norm(x)
+        return x, layer_results
+    def extract_features(self, x, padding_mask=None, tgt_layer=None):
+        if padding_mask is not None:
+            x[padding_mask] = 0
+        x_conv = self.pos_conv(x.transpose(1, 2))
+        x_conv = x_conv.transpose(1, 2)
+        x = x + x_conv
+        if not self.layer_norm_first:
+            x = self.layer_norm(x)
+        x = F.dropout(x, p=self.dropout, training=self.training)
+        # B x T x C -> T x B x C
+        x = x.transpose(0, 1)
+        layer_results = []
+        z = None
+        if tgt_layer is not None:
+            layer_results.append((x, z))
+        r = None
+        pos_bias = None
+        for i, layer in enumerate(self.layers):
+            if self.layer_wise_gradient_decay_ratio != 1.0:
+                x = GradMultiply.apply(x, self.layer_wise_gradient_decay_ratio)
+            dropout_probability = np.random.random()
+            if not self.training or (dropout_probability > self.layerdrop):
+                x, z, pos_bias = layer(x, self_attn_padding_mask=padding_mask, need_weights=False, pos_bias=pos_bias)
+            if tgt_layer is not None:
+                layer_results.append((x, z))
+            if i == tgt_layer:
+                r = x
+                break
+        if r is not None:
+            x = r
+        # T x B x C -> B x T x C
+        x = x.transpose(0, 1)
+        return x, layer_results
+class TransformerSentenceEncoderLayer(nn.Module):
+    def __init__(
+            self,
+            embedding_dim: float = 768,
+            ffn_embedding_dim: float = 3072,
+            num_attention_heads: float = 8,
+            dropout: float = 0.1,
+            attention_dropout: float = 0.1,
+            activation_dropout: float = 0.1,
+            activation_fn: str = "relu",
+            layer_norm_first: bool = False,
+            deep_norm: bool = False,
+            has_relative_attention_bias: bool = False,
+            num_buckets: int = 0,
+            max_distance: int = 0,
+            rescale_init: bool = False,
+            gru_rel_pos: bool = False,
+            encoder_layers: int = 0,
+    ) -> None:
+        super().__init__()
+        self.embedding_dim = embedding_dim
+        self.dropout = dropout
+        self.activation_dropout = activation_dropout
+        self.activation_name = activation_fn
+        self.activation_fn = get_activation_fn(activation_fn)
+        self.self_attn = MultiheadAttention(
+            self.embedding_dim,
+            num_attention_heads,
+            dropout=attention_dropout,
+            self_attention=True,
+            has_relative_attention_bias=has_relative_attention_bias,
+            num_buckets=num_buckets,
+            max_distance=max_distance,
+            rescale_init=rescale_init,
+            gru_rel_pos=gru_rel_pos,
+        )
+        self.dropout1 = nn.Dropout(dropout)
+        self.dropout2 = nn.Dropout(self.activation_dropout)
+        self.dropout3 = nn.Dropout(dropout)
+        self.layer_norm_first = layer_norm_first
+        self.self_attn_layer_norm = LayerNorm(self.embedding_dim)
+        if self.activation_name == "glu":
+            self.fc1 = GLU_Linear(self.embedding_dim, ffn_embedding_dim, "swish")
+        else:
+            self.fc1 = nn.Linear(self.embedding_dim, ffn_embedding_dim)
+        self.fc2 = nn.Linear(ffn_embedding_dim, self.embedding_dim)
+        self.final_layer_norm = LayerNorm(self.embedding_dim)
+        self.deep_norm = deep_norm
+        if self.deep_norm:
+            self.deep_norm_alpha = math.pow(2 * encoder_layers, 1 / 4)
+        else:
+            self.deep_norm_alpha = 1
+    def forward(
+            self,
+            x: torch.Tensor,
+            self_attn_mask: torch.Tensor = None,
+            self_attn_padding_mask: torch.Tensor = None,
+            need_weights: bool = False,
+            pos_bias=None
+    ):
+        residual = x
+        if self.layer_norm_first:
+            x = self.self_attn_layer_norm(x)
+            x, attn, pos_bias = self.self_attn(
+                query=x,
+                key=x,
+                value=x,
+                key_padding_mask=self_attn_padding_mask,
+                need_weights=False,
+                attn_mask=self_attn_mask,
+                position_bias=pos_bias
+            )
+            x = self.dropout1(x)
+            x = residual + x
+            residual = x
+            x = self.final_layer_norm(x)
+            if self.activation_name == "glu":
+                x = self.fc1(x)
+            else:
+                x = self.activation_fn(self.fc1(x))
+            x = self.dropout2(x)
+            x = self.fc2(x)
+            x = self.dropout3(x)
+            x = residual + x
+        else:
+            x, attn, pos_bias = self.self_attn(
+                query=x,
+                key=x,
+                value=x,
+                key_padding_mask=self_attn_padding_mask,
+                need_weights=need_weights,
+                attn_mask=self_attn_mask,
+                position_bias=pos_bias
+            )
+            x = self.dropout1(x)
+            x = residual * self.deep_norm_alpha + x
+            x = self.self_attn_layer_norm(x)
+            residual = x
+            if self.activation_name == "glu":
+                x = self.fc1(x)
+            else:
+                x = self.activation_fn(self.fc1(x))
+            x = self.dropout2(x)
+            x = self.fc2(x)
+            x = self.dropout3(x)
+            x = residual * self.deep_norm_alpha + x
+            x = self.final_layer_norm(x)
+        return x, attn, pos_bias
+class MultiheadAttention(nn.Module):
+    """Multi-headed attention.
+    See "Attention Is All You Need" for more details.
+    """
+    def __init__(
+            self,
+            embed_dim,
+            num_heads,
+            kdim=None,
+            vdim=None,
+            dropout=0.0,
+            bias=True,
+            add_bias_kv=False,
+            add_zero_attn=False,
+            self_attention=False,
+            encoder_decoder_attention=False,
+            q_noise=0.0,
+            qn_block_size=8,
+            has_relative_attention_bias=False,
+            num_buckets=32,
+            max_distance=128,
+            gru_rel_pos=False,
+            rescale_init=False,
+    ):
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.kdim = kdim if kdim is not None else embed_dim
+        self.vdim = vdim if vdim is not None else embed_dim
+        self.qkv_same_dim = self.kdim == embed_dim and self.vdim == embed_dim
+        self.num_heads = num_heads
+        self.dropout_module = nn.Dropout(dropout)
+        self.has_relative_attention_bias = has_relative_attention_bias
+        self.num_buckets = num_buckets
+        self.max_distance = max_distance
+        if self.has_relative_attention_bias:
+            self.relative_attention_bias = nn.Embedding(num_buckets, num_heads)
+        self.head_dim = embed_dim // num_heads
+        self.q_head_dim = self.head_dim
+        self.k_head_dim = self.head_dim
+        assert (
+                self.head_dim * num_heads == self.embed_dim
+        ), "embed_dim must be divisible by num_heads"
+        self.scaling = self.head_dim ** -0.5
+        self.self_attention = self_attention
+        self.encoder_decoder_attention = encoder_decoder_attention
+        assert not self.self_attention or self.qkv_same_dim, (
+            "Self-attention requires query, key and " "value to be of the same size"
+        )
+        k_bias = True
+        if rescale_init:
+            k_bias = False
+        k_embed_dim = embed_dim
+        q_embed_dim = embed_dim
+        self.k_proj = quant_noise(
+            nn.Linear(self.kdim, k_embed_dim, bias=k_bias), q_noise, qn_block_size
+        )
+        self.v_proj = quant_noise(
+            nn.Linear(self.vdim, embed_dim, bias=bias), q_noise, qn_block_size
+        )
+        self.q_proj = quant_noise(
+            nn.Linear(embed_dim, q_embed_dim, bias=bias), q_noise, qn_block_size
+        )
+        self.out_proj = quant_noise(
+            nn.Linear(embed_dim, embed_dim, bias=bias), q_noise, qn_block_size
+        )
+        if add_bias_kv:
+            self.bias_k = Parameter(torch.Tensor(1, 1, embed_dim))
+            self.bias_v = Parameter(torch.Tensor(1, 1, embed_dim))
+        else:
+            self.bias_k = self.bias_v = None
+        self.add_zero_attn = add_zero_attn
+        self.gru_rel_pos = gru_rel_pos
+        if self.gru_rel_pos:
+            self.grep_linear = nn.Linear(self.q_head_dim, 8)
+            self.grep_a = nn.Parameter(torch.ones(1, num_heads, 1, 1))
+        self.reset_parameters()
+    def reset_parameters(self):
+        if self.qkv_same_dim:
+            # Empirically observed the convergence to be much better with
+            # the scaled initialization
+            nn.init.xavier_uniform_(self.k_proj.weight, gain=1 / math.sqrt(2))
+            nn.init.xavier_uniform_(self.v_proj.weight, gain=1 / math.sqrt(2))
+            nn.init.xavier_uniform_(self.q_proj.weight, gain=1 / math.sqrt(2))
+        else:
+            nn.init.xavier_uniform_(self.k_proj.weight)
+            nn.init.xavier_uniform_(self.v_proj.weight)
+            nn.init.xavier_uniform_(self.q_proj.weight)
+        nn.init.xavier_uniform_(self.out_proj.weight)
+        if self.out_proj.bias is not None:
+            nn.init.constant_(self.out_proj.bias, 0.0)
+        if self.bias_k is not None:
+            nn.init.xavier_normal_(self.bias_k)
+        if self.bias_v is not None:
+            nn.init.xavier_normal_(self.bias_v)
+        if self.has_relative_attention_bias:
+            nn.init.xavier_normal_(self.relative_attention_bias.weight)
+    def _relative_positions_bucket(self, relative_positions, bidirectional=True):
+        num_buckets = self.num_buckets
+        max_distance = self.max_distance
+        relative_buckets = 0
+        if bidirectional:
+            num_buckets = num_buckets // 2
+            relative_buckets += (relative_positions > 0).to(torch.long) * num_buckets
+            relative_positions = torch.abs(relative_positions)
+        else:
+            relative_positions = -torch.min(relative_positions, torch.zeros_like(relative_positions))
+        max_exact = num_buckets // 2
+        is_small = relative_positions < max_exact
+        relative_postion_if_large = max_exact + (
+                torch.log(relative_positions.float() / max_exact)
+                / math.log(max_distance / max_exact)
+                * (num_buckets - max_exact)
+        ).to(torch.long)
+        relative_postion_if_large = torch.min(
+            relative_postion_if_large, torch.full_like(relative_postion_if_large, num_buckets - 1)
+        )
+        relative_buckets += torch.where(is_small, relative_positions, relative_postion_if_large)
+        return relative_buckets
+    def compute_bias(self, query_length, key_length):
+        context_position = torch.arange(query_length, dtype=torch.long)[:, None]
+        memory_position = torch.arange(key_length, dtype=torch.long)[None, :]
+        relative_position = memory_position - context_position
+        relative_position_bucket = self._relative_positions_bucket(
+            relative_position,
+            bidirectional=True
+        )
+        relative_position_bucket = relative_position_bucket.to(self.relative_attention_bias.weight.device)
+        values = self.relative_attention_bias(relative_position_bucket)
+        values = values.permute([2, 0, 1])
+        return values
+    def forward(
+            self,
+            query,
+            key: Optional[Tensor],
+            value: Optional[Tensor],
+            key_padding_mask: Optional[Tensor] = None,
+            incremental_state: Optional[Dict[str, Dict[str, Optional[Tensor]]]] = None,
+            need_weights: bool = True,
+            static_kv: bool = False,
+            attn_mask: Optional[Tensor] = None,
+            before_softmax: bool = False,
+            need_head_weights: bool = False,
+            position_bias: Optional[Tensor] = None
+    ) -> Tuple[Tensor, Optional[Tensor], Optional[Tensor]]:
+        """Input shape: Time x Batch x Channel
+        Args:
+            key_padding_mask (ByteTensor, optional): mask to exclude
+                keys that are pads, of shape `(batch, src_len)`, where
+                padding elements are indicated by 1s.
+            need_weights (bool, optional): return the attention weights,
+                averaged over heads (default: False).
+            attn_mask (ByteTensor, optional): typically used to
+                implement causal attention, where the mask prevents the
+                attention from looking forward in time (default: None).
+            before_softmax (bool, optional): return the raw attention
+                weights and values before the attention softmax.
+            need_head_weights (bool, optional): return the attention
+                weights for each head. Implies *need_weights*. Default:
+                return the average attention weights over all heads.
+        """
+        if need_head_weights:
+            need_weights = True
+        is_tpu = query.device.type == "xla"
+        tgt_len, bsz, embed_dim = query.size()
+        src_len = tgt_len
+        assert embed_dim == self.embed_dim
+        assert list(query.size()) == [tgt_len, bsz, embed_dim]
+        if key is not None:
+            src_len, key_bsz, _ = key.size()
+            if not torch.jit.is_scripting():
+                assert key_bsz == bsz
+                assert value is not None
+                assert src_len, bsz == value.shape[:2]
+        if self.has_relative_attention_bias and position_bias is None:
+            position_bias = self.compute_bias(tgt_len, src_len)
+            position_bias = position_bias.unsqueeze(0).repeat(bsz, 1, 1, 1).view(bsz * self.num_heads, tgt_len, src_len)
+        if incremental_state is not None:
+            saved_state = self._get_input_buffer(incremental_state)
+            if saved_state is not None and "prev_key" in saved_state:
+                # previous time steps are cached - no need to recompute
+                # key and value if they are static
+                if static_kv:
+                    assert self.encoder_decoder_attention and not self.self_attention
+                    key = value = None
+        else:
+            saved_state = None
+        if self.self_attention:
+            q = self.q_proj(query)
+            k = self.k_proj(query)
+            v = self.v_proj(query)
+        elif self.encoder_decoder_attention:
+            # encoder-decoder attention
+            q = self.q_proj(query)
+            if key is None:
+                assert value is None
+                k = v = None
+            else:
+                k = self.k_proj(key)
+                v = self.v_proj(key)
+        else:
+            assert key is not None and value is not None
+            q = self.q_proj(query)
+            k = self.k_proj(key)
+            v = self.v_proj(value)
+        q *= self.scaling
+        alpha = 32
+        q *= 1 / alpha
+        if self.bias_k is not None:
+            assert self.bias_v is not None
+            k = torch.cat([k, self.bias_k.repeat(1, bsz, 1)])
+            v = torch.cat([v, self.bias_v.repeat(1, bsz, 1)])
+            if attn_mask is not None:
+                attn_mask = torch.cat(
+                    [attn_mask, attn_mask.new_zeros(attn_mask.size(0), 1)], dim=1
+                )
+            if key_padding_mask is not None:
+                key_padding_mask = torch.cat(
+                    [
+                        key_padding_mask,
+                        key_padding_mask.new_zeros(key_padding_mask.size(0), 1),
+                    ],
+                    dim=1,
+                )
+        q = (
+            q.contiguous()
+                .view(tgt_len, bsz * self.num_heads, self.q_head_dim)
+                .transpose(0, 1)
+        )
+        if k is not None:
+            k = (
+                k.contiguous()
+                    .view(-1, bsz * self.num_heads, self.k_head_dim)
+                    .transpose(0, 1)
+            )
+        if v is not None:
+            v = (
+                v.contiguous()
+                    .view(-1, bsz * self.num_heads, self.head_dim)
+                    .transpose(0, 1)
+            )
+        if saved_state is not None:
+            # saved states are stored with shape (bsz, num_heads, seq_len, head_dim)
+            if "prev_key" in saved_state:
+                _prev_key = saved_state["prev_key"]
+                assert _prev_key is not None
+                prev_key = _prev_key.view(bsz * self.num_heads, -1, self.head_dim)
+                if static_kv:
+                    k = prev_key
+                else:
+                    assert k is not None
+                    k = torch.cat([prev_key, k], dim=1)
+                src_len = k.size(1)
+            if "prev_value" in saved_state:
+                _prev_value = saved_state["prev_value"]
+                assert _prev_value is not None
+                prev_value = _prev_value.view(bsz * self.num_heads, -1, self.head_dim)
+                if static_kv:
+                    v = prev_value
+                else:
+                    assert v is not None
+                    v = torch.cat([prev_value, v], dim=1)
+            prev_key_padding_mask: Optional[Tensor] = None
+            if "prev_key_padding_mask" in saved_state:
+                prev_key_padding_mask = saved_state["prev_key_padding_mask"]
+            assert k is not None and v is not None
+            key_padding_mask = MultiheadAttention._append_prev_key_padding_mask(
+                key_padding_mask=key_padding_mask,
+                prev_key_padding_mask=prev_key_padding_mask,
+                batch_size=bsz,
+                src_len=k.size(1),
+                static_kv=static_kv,
+            )
+            saved_state["prev_key"] = k.view(bsz, self.num_heads, -1, self.head_dim)
+            saved_state["prev_value"] = v.view(bsz, self.num_heads, -1, self.head_dim)
+            saved_state["prev_key_padding_mask"] = key_padding_mask
+            # In this branch incremental_state is never None
+            assert incremental_state is not None
+            incremental_state = self._set_input_buffer(incremental_state, saved_state)
+        assert k is not None
+        assert k.size(1) == src_len
+        # This is part of a workaround to get around fork/join parallelism
+        # not supporting Optional types.
+        if key_padding_mask is not None and key_padding_mask.dim() == 0:
+            key_padding_mask = None
+        if key_padding_mask is not None:
+            assert key_padding_mask.size(0) == bsz
+            assert key_padding_mask.size(1) == src_len
+        if self.add_zero_attn:
+            assert v is not None
+            src_len += 1
+            k = torch.cat([k, k.new_zeros((k.size(0), 1) + k.size()[2:])], dim=1)
+            v = torch.cat([v, v.new_zeros((v.size(0), 1) + v.size()[2:])], dim=1)
+            if attn_mask is not None:
+                attn_mask = torch.cat(
+                    [attn_mask, attn_mask.new_zeros(attn_mask.size(0), 1)], dim=1
+                )
+            if key_padding_mask is not None:
+                key_padding_mask = torch.cat(
+                    [
+                        key_padding_mask,
+                        torch.zeros(key_padding_mask.size(0), 1).type_as(
+                            key_padding_mask
+                        ),
+                    ],
+                    dim=1,
+                )
+        attn_weights = torch.bmm(q, k.transpose(1, 2))
+        attn_weights = (attn_weights - attn_weights.max(dim=-1, keepdim=True)[0]) * alpha
+        attn_weights = self.apply_sparse_mask(attn_weights, tgt_len, src_len, bsz)
+        assert list(attn_weights.size()) == [bsz * self.num_heads, tgt_len, src_len]
+        if attn_mask is not None:
+            attn_mask = attn_mask.unsqueeze(0)
+            attn_weights += attn_mask
+        if key_padding_mask is not None:
+            # don't attend to padding symbols
+            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+            if not is_tpu:
+                attn_weights = attn_weights.masked_fill(
+                    key_padding_mask.unsqueeze(1).unsqueeze(2).to(torch.bool),
+                    float("-inf"),
+                )
+            else:
+                attn_weights = attn_weights.transpose(0, 2)
+                attn_weights = attn_weights.masked_fill(key_padding_mask, float("-inf"))
+                attn_weights = attn_weights.transpose(0, 2)
+            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
+        if before_softmax:
+            return attn_weights, v, position_bias
+        if position_bias is not None:
+            attn_mask_rel_pos = position_bias
+            if self.gru_rel_pos == 1:
+                query_layer = q.view(bsz, self.num_heads, tgt_len, self.q_head_dim) * alpha / self.scaling
+                _B, _H, _L, __ = query_layer.size()
+                gate_a, gate_b = torch.sigmoid(self.grep_linear(query_layer).view(
+                    _B, _H, _L, 2, 4).sum(-1, keepdim=False)).chunk(2, dim=-1)
+                gate_a_1 = gate_a * (gate_b * self.grep_a - 1.0) + 2.0
+                attn_mask_rel_pos = gate_a_1.view(bsz * self.num_heads, tgt_len, 1) * position_bias
+            attn_mask_rel_pos = attn_mask_rel_pos.view(attn_weights.size())
+            attn_weights = attn_weights + attn_mask_rel_pos
+        attn_weights_float = F.softmax(
+            attn_weights, dim=-1
+        )
+        attn_weights = attn_weights_float.type_as(attn_weights)
+        attn_probs = self.dropout_module(attn_weights)
+        assert v is not None
+        attn = torch.bmm(attn_probs, v)
+        assert list(attn.size()) == [bsz * self.num_heads, tgt_len, self.head_dim]
+        attn = attn.transpose(0, 1).contiguous().view(tgt_len, bsz, embed_dim)
+        attn = self.out_proj(attn)
+        attn_weights: Optional[Tensor] = None
+        if need_weights:
+            attn_weights = attn_weights_float.view(
+                bsz, self.num_heads, tgt_len, src_len
+            ).transpose(1, 0)
+            if not need_head_weights:
+                # average attention weights over heads
+                attn_weights = attn_weights.mean(dim=0)
+        return attn, attn_weights, position_bias
+    @staticmethod
+    def _append_prev_key_padding_mask(
+            key_padding_mask: Optional[Tensor],
+            prev_key_padding_mask: Optional[Tensor],
+            batch_size: int,
+            src_len: int,
+            static_kv: bool,
+    ) -> Optional[Tensor]:
+        # saved key padding masks have shape (bsz, seq_len)
+        if prev_key_padding_mask is not None and static_kv:
+            new_key_padding_mask = prev_key_padding_mask
+        elif prev_key_padding_mask is not None and key_padding_mask is not None:
+            new_key_padding_mask = torch.cat(
+                [prev_key_padding_mask.float(), key_padding_mask.float()], dim=1
+            )
+        # During incremental decoding, as the padding token enters and
+        # leaves the frame, there will be a time when prev or current
+        # is None
+        elif prev_key_padding_mask is not None:
+            if src_len > prev_key_padding_mask.size(1):
+                filler = torch.zeros(
+                    (batch_size, src_len - prev_key_padding_mask.size(1)),
+                    device=prev_key_padding_mask.device,
+                )
+                new_key_padding_mask = torch.cat(
+                    [prev_key_padding_mask.float(), filler.float()], dim=1
+                )
+            else:
+                new_key_padding_mask = prev_key_padding_mask.float()
+        elif key_padding_mask is not None:
+            if src_len > key_padding_mask.size(1):
+                filler = torch.zeros(
+                    (batch_size, src_len - key_padding_mask.size(1)),
+                    device=key_padding_mask.device,
+                )
+                new_key_padding_mask = torch.cat(
+                    [filler.float(), key_padding_mask.float()], dim=1
+                )
+            else:
+                new_key_padding_mask = key_padding_mask.float()
+        else:
+            new_key_padding_mask = prev_key_padding_mask
+        return new_key_padding_mask
+    def _get_input_buffer(
+            self, incremental_state: Optional[Dict[str, Dict[str, Optional[Tensor]]]]
+    ) -> Dict[str, Optional[Tensor]]:
+        result = self.get_incremental_state(incremental_state, "attn_state")
+        if result is not None:
+            return result
+        else:
+            empty_result: Dict[str, Optional[Tensor]] = {}
+            return empty_result
+    def _set_input_buffer(
+            self,
+            incremental_state: Dict[str, Dict[str, Optional[Tensor]]],
+            buffer: Dict[str, Optional[Tensor]],
+    ):
+        return self.set_incremental_state(incremental_state, "attn_state", buffer)
+    def apply_sparse_mask(self, attn_weights, tgt_len: int, src_len: int, bsz: int):
+        return attn_weights
+def init_bert_params(module):
+    """
+    Initialize the weights specific to the BERT Model.
+    This overrides the default initializations depending on the specified arguments.
+        1. If normal_init_linear_weights is set then weights of linear
+           layer will be initialized using the normal distribution and
+           bais will be set to the specified value.
+        2. If normal_init_embed_weights is set then weights of embedding
+           layer will be initialized using the normal distribution.
+        3. If normal_init_proj_weights is set then weights of
+           in_project_weight for MultiHeadAttention initialized using
+           the normal distribution (to be validated).
+    """
+    def normal_(data):
+        # with FSDP, module params will be on CUDA, so we cast them back to CPU
+        # so that the RNG is consistent with and without FSDP
+        data.copy_(
+            data.cpu().normal_(mean=0.0, std=0.02).to(data.device)
+        )
+    if isinstance(module, nn.Linear):
+        normal_(module.weight.data)
+        if module.bias is not None:
+            module.bias.data.zero_()
+    if isinstance(module, nn.Embedding):
+        normal_(module.weight.data)
+        if module.padding_idx is not None:
+            module.weight.data[module.padding_idx].zero_()
+    if isinstance(module, MultiheadAttention):
+        normal_(module.q_proj.weight.data)
+        normal_(module.k_proj.weight.data)
+        normal_(module.v_proj.weight.data)

models/beats/modules.py ADDED Viewed

	@@ -0,0 +1,218 @@

+# --------------------------------------------------------
+# BEATs: Audio Pre-Training with Acoustic Tokenizers (https://arxiv.org/abs/2212.09058)
+# Github source: https://github.com/microsoft/unilm/tree/master/beats
+# Copyright (c) 2022 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Based on fairseq code bases
+# https://github.com/pytorch/fairseq
+# --------------------------------------------------------
+import math
+import warnings
+import torch
+from torch import Tensor, nn
+import torch.nn.functional as F
+class GradMultiply(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, x, scale):
+        ctx.scale = scale
+        res = x.new(x)
+        return res
+    @staticmethod
+    def backward(ctx, grad):
+        return grad * ctx.scale, None
+class SamePad(nn.Module):
+    def __init__(self, kernel_size, causal=False):
+        super().__init__()
+        if causal:
+            self.remove = kernel_size - 1
+        else:
+            self.remove = 1 if kernel_size % 2 == 0 else 0
+    def forward(self, x):
+        if self.remove > 0:
+            x = x[:, :, : -self.remove]
+        return x
+class Swish(nn.Module):
+    def __init__(self):
+        super(Swish, self).__init__()
+        self.act = torch.nn.Sigmoid()
+    def forward(self, x):
+        return x * self.act(x)
+class GLU_Linear(nn.Module):
+    def __init__(self, input_dim, output_dim, glu_type="sigmoid", bias_in_glu=True):
+        super(GLU_Linear, self).__init__()
+        self.glu_type = glu_type
+        self.output_dim = output_dim
+        if glu_type == "sigmoid":
+            self.glu_act = torch.nn.Sigmoid()
+        elif glu_type == "swish":
+            self.glu_act = Swish()
+        elif glu_type == "relu":
+            self.glu_act = torch.nn.ReLU()
+        elif glu_type == "gelu":
+            self.glu_act = torch.nn.GELU()
+        if bias_in_glu:
+            self.linear = nn.Linear(input_dim, output_dim * 2, True)
+        else:
+            self.linear = nn.Linear(input_dim, output_dim * 2, False)
+    def forward(self, x):
+        # to be consistent with GLU_Linear, we assume the input always has the #channel (#dim) in the last dimension of the tensor, so need to switch the dimension first for 1D-Conv case
+        x = self.linear(x)
+        if self.glu_type == "bilinear":
+            x = (x[:, :, 0:self.output_dim] * x[:, :, self.output_dim:self.output_dim * 2])
+        else:
+            x = (x[:, :, 0:self.output_dim] * self.glu_act(x[:, :, self.output_dim:self.output_dim * 2]))
+        return x
+def gelu_accurate(x):
+    if not hasattr(gelu_accurate, "_a"):
+        gelu_accurate._a = math.sqrt(2 / math.pi)
+    return (
+        0.5 * x * (1 + torch.tanh(gelu_accurate._a * (x + 0.044715 * torch.pow(x, 3))))
+    )
+def gelu(x: torch.Tensor) -> torch.Tensor:
+    return torch.nn.functional.gelu(x.float()).type_as(x)
+def get_activation_fn(activation: str):
+    """Returns the activation function corresponding to `activation`"""
+    if activation == "relu":
+        return F.relu
+    elif activation == "gelu":
+        return gelu
+    elif activation == "gelu_fast":
+        warnings.warn(
+            "--activation-fn=gelu_fast has been renamed to gelu_accurate"
+        )
+        return gelu_accurate
+    elif activation == "gelu_accurate":
+        return gelu_accurate
+    elif activation == "tanh":
+        return torch.tanh
+    elif activation == "linear":
+        return lambda x: x
+    elif activation == "glu":
+        return lambda x: x
+    else:
+        raise RuntimeError("--activation-fn {} not supported".format(activation))
+def quant_noise(module, p, block_size):
+    """
+    Wraps modules and applies quantization noise to the weights for
+    subsequent quantization with Iterative Product Quantization as
+    described in "Training with Quantization Noise for Extreme Model Compression"
+    Args:
+        - module: nn.Module
+        - p: amount of Quantization Noise
+        - block_size: size of the blocks for subsequent quantization with iPQ
+    Remarks:
+        - Module weights must have the right sizes wrt the block size
+        - Only Linear, Embedding and Conv2d modules are supported for the moment
+        - For more detail on how to quantize by blocks with convolutional weights,
+          see "And the Bit Goes Down: Revisiting the Quantization of Neural Networks"
+        - We implement the simplest form of noise here as stated in the paper
+          which consists in randomly dropping blocks
+    """
+    # if no quantization noise, don't register hook
+    if p <= 0:
+        return module
+    # supported modules
+    assert isinstance(module, (nn.Linear, nn.Embedding, nn.Conv2d))
+    # test whether module.weight has the right sizes wrt block_size
+    is_conv = module.weight.ndim == 4
+    # 2D matrix
+    if not is_conv:
+        assert (
+            module.weight.size(1) % block_size == 0
+        ), "Input features must be a multiple of block sizes"
+    # 4D matrix
+    else:
+        # 1x1 convolutions
+        if module.kernel_size == (1, 1):
+            assert (
+                module.in_channels % block_size == 0
+            ), "Input channels must be a multiple of block sizes"
+        # regular convolutions
+        else:
+            k = module.kernel_size[0] * module.kernel_size[1]
+            assert k % block_size == 0, "Kernel size must be a multiple of block size"
+    def _forward_pre_hook(mod, input):
+        # no noise for evaluation
+        if mod.training:
+            if not is_conv:
+                # gather weight and sizes
+                weight = mod.weight
+                in_features = weight.size(1)
+                out_features = weight.size(0)
+                # split weight matrix into blocks and randomly drop selected blocks
+                mask = torch.zeros(
+                    in_features // block_size * out_features, device=weight.device
+                )
+                mask.bernoulli_(p)
+                mask = mask.repeat_interleave(block_size, -1).view(-1, in_features)
+            else:
+                # gather weight and sizes
+                weight = mod.weight
+                in_channels = mod.in_channels
+                out_channels = mod.out_channels
+                # split weight matrix into blocks and randomly drop selected blocks
+                if mod.kernel_size == (1, 1):
+                    mask = torch.zeros(
+                        int(in_channels // block_size * out_channels),
+                        device=weight.device,
+                    )
+                    mask.bernoulli_(p)
+                    mask = mask.repeat_interleave(block_size, -1).view(-1, in_channels)
+                else:
+                    mask = torch.zeros(
+                        weight.size(0), weight.size(1), device=weight.device
+                    )
+                    mask.bernoulli_(p)
+                    mask = (
+                        mask.unsqueeze(2)
+                        .unsqueeze(3)
+                        .repeat(1, 1, mod.kernel_size[0], mod.kernel_size[1])
+                    )
+            # scale weights and apply mask
+            mask = mask.to(
+                torch.bool
+            )  # x.bool() is not currently supported in TorchScript
+            s = 1 / (1 - p)
+            mod.weight.data = s * weight.masked_fill(mask, 0)
+    module.register_forward_pre_hook(_forward_pre_hook)
+    return module

models/beats/quantizer.py ADDED Viewed

	@@ -0,0 +1,215 @@

+# --------------------------------------------------------
+# BEATs: Audio Pre-Training with Acoustic Tokenizers (https://arxiv.org/abs/2212.09058)
+# Github source: https://github.com/microsoft/unilm/tree/master/beats
+# Copyright (c) 2022 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Based on VQGAN code bases
+# https://github.com/CompVis/taming-transformers
+# --------------------------------------------------------'
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.distributed as distributed
+try:
+    from einops import rearrange, repeat
+except ImportError:
+    pass
+def l2norm(t):
+    return F.normalize(t, p=2, dim=-1)
+def ema_inplace(moving_avg, new, decay):
+    moving_avg.data.mul_(decay).add_(new, alpha=(1 - decay))
+def sample_vectors(samples, num):
+    num_samples, device = samples.shape[0], samples.device
+    if num_samples >= num:
+        indices = torch.randperm(num_samples, device=device)[:num]
+    else:
+        indices = torch.randint(0, num_samples, (num,), device=device)
+    return samples[indices]
+def kmeans(samples, num_clusters, num_iters=10, use_cosine_sim=False):
+    dim, dtype, device = samples.shape[-1], samples.dtype, samples.device
+    means = sample_vectors(samples, num_clusters)
+    for _ in range(num_iters):
+        if use_cosine_sim:
+            dists = samples @ means.t()
+        else:
+            diffs = rearrange(samples, 'n d -> n () d') \
+                    - rearrange(means, 'c d -> () c d')
+            dists = -(diffs ** 2).sum(dim=-1)
+        buckets = dists.max(dim=-1).indices
+        bins = torch.bincount(buckets, minlength=num_clusters)
+        zero_mask = bins == 0
+        bins_min_clamped = bins.masked_fill(zero_mask, 1)
+        new_means = buckets.new_zeros(num_clusters, dim, dtype=dtype)
+        new_means.scatter_add_(0, repeat(buckets, 'n -> n d', d=dim), samples)
+        new_means = new_means / bins_min_clamped[..., None]
+        if use_cosine_sim:
+            new_means = l2norm(new_means)
+        means = torch.where(zero_mask[..., None], means, new_means)
+    return means, bins
+class EmbeddingEMA(nn.Module):
+    def __init__(self, num_tokens, codebook_dim, decay=0.99, eps=1e-5, kmeans_init=True, codebook_init_path=''):
+        super().__init__()
+        self.num_tokens = num_tokens
+        self.codebook_dim = codebook_dim
+        self.decay = decay
+        self.eps = eps
+        if codebook_init_path == '':
+            if not kmeans_init:
+                weight = torch.randn(num_tokens, codebook_dim)
+                weight = l2norm(weight)
+            else:
+                weight = torch.zeros(num_tokens, codebook_dim)
+            self.register_buffer('initted', torch.Tensor([not kmeans_init]))
+        else:
+            print(f"load init codebook weight from {codebook_init_path}")
+            codebook_ckpt_weight = torch.load(codebook_init_path, map_location='cpu')
+            weight = codebook_ckpt_weight.clone()
+            self.register_buffer('initted', torch.Tensor([True]))
+        self.weight = nn.Parameter(weight, requires_grad=False)
+        self.cluster_size = nn.Parameter(torch.zeros(num_tokens), requires_grad=False)
+        self.embed_avg = nn.Parameter(weight.clone(), requires_grad=False)
+        # self.register_buffer('initted', torch.Tensor([not kmeans_init]))
+        self.update = True
+    @torch.jit.ignore
+    def init_embed_(self, data):
+        if self.initted:
+            return
+        print("Performing Kemans init for codebook")
+        embed, cluster_size = kmeans(data, self.num_tokens, 10, use_cosine_sim=True)
+        self.weight.data.copy_(embed)
+        self.cluster_size.data.copy_(cluster_size)
+        self.initted.data.copy_(torch.Tensor([True]))
+    def forward(self, embed_id):
+        return F.embedding(embed_id, self.weight)
+    def cluster_size_ema_update(self, new_cluster_size):
+        self.cluster_size.data.mul_(self.decay).add_(new_cluster_size, alpha=1 - self.decay)
+    def embed_avg_ema_update(self, new_embed_avg):
+        self.embed_avg.data.mul_(self.decay).add_(new_embed_avg, alpha=1 - self.decay)
+    def weight_update(self, num_tokens):
+        n = self.cluster_size.sum()
+        smoothed_cluster_size = (
+                (self.cluster_size + self.eps) / (n + num_tokens * self.eps) * n
+        )
+        # normalize embedding average with smoothed cluster size
+        embed_normalized = self.embed_avg / smoothed_cluster_size.unsqueeze(1)
+        # embed_normalized = l2norm(self.embed_avg / smoothed_cluster_size.unsqueeze(1))
+        self.weight.data.copy_(embed_normalized)
+def norm_ema_inplace(moving_avg, new, decay):
+    moving_avg.data.mul_(decay).add_(new, alpha=(1 - decay))
+    moving_avg.data.copy_(l2norm(moving_avg.data))
+class NormEMAVectorQuantizer(nn.Module):
+    def __init__(self, n_embed, embedding_dim, beta, decay=0.99, eps=1e-5,
+                 statistic_code_usage=True, kmeans_init=False, codebook_init_path=''):
+        super().__init__()
+        self.codebook_dim = embedding_dim
+        self.num_tokens = n_embed
+        self.beta = beta
+        self.decay = decay
+        # learnable = True if orthogonal_reg_weight > 0 else False
+        self.embedding = EmbeddingEMA(self.num_tokens, self.codebook_dim, decay, eps, kmeans_init, codebook_init_path)
+        self.statistic_code_usage = statistic_code_usage
+        if statistic_code_usage:
+            self.register_buffer('cluster_size', torch.zeros(n_embed))
+        if distributed.is_available() and distributed.is_initialized():
+            print("ddp is enable, so use ddp_reduce to sync the statistic_code_usage for each gpu!")
+            self.all_reduce_fn = distributed.all_reduce
+        else:
+            self.all_reduce_fn = nn.Identity()
+    def reset_cluster_size(self, device):
+        if self.statistic_code_usage:
+            self.register_buffer('cluster_size', torch.zeros(self.num_tokens))
+            self.cluster_size = self.cluster_size.to(device)
+    def forward(self, z):
+        # reshape z -> (batch, height, width, channel) and flatten
+        # z, 'b c h w -> b h w c'
+        # z = rearrange(z, 'b c h w -> b h w c')
+        # z = z.transpose(1, 2)
+        z = l2norm(z)
+        z_flattened = z.reshape(-1, self.codebook_dim)
+        self.embedding.init_embed_(z_flattened)
+        d = z_flattened.pow(2).sum(dim=1, keepdim=True) + \
+            self.embedding.weight.pow(2).sum(dim=1) - 2 * \
+            torch.einsum('bd,nd->bn', z_flattened, self.embedding.weight)  # 'n d -> d n'
+        encoding_indices = torch.argmin(d, dim=1)
+        z_q = self.embedding(encoding_indices).view(z.shape)
+        encodings = F.one_hot(encoding_indices, self.num_tokens).type(z.dtype)
+        if not self.training:
+            with torch.no_grad():
+                cluster_size = encodings.sum(0)
+                self.all_reduce_fn(cluster_size)
+                ema_inplace(self.cluster_size, cluster_size, self.decay)
+        if self.training and self.embedding.update:
+            # EMA cluster size
+            bins = encodings.sum(0)
+            self.all_reduce_fn(bins)
+            # self.embedding.cluster_size_ema_update(bins)
+            ema_inplace(self.cluster_size, bins, self.decay)
+            zero_mask = (bins == 0)
+            bins = bins.masked_fill(zero_mask, 1.)
+            embed_sum = z_flattened.t() @ encodings
+            self.all_reduce_fn(embed_sum)
+            embed_normalized = (embed_sum / bins.unsqueeze(0)).t()
+            embed_normalized = l2norm(embed_normalized)
+            embed_normalized = torch.where(zero_mask[..., None], self.embedding.weight,
+                                           embed_normalized)
+            norm_ema_inplace(self.embedding.weight, embed_normalized, self.decay)
+        # compute loss for embedding
+        loss = self.beta * F.mse_loss(z_q.detach(), z)
+        # preserve gradients
+        z_q = z + (z_q - z).detach()
+        # reshape back to match original input shape
+        # z_q, 'b h w c -> b c h w'
+        # z_q = rearrange(z_q, 'b h w c -> b c h w')
+        # z_q = z_q.transpose(1, 2)
+        return z_q, loss, encoding_indices

models/frame_mn/Frame_MN_wrapper.py ADDED Viewed

	@@ -0,0 +1,75 @@

+from models.frame_passt.preprocess import AugmentMelSTFT
+from models.transformer_wrapper import BaseModelWrapper
+from models.frame_mn.model import get_model
+class FrameMNWrapper(BaseModelWrapper):
+    def __init__(self, width_mult=1.0) -> None:
+        super().__init__()
+        self.mel = AugmentMelSTFT(
+            n_mels=128,
+            sr=16_000,
+            win_length=400,
+            hopsize=160,
+            n_fft=512,
+            freqm=0,
+            timem=0,
+            htk=False,
+            fmin=0.0,
+            fmax=None,
+            norm=1,
+            fmin_aug_range=10,
+            fmax_aug_range=2000,
+            fast_norm=True,
+            preamp=True,
+            padding="center",
+            periodic_window=False,
+        )
+        self.frame_mn = get_model(
+            width_mult=width_mult
+        )
+    def mel_forward(self, x):
+        return self.mel(x)
+    def forward(self, x):
+        return self.frame_mn(x)
+    def separate_params(self):
+        pt_params = [[], [], [], [], [], [], [], [], [], [], [], []]
+        for k, p in self.named_parameters():
+            if any(['cls_token' in k,
+                    'pos_embed' in k,
+                    'norm_stats' in k,
+                    'patch_embed' in k]):
+                pt_params[0].append(p)
+            elif 'blocks.0.' in k:
+                pt_params[0].append(p)
+            elif 'blocks.1.' in k:
+                pt_params[1].append(p)
+            elif 'blocks.2.' in k:
+                pt_params[2].append(p)
+            elif 'blocks.3.' in k:
+                pt_params[3].append(p)
+            elif 'blocks.4.' in k:
+                pt_params[4].append(p)
+            elif 'blocks.5.' in k:
+                pt_params[5].append(p)
+            elif 'blocks.6.' in k:
+                pt_params[6].append(p)
+            elif 'blocks.7.' in k:
+                pt_params[7].append(p)
+            elif 'blocks.8.' in k:
+                pt_params[8].append(p)
+            elif 'blocks.9.' in k:
+                pt_params[9].append(p)
+            elif 'blocks.10.' in k:
+                pt_params[10].append(p)
+            elif 'blocks.11.' in k:
+                pt_params[11].append(p)
+            elif 'asit.norm.weight' in k or 'asit.norm.bias' in k:
+                pt_params[11].append(p)
+            else:
+                raise ValueError(f"Check separate params for ASiT! Unknown key: {k}")
+        return list(reversed(pt_params))

models/frame_mn/block_types.py ADDED Viewed

	@@ -0,0 +1,189 @@

+from typing import Dict, Callable, List
+import torch
+import torch.nn as nn
+from torch import Tensor
+from torchvision.ops.misc import ConvNormActivation
+from models.frame_mn.utils import make_divisible, cnn_out_size
+class ConcurrentSEBlock(torch.nn.Module):
+    def __init__(
+        self,
+        c_dim: int,
+        f_dim: int,
+        t_dim: int,
+        se_cnf: Dict
+    ) -> None:
+        super().__init__()
+        dims = [c_dim, f_dim, t_dim]
+        self.conc_se_layers = nn.ModuleList()
+        for d in se_cnf['se_dims']:
+            input_dim = dims[d-1]
+            squeeze_dim = make_divisible(input_dim // se_cnf['se_r'], 8)
+            self.conc_se_layers.append(SqueezeExcitation(input_dim, squeeze_dim, d))
+        if se_cnf['se_agg'] == "max":
+            self.agg_op = lambda x: torch.max(x, dim=0)[0]
+        elif se_cnf['se_agg'] == "avg":
+            self.agg_op = lambda x: torch.mean(x, dim=0)
+        elif se_cnf['se_agg'] == "add":
+            self.agg_op = lambda x: torch.sum(x, dim=0)
+        elif se_cnf['se_agg'] == "min":
+            self.agg_op = lambda x: torch.min(x, dim=0)[0]
+        else:
+            raise NotImplementedError(f"SE aggregation operation '{self.agg_op}' not implemented")
+    def forward(self, input: Tensor) -> Tensor:
+        # apply all concurrent se layers
+        se_outs = []
+        for se_layer in self.conc_se_layers:
+            se_outs.append(se_layer(input))
+        out = self.agg_op(torch.stack(se_outs, dim=0))
+        return out
+class SqueezeExcitation(torch.nn.Module):
+    """
+    This block implements the Squeeze-and-Excitation block from https://arxiv.org/abs/1709.01507.
+    Args:
+        input_dim (int): Input dimension
+        squeeze_dim (int): Size of Bottleneck
+        activation (Callable): activation applied to bottleneck
+        scale_activation (Callable): activation applied to the output
+    """
+    def __init__(
+        self,
+        input_dim: int,
+        squeeze_dim: int,
+        se_dim: int,
+        activation: Callable[..., torch.nn.Module] = torch.nn.ReLU,
+        scale_activation: Callable[..., torch.nn.Module] = torch.nn.Sigmoid,
+    ) -> None:
+        super().__init__()
+        self.fc1 = torch.nn.Linear(input_dim, squeeze_dim)
+        self.fc2 = torch.nn.Linear(squeeze_dim, input_dim)
+        assert se_dim in [1, 2, 3]
+        self.se_dim = [1, 2, 3]
+        self.se_dim.remove(se_dim)
+        self.activation = activation()
+        self.scale_activation = scale_activation()
+    def _scale(self, input: Tensor) -> Tensor:
+        scale = torch.mean(input, self.se_dim, keepdim=True)
+        shape = scale.size()
+        scale = self.fc1(scale.squeeze(2).squeeze(2))
+        scale = self.activation(scale)
+        scale = self.fc2(scale)
+        scale = scale
+        return self.scale_activation(scale).view(shape)
+    def forward(self, input: Tensor) -> Tensor:
+        scale = self._scale(input)
+        return scale * input
+class InvertedResidualConfig:
+    # Stores information listed at Tables 1 and 2 of the MobileNetV3 paper
+    def __init__(
+        self,
+        input_channels: int,
+        kernel: int,
+        expanded_channels: int,
+        out_channels: int,
+        use_se: bool,
+        activation: str,
+        stride: tuple[int],
+        dilation: tuple[int],
+        width_mult: float,
+    ):
+        self.input_channels = self.adjust_channels(input_channels, width_mult)
+        self.kernel = kernel
+        self.expanded_channels = self.adjust_channels(expanded_channels, width_mult)
+        self.out_channels = self.adjust_channels(out_channels, width_mult)
+        self.use_se = use_se
+        self.use_hs = activation == "HS"
+        self.stride = stride
+        self.dilation = dilation
+        self.f_dim = None
+        self.t_dim = None
+    @staticmethod
+    def adjust_channels(channels: int, width_mult: float):
+        return make_divisible(channels * width_mult, 8)
+    def out_size(self, in_size, idx=None):
+        dilation = self.dilation if idx is None else self.dilation[idx]
+        padding = (self.kernel - 1) // 2 * dilation
+        stride = self.stride if idx is None else self.stride[idx]
+        return cnn_out_size(in_size, padding, dilation, self.kernel, stride)
+class InvertedResidual(nn.Module):
+    def __init__(
+        self,
+        cnf: InvertedResidualConfig,
+        se_cnf: Dict,
+        norm_layer: Callable[..., nn.Module],
+        depthwise_norm_layer: Callable[..., nn.Module]
+    ):
+        super().__init__()
+        if not (1 <= cnf.stride[0] <= 2 or 1 <= cnf.stride[1] <= 2):
+            raise ValueError("illegal stride value")
+        self.use_res_connect = cnf.stride[0] == 1 and cnf.stride[1] == 1 and cnf.input_channels == cnf.out_channels
+        layers: List[nn.Module] = []
+        activation_layer = nn.Hardswish if cnf.use_hs else nn.ReLU
+        # expand
+        if cnf.expanded_channels != cnf.input_channels:
+            layers.append(
+                ConvNormActivation(
+                    cnf.input_channels,
+                    cnf.expanded_channels,
+                    kernel_size=1,
+                    norm_layer=norm_layer,
+                    activation_layer=activation_layer,
+                )
+            )
+        # depthwise
+        d = cnf.dilation > 1 if isinstance(cnf.dilation, int) else cnf.dilation[1] > 1
+        stride = [cnf.stride, cnf.stride] if isinstance(cnf.stride, int) else list(cnf.stride)
+        if d:
+            stride[1] = 1
+        layers.append(
+            ConvNormActivation(
+                cnf.expanded_channels,
+                cnf.expanded_channels,
+                kernel_size=cnf.kernel,
+                stride=tuple(stride),
+                dilation=cnf.dilation,
+                groups=cnf.expanded_channels,
+                norm_layer=depthwise_norm_layer,
+                activation_layer=activation_layer,
+            )
+        )
+        if cnf.use_se and se_cnf['se_dims'] is not None:
+            layers.append(ConcurrentSEBlock(cnf.expanded_channels, cnf.f_dim, cnf.t_dim, se_cnf))
+        # project
+        layers.append(
+            ConvNormActivation(
+                cnf.expanded_channels, cnf.out_channels, kernel_size=1, norm_layer=norm_layer, activation_layer=None
+            )
+        )
+        self.block = nn.Sequential(*layers)
+        self.out_channels = cnf.out_channels
+        # self._is_cn = cnf.stride[0] > 1 and cnf.stride[1] > 1
+    def forward(self, inp: Tensor) -> Tensor:
+        result = self.block(inp)
+        if self.use_res_connect:
+            result += inp
+        return result

models/frame_mn/model.py ADDED Viewed

	@@ -0,0 +1,356 @@

+import os
+import urllib.parse
+from functools import partial
+from typing import Any, Callable, List, Optional, Sequence, Tuple
+import torch
+from torch import nn, Tensor
+from torch.hub import load_state_dict_from_url
+from torchvision.ops.misc import ConvNormActivation
+from models.frame_mn.block_types import InvertedResidualConfig, InvertedResidual
+from models.frame_mn.utils import cnn_out_size
+# Adapted version of MobileNetV3 pytorch implementation
+# https://github.com/pytorch/vision/blob/main/torchvision/models/mobilenetv3.py
+# points to github releases
+model_url = "https://github.com/fschmid56/EfficientAT/releases/download/v0.0.1/"
+# folder to store downloaded models to
+model_dir = "resources"
+pretrained_models = {
+    # pytorch ImageNet pre-trained model
+    # own ImageNet pre-trained models will follow
+    # NOTE: for easy loading we provide the adapted state dict ready for AudioSet training (1 input channel,
+    # 527 output classes)
+    # NOTE: the classifier is just a random initialization, feature extractor (conv layers) is pre-trained
+    "mn10_im_pytorch": urllib.parse.urljoin(model_url, "mn10_im_pytorch.pt"),
+    # self-trained models on ImageNet
+    "mn01_im": urllib.parse.urljoin(model_url, "mn01_im.pt"),
+    "mn02_im": urllib.parse.urljoin(model_url, "mn02_im.pt"),
+    "mn04_im": urllib.parse.urljoin(model_url, "mn04_im.pt"),
+    "mn05_im": urllib.parse.urljoin(model_url, "mn05_im.pt"),
+    "mn06_im": urllib.parse.urljoin(model_url, "mn06_im.pt"),
+    "mn10_im": urllib.parse.urljoin(model_url, "mn10_im.pt"),
+    "mn20_im": urllib.parse.urljoin(model_url, "mn20_im.pt"),
+    "mn30_im": urllib.parse.urljoin(model_url, "mn30_im.pt"),
+    "mn40_im": urllib.parse.urljoin(model_url, "mn40_im.pt"),
+    # Models trained on AudioSet
+    "mn01_as": urllib.parse.urljoin(model_url, "mn01_as_mAP_298.pt"),
+    "mn02_as": urllib.parse.urljoin(model_url, "mn02_as_mAP_378.pt"),
+    "mn04_as": urllib.parse.urljoin(model_url, "mn04_as_mAP_432.pt"),
+    "mn05_as": urllib.parse.urljoin(model_url, "mn05_as_mAP_443.pt"),
+    "mn10_as": urllib.parse.urljoin(model_url, "mn10_as_mAP_471.pt"),
+    "mn20_as": urllib.parse.urljoin(model_url, "mn20_as_mAP_478.pt"),
+    "mn30_as": urllib.parse.urljoin(model_url, "mn30_as_mAP_482.pt"),
+    "mn40_as": urllib.parse.urljoin(model_url, "mn40_as_mAP_484.pt"),
+    "mn40_as(2)": urllib.parse.urljoin(model_url, "mn40_as_mAP_483.pt"),
+    "mn40_as(3)": urllib.parse.urljoin(model_url, "mn40_as_mAP_483(2).pt"),
+    "mn40_as_no_im_pre": urllib.parse.urljoin(model_url, "mn40_as_no_im_pre_mAP_483.pt"),
+    "mn40_as_no_im_pre(2)": urllib.parse.urljoin(model_url, "mn40_as_no_im_pre_mAP_483(2).pt"),
+    "mn40_as_no_im_pre(3)": urllib.parse.urljoin(model_url, "mn40_as_no_im_pre_mAP_482.pt"),
+    "mn40_as_ext": urllib.parse.urljoin(model_url, "mn40_as_ext_mAP_487.pt"),
+    "mn40_as_ext(2)": urllib.parse.urljoin(model_url, "mn40_as_ext_mAP_486.pt"),
+    "mn40_as_ext(3)": urllib.parse.urljoin(model_url, "mn40_as_ext_mAP_485.pt"),
+    # varying hop size (time resolution)
+    "mn10_as_hop_5": urllib.parse.urljoin(model_url, "mn10_as_hop_5_mAP_475.pt"),
+    "mn10_as_hop_15": urllib.parse.urljoin(model_url, "mn10_as_hop_15_mAP_463.pt"),
+    "mn10_as_hop_20": urllib.parse.urljoin(model_url, "mn10_as_hop_20_mAP_456.pt"),
+    "mn10_as_hop_25": urllib.parse.urljoin(model_url, "mn10_as_hop_25_mAP_447.pt"),
+    # varying n_mels (frequency resolution)
+    "mn10_as_mels_40": urllib.parse.urljoin(model_url, "mn10_as_mels_40_mAP_453.pt"),
+    "mn10_as_mels_64": urllib.parse.urljoin(model_url, "mn10_as_mels_64_mAP_461.pt"),
+    "mn10_as_mels_256": urllib.parse.urljoin(model_url, "mn10_as_mels_256_mAP_474.pt"),
+    # fully-convolutional head
+    "mn10_as_fc": urllib.parse.urljoin(model_url, "mn10_as_fc_mAP_465.pt"),
+    "mn10_as_fc_s2221": urllib.parse.urljoin(model_url, "mn10_as_fc_s2221_mAP_466.pt"),
+    "mn10_as_fc_s2211": urllib.parse.urljoin(model_url, "mn10_as_fc_s2211_mAP_466.pt"),
+}
+class MN(nn.Module):
+    def __init__(
+            self,
+            inverted_residual_setting: List[InvertedResidualConfig],
+            block: Optional[Callable[..., nn.Module]] = None,
+            norm_layer: Optional[Callable[..., nn.Module]] = None,
+            in_conv_kernel: int = 3,
+            in_conv_stride: int = 2,
+            in_channels: int = 1,
+            **kwargs: Any,
+    ) -> None:
+        """
+        MobileNet V3 main class
+        Args:
+            inverted_residual_setting (List[InvertedResidualConfig]): Network structure
+            block (Optional[Callable[..., nn.Module]]): Module specifying inverted residual building block for models
+            norm_layer (Optional[Callable[..., nn.Module]]): Module specifying the normalization layer to use
+            in_conv_kernel (int): Size of kernel for first convolution
+            in_conv_stride (int): Size of stride for first convolution
+            in_channels (int): Number of input channels
+        """
+        super(MN, self).__init__()
+        if not inverted_residual_setting:
+            raise ValueError("The inverted_residual_setting should not be empty")
+        elif not (
+                isinstance(inverted_residual_setting, Sequence)
+                and all([isinstance(s, InvertedResidualConfig) for s in inverted_residual_setting])
+        ):
+            raise TypeError("The inverted_residual_setting should be List[InvertedResidualConfig]")
+        if block is None:
+            block = InvertedResidual
+        depthwise_norm_layer = norm_layer = \
+            norm_layer if norm_layer is not None else partial(nn.BatchNorm2d, eps=0.001, momentum=0.01)
+        layers: List[nn.Module] = []
+        kernel_sizes = [in_conv_kernel]
+        strides = [in_conv_stride]
+        # building first layer
+        firstconv_output_channels = inverted_residual_setting[0].input_channels
+        layers.append(
+            ConvNormActivation(
+                in_channels,
+                firstconv_output_channels,
+                kernel_size=in_conv_kernel,
+                stride=in_conv_stride,
+                norm_layer=norm_layer,
+                activation_layer=nn.Hardswish,
+            )
+        )
+        # get squeeze excitation config
+        se_cnf = kwargs.get('se_conf', None)
+        # building inverted residual blocks
+        # - keep track of size of frequency and time dimensions for possible application of Squeeze-and-Excitation
+        # on the frequency/time dimension
+        # - applying Squeeze-and-Excitation on the time dimension is not recommended as this constrains the network to
+        # a particular length of the audio clip, whereas Squeeze-and-Excitation on the frequency bands is fine,
+        # as the number of frequency bands is usually not changing
+        f_dim, t_dim = kwargs.get('input_dims', (128, 1000))
+        # take into account first conv layer
+        f_dim = cnn_out_size(f_dim, 1, 1, 3, 2)
+        t_dim = cnn_out_size(t_dim, 1, 1, 3, 2)
+        for cnf in inverted_residual_setting:
+            f_dim = cnf.out_size(f_dim, idx=0)
+            t_dim = cnf.out_size(t_dim, idx=1)
+            cnf.f_dim, cnf.t_dim = f_dim, t_dim  # update dimensions in block config
+            layers.append(block(cnf, se_cnf, norm_layer, depthwise_norm_layer))
+            kernel_sizes.append(cnf.kernel)
+            strides.append(cnf.stride)
+        # building last several layers
+        lastconv_input_channels = inverted_residual_setting[-1].out_channels
+        lastconv_output_channels = 6 * lastconv_input_channels
+        self.lastconv_output_channels = lastconv_output_channels
+        layers.append(
+            ConvNormActivation(
+                lastconv_input_channels,
+                lastconv_output_channels,
+                kernel_size=1,
+                norm_layer=norm_layer,
+                activation_layer=nn.Hardswish,
+            )
+        )
+        self.features = nn.Sequential(*layers)
+        # no prediction head needed - we want to use Frame-MobileNet to extract a 3D sequence
+        #  i.e.: batch size x sequence length x channel dimension
+        for m in self.modules():
+            if isinstance(m, nn.Conv2d):
+                nn.init.kaiming_normal_(m.weight, mode="fan_out")
+                if m.bias is not None:
+                    nn.init.zeros_(m.bias)
+            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm, nn.LayerNorm)):
+                nn.init.ones_(m.weight)
+                nn.init.zeros_(m.bias)
+            elif isinstance(m, nn.Linear):
+                nn.init.normal_(m.weight, 0, 0.01)
+                if m.bias is not None:
+                    nn.init.zeros_(m.bias)
+    def _forward_impl(self, x: Tensor, return_fmaps: bool = False) -> Tensor:
+        fmaps = []
+        for i, layer in enumerate(self.features):
+            x = layer(x)
+            if return_fmaps:
+                fmaps.append(x)
+        # reshape: batch size x channels x frequency bands x time -> batch size x time x channels
+        #  works, because frequency dimension is exactly 1
+        x = x.squeeze(2).permute(0, 2, 1)
+        return x
+    def forward(self, x: Tensor) -> Tensor:
+        return self._forward_impl(x)
+    def load_model(self, path, wandb_id):
+        ckpt_path = os.path.join(path, wandb_id + ".ckpt")
+        pretrained_weights = torch.load(ckpt_path, map_location="cpu")["state_dict"]
+        pretrained_weights = {k[10:]: v for k, v in pretrained_weights.items() if k[:10] == "net.model."}
+        self.load_state_dict(pretrained_weights)
+        print("Loaded model successfully. Wandb_id:", wandb_id)
+def _mobilenet_v3_conf(
+        width_mult: float = 1.0,
+        reduced_tail: bool = False,
+        dilated: bool = False,
+        strides: Tuple[int] = None,
+        dilation_list_t_dim: Optional[List[int]] = None,
+        **kwargs
+):
+    reduce_divider = 2 if reduced_tail else 1
+    if dilation_list_t_dim is None:
+        dilation_list_t_dim = [1] * 15
+    if dilated:
+        dilation_list_t_dim[-3:] = [2] * 3
+    print("dilation_list_t_dim: ")
+    print(dilation_list_t_dim)
+    bneck_conf = partial(InvertedResidualConfig, width_mult=width_mult)
+    adjust_channels = partial(InvertedResidualConfig.adjust_channels, width_mult=width_mult)
+    if strides is None:
+        #            0  1  2  3  4  5  6  7  8  9 10 11 12 13 14
+        f_strides = (1, 2, 2, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 2)
+        t_strides = (1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
+        strides = tuple(zip(f_strides, t_strides))
+    # InvertedResidualConfig:
+    # input_channels, kernel, expanded_channels, out_channels, use_se, activation, stride, dilation
+    inverted_residual_setting = [
+        bneck_conf(16, 3, 16, 16, False, "RE", strides[0], (1, dilation_list_t_dim[0])),  # 0
+        bneck_conf(16, 3, 64, 24, False, "RE", strides[1], (1, dilation_list_t_dim[1])),  # 1 - C1
+        bneck_conf(24, 3, 72, 24, False, "RE", strides[2], (1, dilation_list_t_dim[2])),  # 2
+        bneck_conf(24, 5, 72, 40, True, "RE", strides[3], (1, dilation_list_t_dim[3])),  # 3 - C2
+        bneck_conf(40, 5, 120, 40, True, "RE", strides[4], (1, dilation_list_t_dim[4])),  # 4
+        bneck_conf(40, 5, 120, 40, True, "RE", strides[5], (1, dilation_list_t_dim[5])),  # 5
+        bneck_conf(40, 3, 240, 80, False, "HS", strides[6], (1, dilation_list_t_dim[6])),  # 6 - C3
+        bneck_conf(80, 3, 200, 80, False, "HS", strides[7], (1, dilation_list_t_dim[7])),  # 7
+        bneck_conf(80, 3, 184, 80, False, "HS", strides[8], (1, dilation_list_t_dim[8])),  # 8
+        bneck_conf(80, 3, 184, 80, False, "HS", strides[9], (1, dilation_list_t_dim[9])),  # 9
+        bneck_conf(80, 3, 480, 112, True, "HS", strides[10], (1, dilation_list_t_dim[10])),  # 10
+        bneck_conf(112, 3, 672, 112, True, "HS", strides[11], (1, dilation_list_t_dim[11])),  # 11
+        bneck_conf(112, 5, 672, 160 // reduce_divider, True, "HS", strides[12], (1, dilation_list_t_dim[12])),
+        # 12 - C4 # dilation
+        bneck_conf(160 // reduce_divider, 5, 960 // reduce_divider, 160 // reduce_divider, True, "HS", strides[13],
+                   (1, dilation_list_t_dim[13])),  # 13  # dilation
+        bneck_conf(160 // reduce_divider, 5, 960 // reduce_divider, 160 // reduce_divider, True, "HS", strides[14],
+                   (1, dilation_list_t_dim[14])),  # 14  # dilation
+    ]
+    last_channel = adjust_channels(1280 // reduce_divider)
+    return inverted_residual_setting, last_channel
+def _mobilenet_v3(
+        inverted_residual_setting: List[InvertedResidualConfig],
+        pretrained_name: str,
+        **kwargs: Any,
+):
+    model = MN(inverted_residual_setting, **kwargs)
+    if pretrained_name in pretrained_models:
+        model_url = pretrained_models.get(pretrained_name)
+        state_dict = load_state_dict_from_url(model_url, model_dir=model_dir, map_location="cpu")
+        if kwargs['head_type'] == "mlp":
+            num_classes = state_dict['classifier.5.bias'].size(0)
+        elif kwargs['head_type'] == "fully_convolutional":
+            num_classes = state_dict['classifier.1.bias'].size(0)
+        else:
+            print("Loading weights for classifier only implemented for head types 'mlp' and 'fully_convolutional'")
+            num_classes = -1
+        if kwargs['num_classes'] != num_classes:
+            # if the number of logits is not matching the state dict,
+            # drop the corresponding pre-trained part
+            pretrain_logits = state_dict['classifier.5.bias'].size(0) if kwargs['head_type'] == "mlp" \
+                else state_dict['classifier.1.bias'].size(0)
+            print(f"Number of classes defined: {kwargs['num_classes']}, "
+                  f"but try to load pre-trained layer with logits: {pretrain_logits}\n"
+                  "Dropping last layer.")
+            if kwargs['head_type'] == "mlp":
+                del state_dict['classifier.5.weight']
+                del state_dict['classifier.5.bias']
+            else:
+                state_dict = {k: v for k, v in state_dict.items() if not k.startswith('classifier')}
+        try:
+            model.load_state_dict(state_dict)
+        except RuntimeError as e:
+            print(str(e))
+            print("Loading weights pre-trained weights in a non-strict manner.")
+            model.load_state_dict(state_dict, strict=False)
+    elif pretrained_name:
+        raise NotImplementedError(f"Model name '{pretrained_name}' unknown.")
+    return model
+def mobilenet_v3(pretrained_name: str = None, **kwargs: Any) \
+        -> MN:
+    """
+    Constructs a MobileNetV3 architecture from
+    "Searching for MobileNetV3" <https://arxiv.org/abs/1905.02244>".
+    """
+    inverted_residual_setting, last_channel = _mobilenet_v3_conf(**kwargs)
+    return _mobilenet_v3(inverted_residual_setting, pretrained_name, **kwargs)
+def get_model(pretrained_name: str = None, width_mult: float = 1.0,
+              reduced_tail: bool = False, dilated: bool = False, dilation_list_t_dim=None,
+              strides: Tuple[int, int, int, int] = None,
+              head_type: str = "mlp", multihead_attention_heads: int = 4, input_dim_f: int = 128,
+              input_dim_t: int = 1000, se_dims: str = 'c', se_agg: str = "max", se_r: int = 4):
+    """
+        Arguments to modify the instantiation of a MobileNetv3
+        Args:
+            pretrained_name (str): Specifies name of pre-trained model to load
+            width_mult (float): Scales width of network
+            reduced_tail (bool): Scales down network tail
+            dilated (bool): Applies dilated convolution to network tail
+            dilation_list_t_dim (List): List of dilation factors to apply to network tail
+            strides (Tuple): Strides that are set to '2' in original implementation;
+                might be changed to modify the size of receptive field and the downsampling factor in
+                time and frequency dimension
+            head_type (str): decides which classification head to use
+            multihead_attention_heads (int): number of heads in case 'multihead_attention_heads' is used
+            input_dim_f (int): number of frequency bands
+            input_dim_t (int): number of time frames
+            se_dims (Tuple): choose dimension to apply squeeze-excitation on, if multiple dimensions are chosen, then
+                squeeze-excitation is applied concurrently and se layer outputs are fused by se_agg operation
+            se_agg (str): operation to fuse output of concurrent se layers
+            se_r (int): squeeze excitation bottleneck size
+            se_dims (str): contains letters corresponding to dimensions 'c' - channel, 'f' - frequency, 't' - time
+        """
+    dim_map = {'c': 1, 'f': 2, 't': 3}
+    assert len(se_dims) <= 3 and all([s in dim_map.keys() for s in se_dims]) or se_dims == 'none'
+    input_dims = (input_dim_f, input_dim_t)
+    if se_dims == 'none':
+        se_dims = None
+    else:
+        se_dims = [dim_map[s] for s in se_dims]
+    se_conf = dict(se_dims=se_dims, se_agg=se_agg, se_r=se_r)
+    m = mobilenet_v3(pretrained_name=pretrained_name,
+                     width_mult=width_mult, reduced_tail=reduced_tail, dilated=dilated,
+                     dilation_list_t_dim=dilation_list_t_dim,
+                     strides=strides,
+                     head_type=head_type, multihead_attention_heads=multihead_attention_heads,
+                     input_dims=input_dims, se_conf=se_conf
+                     )
+    print(m)
+    return m

models/frame_mn/utils.py ADDED Viewed

	@@ -0,0 +1,93 @@

+import math
+from typing import Optional, Callable
+import torch
+import torch.nn as nn
+from torch import Tensor
+def NAME_TO_WIDTH(name):
+    frame_mn_map = {
+        'frame_mn01': 0.1,
+        'frame_mn02': 0.2,
+        'frame_mn04': 0.4,
+        'frame_mn05': 0.5,
+        'frame_mn06': 0.6,
+        'frame_mn08': 0.8,
+        'frame_mn10': 1.0,
+        'frame_mn12': 1.2,
+        'frame_mn14': 1.4,
+        'frame_mn16': 1.6,
+        'frame_mn20': 2.0,
+        'frame_mn30': 3.0,
+        'frame_mn40': 4.0,
+    }
+    frame_dymn_map = {
+        'frame_dymn04': 0.4,
+        'frame_dymn10': 1.0,
+        'frame_dymn20': 2.0,
+    }
+    try:
+        if name.startswith('frame_dymn'):
+            w = frame_dymn_map[name[:len('frame_dymnxx')]]
+        else:
+            w = frame_mn_map[name[:len('frame_mnxx')]]
+    except:
+        w = 1.0
+    return w
+def make_divisible(v: float, divisor: int, min_value: Optional[int] = None) -> int:
+    """
+    This function is taken from the original tf repo.
+    It ensures that all layers have a channel number that is divisible by 8
+    It can be seen here:
+    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
+    """
+    if min_value is None:
+        min_value = divisor
+    new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
+    # Make sure that round down does not go down by more than 10%.
+    if new_v < 0.9 * v:
+        new_v += divisor
+    return new_v
+def cnn_out_size(in_size, padding, dilation, kernel, stride):
+    s = in_size + 2 * padding - dilation * (kernel - 1) - 1
+    return math.floor(s / stride + 1)
+def collapse_dim(x: Tensor, dim: int, mode: str = "pool", pool_fn:  Callable[[Tensor, int], Tensor] = torch.mean,
+                 combine_dim: int = None):
+    """
+    Collapses dimension of multi-dimensional tensor by pooling or combining dimensions
+    :param x: input Tensor
+    :param dim: dimension to collapse
+    :param mode: 'pool' or 'combine'
+    :param pool_fn: function to be applied in case of pooling
+    :param combine_dim: dimension to join 'dim' to
+    :return: collapsed tensor
+    """
+    if mode == "pool":
+        return pool_fn(x, dim)
+    elif mode == "combine":
+        s = list(x.size())
+        s[combine_dim] *= dim
+        s[dim] //= dim
+        return x.view(s)
+class CollapseDim(nn.Module):
+    def __init__(self, dim: int, mode: str = "pool", pool_fn:  Callable[[Tensor, int], Tensor] = torch.mean,
+                 combine_dim: int = None):
+        super(CollapseDim, self).__init__()
+        self.dim = dim
+        self.mode = mode
+        self.pool_fn = pool_fn
+        self.combine_dim = combine_dim
+    def forward(self, x):
+        return collapse_dim(x, dim=self.dim, mode=self.mode, pool_fn=self.pool_fn, combine_dim=self.combine_dim)

models/frame_passt/fpasst.py ADDED Viewed

	@@ -0,0 +1,963 @@

+"""
+Most of this code comes from the timm  library.
+We tried to disentangle from the timm library version.
+Adapted from https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py
+"""
+import collections
+import logging
+import math
+import os
+import warnings
+from collections import OrderedDict
+from functools import partial
+from itertools import repeat
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from models.frame_passt.vit_helpers import (DropPath, trunc_normal_,
+                                            build_model_with_cfg, adapt_input_conv)
+_logger = logging.getLogger()
+# From PyTorch internals
+def _ntuple(n):
+    def parse(x):
+        if isinstance(x, collections.abc.Iterable) and not isinstance(x, str):
+            return tuple(x)
+        return tuple(repeat(x, n))
+    return parse
+to_2tuple = _ntuple(2)
+IMAGENET_DEFAULT_MEAN = (0.485, 0.456, 0.406)
+IMAGENET_DEFAULT_STD = (0.229, 0.224, 0.225)
+IMAGENET_INCEPTION_MEAN = (0.5, 0.5, 0.5)
+IMAGENET_INCEPTION_STD = (0.5, 0.5, 0.5)
+def _cfg(url='', **kwargs):
+    return {
+        'url': url,
+        'num_classes': 1000, 'input_size': (3, 224, 224), 'pool_size': None,
+        'crop_pct': .9, 'interpolation': 'bicubic', 'fixed_input_size': True,
+        'mean': IMAGENET_INCEPTION_MEAN, 'std': IMAGENET_INCEPTION_STD,
+        'first_conv': 'patch_embed.proj', 'classifier': 'head',
+        **kwargs
+    }
+default_cfgs = {
+    # patch models (weights from official Google JAX impl)
+    'vit_tiny_patch16_224': _cfg(
+        url='https://storage.googleapis.com/vit_models/augreg/'
+            'Ti_16-i21k-300ep-lr_0.001-aug_none-wd_0.03-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.03-res_224.npz'),
+    'vit_tiny_patch16_384': _cfg(
+        url='https://storage.googleapis.com/vit_models/augreg/'
+            'Ti_16-i21k-300ep-lr_0.001-aug_none-wd_0.03-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.03-res_384.npz',
+        input_size=(3, 384, 384), crop_pct=1.0),
+    'vit_small_patch32_224': _cfg(
+        url='https://storage.googleapis.com/vit_models/augreg/'
+            'S_32-i21k-300ep-lr_0.001-aug_light1-wd_0.03-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.03-res_224.npz'),
+    'vit_small_patch32_384': _cfg(
+        url='https://storage.googleapis.com/vit_models/augreg/'
+            'S_32-i21k-300ep-lr_0.001-aug_light1-wd_0.03-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.03-res_384.npz',
+        input_size=(3, 384, 384), crop_pct=1.0),
+    'vit_small_patch16_224': _cfg(
+        url='https://storage.googleapis.com/vit_models/augreg/'
+            'S_16-i21k-300ep-lr_0.001-aug_light1-wd_0.03-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.03-res_224.npz'),
+    'vit_small_patch16_384': _cfg(
+        url='https://storage.googleapis.com/vit_models/augreg/'
+            'S_16-i21k-300ep-lr_0.001-aug_light1-wd_0.03-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.03-res_384.npz',
+        input_size=(3, 384, 384), crop_pct=1.0),
+    'vit_base_patch32_224': _cfg(
+        url='https://storage.googleapis.com/vit_models/augreg/'
+            'B_32-i21k-300ep-lr_0.001-aug_medium1-wd_0.03-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.03-res_224.npz'),
+    'vit_base_patch32_384': _cfg(
+        url='https://storage.googleapis.com/vit_models/augreg/'
+            'B_32-i21k-300ep-lr_0.001-aug_light1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.03-res_384.npz',
+        input_size=(3, 384, 384), crop_pct=1.0),
+    'vit_base_patch16_224': _cfg(
+        url='https://storage.googleapis.com/vit_models/augreg/'
+            'B_16-i21k-300ep-lr_0.001-aug_medium1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.01-res_224.npz'),
+    'vit_base_patch16_384': _cfg(
+        url='https://storage.googleapis.com/vit_models/augreg/'
+            'B_16-i21k-300ep-lr_0.001-aug_medium1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.01-res_384.npz',
+        input_size=(3, 384, 384), crop_pct=1.0),
+    'vit_large_patch32_224': _cfg(
+        url='',  # no official model weights for this combo, only for in21k
+    ),
+    'vit_large_patch32_384': _cfg(
+        url='https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_large_p32_384-9b920ba8.pth',
+        input_size=(3, 384, 384), crop_pct=1.0),
+    'vit_large_patch16_224': _cfg(
+        url='https://storage.googleapis.com/vit_models/augreg/'
+            'L_16-i21k-300ep-lr_0.001-aug_medium1-wd_0.1-do_0.1-sd_0.1--imagenet2012-steps_20k-lr_0.01-res_224.npz'),
+    'vit_large_patch16_384': _cfg(
+        url='https://storage.googleapis.com/vit_models/augreg/'
+            'L_16-i21k-300ep-lr_0.001-aug_medium1-wd_0.1-do_0.1-sd_0.1--imagenet2012-steps_20k-lr_0.01-res_384.npz',
+        input_size=(3, 384, 384), crop_pct=1.0),
+    # patch models, imagenet21k (weights from official Google JAX impl)
+    'vit_tiny_patch16_224_in21k': _cfg(
+        url='https://storage.googleapis.com/vit_models/augreg/Ti_16-i21k-300ep-lr_0.001-aug_none-wd_0.03-do_0.0-sd_0.0.npz',
+        num_classes=21843),
+    'vit_small_patch32_224_in21k': _cfg(
+        url='https://storage.googleapis.com/vit_models/augreg/S_32-i21k-300ep-lr_0.001-aug_light1-wd_0.03-do_0.0-sd_0.0.npz',
+        num_classes=21843),
+    'vit_small_patch16_224_in21k': _cfg(
+        url='https://storage.googleapis.com/vit_models/augreg/S_16-i21k-300ep-lr_0.001-aug_light1-wd_0.03-do_0.0-sd_0.0.npz',
+        num_classes=21843),
+    'vit_base_patch32_224_in21k': _cfg(
+        url='https://storage.googleapis.com/vit_models/augreg/B_32-i21k-300ep-lr_0.001-aug_medium1-wd_0.03-do_0.0-sd_0.0.npz',
+        num_classes=21843),
+    'vit_base_patch16_224_in21k': _cfg(
+        url='https://storage.googleapis.com/vit_models/augreg/B_16-i21k-300ep-lr_0.001-aug_medium1-wd_0.1-do_0.0-sd_0.0.npz',
+        num_classes=21843),
+    'vit_large_patch32_224_in21k': _cfg(
+        url='https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_large_patch32_224_in21k-9046d2e7.pth',
+        num_classes=21843),
+    'vit_large_patch16_224_in21k': _cfg(
+        url='https://storage.googleapis.com/vit_models/augreg/L_16-i21k-300ep-lr_0.001-aug_medium1-wd_0.1-do_0.1-sd_0.1.npz',
+        num_classes=21843),
+    'vit_huge_patch14_224_in21k': _cfg(
+        url='https://storage.googleapis.com/vit_models/imagenet21k/ViT-H_14.npz',
+        hf_hub='timm/vit_huge_patch14_224_in21k',
+        num_classes=21843),
+    # SAM trained models (https://arxiv.org/abs/2106.01548)
+    'vit_base_patch32_sam_224': _cfg(
+        url='https://storage.googleapis.com/vit_models/sam/ViT-B_32.npz'),
+    'vit_base_patch16_sam_224': _cfg(
+        url='https://storage.googleapis.com/vit_models/sam/ViT-B_16.npz'),
+    # deit models (FB weights)
+    'deit_tiny_patch16_224': _cfg(
+        url='https://dl.fbaipublicfiles.com/deit/deit_tiny_patch16_224-a1311bcf.pth',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD),
+    'deit_small_patch16_224': _cfg(
+        url='https://dl.fbaipublicfiles.com/deit/deit_small_patch16_224-cd65a155.pth',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD),
+    'deit_base_patch16_224': _cfg(
+        url='https://dl.fbaipublicfiles.com/deit/deit_base_patch16_224-b5f2ef4d.pth',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD),
+    'deit_base_patch16_384': _cfg(
+        url='https://dl.fbaipublicfiles.com/deit/deit_base_patch16_384-8de9b5d1.pth',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, input_size=(3, 384, 384), crop_pct=1.0),
+    'deit_tiny_distilled_patch16_224': _cfg(
+        url='https://dl.fbaipublicfiles.com/deit/deit_tiny_distilled_patch16_224-b40b3cf7.pth',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, classifier=('head', 'head_dist')),
+    'deit_small_distilled_patch16_224': _cfg(
+        url='https://dl.fbaipublicfiles.com/deit/deit_small_distilled_patch16_224-649709d9.pth',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, classifier=('head', 'head_dist')),
+    'deit_base_distilled_patch16_224': _cfg(
+        url='https://dl.fbaipublicfiles.com/deit/deit_base_distilled_patch16_224-df68dfff.pth',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, classifier=('head', 'head_dist')),
+    'deit_base_distilled_patch16_384': _cfg(
+        url='https://dl.fbaipublicfiles.com/deit/deit_base_distilled_patch16_384-d0272ac0.pth',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, input_size=(3, 384, 384), crop_pct=1.0,
+        classifier=('head', 'head_dist')),
+    # ViT ImageNet-21K-P pretraining by MILL
+    'vit_base_patch16_224_miil_in21k': _cfg(
+        url='https://miil-public-eu.oss-eu-central-1.aliyuncs.com/model-zoo/ImageNet_21K_P/models/timm/vit_base_patch16_224_in21k_miil.pth',
+        mean=(0, 0, 0), std=(1, 1, 1), crop_pct=0.875, interpolation='bilinear', num_classes=11221,
+    ),
+    'vit_base_patch16_224_miil': _cfg(
+        url='https://miil-public-eu.oss-eu-central-1.aliyuncs.com/model-zoo/ImageNet_21K_P/models/timm'
+            '/vit_base_patch16_224_1k_miil_84_4.pth',
+        mean=(0, 0, 0), std=(1, 1, 1), crop_pct=0.875, interpolation='bilinear',
+    ),
+    # PaSST
+    'passt_s_swa_p16_128_ap476': _cfg(
+        url='https://github.com/kkoutini/PaSST/releases/download/v0.0.1-audioset/passt-s-f128-p16-s10-ap.476-swa.pt',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, input_size=(1, 128, 998), crop_pct=1.0,
+        classifier=('head.1', 'head_dist'), num_classes=527),
+    'passt_s_kd_p16_128_ap486': _cfg(
+        url='https://github.com/kkoutini/PaSST/releases/download/v.0.0.9/passt-s-kd-ap.486.pt',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, input_size=(1, 128, 998), crop_pct=1.0,
+        classifier=('head.1', 'head_dist'), num_classes=527),
+    'passt_l_kd_p16_128_ap47': _cfg(
+        url='https://github.com/kkoutini/PaSST/releases/download/v.0.0.10/passt-l-kd-ap.47.pt',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, input_size=(1, 128, 998), crop_pct=1.0,
+        classifier=('head.1', 'head_dist'), num_classes=527),
+    'passt_s_swa_p16_128_ap4761': _cfg(
+        url='https://github.com/kkoutini/PaSST/releases/download/v0.0.2-audioset/passt-s-f128-p16-s10-ap.4761-swa.pt',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, input_size=(1, 128, 998), crop_pct=1.0,
+        classifier=('head.1', 'head_dist'), num_classes=527),
+    'passt_s_p16_128_ap472': _cfg(
+        url='https://github.com/kkoutini/PaSST/releases/download/v0.0.2-audioset/passt-s-f128-p16-s10-ap.472.pt',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, input_size=(1, 128, 998), crop_pct=1.0,
+        classifier=('head.1', 'head_dist'), num_classes=527),
+    'passt_s_p16_s16_128_ap468': _cfg(
+        url='https://github.com/kkoutini/PaSST/releases/download/v0.0.2-audioset/passt-s-f128-p16-s16-ap.468.pt',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, input_size=(1, 128, 998), crop_pct=1.0,
+        classifier=('head.1', 'head_dist'), num_classes=527),
+    'passt_s_swa_p16_s16_128_ap473': _cfg(
+        url='https://github.com/kkoutini/PaSST/releases/download/v0.0.2-audioset/passt-s-f128-p16-s16-ap.473-swa.pt',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, input_size=(1, 128, 998), crop_pct=1.0,
+        classifier=('head.1', 'head_dist'), num_classes=527),
+    'passt_s_swa_p16_s14_128_ap471': _cfg(
+        url='https://github.com/kkoutini/PaSST/releases/download/v0.0.2-audioset/passt-s-f128-p16-s14-ap.471-swa.pt',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, input_size=(1, 128, 998), crop_pct=1.0,
+        classifier=('head.1', 'head_dist'), num_classes=527),
+    'passt_s_p16_s14_128_ap469': _cfg(
+        url='https://github.com/kkoutini/PaSST/releases/download/v0.0.2-audioset/passt-s-f128-p16-s14-ap.469.pt',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, input_size=(1, 128, 998), crop_pct=1.0,
+        classifier=('head.1', 'head_dist'), num_classes=527),
+    'passt_s_swa_p16_s12_128_ap473': _cfg(
+        url='https://github.com/kkoutini/PaSST/releases/download/v0.0.2-audioset/passt-s-f128-p16-s12-ap.473-swa.pt',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, input_size=(1, 128, 998), crop_pct=1.0,
+        classifier=('head.1', 'head_dist'), num_classes=527),
+    'passt_s_p16_s12_128_ap470': _cfg(
+        url='https://github.com/kkoutini/PaSST/releases/download/v0.0.2-audioset/passt-s-f128-p16-s12-ap.470.pt',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, input_size=(1, 128, 998), crop_pct=1.0,
+        classifier=('head.1', 'head_dist'), num_classes=527),
+    'passt_s_swa_f128_stfthop100_p16_s10_ap473': _cfg(
+        url='https://github.com/kkoutini/PaSST/releases/download/v0.0.3-audioset/passt-s-f128-stfthop100-p16-s10-ap.473-swa.pt',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, input_size=(1, 128, 3200), crop_pct=1.0,
+        classifier=('head.1', 'head_dist'), num_classes=527),
+    'passt_s_swa_f128_stfthop160_p16_s10_ap473': _cfg(
+        url='https://github.com/kkoutini/PaSST/releases/download/v0.0.3-audioset/passt-s-f128-stfthop160-p16-s10-ap.473-swa.pt',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, input_size=(1, 128, 2000), crop_pct=1.0,
+        classifier=('head.1', 'head_dist'), num_classes=527),
+    'passt-s-f128-20sec-p16-s10-ap474-swa': _cfg(
+        url='https://github.com/kkoutini/PaSST/releases/download/v0.0.5/passt-s-f128-20sec-p16-s10-ap.474-swa.pt',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, input_size=(1, 128, 2000), crop_pct=1.0,
+        classifier=('head.1', 'head_dist'), num_classes=527),
+    'passt-s-f128-30sec-p16-s10-ap473-swa': _cfg(
+        url='https://github.com/kkoutini/PaSST/releases/download/v0.0.5/passt-s-f128-30sec-p16-s10-ap.473-swa.pt',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, input_size=(1, 128, 3000), crop_pct=1.0,
+        classifier=('head.1', 'head_dist'), num_classes=527),
+    'openmic2008_passt_u_f128_p16_s10_ap85_swa': _cfg(
+        url='https://github.com/kkoutini/PaSST/releases/download/v0.0.4-openmic/openmic2008.passt-u-f128-p16-s10-ap.85-swa.pt',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, input_size=(1, 128, 3200), crop_pct=1.0,
+        classifier=('head.1', 'head_dist'), num_classes=20),
+    'openmic2008_passt_u_f128_p16_s10_ap85  ': _cfg(
+        url='https://github.com/kkoutini/PaSST/releases/download/v0.0.4-openmic/openmic2008.passt-u-f128-p16-s10-ap.85.pt',
+        mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, input_size=(1, 128, 2000), crop_pct=1.0,
+        classifier=('head.1', 'head_dist'), num_classes=20),
+}
+class Mlp(nn.Module):
+    """ MLP as used in Vision Transformer, MLP-Mixer and related networks
+    """
+    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop = nn.Dropout(drop)
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+first_RUN = True
+PLUS1_TRICK = False
+class PatchEmbed(nn.Module):
+    """ 2D Image to Patch Embedding
+    """
+    def __init__(self, img_size=224, in_chans=1, frame_nr=1, stride=1, overlap=1, embed_dim=768, norm_layer=None):
+        super().__init__()
+        img_size = to_2tuple(img_size)
+        frame_nr = frame_nr
+        stride = stride
+        self.img_size = img_size
+        self.frame_nr = frame_nr
+        self.stride = stride
+        self.seq_len = int(img_size[1]) // frame_nr
+        self.num_patches = self.seq_len // stride
+        self.embed_dim = embed_dim
+        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=(int(img_size[0]), stride + overlap),
+                              stride=stride, padding=(0, 1))  # 128 x 2 kernel
+        self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()
+    def forward(self, x):
+        B, C, F, T = x.shape
+        if not (F == self.img_size[0] and abs(T - self.img_size[1]) <= 1):  # allows for a difference of 1
+            warnings.warn(f"Input image size ({F}*{T}) doesn't match model ({self.img_size[0]}*{self.img_size[1]}).")
+        x = self.proj(x)[:, :, :, 1:]  # B embed_dim 1 T    (F=1)
+        x = self.norm(x)
+        if first_RUN: print("self.norm(x)", x.size())
+        return x
+class Attention(nn.Module):
+    def __init__(self, dim, num_heads=8, qkv_bias=False, attn_drop=0., proj_drop=0.):
+        super().__init__()
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = head_dim ** -0.5
+        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
+        self.attn_drop = attn_drop
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+    def forward(self, x):
+        B, N, C = x.shape
+        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[2]  # make torchscript happy (cannot use tensor as tuple)
+        x = F.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=self.attn_drop,
+                                           is_causal=False, scale=self.scale)
+        x = x.transpose(1, 2).reshape(B, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+class Block(nn.Module):
+    def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, drop=0., attn_drop=0.,
+                 drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm):
+        super().__init__()
+        self.norm1 = norm_layer(dim)
+        self.attn = Attention(dim, num_heads=num_heads, qkv_bias=qkv_bias, attn_drop=attn_drop, proj_drop=drop)
+        # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
+    def forward(self, x):
+        x = x + self.drop_path(self.attn(self.norm1(x)))
+        x = x + self.drop_path(self.mlp(self.norm2(x)))
+        return x
+class PaSST(nn.Module):
+    """
+    Based on the implementation of Vision Transformer in timm library.
+     Take a look at the get_model function, adapting the weights of pretrained imagenet models.
+    """
+    def __init__(self, img_size=(128, 998),
+                 in_chans=1, num_classes=527, embed_dim=768, depth=12,
+                 num_heads=12, mlp_ratio=4., qkv_bias=True, representation_size=None, distilled=False,
+                 drop_rate=0., attn_drop_rate=0., drop_path_rate=0., embed_layer=PatchEmbed, norm_layer=None,
+                 act_layer=None, weight_init='',
+                 frame_patchout=300, frame_nr=1, pos_embed_length=1000):
+        """
+        Args:
+            img_size (int, tuple): input image size
+            in_chans (int): number of input channels
+            num_classes (int): number of classes for classification head
+            embed_dim (int): embedding dimension
+            depth (int): depth of transformer
+            num_heads (int): number of attention heads
+            mlp_ratio (int): ratio of mlp hidden dim to embedding dim
+            qkv_bias (bool): enable bias for qkv if True
+            representation_size (Optional[int]): enable and set representation layer (pre-logits) to this value if set
+            distilled (bool): model includes a distillation token and head as in DeiT models
+            drop_rate (float): dropout rate
+            attn_drop_rate (float): attention dropout rate
+            drop_path_rate (float): stochastic depth rate
+            embed_layer (nn.Module): patch embedding layer
+            norm_layer: (nn.Module): normalization layer
+            act_layer: (nn.Module): activation layer
+            weight_init: (str): weight init scheme
+            frame_patchout (int): number of frames to patch out
+            frame_nr (int): the second dimension of the proj-convolution kernel
+            pos_embed_length (int): length of the positional embedding
+        """
+        super().__init__()
+        self.num_classes = num_classes
+        self.num_features = self.embed_dim = embed_dim  # num_features for consistency with other models
+        self.num_tokens = 2 if distilled else 1
+        norm_layer = norm_layer or partial(nn.LayerNorm, eps=1e-6)
+        act_layer = act_layer or nn.GELU
+        self.act_layer = act_layer()
+        self.in_chans = in_chans
+        self.frame_patchout = frame_patchout
+        self.pos_embed_len = pos_embed_length
+        # these three convolution are different compared to the vanilla passt
+        self.conv_in_1 = nn.Conv2d(1, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
+        self.conv_in_2 = nn.Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        self.conv_in_3 = nn.Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))  # 64 instead of 4
+        img_size = (img_size[0], pos_embed_length)  # 128, 250
+        self.patch_embed = embed_layer(
+            img_size=img_size, in_chans=in_chans, frame_nr=frame_nr, stride=frame_nr, embed_dim=embed_dim)
+        num_patches = self.patch_embed.num_patches
+        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
+        self.dist_token = nn.Parameter(torch.zeros(1, 1, embed_dim)) if distilled else None
+        # PaSST
+        # refer to https://arxiv.org/abs/2110.05069 Section 2
+        self.new_pos_embed = nn.Parameter(torch.zeros(1, self.num_tokens, embed_dim))  # for C and D tokens
+        self.freq_new_pos_embed = nn.Parameter(torch.zeros(1, embed_dim, 1, 1))  # | f
+        self.time_new_pos_embed = nn.Parameter(torch.zeros(1, embed_dim, 1, self.pos_embed_len))  # __ t
+        ####
+        self.pos_drop = nn.Dropout(p=drop_rate)
+        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]  # stochastic depth decay rule
+        self.blocks = nn.Sequential(*[
+            Block(
+                dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, drop=drop_rate,
+                attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=norm_layer, act_layer=act_layer)
+            for i in range(depth)])
+        self.norm = norm_layer(embed_dim)
+        # Representation layer
+        if representation_size and not distilled:
+            self.num_features = representation_size
+            self.pre_logits = nn.Sequential(OrderedDict([
+                ('fc', nn.Linear(embed_dim, representation_size)),
+                ('act', nn.Tanh())
+            ]))
+        else:
+            self.pre_logits = nn.Identity()
+        self.init_weights(weight_init)
+    def init_weights(self, mode=''):
+        assert mode in ('jax', 'jax_nlhb', 'nlhb', ''), f"mode: {mode}"
+        head_bias = -math.log(self.num_classes) if 'nlhb' in mode else 0.
+        trunc_normal_(self.new_pos_embed, std=.02)
+        trunc_normal_(self.freq_new_pos_embed, std=.02)
+        trunc_normal_(self.time_new_pos_embed, std=.02)
+        if self.dist_token is not None:
+            trunc_normal_(self.dist_token, std=.02)
+        if mode.startswith('jax'):
+            # leave cls token as zeros to match jax impl
+            raise RuntimeError("Not supported yet")
+        else:
+            trunc_normal_(self.cls_token, std=.02)
+            self.apply(_init_vit_weights)
+    def _init_weights(self, m):
+        # this fn left here for compat with downstream users
+        _init_vit_weights(m)
+    @torch.jit.ignore
+    def no_weight_decay(self):
+        return {'new_pos_embed', 'freq_new_pos_embed', 'time_new_pos_embed', 'cls_token', 'dist_token'}
+    def forward_features(self, x):
+        global first_RUN  # not jit friendly? use trace instead
+        # some 2D convolutions
+        f_dim = x.size(2)  # 128
+        x = self.act_layer(self.conv_in_1(x))
+        x = self.act_layer(self.conv_in_2(x))
+        x = self.act_layer(self.conv_in_3(x))
+        if first_RUN: print("after convs", x.size())
+        x = x.reshape(x.shape[0], (x.shape[1] * x.shape[2]) // f_dim, f_dim, x.shape[3])
+        if first_RUN: print("after reshape", x.size())
+        x = self.patch_embed(x)  # [b, e, f, t]
+        B_dim, E_dim, F_dim, T_dim = x.shape  # slow
+        if first_RUN: print(" patch_embed : ", x.shape)
+        # Adding Time/Freq information
+        if first_RUN: print(" self.time_new_pos_embed.shape", self.time_new_pos_embed.shape)
+        time_new_pos_embed = self.time_new_pos_embed
+        if x.shape[-1] < time_new_pos_embed.shape[-1]:
+            if self.training:
+                toffset = torch.randint(1 + time_new_pos_embed.shape[-1] - x.shape[-1], (1,)).item()
+                if first_RUN: print(f" CUT with randomoffset={toffset} time_new_pos_embed.shape",
+                                    time_new_pos_embed.shape)
+                time_new_pos_embed = time_new_pos_embed[:, :, :, toffset:toffset + x.shape[-1]]
+            else:
+                time_new_pos_embed = time_new_pos_embed[:, :, :, :x.shape[-1]]
+            if first_RUN: print(" CUT time_new_pos_embed.shape", time_new_pos_embed.shape)
+        else:
+            # warnings.warn(
+            #    f"the patches shape:{x.shape} are larger than the expected time encodings {time_new_pos_embed.shape}, x will be cut")
+            x = x[:, :, :, :time_new_pos_embed.shape[-1]]
+        x = x + time_new_pos_embed
+        if first_RUN: print(" self.freq_new_pos_embed.shape", self.freq_new_pos_embed.shape)
+        x = x + self.freq_new_pos_embed
+        # Structured Patchout https://arxiv.org/abs/2110.05069 Section 2.2
+        if self.training and self.frame_patchout:
+            if first_RUN: print(f"X Before frame Patchout of {self.frame_patchout} ", x.size())
+            # ([1, 768, 1, 82])
+            random_indices = torch.randperm(T_dim)[:T_dim - self.frame_patchout].sort().values
+            x = x[:, :, :, random_indices]
+            if first_RUN: print("X after frame Patchout", x.size())
+        x = x.flatten(2).transpose(1, 2)
+        # Add the C/D tokens
+        if first_RUN: print(" self.new_pos_embed.shape", self.new_pos_embed.shape)
+        cls_tokens = self.cls_token.expand(B_dim, -1, -1) + self.new_pos_embed[:, :1, :]
+        if first_RUN: print(" self.cls_tokens.shape", cls_tokens.shape)
+        if self.dist_token is None:
+            x = torch.cat((cls_tokens, x), dim=1)
+        else:
+            dist_token = self.dist_token.expand(B_dim, -1, -1) + self.new_pos_embed[:, 1:, :]
+            if first_RUN: print(" self.dist_token.shape", dist_token.shape)
+            x = torch.cat((cls_tokens, dist_token, x), dim=1)
+        if first_RUN: print(" final sequence x", x.shape)
+        x = self.pos_drop(x)
+        x = self.blocks(x)
+        if first_RUN: print(f" after {len(self.blocks)} atten blocks x", x.shape)
+        x = self.norm(x)
+        return x
+    def forward(self, x):
+        global first_RUN
+        if first_RUN: print("x", x.size())
+        x = self.forward_features(x)
+        c, x = x[:, :2].mean(1), x[:, 2:]
+        if first_RUN: print("x after forward_features", x.size())
+        first_RUN = False
+        return x
+    def load_model(self, path, wandb_id):
+        ckpt_path = os.path.join(path, wandb_id + ".ckpt")
+        pretrained_weights = torch.load(ckpt_path, map_location="cpu")["state_dict"]
+        pretrained_weights = {k[10:]: v for k, v in pretrained_weights.items() if k[:10] == "net.model."}
+        self.load_state_dict(pretrained_weights)
+        print("Loaded model successfully. Wandb_id:", wandb_id)
+def _init_vit_weights(module: nn.Module, name: str = '', head_bias: float = 0., jax_impl: bool = False):
+    """ ViT weight initialization
+    * When called without n, head_bias, jax_impl args it will behave exactly the same
+      as my original init for compatibility with prev hparam / downstream use cases (ie DeiT).
+    * When called w/ valid n (module name) and jax_impl=True, will (hopefully) match JAX impl
+    """
+    if isinstance(module, nn.Linear):
+        if name.startswith('head'):
+            nn.init.zeros_(module.weight)
+            nn.init.constant_(module.bias, head_bias)
+        elif name.startswith('pre_logits'):
+            lecun_normal_(module.weight)
+            nn.init.zeros_(module.bias)
+        else:
+            if jax_impl:
+                nn.init.xavier_uniform_(module.weight)
+                if module.bias is not None:
+                    if 'mlp' in name:
+                        nn.init.normal_(module.bias, std=1e-6)
+                    else:
+                        nn.init.zeros_(module.bias)
+            else:
+                trunc_normal_(module.weight, std=.02)
+                if module.bias is not None:
+                    nn.init.zeros_(module.bias)
+    elif jax_impl and isinstance(module, nn.Conv2d):
+        # NOTE conv was left to pytorch default in my original init
+        lecun_normal_(module.weight)
+        if module.bias is not None:
+            nn.init.zeros_(module.bias)
+    elif isinstance(module, (nn.LayerNorm, nn.GroupNorm, nn.BatchNorm2d)):
+        nn.init.zeros_(module.bias)
+        nn.init.ones_(module.weight)
+def resize_pos_embed(posemb, posemb_new, num_tokens=1, gs_new=(), mode='bicubic'):
+    # Rescale the grid of position embeddings when loading from state_dict. Adapted from
+    # https://github.com/google-research/vision_transformer/blob/00883dd691c63a6830751563748663526e811cee/vit_jax/checkpoint.py#L224
+    _logger.info('Resized position embedding: %s to %s with %s cls/dis tokens', posemb.shape, posemb_new.shape,
+                 num_tokens)
+    ntok_new = posemb_new.shape[1]
+    if num_tokens:
+        posemb_tok, posemb_grid = posemb[:, :num_tokens], posemb[0, num_tokens:]
+        ntok_new -= num_tokens
+    else:
+        posemb_tok, posemb_grid = posemb[:, :0], posemb[0]
+    gs_old = int(math.sqrt(len(posemb_grid)))
+    if not len(gs_new):  # backwards compatibility
+        gs_new = [int(math.sqrt(ntok_new))] * 2
+    assert len(gs_new) >= 2
+    _logger.info('Position embedding grid-size from %s to %s', [gs_old, gs_old], gs_new)
+    posemb_grid = posemb_grid.reshape(1, gs_old, gs_old, -1).permute(0, 3, 1, 2)
+    posemb_grid = F.interpolate(posemb_grid, size=gs_new, mode=mode, align_corners=False)
+    posemb_grid = posemb_grid.permute(0, 2, 3, 1).reshape(1, gs_new[0] * gs_new[1], -1)
+    posemb = torch.cat([posemb_tok, posemb_grid], dim=1)
+    return posemb
+def adapt_image_pos_embed_to_passt(posemb, num_tokens=1, posemb_len=1000, mode='bicubic'):
+    # Rescale the grid of position embeddings when loading from state_dict. Adapted from
+    # https://github.com/google-research/vision_transformer/blob/00883dd691c63a6830751563748663526e811cee/vit_jax/checkpoint.py#L224
+    if num_tokens:
+        posemb_tok, posemb_grid = posemb[:, :num_tokens], posemb[0, num_tokens:]
+    else:
+        posemb_tok, posemb_grid = posemb[:, :0], posemb[0]
+    gs_old = int(math.sqrt(len(posemb_grid)))
+    posemb_grid = posemb_grid.reshape(1, gs_old, gs_old, -1).permute(0, 3, 1, 2)
+    posemb_grid = F.interpolate(posemb_grid, size=(1, posemb_len), mode=mode, align_corners=False)
+    freq_new_pos_embed = posemb_grid.mean(dim=3, keepdim=True)
+    time_new_pos_embed = posemb_grid.mean(dim=2, keepdim=True)
+    _logger.info('New Position cls/dstl embedding %s', posemb_tok.shape)
+    _logger.info('New FREQ Position embedding %s', freq_new_pos_embed.shape)
+    _logger.info('New TIME Position embedding %s', time_new_pos_embed.shape)
+    return posemb_tok, freq_new_pos_embed, time_new_pos_embed
+def checkpoint_filter_fn(state_dict, model):
+    """ convert patch embedding weight from manual patchify + linear proj to conv"""
+    out_dict = {}
+    if 'model' in state_dict:
+        # For deit models
+        state_dict = state_dict['model']
+    state_dict = {k: v for k, v in state_dict.items()}
+    if "time_new_pos_embed" not in state_dict:
+        # we are working with ImageNet model
+        _logger.info("Adapting pos embedding from ImageNet pretrained model to PaSST.")
+        v = state_dict.pop("pos_embed")
+        new_pos_embed, freq_new_pos_embed, time_new_pos_embed = adapt_image_pos_embed_to_passt(
+            v, getattr(model, 'num_tokens', 1), model.pos_embed_len)
+        state_dict["new_pos_embed"] = new_pos_embed
+        state_dict["freq_new_pos_embed"] = freq_new_pos_embed
+        state_dict["time_new_pos_embed"] = time_new_pos_embed
+    for k, v in state_dict.items():
+        if 'patch_embed.proj.weight' in k:
+            embed_dim, C, H, W = v.shape
+            v = adapt_input_conv(model.in_chans, v, input_conv_name=k)
+            k1, k2 = model.patch_embed.proj.kernel_size  # 128, 2
+            # clever reshape
+            assert H * W == k1 * k2, "Error in the kernel size of the patch embedding"
+            v = v.reshape(embed_dim, model.in_chans, k1, k2)  # [embed_dim, 1, k1, k2]
+        out_dict[k] = v
+    return out_dict
+def _create_vision_transformer(variant, pretrained=False, default_cfg=None, **kwargs):
+    default_cfg = default_cfg or default_cfgs[variant]
+    if kwargs.get('features_only', None):
+        raise RuntimeError('features_only not implemented for Vision Transformer models.')
+    # NOTE this extra code to support handling of repr size for in21k pretrained models
+    default_num_classes = default_cfg['num_classes']
+    num_classes = kwargs.get('num_classes', default_num_classes)
+    repr_size = kwargs.pop('representation_size', None)
+    if repr_size is not None and num_classes != default_num_classes:
+        # Remove representation layer if fine-tuning. This may not always be the desired action,
+        # but I feel better than doing nothing by default for fine-tuning. Perhaps a better interface?
+        _logger.warning("Removing representation layer for fine-tuning.")
+        repr_size = None
+    model = build_model_with_cfg(
+        PaSST, variant, pretrained,
+        default_cfg=default_cfg,
+        representation_size=repr_size,
+        pretrained_filter_fn=checkpoint_filter_fn,
+        pretrained_custom_load='npz' in default_cfg['url'],
+        **kwargs)
+    return model
+def vit_huge_patch14_224_in21k(pretrained=False, **kwargs):
+    """ ViT-Huge model (ViT-H/14) from original paper (https://arxiv.org/abs/2010.11929).
+    ImageNet-21k weights @ 224x224, source https://github.com/google-research/vision_transformer.
+    NOTE: this model has a representation layer but the 21k classifier head is zero'd out in original weights
+    """
+    model_kwargs = dict(
+        patch_size=14, embed_dim=1280, depth=32, num_heads=16, representation_size=1280, **kwargs)
+    model = _create_vision_transformer('vit_huge_patch14_224_in21k', pretrained=pretrained, **model_kwargs)
+    return model
+def deit_base_distilled_patch16_384(pretrained=False, **kwargs):
+    """ DeiT-base distilled model @ 384x384 from paper (https://arxiv.org/abs/2012.12877).
+    ImageNet-1k weights from https://github.com/facebookresearch/deit.
+    """
+    model_kwargs = dict(embed_dim=768, depth=12, num_heads=12, **kwargs)
+    model = _create_vision_transformer(
+        'deit_base_distilled_patch16_384', pretrained=pretrained, distilled=True, **model_kwargs)
+    return model
+def passt_s_swa_p16_128_ap476(pretrained=False, **kwargs):
+    """ PaSST pre-trained on AudioSet
+    """
+    print("\n\n Loading PaSST pre-trained on AudioSet Patch 16 stride 10 structured patchout mAP=476 SWA \n\n")
+    model_kwargs = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12, **kwargs)
+    if model_kwargs.get("stride") != (10, 10):
+        warnings.warn(
+            f"This model was pre-trained with strides {(10, 10)}, but now you set (fstride,tstride) to {model_kwargs.get('stride')}.")
+    model = _create_vision_transformer(
+        'passt_s_swa_p16_128_ap476', pretrained=pretrained, distilled=True, **model_kwargs)
+    return model
+def passt_s_kd_p16_128_ap486(pretrained=False, **kwargs):
+    """ PaSST pre-trained on AudioSet
+    """
+    print("\n\n Loading PaSST pre-trained on AudioSet (with KD) Patch 16 stride 10 structured patchout mAP=486 \n\n")
+    model_kwargs = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12, **kwargs)
+    if model_kwargs.get("stride") != (10, 10):
+        warnings.warn(
+            f"This model was pre-trained with strides {(10, 10)}, but now you set (fstride,tstride) to {model_kwargs.get('stride')}.")
+    model = _create_vision_transformer(
+        'passt_s_kd_p16_128_ap486', pretrained=pretrained, distilled=True, **model_kwargs)
+    return model
+def passt_l_kd_p16_128_ap47(pretrained=False, **kwargs):
+    """ PaSST pre-trained on AudioSet
+    """
+    print(
+        "\n\n Loading PaSST-L (light, reduced depth=7) pre-trained on AudioSet (with KD) Patch 16 stride 10 structured patchout mAP=4708 \n\n")
+    model_kwargs = dict(patch_size=16, embed_dim=768,
+                        depth=7, num_heads=12, **kwargs)
+    if model_kwargs.get("stride") != (10, 10):
+        warnings.warn(
+            f"This model was pre-trained with strides {(10, 10)}, but now you set (fstride,tstride) to {model_kwargs.get('stride')}.")
+    model = _create_vision_transformer(
+        'passt_l_kd_p16_128_ap47', pretrained=pretrained, distilled=True, **model_kwargs)
+    return model
+def passt_s_swa_p16_128_ap4761(pretrained=False, **kwargs):
+    """ PaSST pre-trained on AudioSet
+    """
+    print("\n\n Loading PaSST pre-trained on AudioSet Patch 16 stride 10 structured patchout mAP=4763 SWA \n\n")
+    model_kwargs = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12, **kwargs)
+    if model_kwargs.get("stride") != (10, 10):
+        warnings.warn(
+            f"This model was pre-trained with strides {(10, 10)}, but now you set (fstride,tstride) to {model_kwargs.get('stride')}.")
+    model = _create_vision_transformer(
+        'passt_s_swa_p16_128_ap4761', pretrained=pretrained, distilled=True, **model_kwargs)
+    return model
+def passt_s_p16_128_ap472(pretrained=False, **kwargs):
+    """ PaSST pre-trained on AudioSet
+    """
+    print("\n\n Loading PaSST pre-trained on AudioSet Patch 16 stride 10 structured patchout mAP=472 \n\n")
+    model_kwargs = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12, **kwargs)
+    if model_kwargs.get("stride") != (10, 10):
+        warnings.warn(
+            f"This model was pre-trained with strides {(10, 10)}, but now you set (fstride,tstride) to {model_kwargs.get('stride')}.")
+    model = _create_vision_transformer(
+        'passt_s_p16_128_ap472', pretrained=pretrained, distilled=True, **model_kwargs)
+    return model
+def passt_s_p16_s12_128_ap470(pretrained=False, **kwargs):
+    """ PaSST pre-trained on AudioSet
+    """
+    print("\n\n Loading PaSST pre-trained on AudioSet Patch 16 stride 12 structured patchout mAP=472 \n\n")
+    model_kwargs = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12, **kwargs)
+    if model_kwargs.get("stride") != (12, 12):
+        warnings.warn(
+            f"This model was pre-trained with strides {(12, 12)}, but now you set (fstride,tstride) to {model_kwargs.get('stride')}.")
+    model = _create_vision_transformer(
+        'passt_s_p16_s12_128_ap470', pretrained=pretrained, distilled=True, **model_kwargs)
+    return model
+def passt_s_f128_20sec_p16_s10_ap474_swa(pretrained=False, **kwargs):
+    print("\n\n Loading PASST TRAINED ON AUDISET with 20 Second time encodings, with STFT hop of 160 \n\n")
+    model_kwargs = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12, **kwargs)
+    model = _create_vision_transformer(
+        'passt-s-f128-20sec-p16-s10-ap474-swa', pretrained=pretrained, distilled=True, **model_kwargs)
+    return model
+def passt_s_f128_30sec_p16_s10_ap473_swa(pretrained=False, **kwargs):
+    print("\n\n Loading PASST TRAINED ON AUDISET with 30 Second time encodings, with STFT hop of 160 \n\n")
+    model_kwargs = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12, **kwargs)
+    model = _create_vision_transformer(
+        'passt-s-f128-30sec-p16-s10-ap473-swa', pretrained=pretrained, distilled=True, **model_kwargs)
+    return model
+def passt_s_swa_p16_s12_128_ap473(pretrained=False, **kwargs):
+    """ PaSST pre-trained on AudioSet
+    """
+    print("\n\n Loading PaSST pre-trained on AudioSet Patch 16 stride 12 structured patchout mAP=472 \n\n")
+    model_kwargs = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12, **kwargs)
+    if model_kwargs.get("stride") != (12, 12):
+        warnings.warn(
+            f"This model was pre-trained with strides {(12, 12)}, but now you set (fstride,tstride) to {model_kwargs.get('stride')}.")
+    model = _create_vision_transformer(
+        'passt_s_swa_p16_s12_128_ap473', pretrained=pretrained, distilled=True, **model_kwargs)
+    return model
+def passt_s_p16_s14_128_ap469(pretrained=False, **kwargs):
+    """ PaSST pre-trained on AudioSet
+    """
+    print("\n\n Loading PaSST pre-trained on AudioSet Patch 16 stride 14 structured patchout mAP=472 \n\n")
+    model_kwargs = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12, **kwargs)
+    if model_kwargs.get("stride") != (14, 14):
+        warnings.warn(
+            f"This model was pre-trained with strides {(14, 14)}, but now you set (fstride,tstride) to {model_kwargs.get('stride')}.")
+    model = _create_vision_transformer(
+        'passt_s_p16_s14_128_ap469', pretrained=pretrained, distilled=True, **model_kwargs)
+    return model
+def passt_s_swa_p16_s14_128_ap471(pretrained=False, **kwargs):
+    """ PaSST pre-trained on AudioSet
+    """
+    print("\n\n Loading PaSST pre-trained on AudioSet Patch 16 stride 14 structured patchout mAP=472 \n\n")
+    model_kwargs = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12, **kwargs)
+    if model_kwargs.get("stride") != (14, 14):
+        warnings.warn(
+            f"This model was pre-trained with strides {(14, 14)}, but now you set (fstride,tstride) to {model_kwargs.get('stride')}.")
+    model = _create_vision_transformer(
+        'passt_s_swa_p16_s14_128_ap471', pretrained=pretrained, distilled=True, **model_kwargs)
+    return model
+def passt_s_swa_p16_s16_128_ap473(pretrained=False, **kwargs):
+    """ PaSST pre-trained on AudioSet
+    """
+    print("\n\n Loading PaSST pre-trained on AudioSet Patch 16 stride 16 structured patchout mAP=472 \n\n")
+    model_kwargs = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12, **kwargs)
+    if model_kwargs.get("stride") != (16, 16):
+        warnings.warn(
+            f"This model was pre-trained with strides {(16, 16)}, but now you set (fstride,tstride) to {model_kwargs.get('stride')}.")
+    model = _create_vision_transformer(
+        'passt_s_swa_p16_s16_128_ap473', pretrained=pretrained, distilled=True, **model_kwargs)
+    return model
+def passt_s_p16_s16_128_ap468(pretrained=False, **kwargs):
+    """ PaSST pre-trained on AudioSet
+    """
+    print("\n\n Loading PaSST pre-trained on AudioSet Patch 16 stride 16 structured patchout mAP=472 \n\n")
+    model_kwargs = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12, **kwargs)
+    if model_kwargs.get("stride") != (16, 16):
+        warnings.warn(
+            f"This model was pre-trained with strides {(16, 16)}, but now you set (fstride,tstride) to {model_kwargs.get('stride')}.")
+    model = _create_vision_transformer(
+        'passt_s_p16_s16_128_ap468', pretrained=pretrained, distilled=True, **model_kwargs)
+    return model
+def fix_embedding_layer(model, embed="default"):
+    if embed == "default":
+        return model
+    if embed == "overlap":
+        model.patch_embed = PatchEmbedAdaptiveMean(replace=model.patch_embed)
+    if embed == "am_keepconv":
+        model.patch_embed = PatchEmbedAdaptiveMeanKeepConv(replace=model.patch_embed)
+    return model
+def lighten_model(model, cut_depth=0):
+    if cut_depth == 0:
+        return model
+    if cut_depth:
+        if cut_depth < 0:
+            print(f"\n Reducing model depth by removing every  {-cut_depth} layer \n\n")
+        else:
+            print(f"\n Reducing model depth by {cut_depth} \n\n")
+            if len(model.blocks) < cut_depth + 2:
+                raise ValueError(f"Cut depth a VIT with {len(model.blocks)} "
+                                 f"layers should be between 1 and {len(model.blocks) - 2}")
+        print(f"\n Before Cutting it was  {len(model.blocks)} \n\n")
+        old_blocks = list(model.blocks.children())
+        if cut_depth < 0:
+            print(f"cut_depth={cut_depth}")
+            old_blocks = [old_blocks[0]] + old_blocks[1:-1:-cut_depth] + [old_blocks[-1]]
+        else:
+            old_blocks = [old_blocks[0]] + old_blocks[cut_depth + 1:]
+        model.blocks = nn.Sequential(*old_blocks)
+        print(f"\n Atfer Cutting it is  {len(model.blocks)} \n\n")
+    return model
+def get_model(arch="passt_s_kd_p16_128_ap486", pretrained=True, n_classes=527, in_channels=1,
+              input_fdim=128, input_tdim=998, frame_patchout=300, pos_embed_length=1000
+              ):
+    """
+    :param arch: Base ViT or Deit architecture
+    :param pretrained: use pretrained model on imagenet
+    :param n_classes: number of classes
+    :param in_channels: number of input channels: 1 for mono
+    :param input_fdim: the expected input frequency bins.
+    :param input_tdim: the expected input time bins.
+    :param frame_patchout: the number of frames to be removed from the input
+    @param wandb_id: tries to load model with corresponding wandb_id from 'pretrained_path'
+    :return:
+    """
+    model_func = None
+    input_size = (input_fdim, input_tdim)
+    if arch == "passt_deit_bd_p16_384":  # base deit
+        model_func = deit_base_distilled_patch16_384
+    elif arch == "passt_s_kd_p16_128_ap486":  # pretrained
+        model_func = passt_s_kd_p16_128_ap486
+    elif arch == "passt_l_kd_p16_128_ap47":  # pretrained passt-L
+        model_func = passt_l_kd_p16_128_ap47
+    elif arch == "passt_s_swa_p16_128_ap476":  # pretrained
+        model_func = passt_s_swa_p16_128_ap476
+    elif arch == "passt_s_swa_p16_128_ap4761":
+        model_func = passt_s_swa_p16_128_ap4761
+    elif arch == "passt_s_p16_128_ap472":
+        model_func = passt_s_p16_128_ap472
+    elif arch == "passt_s_p16_s16_128_ap468":
+        model_func = passt_s_p16_s16_128_ap468
+    elif arch == "passt_s_swa_p16_s16_128_ap473":
+        model_func = passt_s_swa_p16_s16_128_ap473
+    elif arch == "passt_s_swa_p16_s14_128_ap471":
+        model_func = passt_s_swa_p16_s14_128_ap471
+    elif arch == "passt_s_p16_s14_128_ap469":
+        model_func = passt_s_p16_s14_128_ap469
+    elif arch == "passt_s_swa_p16_s12_128_ap473":
+        model_func = passt_s_swa_p16_s12_128_ap473
+    elif arch == "passt_s_p16_s12_128_ap470":
+        model_func = passt_s_p16_s12_128_ap470
+    elif arch == "passt_s_f128_20sec_p16_s10_ap474":
+        model_func = passt_s_f128_20sec_p16_s10_ap474_swa
+    elif arch == "passt_s_f128_30sec_p16_s10_ap473":
+        model_func = passt_s_f128_30sec_p16_s10_ap473_swa
+    if model_func is None:
+        raise RuntimeError(f"Unknown model {arch}")
+    model = model_func(pretrained=pretrained, num_classes=n_classes, in_chans=in_channels,
+                       img_size=input_size, frame_patchout=frame_patchout, pos_embed_length=pos_embed_length)
+    model = fix_embedding_layer(model)
+    model = lighten_model(model)
+    return model
+class EnsembelerModel(nn.Module):
+    def __init__(self, models):
+        super(EnsembelerModel, self).__init__()
+        self.models = nn.ModuleList(models)
+    def forward(self, x):
+        # ModuleList can act as an iterable, or be indexed using ints
+        all_out = None
+        for i, m in enumerate(self.models):
+            out, _ = m(x)
+            if all_out is None:
+                all_out = out
+            else:
+                all_out = out + all_out
+        all_out = all_out / len(self.models)
+        return all_out, all_out

models/frame_passt/fpasst_wrapper.py ADDED Viewed

	@@ -0,0 +1,86 @@

+from models.frame_passt.fpasst import get_model
+from models.frame_passt.preprocess import AugmentMelSTFT
+from models.transformer_wrapper import BaseModelWrapper
+class FPaSSTWrapper(BaseModelWrapper):
+    def __init__(self):
+        super().__init__()
+        self.mel = AugmentMelSTFT(
+            n_mels=128,
+            sr=16_000,
+            win_length=400,
+            hopsize=160,
+            n_fft=512,
+            freqm=0,
+            timem=0,
+            htk=False,
+            fmin=0.0,
+            fmax=None,
+            norm=1,
+            fmin_aug_range=10,
+            fmax_aug_range=2000,
+            fast_norm=True,
+            preamp=True,
+        )
+        self.fpasst = get_model(
+            arch="passt_deit_bd_p16_384",
+            n_classes=527,
+            pos_embed_length=250,
+            frame_patchout=0,
+            in_channels=16
+        )
+    def mel_forward(self, x):
+        return self.mel(x)
+    def forward(self, x):
+        return self.fpasst(x)
+    def separate_params(self):
+        pt_params = [[], [], [], [], [], [], [], [], [], [], [], []]
+        for k, p in self.fpasst.named_parameters():
+            if k in ['cls_token',
+                     'dist_token',
+                     'new_pos_embed',
+                     'freq_new_pos_embed',
+                     'time_new_pos_embed',
+                     'conv_in_1.weight',
+                     'conv_in_1.bias',
+                     'conv_in_2.weight',
+                     'conv_in_2.bias',
+                     'conv_in_3.weight',
+                     'conv_in_3.bias',
+                     'patch_embed.proj.weight',
+                     'patch_embed.proj.bias',
+                     ]:
+                pt_params[0].append(p)
+            elif 'blocks.0.' in k:
+                pt_params[0].append(p)
+            elif 'blocks.1.' in k:
+                pt_params[1].append(p)
+            elif 'blocks.2.' in k:
+                pt_params[2].append(p)
+            elif 'blocks.3.' in k:
+                pt_params[3].append(p)
+            elif 'blocks.4.' in k:
+                pt_params[4].append(p)
+            elif 'blocks.5.' in k:
+                pt_params[5].append(p)
+            elif 'blocks.6.' in k:
+                pt_params[6].append(p)
+            elif 'blocks.7.' in k:
+                pt_params[7].append(p)
+            elif 'blocks.8.' in k:
+                pt_params[8].append(p)
+            elif 'blocks.9.' in k:
+                pt_params[9].append(p)
+            elif 'blocks.10.' in k:
+                pt_params[10].append(p)
+            elif 'blocks.11.' in k:
+                pt_params[11].append(p)
+            elif k in ['norm.weight', 'norm.bias']:
+                pt_params[11].append(p)
+            else:
+                raise ValueError(f"Check separate params for frame-passt! Unexpected key: {k}")
+        return list(reversed(pt_params))

models/frame_passt/preprocess.py ADDED Viewed

	@@ -0,0 +1,147 @@

+import torch
+import torch.nn as nn
+import torchaudio
+sz_float = 4  # size of a float
+epsilon = 10e-8  # fudge factor for normalization
+class AugmentMelSTFT(nn.Module):
+    def __init__(
+            self,
+            n_mels=128,
+            sr=32000,
+            win_length=None,
+            hopsize=320,
+            n_fft=1024,
+            freqm=0,
+            timem=0,
+            htk=False,
+            fmin=0.0,
+            fmax=None,
+            norm=1,
+            fmin_aug_range=1,
+            fmax_aug_range=1,
+            fast_norm=False,
+            preamp=True,
+            padding="center",
+            periodic_window=True,
+    ):
+        torch.nn.Module.__init__(self)
+        # adapted from: https://github.com/CPJKU/kagglebirds2020/commit/70f8308b39011b09d41eb0f4ace5aa7d2b0e806e
+        # Similar config to the spectrograms used in AST: https://github.com/YuanGongND/ast
+        if win_length is None:
+            win_length = n_fft
+        if isinstance(win_length, list) or isinstance(win_length, tuple):
+            assert isinstance(n_fft, list) or isinstance(n_fft, tuple)
+            assert len(win_length) == len(n_fft)
+        else:
+            win_length = [win_length]
+            n_fft = [n_fft]
+        self.win_length = win_length
+        self.n_mels = n_mels
+        self.n_fft = n_fft
+        self.sr = sr
+        self.htk = htk
+        self.fmin = fmin
+        if fmax is None:
+            fmax = sr // 2 - fmax_aug_range // 2
+        self.fmax = fmax
+        self.norm = norm
+        self.hopsize = hopsize
+        self.preamp = preamp
+        for win_l in self.win_length:
+            self.register_buffer(
+                f"window_{win_l}",
+                torch.hann_window(win_l, periodic=periodic_window),
+                persistent=False,
+            )
+        assert (
+                fmin_aug_range >= 1
+        ), f"fmin_aug_range={fmin_aug_range} should be >=1; 1 means no augmentation"
+        assert (
+                fmin_aug_range >= 1
+        ), f"fmax_aug_range={fmax_aug_range} should be >=1; 1 means no augmentation"
+        self.fmin_aug_range = fmin_aug_range
+        self.fmax_aug_range = fmax_aug_range
+        self.register_buffer(
+            "preemphasis_coefficient", torch.as_tensor([[[-0.97, 1]]]), persistent=False
+        )
+        if freqm == 0:
+            self.freqm = torch.nn.Identity()
+        else:
+            self.freqm = torchaudio.transforms.FrequencyMasking(freqm, iid_masks=False)
+        if timem == 0:
+            self.timem = torch.nn.Identity()
+        else:
+            self.timem = torchaudio.transforms.TimeMasking(timem, iid_masks=False)
+        self.fast_norm = fast_norm
+        self.padding = padding
+        if padding not in ["center", "same"]:
+            raise ValueError("Padding must be 'center' or 'same'.")
+        self.iden = nn.Identity()
+    def forward(self, x):
+        if self.preamp:
+            x = nn.functional.conv1d(x.unsqueeze(1), self.preemphasis_coefficient)
+        x = x.squeeze(1)
+        fmin = self.fmin + torch.randint(self.fmin_aug_range, (1,)).item()
+        fmax = self.fmax + self.fmax_aug_range // 2 - torch.randint(self.fmax_aug_range, (1,)).item()
+        # don't augment eval data
+        if not self.training:
+            fmin = self.fmin
+            fmax = self.fmax
+        mels = []
+        for n_fft, win_length in zip(self.n_fft, self.win_length):
+            x_temp = x
+            if self.padding == "same":
+                pad = win_length - self.hopsize
+                self.iden(x_temp)  # printing
+                x_temp = torch.nn.functional.pad(x_temp, (pad // 2, pad // 2), mode="reflect")
+                self.iden(x_temp)  # printing
+            x_temp = torch.stft(
+                x_temp,
+                n_fft,
+                hop_length=self.hopsize,
+                win_length=win_length,
+                center=self.padding == "center",
+                normalized=False,
+                window=getattr(self, f"window_{win_length}"),
+                return_complex=True
+            )
+            x_temp = torch.view_as_real(x_temp)
+            x_temp = (x_temp ** 2).sum(dim=-1)  # power mag
+            mel_basis, _ = torchaudio.compliance.kaldi.get_mel_banks(self.n_mels, n_fft, self.sr,
+                                                                     fmin, fmax, vtln_low=100.0, vtln_high=-500.,
+                                                                     vtln_warp_factor=1.0)
+            mel_basis = torch.as_tensor(torch.nn.functional.pad(mel_basis, (0, 1), mode='constant', value=0),
+                                        device=x.device)
+            with torch.cuda.amp.autocast(enabled=False):
+                x_temp = torch.matmul(mel_basis, x_temp)
+            x_temp = torch.log(torch.clip(x_temp, min=1e-7))
+            mels.append(x_temp)
+        mels = torch.stack(mels, dim=1)
+        if self.training:
+            mels = self.freqm(mels)
+            mels = self.timem(mels)
+        if self.fast_norm:
+            mels = (mels + 4.5) / 5.0  # fast normalization
+        return mels
+    def extra_repr(self):
+        return "winsize={}, hopsize={}".format(self.win_length, self.hopsize)

models/frame_passt/vit_helpers.py ADDED Viewed

	@@ -0,0 +1,399 @@

+"""
+Adapted from https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py
+Credit to @leo19941227  for remove timm dependencies here : https://github.com/s3prl/passt_hear21/blob/48a0dc1b824641ca59884ced53f5b86053fed141/hear21passt/models/helpers/vit_helpers.py
+"""
+import math
+import logging
+import warnings
+from copy import deepcopy
+import torch
+from torch import nn
+from timm.models._hub import download_cached_file
+# Global variables for rarely used pretrained checkpoint download progress and hash check.
+# Use set_pretrained_download_progress / set_pretrained_check_hash functions to toggle.
+_DOWNLOAD_PROGRESS = True
+_CHECK_HASH = False
+_logger = logging.getLogger(__name__)
+def adapt_input_conv(in_chans, conv_weight, input_conv_name="(name not given)"):
+    conv_type = conv_weight.dtype
+    conv_weight = (
+        conv_weight.float()
+    )  # Some weights are in torch.half, ensure it's float for sum on CPU
+    O, I, J, K = conv_weight.shape
+    if in_chans == 1:
+        print(f"adapt_input_conv: Converted from {I} to 1 channel")
+        if I > 3:
+            assert conv_weight.shape[1] % 3 == 0
+            # For models with space2depth stems
+            conv_weight = conv_weight.reshape(O, I // 3, 3, J, K)
+            conv_weight = conv_weight.sum(dim=2, keepdim=False)
+        else:
+            conv_weight = conv_weight.sum(dim=1, keepdim=True)
+    elif in_chans != 3:
+        if I != 3:
+            # loading a model pretrained on AudioSet for the downstream-task
+            if I == in_chans:
+                print(f"adapt_input_conv: Loading pretrained weights for {input_conv_name}, "
+                      f"Assuming same input-conv and proj-conv configuration (1:1).")
+                pass
+        else:
+            print(f"adapt_input_conv: Converted input conv {input_conv_name} weights from 3 to {in_chans} channel(s)")
+            # NOTE this strategy should be better than random init, but there could be other combinations of
+            # the original RGB input layer weights that'd work better for specific cases.
+            repeat = int(math.ceil(in_chans / 3))
+            conv_weight = conv_weight.repeat(1, repeat, 1, 1)[:, :in_chans, :, :]
+            conv_weight *= 3 / float(in_chans)
+    conv_weight = conv_weight.to(conv_type)
+    return conv_weight
+def load_pretrained(
+    model,
+    default_cfg=None,
+    num_classes=1000,
+    in_chans=3,
+    filter_fn=None,
+    strict=True,
+    progress=False,
+):
+    """Load pretrained checkpoint
+    Args:
+        model (nn.Module) : PyTorch model module
+        default_cfg (Optional[Dict]): default configuration for pretrained weights / target dataset
+        num_classes (int): num_classes for model
+        in_chans (int): in_chans for model
+        filter_fn (Optional[Callable]): state_dict filter fn for load (takes state_dict, model as args)
+        strict (bool): strict load of checkpoint
+        progress (bool): enable progress bar for weight download
+    """
+    default_cfg = default_cfg or getattr(model, "default_cfg", None) or {}
+    pretrained_url = default_cfg.get("url", None)
+    if not pretrained_url:
+        _logger.warning(
+            "No pretrained weights exist for this model. Using random initialization."
+        )
+        return
+    _logger.info(f"Loading pretrained weights from url ({pretrained_url})")
+    pretrained_loc = download_cached_file(
+            pretrained_url,
+            check_hash=_CHECK_HASH,
+            progress=_DOWNLOAD_PROGRESS,
+        )
+    state_dict = torch.load(pretrained_loc, map_location="cpu")
+    if filter_fn is not None:
+        # for backwards compat with filter fn that take one arg, try one first, the two
+        try:
+            state_dict = filter_fn(state_dict)
+        except TypeError:
+            state_dict = filter_fn(state_dict, model)
+    input_convs = default_cfg.get("first_conv", None)
+    if input_convs is not None and in_chans != 3:
+        if isinstance(input_convs, str):
+            input_convs = (input_convs,)
+        for input_conv_name in input_convs:
+            weight_name = input_conv_name + ".weight"
+            try:
+                state_dict[weight_name] = adapt_input_conv(
+                    in_chans, state_dict[weight_name], input_conv_name
+                )
+                # _logger.info(
+                #     f"Converted input conv {input_conv_name} pretrained weights from 3 to {in_chans} channel(s)"
+                # )
+            except (NotImplementedError, KeyError) as e:
+                if weight_name in state_dict:
+                    del state_dict[weight_name]
+                strict = False
+                _logger.warning(
+                    f"Unable to convert pretrained {input_conv_name} weights, using random init for this layer."
+                )
+    classifiers = default_cfg.get("classifier", None)
+    label_offset = default_cfg.get("label_offset", 0)
+    if classifiers is not None:
+        if isinstance(classifiers, str):
+            classifiers = (classifiers,)
+        if num_classes != default_cfg["num_classes"]:
+            for classifier_name in classifiers:
+                # completely discard fully connected if model num_classes doesn't match pretrained weights
+                del state_dict[classifier_name + ".weight"]
+                del state_dict[classifier_name + ".bias"]
+            strict = False
+        elif label_offset > 0:
+            for classifier_name in classifiers:
+                # special case for pretrained weights with an extra background class in pretrained weights
+                classifier_weight = state_dict[classifier_name + ".weight"]
+                state_dict[classifier_name + ".weight"] = classifier_weight[
+                    label_offset:
+                ]
+                classifier_bias = state_dict[classifier_name + ".bias"]
+                state_dict[classifier_name + ".bias"] = classifier_bias[label_offset:]
+    model.load_state_dict(state_dict, strict=strict)
+def overlay_external_default_cfg(default_cfg, kwargs):
+    """Overlay 'external_default_cfg' in kwargs on top of default_cfg arg."""
+    external_default_cfg = kwargs.pop("external_default_cfg", None)
+    if external_default_cfg:
+        default_cfg.pop("url", None)  # url should come from external cfg
+        default_cfg.pop("hf_hub", None)  # hf hub id should come from external cfg
+        default_cfg.update(external_default_cfg)
+def filter_kwargs(kwargs, names):
+    if not kwargs or not names:
+        return
+    for n in names:
+        kwargs.pop(n, None)
+def set_default_kwargs(kwargs, names, default_cfg):
+    for n in names:
+        # for legacy reasons, model __init__args uses img_size + in_chans as separate args while
+        # default_cfg has one input_size=(C, H ,W) entry
+        if n == "img_size":
+            input_size = default_cfg.get("input_size", None)
+            if input_size is not None:
+                assert len(input_size) == 3
+                kwargs.setdefault(n, input_size[-2:])
+        elif n == "in_chans":
+            input_size = default_cfg.get("input_size", None)
+            if input_size is not None:
+                assert len(input_size) == 3
+                kwargs.setdefault(n, input_size[0])
+        else:
+            default_val = default_cfg.get(n, None)
+            if default_val is not None:
+                kwargs.setdefault(n, default_cfg[n])
+def update_default_cfg_and_kwargs(default_cfg, kwargs, kwargs_filter):
+    """Update the default_cfg and kwargs before passing to model
+    FIXME this sequence of overlay default_cfg, set default kwargs, filter kwargs
+    could/should be replaced by an improved configuration mechanism
+    Args:
+        default_cfg: input default_cfg (updated in-place)
+        kwargs: keyword args passed to model build fn (updated in-place)
+        kwargs_filter: keyword arg keys that must be removed before model __init__
+    """
+    # Overlay default cfg values from `external_default_cfg` if it exists in kwargs
+    overlay_external_default_cfg(default_cfg, kwargs)
+    # Set model __init__ args that can be determined by default_cfg (if not already passed as kwargs)
+    default_kwarg_names = ("num_classes", "global_pool", "in_chans")
+    if default_cfg.get("fixed_input_size", False):
+        # if fixed_input_size exists and is True, model takes an img_size arg that fixes its input size
+        default_kwarg_names += ("img_size",)
+    set_default_kwargs(kwargs, names=default_kwarg_names, default_cfg=default_cfg)
+    # Filter keyword args for task specific model variants (some 'features only' models, etc.)
+    filter_kwargs(kwargs, names=kwargs_filter)
+def drop_path(x, drop_prob: float = 0.0, training: bool = False):
+    """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
+    This is the same as the DropConnect impl I created for EfficientNet, etc networks, however,
+    the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
+    See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for
+    changing the layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use
+    'survival rate' as the argument.
+    """
+    if drop_prob == 0.0 or not training:
+        return x
+    keep_prob = 1 - drop_prob
+    shape = (x.shape[0],) + (1,) * (
+        x.ndim - 1
+    )  # work with diff dim tensors, not just 2D ConvNets
+    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
+    random_tensor.floor_()  # binarize
+    output = x.div(keep_prob) * random_tensor
+    return output
+class DropPath(nn.Module):
+    """Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks)."""
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+    def forward(self, x):
+        return drop_path(x, self.drop_prob, self.training)
+from torch.nn.init import _calculate_fan_in_and_fan_out
+def _no_grad_trunc_normal_(tensor, mean, std, a, b):
+    # Cut & paste from PyTorch official master until it's in a few official releases - RW
+    # Method based on https://people.sc.fsu.edu/~jburkardt/presentations/truncated_normal.pdf
+    def norm_cdf(x):
+        # Computes standard normal cumulative distribution function
+        return (1.0 + math.erf(x / math.sqrt(2.0))) / 2.0
+    if (mean < a - 2 * std) or (mean > b + 2 * std):
+        warnings.warn(
+            "mean is more than 2 std from [a, b] in nn.init.trunc_normal_. "
+            "The distribution of values may be incorrect.",
+            stacklevel=2,
+        )
+    with torch.no_grad():
+        # Values are generated by using a truncated uniform distribution and
+        # then using the inverse CDF for the normal distribution.
+        # Get upper and lower cdf values
+        l = norm_cdf((a - mean) / std)
+        u = norm_cdf((b - mean) / std)
+        # Uniformly fill tensor with values from [l, u], then translate to
+        # [2l-1, 2u-1].
+        tensor.uniform_(2 * l - 1, 2 * u - 1)
+        # Use inverse cdf transform for normal distribution to get truncated
+        # standard normal
+        tensor.erfinv_()
+        # Transform to proper mean, std
+        tensor.mul_(std * math.sqrt(2.0))
+        tensor.add_(mean)
+        # Clamp to ensure it's in the proper range
+        tensor.clamp_(min=a, max=b)
+        return tensor
+def trunc_normal_(tensor, mean=0.0, std=1.0, a=-2.0, b=2.0):
+    r"""Fills the input Tensor with values drawn from a truncated
+    normal distribution. The values are effectively drawn from the
+    normal distribution :math:`\mathcal{N}(\text{mean}, \text{std}^2)`
+    with values outside :math:`[a, b]` redrawn until they are within
+    the bounds. The method used for generating the random values works
+    best when :math:`a \leq \text{mean} \leq b`.
+    Args:
+        tensor: an n-dimensional `torch.Tensor`
+        mean: the mean of the normal distribution
+        std: the standard deviation of the normal distribution
+        a: the minimum cutoff value
+        b: the maximum cutoff value
+    Examples:
+        >>> w = torch.empty(3, 5)
+        >>> nn.init.trunc_normal_(w)
+    """
+    return _no_grad_trunc_normal_(tensor, mean, std, a, b)
+def variance_scaling_(tensor, scale=1.0, mode="fan_in", distribution="normal"):
+    fan_in, fan_out = _calculate_fan_in_and_fan_out(tensor)
+    if mode == "fan_in":
+        denom = fan_in
+    elif mode == "fan_out":
+        denom = fan_out
+    elif mode == "fan_avg":
+        denom = (fan_in + fan_out) / 2
+    variance = scale / denom
+    if distribution == "truncated_normal":
+        # constant is stddev of standard normal truncated to (-2, 2)
+        trunc_normal_(tensor, std=math.sqrt(variance) / 0.87962566103423978)
+    elif distribution == "normal":
+        tensor.normal_(std=math.sqrt(variance))
+    elif distribution == "uniform":
+        bound = math.sqrt(3 * variance)
+        tensor.uniform_(-bound, bound)
+    else:
+        raise ValueError(f"invalid distribution {distribution}")
+def lecun_normal_(tensor):
+    variance_scaling_(tensor, mode="fan_in", distribution="truncated_normal")
+def build_model_with_cfg(
+    model_cls,
+    variant: str,
+    pretrained: bool,
+    default_cfg: dict,
+    model_cfg=None,
+    feature_cfg=None,
+    pretrained_strict: bool = True,
+    pretrained_filter_fn=None,
+    pretrained_custom_load=False,
+    kwargs_filter=None,
+    **kwargs,
+):
+    """Build model with specified default_cfg and optional model_cfg
+    This helper fn aids in the construction of a model including:
+      * handling default_cfg and associated pretained weight loading
+      * passing through optional model_cfg for models with config based arch spec
+      * features_only model adaptation
+      * pruning config / model adaptation
+    Args:
+        model_cls (nn.Module): model class
+        variant (str): model variant name
+        pretrained (bool): load pretrained weights
+        default_cfg (dict): model's default pretrained/task config
+        model_cfg (Optional[Dict]): model's architecture config
+        feature_cfg (Optional[Dict]: feature extraction adapter config
+        pretrained_strict (bool): load pretrained weights strictly
+        pretrained_filter_fn (Optional[Callable]): filter callable for pretrained weights
+        pretrained_custom_load (bool): use custom load fn, to load numpy or other non PyTorch weights
+        kwargs_filter (Optional[Tuple]): kwargs to filter before passing to model
+        **kwargs: model args passed through to model __init__
+    """
+    pruned = kwargs.pop("pruned", False)
+    features = False
+    feature_cfg = feature_cfg or {}
+    default_cfg = deepcopy(default_cfg) if default_cfg else {}
+    update_default_cfg_and_kwargs(default_cfg, kwargs, kwargs_filter)
+    default_cfg.setdefault("architecture", variant)
+    # Setup for feature extraction wrapper done at end of this fn
+    if kwargs.pop("features_only", False):
+        features = True
+        feature_cfg.setdefault("out_indices", (0, 1, 2, 3, 4))
+        if "out_indices" in kwargs:
+            feature_cfg["out_indices"] = kwargs.pop("out_indices")
+    # Build the model
+    model = (
+        model_cls(**kwargs) if model_cfg is None else model_cls(cfg=model_cfg, **kwargs)
+    )
+    model.default_cfg = default_cfg
+    # For classification models, check class attr, then kwargs, then default to 1k, otherwise 0 for feats
+    num_classes_pretrained = (
+        0
+        if features
+        else getattr(model, "num_classes", kwargs.get("num_classes", 1000))
+    )
+    if pretrained:
+        assert not pretrained_custom_load, "URL should not contain npz for PASST models"
+        load_pretrained(
+            model,
+            num_classes=num_classes_pretrained,
+            in_chans=kwargs.get("in_chans", 3),
+            filter_fn=pretrained_filter_fn,
+            strict=pretrained_strict,
+        )
+    return model

models/m2d/M2D_wrapper.py ADDED Viewed

	@@ -0,0 +1,52 @@

+from models.m2d.portable_m2d import PortableM2D as M2D
+from models.transformer_wrapper import BaseModelWrapper
+class M2DWrapper(BaseModelWrapper):
+    def __init__(self) -> None:
+        super().__init__()
+        self.m2d = M2D()
+    def mel_forward(self, x):
+        return self.m2d.to_normalized_feature(x)
+    def forward(self, spec):
+        return self.m2d.forward_mel(spec)
+    def separate_params(self):
+        pt_params = [[], [], [], [], [], [], [], [], [], [], [], []]
+        for k, p in self.named_parameters():
+            if any(['cls_token' in k,
+                    'pos_embed' in k,
+                    'norm_stats' in k,
+                    'patch_embed' in k]):
+                pt_params[0].append(p)
+            elif 'blocks.0.' in k:
+                pt_params[0].append(p)
+            elif 'blocks.1.' in k:
+                pt_params[1].append(p)
+            elif 'blocks.2.' in k:
+                pt_params[2].append(p)
+            elif 'blocks.3.' in k:
+                pt_params[3].append(p)
+            elif 'blocks.4.' in k:
+                pt_params[4].append(p)
+            elif 'blocks.5.' in k:
+                pt_params[5].append(p)
+            elif 'blocks.6.' in k:
+                pt_params[6].append(p)
+            elif 'blocks.7.' in k:
+                pt_params[7].append(p)
+            elif 'blocks.8.' in k:
+                pt_params[8].append(p)
+            elif 'blocks.9.' in k:
+                pt_params[9].append(p)
+            elif 'blocks.10.' in k:
+                pt_params[10].append(p)
+            elif 'blocks.11.' in k:
+                pt_params[11].append(p)
+            elif 'backbone.norm.weight' in k or 'backbone.norm.bias' in k:
+                pt_params[11].append(p)
+            else:
+                raise ValueError(f"Check separate params for M2D! Unknown key: {k}")
+        return list(reversed(pt_params))

models/m2d/portable_m2d.py ADDED Viewed

	@@ -0,0 +1,410 @@

+"""Masked Modeling Duo (M2D) Portable Runtime.
+All you need is:
+    pip install timm, einops, nnAudio
+"""
+import logging
+from functools import partial
+from pathlib import Path
+import nnAudio.features
+import numpy as np
+import timm
+import torch
+from einops import rearrange
+from timm.models.layers import trunc_normal_
+class Config:
+    weight_file = ''
+    feature_d = 768 * 5
+    norm_type = all
+    pooling_type = 'mean'
+    model = ''
+    input_size = [80, 208]
+    patch_size = [16, 16]
+    sr = '16k'
+    flat_features = False
+def expand_size(sz):
+    if isinstance(sz, int):
+        return [sz, sz]
+    return sz
+class PatchEmbed(torch.nn.Module):
+    """ 2D Image to Patch Embedding -- borrowed from https://pypi.org/project/timm/0.4.12/"""
+    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True):
+        super().__init__()
+        img_size = expand_size(img_size)
+        patch_size = expand_size(patch_size)
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.grid_size = (img_size[0] // patch_size[0], img_size[1] // patch_size[1])
+        self.num_patches = self.grid_size[0] * self.grid_size[1]
+        self.flatten = flatten
+        self.proj = torch.nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
+        self.norm = norm_layer(embed_dim) if norm_layer else torch.nn.Identity()
+    def forward(self, x):
+        x = self.proj(x)
+        if self.flatten:
+            x = x.flatten(2).transpose(1, 2)  # BCHW -> BNC
+        x = self.norm(x)
+        return x
+class LocalViT(timm.models.vision_transformer.VisionTransformer):
+    """ Vision Transformer for M2D Audio"""
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        # Workaround for PatchEmbed to avoid unintended assertion failure. ex) AssertionError: Input image width (102) doesn't match model (608).
+        self.patch_embed = PatchEmbed(self.patch_embed.img_size, self.patch_embed.patch_size,
+                                      self.patch_embed.proj.in_channels, self.patch_embed.proj.out_channels)
+        self.norm_stats = torch.nn.Parameter(torch.tensor([-7.1, 4.2]), requires_grad=False)
+        # We do not use the default head
+        del self.head
+    def patch_size(self):
+        return np.array(self.patch_embed.patch_size)
+    def grid_size(self):
+        # Workaround for compatibility issue (timm 0.4.5 fails with: return self.patch_embed.grid_size)
+        img_size = np.array(self.patch_embed.img_size)
+        patch_size = self.patch_size()
+        grid_size = img_size // patch_size
+        return grid_size
+    def forward_encoder(self, x):
+        x = self.patch_embed(x)
+        # add pos embed w/o cls token
+        pos_embed = self.pos_embed[:, 1:, :]
+        if x.shape[1] < pos_embed.shape[1]:  # shorten pos_embed for a short input
+            dims = pos_embed.shape[-1]
+            fbins = self.grid_size()[0]
+            frames = x.shape[1] // fbins
+            pos_embed = pos_embed.reshape(1, fbins, -1, dims)[:, :, :frames, :].reshape(1, fbins * frames, dims)
+        x = x + pos_embed
+        # append cls token
+        cls_token = self.cls_token + self.pos_embed[:, :1, :]
+        cls_tokens = cls_token.expand(x.shape[0], -1, -1)
+        x = torch.cat((cls_tokens, x), dim=1)
+        # apply Transformer blocks
+        for blk in self.blocks:
+            x = blk(x)
+        x = self.norm(x)
+        return x
+def parse_sizes_by_name(name):
+    # Parse parameters. "m2d_vit_base-80x1001p16x16p16k" -> input size: 80x1001, patch size: 16x16, sr: 16k
+    model_cls = name.split('-')[0]
+    params = name.split('-')[1]
+    params = params.split('p')[:3]
+    input_str, patch_str, sr = params[0], params[1], params[2] if len(params) > 2 else '16k'
+    input_size = [int(a) for a in input_str.split('x')]
+    patch_size = [int(a) for a in patch_str.split('x')]
+    return input_size, patch_size, sr, model_cls
+def drop_non_model_weights(model, checkpoint, filename):
+    model_keys = [n for n, p in model.named_parameters()]
+    new_ckpt, dropped = {}, []
+    for k in checkpoint:
+        if k not in model_keys:
+            dropped.append(k)
+            continue
+        new_ckpt[k] = checkpoint[k]
+    n_org = len(checkpoint.keys())
+    n_cur = len(new_ckpt.keys())
+    print(
+        f' using {n_cur} parameters, while dropped {n_org - n_cur} out of {n_org} parameters from {Path(filename).parent / Path(filename).name}'
+        if n_org > n_cur else f' using {n_cur} parameters from {Path(filename).parent / Path(filename).name}')
+    print(' (dropped:', dropped[:5], ')' if len(dropped) < 5 else '...)')
+    return new_ckpt
+def load_evar_head_parameters(checkpoint, head_norm, head):
+    # Load the weights of the task head trained in the EVAR fine-tuning.
+    if 'module.head.norm.running_mean' in checkpoint:
+        head_norm.load_state_dict({to_k: checkpoint[k] for to_k, k in {
+            'running_mean': 'module.head.norm.running_mean', 'running_var': 'module.head.norm.running_var'}.items()})
+        head.load_state_dict({to_k: checkpoint[k] for to_k, k in {
+            'weight': 'module.head.mlp.mlp.0.weight', 'bias': 'module.head.mlp.mlp.0.bias'}.items()})
+    else:
+        print(' Not an EVAR checkpoint for loading head weights.')
+def reformat_ckpt_keys(checkpoint):
+    # In case: checkpoint['model']
+    checkpoint = checkpoint['model'] if 'model' in checkpoint else checkpoint
+    # The checkpoints saved in a EVAR fine-tuning has a prefix of "module.ar.runtime.backbone", the following removes it.
+    new_ckpt = {}
+    for k in checkpoint:
+        new_k = k.replace('module.ar.runtime.backbone.', '')  # replace
+        new_ckpt[new_k] = checkpoint[k]
+    return new_ckpt
+def make_it_CLAP(model, checkpoint):
+    # Add projectors if needed
+    if 'audio_proj.0.weight' in checkpoint.keys():
+        proj_hidden_dim = embed_dim = checkpoint['audio_proj.0.weight'].shape[1]
+        model.audio_proj = torch.nn.Sequential(
+            torch.nn.Linear(embed_dim, proj_hidden_dim),
+            torch.nn.ReLU(),
+            torch.nn.Linear(proj_hidden_dim, embed_dim),
+        )
+        if 'text_proj.weight' in checkpoint.keys():
+            dim = checkpoint['text_proj.weight'].shape
+            model.text_proj = torch.nn.Linear(dim[1], dim[0])
+        else:
+            model.text_proj = torch.nn.Identity()
+def get_backbone(args, weight_file):
+    name = Path(weight_file).parent.name if weight_file is not None \
+        else "m2d_clap_vit_base-80x1001p16x16-240128_AS-FT_enconly"
+    args.input_size, args.patch_size, args.sr, args.beats = parse_sizes_by_name(name)
+    # Create a ViT.
+    model = LocalViT(
+        in_chans=1, img_size=args.input_size, patch_size=args.patch_size, embed_dim=768, depth=12, num_heads=12,
+        mlp_ratio=4, norm_layer=partial(torch.nn.LayerNorm, eps=1e-6))
+    if weight_file is None:
+        args.mean, args.std = -7.1, 4.2
+        model.eval()
+        return model, None
+    # Load checkpoint.
+    checkpoint = torch.load(weight_file, map_location='cpu')
+    checkpoint = reformat_ckpt_keys(checkpoint)
+    # Set normalization statistics for backward compatibility. The [-7.1, 4.2] is for 2022 models.
+    if 'norm_stats' not in checkpoint:
+        checkpoint['norm_stats'] = torch.tensor([-7.1, 4.2])
+        print(' using default norm_stats:', checkpoint['norm_stats'])
+    # Modify the model if it should be a M2D-CLAP.
+    make_it_CLAP(model, checkpoint)
+    # Load weights.
+    dropped = drop_non_model_weights(model, checkpoint, weight_file)
+    msg = model.load_state_dict(dropped)
+    print(msg);
+    logging.info(msg)
+    # Make normalization statistics for the model easy to use in the downstream task.
+    args.mean, args.std = model.state_dict()['norm_stats'].to('cpu').numpy()
+    model.eval()
+    return model, checkpoint
+def get_to_melspec(cfg):
+    if cfg.sr == '16k':
+        cfg.sample_rate, cfg.n_fft, cfg.window_size, cfg.hop_size = 16000, 400, 400, 160
+        cfg.n_mels, cfg.f_min, cfg.f_max = 80, 50, 8000
+    elif cfg.sr == '32k':
+        cfg.sample_rate, cfg.n_fft, cfg.window_size, cfg.hop_size = 32000, 800, 800, 320
+        cfg.n_mels, cfg.f_min, cfg.f_max = 80, 50, 16000
+    else:
+        assert False, f'Unknown input size: {cfg.input_size}'
+    to_spec = nnAudio.features.MelSpectrogram(
+        sr=cfg.sample_rate,
+        n_fft=cfg.n_fft,
+        win_length=cfg.window_size,
+        hop_length=cfg.hop_size,
+        n_mels=cfg.n_mels,
+        fmin=cfg.f_min,
+        fmax=cfg.f_max,
+        center=True,
+        power=2,
+        verbose=False,
+    )
+    logging.info(f'Runtime MelSpectrogram({cfg.sample_rate}, {cfg.n_fft}, {cfg.window_size}, {cfg.hop_size}, '
+                 + f'{cfg.n_mels}, {cfg.f_min}, {cfg.f_max}):')
+    logging.info(to_spec)
+    return to_spec
+def get_timestamps(cfg, batch_audio, x):  # Returns timestamps in milliseconds.
+    audio_len = len(batch_audio[0])
+    sec = audio_len / cfg.sample_rate
+    x_len = len(x[0])
+    step = sec / x_len * 1000  # sec -> ms
+    ts = torch.tensor([step * i for i in range(x_len)]).unsqueeze(0)
+    ts = ts.repeat(len(batch_audio), 1)
+    return ts
+class PortableM2D(torch.nn.Module):
+    def __init__(self, weight_file=None, num_classes=None, freeze_embed=False, flat_features=None):
+        super().__init__()
+        self.cfg = Config()
+        self.cfg.weight_file = weight_file
+        self.cfg.freeze_embed = freeze_embed
+        self.cfg.flat_features = self.cfg.flat_features if flat_features is None else flat_features
+        # Create backbone model.
+        self.backbone, checkpoint = get_backbone(self.cfg, self.cfg.weight_file)
+        # Finalize feature dimension.
+        d = self.backbone.pos_embed.shape[-1]
+        if num_classes is not None and 'module.head.mlp.mlp.0.weight' in checkpoint and \
+                checkpoint['module.head.mlp.mlp.0.weight'].shape[-1] == d:
+            self.cfg.flat_features = True
+        n_stack_feature = 1 if self.cfg.flat_features else (self.cfg.input_size[0] // self.cfg.patch_size[0])
+        self.cfg.feature_d = d * n_stack_feature  # 768 if flat_features else 768*5=3840
+        # Create head.
+        if num_classes is not None:
+            self.head_norm = torch.nn.BatchNorm1d(self.cfg.feature_d, affine=False)
+            self.head = torch.nn.Linear(self.cfg.feature_d, num_classes)
+            trunc_normal_(self.head.weight, std=2e-5)
+            load_evar_head_parameters(checkpoint, self.head_norm, self.head)
+        # Option: freeze patch embedding ([2211.09359] How to Fine-Tune Vision Models with SGD)
+        if self.cfg.freeze_embed:
+            models_mae.set_requires_grad(self.backbone.patch_embed, False)
+            logging.info(' ** Freeze patch_embed **')
+            logging.info(self.backbone.patch_embed)
+        logging.info(f'Model input size: {self.cfg.input_size}')
+        logging.info(f'Using weights: {self.cfg.weight_file}')
+        logging.info(f'Feature dimension: {self.cfg.feature_d}')
+        logging.info(f'Norm stats: {self.cfg.mean}, {self.cfg.std}')
+        self.to_spec = get_to_melspec(self.cfg)
+        self.eval()
+    def to_log_mel_spec(self, batch_audio):
+        x = self.to_spec(batch_audio)
+        x = (x + torch.finfo().eps).log()
+        x = x.unsqueeze(1)
+        return x
+    def normalize_batch(self, x):
+        x = (x - self.cfg.mean) / self.cfg.std
+        return x
+    def to_normalized_feature(self, batch_audio):
+        x = self.to_log_mel_spec(batch_audio)
+        x = self.normalize_batch(x)
+        return x
+    def encode_lms(self, x, average_per_time_frame=False):
+        patch_fbins = self.backbone.grid_size()[0]
+        unit_frames = self.cfg.input_size[1]
+        patch_frames = self.backbone.patch_size()[1]
+        embed_d = self.backbone.patch_embed.proj.out_channels
+        n_chunk = (x.shape[-1] + unit_frames - 1) // unit_frames
+        pad_frames = (patch_frames - (x.shape[-1] % unit_frames % patch_frames)) % patch_frames
+        if pad_frames > 0:
+            x = torch.nn.functional.pad(x, (0, pad_frames))
+        embeddings = []
+        if self.cfg.flat_features:
+            # flatten all patch embeddings
+            for i in range(n_chunk):
+                emb = self.backbone.forward_encoder(x[..., i * unit_frames:(i + 1) * unit_frames])
+                emb = emb[..., 1:, :]
+                if average_per_time_frame:
+                    emb = rearrange(emb, 'b (f t) d -> b t d f', f=patch_fbins, d=embed_d).mean(-1)
+                embeddings.append(emb)
+        else:
+            # stack embeddings along time frame
+            for i in range(n_chunk):
+                emb = self.backbone.forward_encoder(x[..., i * unit_frames:(i + 1) * unit_frames])
+                emb = emb[..., 1:, :]
+                emb = rearrange(emb, 'b (f t) d -> b t (f d)', f=patch_fbins, d=embed_d)
+                embeddings.append(emb)
+        # concatenate embedding chunks in the time axis
+        x = torch.cat(embeddings, axis=-2)
+        return x
+    def encode(self, batch_audio, average_per_time_frame=False):
+        x = self.to_normalized_feature(batch_audio)
+        return self.encode_lms(x, average_per_time_frame=average_per_time_frame)
+    def forward(self, batch_audio, average_per_time_frame=False):
+        x = self.encode(batch_audio, average_per_time_frame=average_per_time_frame)
+        if hasattr(self, 'head'):
+            x = x.mean(1)  # B, D
+            x = self.head_norm(x.unsqueeze(-1)).squeeze(-1)
+            x = self.head(x)
+        return x
+    def forward_mel(self, batch_mel, average_per_time_frame=False):
+        x = self.encode_lms(batch_mel, average_per_time_frame=average_per_time_frame)
+        if hasattr(self, 'head'):
+            x = x.mean(1)  # B, D
+            x = self.head_norm(x.unsqueeze(-1)).squeeze(-1)
+            x = self.head(x)
+        return x
+    def get_scene_embeddings(self, batch_audio):
+        x = self.encode(batch_audio)
+        x = torch.mean(x, dim=1)
+        return x
+    def get_timestamp_embeddings(self, batch_audio):
+        x = self.encode(batch_audio, average_per_time_frame=True)
+        ts = get_timestamps(self.cfg, batch_audio, x)
+        return x, ts
+    def forward_frames(self, batch_audio):
+        x, ts = self.get_timestamp_embeddings(batch_audio)
+        if hasattr(self, 'head'):
+            x = self.head_norm(x.transpose(-1, -2)).transpose(-2, -1)
+            x = self.head(x)
+        return x, ts
+    def encode_clap_audio(self, batch_audio):
+        audio_embeddings = self.forward(batch_audio)
+        audio_embeddings = audio_embeddings.mean(dim=-2)
+        audio_embeddings = self.backbone.audio_proj(audio_embeddings)
+        return audio_embeddings
+    def encode_clap_text(self, batch_text, truncate=False):
+        if not hasattr(self, 'text_encoder'):
+            self.text_encoder = GTETextEncoder()
+        text_embeddings = self.text_encoder(batch_text, truncate=truncate)
+        text_embeddings = self.backbone.text_proj(text_embeddings)
+        text_embeddings = text_embeddings.detach().cpu().to(torch.float)
+        return text_embeddings
+# For the CLAP models
+class GTETextEncoder:
+    def __init__(self, clip_weight="thenlper/gte-base"):
+        from transformers import AutoTokenizer, AutoModel
+        import os
+        os.environ["TOKENIZERS_PARALLELISM"] = "true"  # To suppress warnings.
+        self.tokenizer = AutoTokenizer.from_pretrained(clip_weight)
+        self.model = AutoModel.from_pretrained(clip_weight)
+    def __call__(self, texts, truncate=True, max_length=512):
+        def average_pool(last_hidden_states, attention_mask):
+            last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
+            return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
+        with torch.no_grad():
+            device = next(self.model.parameters()).device
+            batch_dict = self.tokenizer(texts, max_length=max_length, padding=True, truncation=truncate,
+                                        return_tensors='pt')
+            batch_dict['input_ids'] = batch_dict['input_ids'].to(device)
+            batch_dict['token_type_ids'] = batch_dict['token_type_ids'].to(device)
+            batch_dict['attention_mask'] = batch_dict['attention_mask'].to(device)
+            outputs = self.model.to(device)(**batch_dict)
+        embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
+        return embeddings

models/prediction_wrapper.py ADDED Viewed

	@@ -0,0 +1,213 @@

+import os
+import torch
+import torch.nn as nn
+from torch.hub import download_url_to_file
+from config import RESOURCES_FOLDER, CHECKPOINT_URLS
+from models.seq_models import BidirectionalLSTM, BidirectionalGRU
+class PredictionsWrapper(nn.Module):
+    """
+        A wrapper module that adds an optional sequence model and classification heads on top of a transformer.
+        It implements equations (1), (2), and (3) in the paper.
+        Args:
+            base_model (BaseModelWrapper): The base model (transformer) providing sequence embeddings
+            checkpoint (str, optional): checkpoint name for loading pre-trained weights. Default is None.
+            n_classes_strong (int): Number of classes for strong predictions. Default is 447.
+            n_classes_weak (int, optional): Number of classes for weak predictions. Default is None,
+                                            which sets it equal to n_classes_strong.
+            embed_dim (int, optional): Embedding dimension of the base model output. Default is 768.
+            seq_len (int, optional): Desired sequence length. Default is 250 (40 ms resolution).
+            seq_model_type (str, optional): Type of sequence model to use.
+                                            Default is None, which means no additional sequence model is used.
+            head_type (str, optional): Type of classification head. Choices are ["linear", "attention", "None"].
+                                       Default is "linear". "None" means that sequence embeddings are returned.
+            rnn_layers (int, optional): Number of RNN layers if seq_model_type is "rnn". Default is 2.
+            rnn_type (str, optional): Type of RNN to use. Choices are ["BiGRU", "BiLSTM"]. Default is "BiGRU".
+            rnn_dim (int, optional): Dimension of RNN hidden state if seq_model_type is "rnn". Default is 256.
+            rnn_dropout (float, optional): Dropout rate for RNN layers. Default is 0.0.
+        """
+    def __init__(self,
+                 base_model,
+                 checkpoint=None,
+                 n_classes_strong=447,
+                 n_classes_weak=None,
+                 embed_dim=768,
+                 seq_len=250,
+                 seq_model_type=None,
+                 head_type="linear",
+                 rnn_layers=2,
+                 rnn_type="BiGRU",
+                 rnn_dim=2048,
+                 rnn_dropout=0.0
+                 ):
+        super(PredictionsWrapper, self).__init__()
+        self.model = base_model
+        self.seq_len = seq_len
+        self.embed_dim = embed_dim
+        self.n_classes_strong = n_classes_strong
+        self.n_classes_weak = n_classes_weak if n_classes_weak is not None else n_classes_strong
+        self.seq_model_type = seq_model_type
+        self.head_type = head_type
+        if self.seq_model_type == "rnn":
+            if rnn_type == "BiGRU":
+                self.seq_model = BidirectionalGRU(
+                    n_in=self.embed_dim,
+                    n_hidden=rnn_dim,
+                    dropout=rnn_dropout,
+                    num_layers=rnn_layers
+                )
+            elif rnn_type == "BiLSTM":
+                self.seq_model = BidirectionalLSTM(
+                    nIn=self.embed_dim,
+                    nHidden=rnn_dim,
+                    nOut=rnn_dim * 2,
+                    dropout=rnn_dropout,
+                    num_layers=rnn_layers
+                )
+            num_features = rnn_dim * 2
+        elif self.seq_model_type is None:
+            self.seq_model = nn.Identity()
+            # no additional sequence model
+            num_features = self.embed_dim
+        else:
+            raise ValueError(f"Unknown seq_model_type: {self.seq_model_type}")
+        if self.head_type == "attention":
+            assert self.n_classes_strong == self.n_classes_weak, "head_type=='attention' requires number of strong and " \
+                                                                 "weak classes to be the same!"
+        if self.head_type is not None:
+            self.strong_head = nn.Linear(num_features, self.n_classes_strong)
+            self.weak_head = nn.Linear(num_features, self.n_classes_weak)
+        if checkpoint is not None:
+            print("Loading pretrained checkpoint: ", checkpoint)
+            self.load_checkpoint(checkpoint)
+    def load_checkpoint(self, checkpoint):
+        ckpt_file = os.path.join(RESOURCES_FOLDER, checkpoint + ".pt")
+        if not os.path.exists(ckpt_file):
+            download_url_to_file(CHECKPOINT_URLS[checkpoint], ckpt_file)
+        state_dict = torch.load(ckpt_file, map_location="cpu", weights_only=True)
+        # compatibility with uniform wrapper structure we introduced for the public repo
+        if 'fpasst' in checkpoint:
+            state_dict = {("model.fpasst." + k[len("model."):] if k.startswith("model.")
+                           else k): v for k, v in state_dict.items()}
+        elif 'M2D' in checkpoint:
+            state_dict = {("model.m2d." + k[len("model."):] if not k.startswith("model.m2d.") and k.startswith("model.")
+                           else k): v for k, v in state_dict.items()}
+        elif 'BEATs' in checkpoint:
+            state_dict = {("model.beats." + k[len("model.model."):] if k.startswith("model.model")
+                           else k): v for k, v in state_dict.items()}
+        elif 'ASIT' in checkpoint:
+            state_dict = {("model.asit." + k[len("model."):] if k.startswith("model.")
+                           else k): v for k, v in state_dict.items()}
+        n_classes_weak_in_sd = state_dict['weak_head.bias'].shape[0] if 'weak_head.bias' in state_dict else -1
+        n_classes_strong_in_sd = state_dict['strong_head.bias'].shape[0] if 'strong_head.bias' in state_dict else -1
+        seq_model_in_sd = any(['seq_model.' in key for key in state_dict.keys()])
+        keys_to_remove = []
+        strict = True
+        expected_missing = 0
+        if self.head_type is None:
+            # remove all keys related to head
+            keys_to_remove.append('weak_head.bias')
+            keys_to_remove.append('weak_head.weight')
+            keys_to_remove.append('strong_head.bias')
+            keys_to_remove.append('strong_head.weight')
+        elif self.seq_model_type is not None and not seq_model_in_sd:
+            # we want to train a sequence model (e.g., rnn) on top of a
+            #   pre-trained transformer (e.g., AS weak pretrained)
+            keys_to_remove.append('weak_head.bias')
+            keys_to_remove.append('weak_head.weight')
+            keys_to_remove.append('strong_head.bias')
+            keys_to_remove.append('strong_head.weight')
+            num_seq_model_keys = len([key for key in self.seq_model.state_dict()])
+            expected_missing = len(keys_to_remove) + num_seq_model_keys
+            strict = False
+        else:
+            # head type is not None
+            if n_classes_weak_in_sd != self.n_classes_weak:
+                # remove weak head from sd
+                keys_to_remove.append('weak_head.bias')
+                keys_to_remove.append('weak_head.weight')
+                strict = False
+            if n_classes_strong_in_sd != self.n_classes_strong:
+                # remove strong head from sd
+                keys_to_remove.append('strong_head.bias')
+                keys_to_remove.append('strong_head.weight')
+                strict = False
+            expected_missing = len(keys_to_remove)
+        # allow missing mel parameters for compatibility
+        num_mel_keys = len([key for key in self.state_dict() if 'mel_transform' in key])
+        if num_mel_keys > 0:
+            expected_missing += num_mel_keys
+            strict = False
+        state_dict = {k: v for k, v in state_dict.items() if k not in keys_to_remove}
+        missing, unexpected = self.load_state_dict(state_dict, strict=strict)
+        assert len(missing) == expected_missing
+        assert len(unexpected) == 0
+    def separate_params(self):
+        if hasattr(self, "separate_params"):
+            return self.model.separate_params()
+        else:
+            raise NotImplementedError("The base model has no 'separate_params' method!'")
+    def has_separate_params(self):
+        return hasattr(self.model, "separate_params")
+    def mel_forward(self, x):
+        return self.model.mel_forward(x)
+    def forward(self, x):
+        # base model is expected to output a sequence (see Eq. (1) in paper)
+        # (batch size x sequence length x embedding dimension)
+        x = self.model(x)
+        # ATST: x.shape: batch size x 250 x 768
+        # PaSST: x.shape: batch size x 250 x 768
+        # ASiT: x.shape: batch size x 497 x 768
+        # M2D: x.shape: batch size x 62 x 3840
+        # BEATs: x.shape: batch size x 496 x 768
+        assert len(x.shape) == 3
+        if x.size(-2) > self.seq_len:
+            x = torch.nn.functional.adaptive_avg_pool1d(x.transpose(1, 2), self.seq_len).transpose(1, 2)
+        elif x.size(-2) < self.seq_len:
+            x = torch.nn.functional.interpolate(x.transpose(1, 2), size=self.seq_len,
+                                                mode='linear').transpose(1, 2)
+        # Eq. (3) in the paper
+        # for teachers this is an RNN, for students it is nn.Identity
+        x = self.seq_model(x)
+        if self.head_type == "attention":
+            # attention head to obtain weak from strong predictions
+            # this is typically used for the DESED task, which requires both
+            # weak and strong predictions
+            strong = torch.sigmoid(self.strong_head(x))
+            sof = torch.softmax(self.weak_head(x), dim=-1)
+            sof = torch.clamp(sof, min=1e-7, max=1)
+            weak = (strong * sof).sum(1) / sof.sum(1)
+            return strong.transpose(1, 2), weak
+        elif self.head_type == "linear":
+            # simple linear layers as head (see Eq. (3) in the paper)
+            # on AudioSet strong, only strong predictions are used
+            # on AudioSet weak, only weak predictions are used
+            # why both? because we tried to simultaneously train on AudioSet weak and strong (less successful)
+            strong = self.strong_head(x)
+            weak = self.weak_head(x.mean(dim=1))
+            return strong.transpose(1, 2), weak
+        else:
+            # no head means the sequence is returned instead of strong and weak predictions
+            return x

models/seq_models.py ADDED Viewed

	@@ -0,0 +1,40 @@

+import torch.nn as nn
+class BidirectionalGRU(nn.Module):
+    def __init__(self, n_in, n_hidden, dropout=0, num_layers=1):
+        super(BidirectionalGRU, self).__init__()
+        self.rnn = nn.GRU(
+            n_in,
+            n_hidden,
+            bidirectional=True,
+            dropout=dropout,
+            batch_first=True,
+            num_layers=num_layers,
+        )
+    def forward(self, input_feat):
+        recurrent, _ = self.rnn(input_feat)
+        return recurrent
+class BidirectionalLSTM(nn.Module):
+    def __init__(self, nIn, nHidden, nOut, dropout=0, num_layers=1):
+        super(BidirectionalLSTM, self).__init__()
+        self.rnn = nn.LSTM(
+            nIn,
+            nHidden,
+            bidirectional=True,
+            batch_first=True,
+            dropout=dropout,
+            num_layers=num_layers,
+        )
+        self.embedding = nn.Linear(nHidden * 2, nOut)
+    def forward(self, input_feat):
+        recurrent, _ = self.rnn(input_feat)
+        b, T, h = recurrent.size()
+        t_rec = recurrent.contiguous().view(b * T, h)
+        output = self.embedding(t_rec)
+        output = output.view(b, T, -1)
+        return output

models/transformer_wrapper.py ADDED Viewed

	@@ -0,0 +1,19 @@

+from abc import ABC, abstractmethod
+import torch.nn as nn
+class BaseModelWrapper(ABC, nn.Module):
+    @abstractmethod
+    def mel_forward(self, x):
+        """Process input waveform to mel spectrogram."""
+        pass
+    @abstractmethod
+    def forward(self, x):
+        """Extract embedding sequence from mel spectrogram."""
+        pass
+    @abstractmethod
+    def separate_params(self):
+        """Separate model parameters into predefined groups for layer-wise learning rate decay."""
+        pass

requirements.txt ADDED Viewed

	@@ -0,0 +1,17 @@

+numpy<2
+librosa
+pandas
+timm
+nnAudio
+av>=10.0.0
+h5py>=3.8.0
+jsonpickle>=3.0.1
+hf_transfer>=0.1.4
+hf-fastup>=0.0.5
+datasets>=2.15.0
+pytorch-lightning>=2.0.0
+wandb
+transformers
+sed_scores_eval==0.0.3
+intervaltree
+more-itertools

resources/README.md ADDED Viewed

	@@ -0,0 +1 @@


1	+ In this folder, we place all files that are automatically downloaded (such as model checkpoints).

resources/best_model_BEATs.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e610c0ce85b77d15cdba5d25e02618ae47eada299f0c3d77fd802e19316ed821
+size 361619724

resources/eval_durations.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

resources/labelvocabulary.csv ADDED Viewed

	@@ -0,0 +1,89 @@

+idx,label
+0,21
+1,22
+2,23
+3,24
+4,25
+5,26
+6,27
+7,28
+8,29
+9,30
+10,31
+11,32
+12,33
+13,34
+14,35
+15,36
+16,37
+17,38
+18,39
+19,40
+20,41
+21,42
+22,43
+23,44
+24,45
+25,46
+26,47
+27,48
+28,49
+29,50
+30,51
+31,52
+32,53
+33,54
+34,55
+35,56
+36,57
+37,58
+38,59
+39,60
+40,61
+41,62
+42,63
+43,64
+44,65
+45,66
+46,67
+47,68
+48,69
+49,70
+50,71
+51,72
+52,73
+53,74
+54,75
+55,76
+56,77
+57,78
+58,79
+59,80
+60,81
+61,82
+62,83
+63,84
+64,85
+65,86
+66,87
+67,88
+68,89
+69,90
+70,91
+71,92
+72,93
+73,94
+74,95
+75,96
+76,97
+77,98
+78,99
+79,100
+80,101
+81,102
+82,103
+83,104
+84,105
+85,106
+86,107
+87,108