Upload MILFER denoiser v1.0

Browse files

Files changed (8) hide show

.gitattributes +0 -34
README.md +159 -0
milfer.py +200 -0
run.sh +7 -0
weights/decoder_state_dict.pt +3 -0
weights/feature_predictor_config.json +81 -0
weights/feature_predictor_state_dict.pt +3 -0
weights/milfer_config.json +18 -0

.gitattributes CHANGED Viewed

@@ -1,35 +1 @@
-*.7z filter=lfs diff=lfs merge=lfs -text
-*.arrow filter=lfs diff=lfs merge=lfs -text
-*.bin filter=lfs diff=lfs merge=lfs -text
-*.bz2 filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.ftz filter=lfs diff=lfs merge=lfs -text
-*.gz filter=lfs diff=lfs merge=lfs -text
-*.h5 filter=lfs diff=lfs merge=lfs -text
-*.joblib filter=lfs diff=lfs merge=lfs -text
-*.lfs.* filter=lfs diff=lfs merge=lfs -text
-*.mlmodel filter=lfs diff=lfs merge=lfs -text
-*.model filter=lfs diff=lfs merge=lfs -text
-*.msgpack filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.npz filter=lfs diff=lfs merge=lfs -text
-*.onnx filter=lfs diff=lfs merge=lfs -text
-*.ot filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pickle filter=lfs diff=lfs merge=lfs -text
-*.pkl filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.rar filter=lfs diff=lfs merge=lfs -text
-*.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
-*.tflite filter=lfs diff=lfs merge=lfs -text
-*.tgz filter=lfs diff=lfs merge=lfs -text
-*.wasm filter=lfs diff=lfs merge=lfs -text
-*.xz filter=lfs diff=lfs merge=lfs -text
-*.zip filter=lfs diff=lfs merge=lfs -text
-*.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text























1	*.pt filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,159 @@

+---
+language:
+- ru
+- en
+tags:
+- audio
+- speech
+- audio-restoration
+- pytorch
+- cuda
+pipeline_tag: audio-to-audio
+license: other
+---
+# MILFER
+MILFER is a standalone PyTorch audio-to-audio model for speech-preserving audio
+restoration. It takes an input audio file, extracts SSL speech features, and
+reconstructs a 48 kHz waveform with the bundled neural decoder.
+The bundled checkpoint is `milfer_lora100h_step001000`.
+## Highlights
+- Pure PyTorch inference, no TorchScript runtime required.
+- CUDA fp16 inference by default when a CUDA GPU is available.
+- Accepts common audio formats supported by `torchaudio`, including wav and mp3.
+- Emits a mono 48 kHz wav file.
+- Tuned to preserve more game/dialogue sound character than the base checkpoint.
+## Quick Start
+```bash
+python milfer.py input.wav output.wav
+```
+For CUDA fp16:
+```bash
+python milfer.py input.mp3 output.wav --device cuda --precision fp16
+```
+For repeated inference in the same Python process, compile the feature model and
+run one warm-up pass first:
+```bash
+python milfer.py input.mp3 output.wav --device cuda --precision fp16 --compile-feature
+```
+The helper script does the same thing with the local Python environment:
+```bash
+./run.sh input.wav output.wav --device cuda --precision fp16
+```
+## Files
+```text
+milfer.py
+run.sh
+weights/
+  decoder_state_dict.pt
+  feature_predictor_config.json
+  feature_predictor_state_dict.pt
+  milfer_config.json
+```
+## Requirements
+Tested with:
+- Python 3.10
+- PyTorch 2.6.0 + CUDA 12.4
+- torchaudio 2.6.0
+- transformers
+- soundfile
+- descript-audio-codec
+## Clean Input Check
+The table below measures how much MILFER changes already-clean clips. It is a
+sanity check, not a denoising benchmark.
+Evaluation set: `prompts_5kh`, 250 mono wav clips, 44.1 kHz, 19.0 minutes total.
+Higher is better for STOI, eSTOI, PESQ-WB, and MOS predictors. Lower is better
+for LSD and clipped samples.
+| Subset | Files | STOI | eSTOI | PESQ-WB | LSD 16 kHz | Clipped Samples |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: |
+| all clips | 250 | 0.9241 | 0.8719 | 2.1653 | 11.825 dB | 0.0006% |
+| duration >= 1 s | 232 | 0.9288 | 0.8767 | 2.1917 | 11.683 dB | 0.0005% |
+No-reference MOS predictors on the original and processed outputs:
+| Subset | Audio | UTMOS | DistillMOS | NISQA-TTS |
+| --- | --- | ---: | ---: | ---: |
+| all clips | original | 2.9998 | 3.9392 | 3.6311 |
+| all clips | MILFER | 2.9741 | 3.8080 | 3.7021 |
+| all clips | delta | -0.0258 | -0.1313 | +0.0710 |
+| duration >= 1 s | original | 3.0120 | 3.9829 | 3.6603 |
+| duration >= 1 s | MILFER | 2.9977 | 3.8554 | 3.7483 |
+| duration >= 1 s | delta | -0.0143 | -0.1275 | +0.0880 |
+Very short clips can make intelligibility metrics unstable, so the filtered row
+excludes clips shorter than one second.
+## Degraded-Input Evaluation
+For a cleaner-style benchmark, the clean prompts were synthetically degraded and
+then processed with MILFER. Metrics compare either the degraded input or the
+MILFER output against the original clean prompt. The table uses the
+`duration >= 1 s` subset: 232 clips, 18.8 minutes total.
+Degradation profiles:
+- `noisy_room`: additive noise, room response, light band-limiting.
+- `radio_clip`: band-pass channel, saturation, quantization, hiss.
+- `mixed_hard`: noise, reverb, band-limiting, downsampling, saturation.
+Full-reference metrics:
+| Profile | STOI | eSTOI | PESQ-WB | LSD 16 kHz |
+| --- | ---: | ---: | ---: | ---: |
+| noisy_room degraded | 0.8830 | 0.7128 | 1.2104 | 18.121 dB |
+| noisy_room MILFER | 0.9020 | 0.8145 | 1.7757 | 13.174 dB |
+| noisy_room delta | +0.0190 | +0.1017 | +0.5653 | -4.947 dB |
+| mixed_hard degraded | 0.8617 | 0.7079 | 1.1851 | 23.143 dB |
+| mixed_hard MILFER | 0.8948 | 0.8068 | 1.7321 | 13.904 dB |
+| mixed_hard delta | +0.0331 | +0.0989 | +0.5470 | -9.239 dB |
+| radio_clip degraded | 0.9185 | 0.8528 | 1.8765 | 19.167 dB |
+| radio_clip MILFER | 0.9040 | 0.8397 | 1.9412 | 14.237 dB |
+| radio_clip delta | -0.0145 | -0.0131 | +0.0647 | -4.930 dB |
+No-reference MOS predictors:
+| Profile | UTMOS | DistillMOS | NISQA-TTS |
+| --- | ---: | ---: | ---: |
+| noisy_room degraded | 1.4220 | 2.7625 | 1.8844 |
+| noisy_room MILFER | 3.0324 | 3.7896 | 3.7557 |
+| noisy_room delta | +1.6104 | +1.0270 | +1.8713 |
+| mixed_hard degraded | 1.3709 | 2.4796 | 2.1738 |
+| mixed_hard MILFER | 2.9478 | 3.7044 | 3.7243 |
+| mixed_hard delta | +1.5769 | +1.2248 | +1.5505 |
+| radio_clip degraded | 1.4901 | 2.9555 | 2.5081 |
+| radio_clip MILFER | 2.8414 | 3.7817 | 3.5840 |
+| radio_clip delta | +1.3513 | +0.8262 | +1.0759 |
+## Notes
+- Input audio is mixed to mono and resampled to 16 kHz for feature extraction.
+- Output is written as mono 48 kHz PCM wav.
+- Very long files can be processed, but peak memory depends on input duration.
+- This is an experimental checkpoint. It can still change ambience, effects,
+  music, and non-speech sounds.
+## License
+License is not specified in this package. Set the final license field before
+publishing if you need redistributable model weights.

milfer.py ADDED Viewed

	@@ -0,0 +1,200 @@

+#!/usr/bin/env python3
+"""MILFER command line inference."""
+from __future__ import annotations
+import argparse
+import json
+from pathlib import Path
+from typing import Sequence
+import soundfile as sf
+import torch
+import torchaudio
+from dac.model.dac import Decoder
+from transformers import Wav2Vec2BertConfig, Wav2Vec2BertModel
+ROOT = Path(__file__).resolve().parent
+WEIGHTS = ROOT / "weights"
+def build_argparser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(description="Run MILFER audio processing.")
+    parser.add_argument("input", type=Path, help="Input audio path")
+    parser.add_argument("output", type=Path, help="Output wav path")
+    parser.add_argument("--weights", type=Path, default=WEIGHTS)
+    parser.add_argument("--device", default="auto", help="auto, cuda, or cpu")
+    parser.add_argument(
+        "--precision",
+        choices=("auto", "fp32", "fp16"),
+        default="auto",
+        help="auto uses fp16 on CUDA and fp32 on CPU.",
+    )
+    parser.add_argument(
+        "--compile-feature",
+        action="store_true",
+        help="Compile the feature predictor. Slow first call, faster repeated calls.",
+    )
+    return parser
+def resolve_device(name: str) -> torch.device:
+    if name == "auto":
+        name = "cuda" if torch.cuda.is_available() else "cpu"
+    return torch.device(name)
+def use_fp16(precision: str, device: torch.device) -> bool:
+    if precision == "fp16":
+        if device.type != "cuda":
+            raise ValueError("--precision fp16 requires CUDA")
+        return True
+    if precision == "fp32":
+        return False
+    return device.type == "cuda"
+def load_audio(path: Path, sample_rate: int) -> tuple[torch.Tensor, float]:
+    waveform, source_sr = torchaudio.load(str(path))
+    if waveform.ndim == 2 and waveform.shape[0] > 1:
+        waveform = waveform.mean(dim=0)
+    else:
+        waveform = waveform.view(-1)
+    waveform = waveform.to(torch.float32).contiguous()
+    duration = waveform.numel() / float(source_sr)
+    if source_sr != sample_rate:
+        waveform = torchaudio.functional.resample(
+            waveform,
+            source_sr,
+            sample_rate,
+        ).contiguous()
+    return waveform, duration
+def extract_features(
+    waveforms: Sequence[torch.Tensor],
+    device: torch.device,
+    sampling_rate: int = 16_000,
+    padding_value: float = 1.0,
+) -> torch.Tensor:
+    mel_features: list[torch.Tensor] = []
+    for waveform in waveforms:
+        waveform = waveform.to(device=device, dtype=torch.float32)
+        if waveform.ndim > 1:
+            waveform = waveform[0]
+        feature = torchaudio.compliance.kaldi.fbank(
+            waveform=waveform.unsqueeze(0),
+            sample_frequency=sampling_rate,
+            num_mel_bins=80,
+            frame_length=25,
+            frame_shift=10,
+            dither=0.0,
+            preemphasis_coefficient=0.97,
+            remove_dc_offset=True,
+            window_type="povey",
+            use_energy=False,
+            energy_floor=1.192092955078125e-07,
+        )
+        mean = feature.mean(0, keepdim=True)
+        var = feature.var(0, keepdim=True)
+        feature = (feature - mean) / torch.sqrt(var + 1e-5)
+        mel_features.append(feature)
+    target_frames = max(feature.shape[0] for feature in mel_features)
+    if target_frames % 2:
+        target_frames += 1
+    batch = torch.full(
+        (len(mel_features), target_frames, 80),
+        padding_value,
+        dtype=torch.float32,
+        device=device,
+    )
+    for index, feature in enumerate(mel_features):
+        batch[index, : feature.shape[0]] = feature
+    return batch.reshape(len(mel_features), target_frames // 2, 160)
+def load_feature_predictor(weights: Path, device: torch.device) -> torch.nn.Module:
+    config = Wav2Vec2BertConfig.from_json_file(str(weights / "feature_predictor_config.json"))
+    model = Wav2Vec2BertModel(config)
+    state = torch.load(weights / "feature_predictor_state_dict.pt", map_location="cpu")
+    model.load_state_dict(state)
+    model.eval()
+    return model.to(device)
+def load_decoder(weights: Path, device: torch.device) -> torch.nn.Module:
+    with (weights / "milfer_config.json").open("r", encoding="utf-8") as file:
+        config = json.load(file)
+    decoder_config = config["decoder"]
+    decoder = Decoder(
+        input_channel=decoder_config["input_channel"],
+        channels=decoder_config["channels"],
+        rates=decoder_config["rates"],
+    )
+    state = torch.load(weights / "decoder_state_dict.pt", map_location="cpu")
+    decoder.load_state_dict(state)
+    for module in decoder.modules():
+        try:
+            torch.nn.utils.remove_weight_norm(module)
+        except (AttributeError, ValueError):
+            pass
+    decoder.eval()
+    return decoder.to(device)
+@torch.inference_mode()
+def run(args: argparse.Namespace) -> None:
+    torch.set_grad_enabled(False)
+    if torch.cuda.is_available():
+        torch.backends.cudnn.benchmark = True
+    device = resolve_device(args.device)
+    half = use_fp16(args.precision, device)
+    waveform, duration = load_audio(args.input, sample_rate=16_000)
+    expected_samples = int(round(duration * 48_000))
+    feature_predictor = load_feature_predictor(args.weights, device)
+    decoder = load_decoder(args.weights, device)
+    if half:
+        feature_predictor = feature_predictor.half()
+        decoder = decoder.half()
+    if args.compile_feature:
+        if device.type != "cuda":
+            raise ValueError("--compile-feature requires CUDA")
+        feature_predictor = torch.compile(
+            feature_predictor,
+            mode="reduce-overhead",
+            fullgraph=False,
+        )
+    features = extract_features([waveform], device=device)
+    if half:
+        features = features.half()
+    hidden = feature_predictor(input_features=features).last_hidden_state
+    restored = decoder(hidden.transpose(1, 2))[0].view(-1).float().cpu()
+    if restored.numel() < expected_samples:
+        restored = torch.nn.functional.pad(restored, (0, expected_samples - restored.numel()))
+    elif restored.numel() > expected_samples:
+        restored = restored[:expected_samples]
+    args.output.parent.mkdir(parents=True, exist_ok=True)
+    sf.write(
+        str(args.output),
+        restored.clamp(-1.0, 1.0).numpy(),
+        48_000,
+        subtype="PCM_16",
+    )
+    print(f"wrote={args.output}")
+def main(argv: Sequence[str] | None = None) -> None:
+    args = build_argparser().parse_args(argv)
+    run(args)
+if __name__ == "__main__":
+    main()

run.sh ADDED Viewed

	@@ -0,0 +1,7 @@

+#!/usr/bin/env bash
+set -euo pipefail
+ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PYTHON_BIN="${MILFER_PYTHON:-${ROOT}/.venv/bin/python}"
+exec "${PYTHON_BIN}" "${ROOT}/milfer.py" "$@"

weights/decoder_state_dict.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f97bb316f4dfc4186463dfdb820cd6d9e31159b483ded78385632f41dc34cec8
+size 209835977

weights/feature_predictor_config.json ADDED Viewed

	@@ -0,0 +1,81 @@

+{
+  "activation_dropout": 0.0,
+  "adapter_act": "relu",
+  "adapter_kernel_size": 3,
+  "adapter_stride": 2,
+  "add_adapter": false,
+  "apply_spec_augment": false,
+  "architectures": [
+    "Wav2Vec2BertModel"
+  ],
+  "attention_dropout": 0.0,
+  "bos_token_id": 1,
+  "classifier_proj_size": 768,
+  "codevector_dim": 768,
+  "conformer_conv_dropout": 0.1,
+  "contrastive_logits_temperature": 0.1,
+  "conv_depthwise_kernel_size": 31,
+  "ctc_loss_reduction": "sum",
+  "ctc_zero_infinity": false,
+  "diversity_loss_weight": 0.1,
+  "dtype": "float32",
+  "eos_token_id": 2,
+  "feat_proj_dropout": 0.0,
+  "feat_quantizer_dropout": 0.0,
+  "feature_projection_input_dim": 160,
+  "final_dropout": 0.1,
+  "hidden_act": "swish",
+  "hidden_dropout": 0.0,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "layer_norm_eps": 1e-05,
+  "layerdrop": 0.0,
+  "left_max_position_embeddings": 64,
+  "mask_feature_length": 10,
+  "mask_feature_min_masks": 0,
+  "mask_feature_prob": 0.0,
+  "mask_time_length": 10,
+  "mask_time_min_masks": 2,
+  "mask_time_prob": 0.05,
+  "max_source_positions": 5000,
+  "model_type": "wav2vec2-bert",
+  "num_adapter_layers": 1,
+  "num_attention_heads": 16,
+  "num_codevector_groups": 2,
+  "num_codevectors_per_group": 320,
+  "num_hidden_layers": 8,
+  "num_negatives": 100,
+  "output_hidden_size": 1024,
+  "pad_token_id": 0,
+  "position_embeddings_type": "relative_key",
+  "proj_codevector_dim": 768,
+  "right_max_position_embeddings": 8,
+  "rotary_embedding_base": 10000,
+  "tdnn_dilation": [
+    1,
+    2,
+    3,
+    1,
+    1
+  ],
+  "tdnn_dim": [
+    512,
+    512,
+    512,
+    512,
+    1500
+  ],
+  "tdnn_kernel": [
+    5,
+    3,
+    3,
+    1,
+    1
+  ],
+  "transformers_version": "5.9.0",
+  "use_intermediate_ffn_before_adapter": false,
+  "use_weighted_layer_sum": false,
+  "vocab_size": null,
+  "xvector_output_dim": 512
+}

weights/feature_predictor_state_dict.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8837be29d1a70cb00c8b295ae520dab1ff411b7214ec411900183d550650915d
+size 774540736

weights/milfer_config.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+  "name": "milfer",
+  "feature_sample_rate": 16000,
+  "target_sample_rate": 48000,
+  "feature_dim": 1024,
+  "checkpoint": "milfer_lora100h_step001000",
+  "decoder": {
+    "input_channel": 1024,
+    "channels": 1536,
+    "rates": [
+      8,
+      5,
+      4,
+      3,
+      2
+    ]
+  }
+}