nguyensu27 commited on
Commit
1c3fc13
·
verified ·
1 Parent(s): e3407e2

Upload 2 files

Browse files
Files changed (2) hide show
  1. NeuCodec-Thumbnail.jpg +0 -0
  2. README.md +111 -0
NeuCodec-Thumbnail.jpg ADDED
README.md ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+ ---
4
+ license: apache-2.0
5
+ tags:
6
+ - audio
7
+ - speech
8
+ - audio-to-audio
9
+ - speech-language-models
10
+ datasets:
11
+ - amphion/Emilia-Dataset
12
+ - facebook/multilingual_librispeech
13
+ - CSTR-Edinburgh/vctk
14
+ - google/fleurs
15
+ - mozilla-foundation/common_voice_13_0
16
+ - mythicinfinity/libritts_r
17
+ ---
18
+
19
+ # NeuCodec 🎧
20
+
21
+ [![NeuCodec Intro](NeuCodec-Thumbnail.jpg)](https://www.youtube.com/watch?v=O7XH1lGZyYY)
22
+
23
+ *Click the image above to see NeuCodec in action on Youtube!*
24
+
25
+ *Created by Neuphonic - building faster, smaller, on-device voice AI*
26
+
27
+ A lightweight neural codec that encodes audio at just 0.8 kbps - perfect for researchers and builders who need something that *just works* for training high quality text-to-speech models.
28
+
29
+ # Key Features
30
+
31
+ * 🔊 Low bit-rate compression - a speech codec that compresses and reconstructs audio with near-inaudible reconstruction loss
32
+ <br>
33
+ * 🎼 Upsamples from 16kHz → 24kHz
34
+ <br>
35
+ * 🌍 Ready for real-world use - train your own SpeechLMs without needing to build your own codec
36
+ <br>
37
+ * 🏢 Commercial use permitted - use it in your own tools or products
38
+ <br>
39
+ * 📊 Released with large pre-encoded datasets - we’ve compressed Emilia-YODAS from 1.7TB to 41GB using NeuCodec, significantly reducing the compute requirements needed for training
40
+ <br>
41
+
42
+ # Model Details
43
+
44
+ NeuCodec is a Finite Scalar Quantisation (FSQ) based 0.8kbps audio codec for speech tokenization.
45
+ It takes advantage of the following features:
46
+
47
+ * FSQ quantisation resulting in a single codebook, making it ideal for downstream modeling with Speech Language Models.
48
+ * Trained with CC data such that there are no Non-Commercial data restrictions.
49
+ * At 50 tokens/sec and 16 bits per token, the overall bit-rate is 0.8kbps.
50
+ * The codec takes in 16kHz input and outputs 24kHz using an upsampling decoder.
51
+ * The FSQ encoding scheme allows for bit-level error resistance suitable for unreliable and noisy channels.
52
+
53
+ NeuCodec is largely based on extending the work of [X-Codec2.0](https://huggingface.co/HKUSTAudio/xcodec2).
54
+
55
+ - **Developed by:** Neuphonic
56
+ - **Model type:** Neural Audio Codec
57
+ - **License:** apache-2.0
58
+ - **Repository:** https://github.com/neuphonic/neucodec
59
+ - **Paper:** [arXiv](https://arxiv.org/abs/2509.09550)
60
+ - **Pre-encoded Datasets:**
61
+ - [Emilia-YODAS-EN](https://huggingface.co/datasets/neuphonic/emilia-yodas-english-neucodec)
62
+ - *More coming soon!*
63
+
64
+ # Get Started
65
+
66
+ Use the code below to get started with the model.
67
+
68
+ To install from pypi in a dedicated environment, using Python 3.10 or above:
69
+
70
+ ```bash
71
+ conda create -n neucodec python=3.10
72
+ conda activate neucodec
73
+ pip install neucodec
74
+ ```
75
+ Then, to use in python:
76
+
77
+ ```python
78
+ import librosa
79
+ import torch
80
+ import torchaudio
81
+ from torchaudio import transforms as T
82
+ from neucodec import NeuCodec
83
+
84
+ model = NeuCodec.from_pretrained("neuphonic/neucodec")
85
+ model.eval().cuda()
86
+
87
+ y, sr = torchaudio.load(librosa.ex("libri1"))
88
+ if sr != 16_000:
89
+ y = T.Resample(sr, 16_000)(y)[None, ...] # (B, 1, T_16)
90
+
91
+ with torch.no_grad():
92
+ fsq_codes = model.encode_code(y)
93
+ # fsq_codes = model.encode_code(librosa.ex("libri1")) # or directly pass your filepath!
94
+ print(f"Codes shape: {fsq_codes.shape}")
95
+ recon = model.decode_code(fsq_codes).cpu() # (B, 1, T_24)
96
+
97
+ torchaudio.save("reconstructed.wav", recon[0, :, :], 24_000)
98
+ ```
99
+
100
+ # Training Details
101
+
102
+ The model was trained using the following data:
103
+ * Emilia-YODAS
104
+ * MLS
105
+ * LibriTTS
106
+ * Fleurs
107
+ * CommonVoice
108
+ * HUI
109
+ * Additional proprietary set
110
+
111
+ All publically available data was covered by either the CC-BY-4.0 or CC0 license.