Spaces:
Runtime error
Runtime error
| # XY Tokenizer | |
| XY Tokenizer is a speech codec that simultaneously models both semantic and acoustic aspects of speech, converting audio into discrete tokens and decoding them back to high-quality audio. It achieves efficient speech representation at only 1kbps with RVQ8 quantization at 12.5Hz frame rate. | |
| ## Features | |
| - **Dual-channel modeling**: Simultaneously captures semantic meaning and acoustic details | |
| - **Efficient representation**: 1kbps bitrate with RVQ8 quantization at 12.5Hz | |
| - **High-quality audio tokenization**: Convert speech to discrete tokens and back with minimal quality loss | |
| - **Long audio support**: Process audio files longer than 30 seconds using chunking with overlap | |
| - **Batch processing**: Efficiently process multiple audio files in batches | |
| - **24kHz output**: Generate high-quality 24kHz audio output | |
| ## Installation | |
| ```bash | |
| # Create and activate conda environment | |
| conda create -n xy_tokenizer python=3.10 -y && conda activate xy_tokenizer | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| ``` | |
| ## Usage | |
| ### Basic Inference | |
| To tokenize audio files and reconstruct them: | |
| ```bash | |
| python inference.py \ | |
| --config_path ./config/xy_tokenizer_config.yaml \ | |
| --checkpoint_path ./weights/xy_tokenizer.ckpt \ | |
| --input_dir ./input_wavs/ \ | |
| --output_dir ./output_wavs/ | |
| ``` | |
| ### Parameters | |
| - `--config_path`: Path to the model configuration file | |
| - `--checkpoint_path`: Path to the pre-trained model checkpoint | |
| - `--input_dir`: Directory containing input WAV files | |
| - `--output_dir`: Directory to save reconstructed audio files | |
| - `--device`: Device to run inference on (default: "cuda") | |
| - `--debug`, `--debug_ip`, `--debug_port`: Debugging options (disabled by default) | |
| ## Project Structure | |
| - `xy_tokenizer/`: Core model implementation | |
| - `model.py`: Main XY_Tokenizer model class | |
| - `nn/`: Neural network components | |
| - `config/`: Configuration files | |
| - `utils/`: Utility functions | |
| - `weights/`: Pre-trained model weights | |
| - `input_wavs/`: Directory for input audio files | |
| - `output_wavs/`: Directory for output audio files | |
| ## Model Architecture | |
| XY Tokenizer uses a dual-channel architecture that simultaneously models: | |
| 1. **Semantic Channel**: Captures high-level semantic information and linguistic content | |
| 2. **Acoustic Channel**: Preserves detailed acoustic features including speaker characteristics and prosody | |
| The model processes audio through several stages: | |
| 1. Feature extraction (mel-spectrogram) | |
| 2. Parallel semantic and acoustic encoding | |
| 3. Residual Vector Quantization (RVQ8) at 12.5Hz frame rate (1kbps) | |
| 4. Decoding and waveform generation | |
| ## License | |
| [Specify your license here] |