|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- zh |
|
|
- en |
|
|
tags: |
|
|
- audio |
|
|
- speech-alignment |
|
|
- wav2vec2 |
|
|
- ctc |
|
|
- forced-alignment |
|
|
--- |
|
|
|
|
|
# 🌊 FlexAligner: Robust Speech-Text Alignment Framework |
|
|
|
|
|
**FlexAligner** is a robust two-stage speech-text alignment framework designed specifically for "non-ideal" real-world acoustic data. |
|
|
**FlexAligner** 是一种强鲁棒性的两阶段语音-文本对齐框架,专门为“非理想”的真实世界语音数据设计。 |
|
|
|
|
|
--- |
|
|
|
|
|
## 🌟 Key Features / 核心功能 |
|
|
|
|
|
### English |
|
|
- **Robustness to Mismatched Data**: Unlike traditional MFA (Montreal Forced Aligner), FlexAligner automatically identifies and skips mismatched segments (e.g., laughter, background noise, or un-transcribed words), preventing cumulative errors. |
|
|
- **Two-Stage Architecture**: |
|
|
1. **Stage 1 (CTC Chunking)**: Macro-segmentation to locate reliable "speech islands." |
|
|
2. **Stage 2 (CE Alignment)**: Micro-alignment using dynamic hop calibration for sub-millisecond boundary regression. |
|
|
- **Eliminating Temporal Drift**: Includes a self-calibrating decoding algorithm for long recordings, ensuring phoneme boundaries at the end of the file remain strictly aligned with samples. |
|
|
|
|
|
### 中文 |
|
|
- **处理不匹配数据**:不同于传统的 MFA (Montreal Forced Aligner),FlexAligner 能够自动识别并跳过音频与文本不匹配的部分(如笑声、长时间噪音或漏记的单词),而不会产生累积误差。 |
|
|
- **两阶段对齐架构**: |
|
|
1. **Stage 1 (CTC Chunking)**: 宏观切分,定位可靠的语音“岛屿”。 |
|
|
2. **Stage 2 (CE Alignment)**: 微观对齐,利用动态步长校准实现亚毫秒级的边界回归。 |
|
|
- **消除时间漂移**:针对长音频设计了自校准解码算法,确保音频末尾的音素边界依然能够与采样点严格对齐。 |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## 📦 Model Components / 模型组成 |
|
|
|
|
|
This repository contains the weights for the two core components of FlexAligner: |
|
|
本仓库包含 FlexAligner 运行所需的两套核心权重: |
|
|
|
|
|
- `hf_phs/`: Wav2Vec 2.0 based CTC chunking model. / 基于 Wav2Vec 2.0 训练的 CTC 切分模型。 |
|
|
- `ce2/`: High-precision frame-level Cross-Entropy alignment model. / 高精度帧级别交叉熵对齐模型。 |
|
|
|
|
|
--- |
|
|
|
|
|
## 🚀 Quick Start / 快速上手 |
|
|
|
|
|
### CLI Usage / 命令行使用 |
|
|
After installation, you can call the cloud models directly via the CLI: |
|
|
安装后,你可以通过 CLI 命令行直接调用云端模型: |
|
|
|
|
|
```bash |
|
|
flex-align input.wav transcript.txt --dynamic -o output.TextGrid |
|
|
``` |
|
|
|
|
|
Python API |
|
|
Or integrate it into your Python pipeline: / 或者在 Python 代码中集成: |
|
|
|
|
|
```Python |
|
|
from flexaligner import FlexAligner |
|
|
|
|
|
# Use the Hugging Face Repo ID to automatically download and load weights |
|
|
# 填入本仓库 ID,程序会自动处理模型下载与加载 |
|
|
aligner = FlexAligner(config={ |
|
|
"chunk_model_path": "USTCPhonetics/FlexAligner", |
|
|
"use_dynamic_hop": True |
|
|
}) |
|
|
|
|
|
aligner.align("test.wav", "test.txt", "result.TextGrid") |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
📜 License / 协议 |
|
|
This project is licensed under the MIT License. Feel free to use it in academic research or commercial projects. 本项目遵循 MIT License。你可以自由地在学术研究或商业项目中使用。 |
|
|
|
|
|
--- |