Automatic Speech Recognition
NeMo
Finnish
asr
speech-recognition
canary-v2
kenlm
finnish
Eval Results (legacy)
Instructions to use RASMUS/Finnish-ASR-Canary-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use RASMUS/Finnish-ASR-Canary-v2 with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("RASMUS/Finnish-ASR-Canary-v2") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
| .. _parallelisms: | |
| Parallelisms | |
| ============ | |
| NeMo uses native PyTorch parallelism primitives for distributed training, enabling efficient multi-GPU and multi-node | |
| model training for Speech AI workloads. | |
| DDP (all collections) | |
| --------------------- | |
| Distributed Data Parallelism (DDP) is the default strategy for all NeMo collections (ASR, TTS, Audio, SpeechLM2). | |
| It replicates the entire model on every GPU, runs each GPU on a different data shard, and synchronizes | |
| parameter gradients via all-reduce after each backward pass. | |
| **When to use:** DDP works well when the full model fits in a single GPU's memory. | |
| This covers the vast majority of ASR, TTS, and Audio training workloads. | |
| DDP is enabled by default in NeMo. You can configure it explicitly in YAML: | |
| .. code-block:: yaml | |
| trainer: | |
| strategy: | |
| _target_: lightning.pytorch.strategies.DDPStrategy | |
| gradient_as_bucket_view: true | |
| find_unused_parameters: true | |
| Or in Python: | |
| .. code-block:: python | |
| from lightning.pytorch.strategies import DDPStrategy | |
| trainer = pl.Trainer( | |
| strategy=DDPStrategy(gradient_as_bucket_view=True, find_unused_parameters=True), | |
| devices=8, | |
| accelerator="gpu", | |
| ) | |
| ModelParallelStrategy (SpeechLM2) | |
| --------------------------------- | |
| For SpeechLM2 models (e.g. SALM / Canary-Qwen), the backbone LLM can be too large for a single GPU. | |
| PyTorch Lightning's ``ModelParallelStrategy`` enables FSDP2, Tensor Parallelism (TP), and | |
| Sequence Parallelism (SP) using PyTorch-native DTensor. | |
| **When to use:** When training or fine-tuning SpeechLM2 models whose LLM backbone does not fit | |
| in a single GPU's memory, or when you want to scale training to many GPUs more efficiently | |
| than DDP allows. | |
| **Requirements:** Each model must implement a ``configure_model()`` method that defines how its | |
| layers are sharded (FSDP2) and parallelized (TP / SP). The SpeechLM2 models (SALM, DuplexEARTTS) | |
| already implement this. You cannot simply switch an arbitrary model from DDP to | |
| ``ModelParallelStrategy`` without providing this implementation. | |
| Concepts | |
| ^^^^^^^^ | |
| **FSDP2 (Fully Sharded Data Parallelism):** | |
| Shards model parameters, gradients, and optimizer states across GPUs in the data-parallel | |
| dimension. Dramatically reduces per-GPU memory -- enabling training of models that would not | |
| fit with DDP. Controlled via the ``data_parallel_size`` argument. | |
| **Tensor Parallelism (TP):** | |
| Splits individual weight matrices across GPUs. For example, a large linear layer's weight | |
| is partitioned column-wise or row-wise so each GPU holds only a slice. Controlled via the | |
| ``tensor_parallel_size`` argument. The model must define a TP sharding plan (which layers | |
| are split and how). SpeechLM2 models automatically use the HuggingFace TP plan for the | |
| backbone LLM when available. | |
| **Sequence Parallelism (SP):** | |
| Distributes activation memory along the sequence dimension across the TP group. | |
| SP is typically enabled alongside TP and reduces activation memory further. | |
| Configuration | |
| ^^^^^^^^^^^^^ | |
| To enable ``ModelParallelStrategy`` for SpeechLM2, replace the DDP strategy block in the | |
| trainer config. The product of ``data_parallel_size`` and ``tensor_parallel_size`` must equal | |
| the total number of GPUs (``devices * num_nodes``). | |
| In YAML (with Hydra): | |
| .. code-block:: yaml | |
| trainer: | |
| devices: 8 | |
| num_nodes: 1 | |
| accelerator: gpu | |
| precision: bf16-true | |
| strategy: | |
| _target_: lightning.pytorch.strategies.ModelParallelStrategy | |
| data_parallel_size: 4 # FSDP2: shard across 4 GPUs | |
| tensor_parallel_size: 2 # TP: split layers across 2 GPUs | |
| In Python: | |
| .. code-block:: python | |
| from lightning.pytorch.strategies import ModelParallelStrategy | |
| trainer = pl.Trainer( | |
| strategy=ModelParallelStrategy( | |
| data_parallel_size=4, | |
| tensor_parallel_size=2, | |
| ), | |
| devices=8, | |
| accelerator="gpu", | |
| precision="bf16-true", | |
| use_distributed_sampler=False, | |
| ) | |
| .. note:: | |
| When using ``ModelParallelStrategy``, set ``use_distributed_sampler=False`` in the trainer. | |
| NeMo's data modules handle distributed sampling internally. | |
| Example: SALM with FSDP2 only (no TP) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| The simplest ``ModelParallelStrategy`` setup uses FSDP2 alone. This requires no TP plan | |
| and works when individual layers fit in GPU memory: | |
| .. code-block:: yaml | |
| trainer: | |
| devices: 8 | |
| strategy: | |
| _target_: lightning.pytorch.strategies.ModelParallelStrategy | |
| data_parallel_size: 8 | |
| tensor_parallel_size: 1 | |
| Example: SALM with TP + FSDP2 | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| For larger LLM backbones, combine TP with FSDP2. Here, 2-way TP splits each layer across | |
| 2 GPUs within a node, and 4-way FSDP2 shards the model across 4 such groups: | |
| .. code-block:: yaml | |
| trainer: | |
| devices: 8 | |
| strategy: | |
| _target_: lightning.pytorch.strategies.ModelParallelStrategy | |
| data_parallel_size: 4 | |
| tensor_parallel_size: 2 | |
| See the SpeechLM2 example configs in ``examples/speechlm2/conf/`` for complete training | |
| configurations including data and optimizer settings. | |