How to run Nemo Forced Aligner on indic conformer ASR model ?

by StephennFernandes - opened Sep 4, 2025

Sep 4, 2025

Hi, based on the inputs given from the BhasaAnuvaadh paper where you guys specifically mentioned using Nemo Forced Aligner for speech alignment.

i tried to recreate it precisely, using the AI4bharat Nemo version on github and adding the indic conformer nemo model to the align.py script.

model used: indicconformer_stt_multi_hybrid_rnnt_600m.nemo

# aligner script. 

python NeMo/tools/nemo_forced_aligner/align.py \
    pretrained_name=null \
    model_path="./indicconformer_stt_multi_hybrid_rnnt_600m.nemo" \
    manifest_filepath="hindi_manifest.json" \
    output_dir="alignment_outputs" \
    language_id=hi \
    save_output_file_formats=["ctm"] \
    ctm_file_config.remove_blank_tokens=True

the script runs and finishes without any such error that could crash. but outputs a subtle warning in the end:

Transcribing:   0%|                                                                                     | 0/1 [00:00<?, ?it/s][NeMo W 2025-09-04 05:49:30 nemo_logging:349] /mnt/raid_drive/drive_data/drive_2/timestamp_decoding/ai4bharat-Nemo/NeMo/nemo/collections/asr/parts/preprocessing/features.py:417: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
      with torch.cuda.amp.autocast(enabled=False):
    
Transcribing: 100%|█████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.35it/s]
[NeMo W 2025-09-04 05:49:31 data_prep:287] CTC decoder vocabulary size (5632) doesn't match CTC output size (257). Using character-based tokenization instead.
[NeMo W 2025-09-04 05:49:31 data_prep:725] Model doesn't support character-based tokenization. Using word-based alignment with safe tokenization.
[NeMo I 2025-09-04 05:49:31 data_prep:1064] Calculated that the model downsample factor is 8 and therefore the ASR model output timestep duration is 0.08 -- will use this for all batches

upon investigating deeper i noticed that there are issues with the tokenizer. which results in extremely bad aligned transcriptions, when manually inspected upon rendering the ctm files on a video with subtitles. (entire transcriptions are extremely misaligned )

guide-toGalaxy

Oct 3, 2025

can you share the .json you passed

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment