Automatic Speech Recognition
Transformers
Safetensors
seamless_m4t_v2

How to run Nemo Forced Aligner on indic conformer ASR model ?

#5
by StephennFernandes - opened

Hi, based on the inputs given from the BhasaAnuvaadh paper where you guys specifically mentioned using Nemo Forced Aligner for speech alignment.

i tried to recreate it precisely, using the AI4bharat Nemo version on github and adding the indic conformer nemo model to the align.py script.

model used: indicconformer_stt_multi_hybrid_rnnt_600m.nemo

# aligner script. 

python NeMo/tools/nemo_forced_aligner/align.py \
    pretrained_name=null \
    model_path="./indicconformer_stt_multi_hybrid_rnnt_600m.nemo" \
    manifest_filepath="hindi_manifest.json" \
    output_dir="alignment_outputs" \
    language_id=hi \
    save_output_file_formats=["ctm"] \
    ctm_file_config.remove_blank_tokens=True

the script runs and finishes without any such error that could crash. but outputs a subtle warning in the end:

Transcribing:   0%|                                                                                     | 0/1 [00:00<?, ?it/s][NeMo W 2025-09-04 05:49:30 nemo_logging:349] /mnt/raid_drive/drive_data/drive_2/timestamp_decoding/ai4bharat-Nemo/NeMo/nemo/collections/asr/parts/preprocessing/features.py:417: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
      with torch.cuda.amp.autocast(enabled=False):
    
Transcribing: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  2.35it/s]
[NeMo W 2025-09-04 05:49:31 data_prep:287] CTC decoder vocabulary size (5632) doesn't match CTC output size (257). Using character-based tokenization instead.
[NeMo W 2025-09-04 05:49:31 data_prep:725] Model doesn't support character-based tokenization. Using word-based alignment with safe tokenization.
[NeMo I 2025-09-04 05:49:31 data_prep:1064] Calculated that the model downsample factor is 8 and therefore the ASR model output timestep duration is 0.08 -- will use this for all batches

upon investigating deeper i noticed that there are issues with the tokenizer. which results in extremely bad aligned transcriptions, when manually inspected upon rendering the ctm files on a video with subtitles. (entire transcriptions are extremely misaligned )

can you share the .json you passed

Sign up or log in to comment