How to run Nemo Forced Aligner on indic conformer ASR model ?
Hi, based on the inputs given from the BhasaAnuvaadh paper where you guys specifically mentioned using Nemo Forced Aligner for speech alignment.
i tried to recreate it precisely, using the AI4bharat Nemo version on github and adding the indic conformer nemo model to the align.py script.
model used: indicconformer_stt_multi_hybrid_rnnt_600m.nemo
# aligner script.
python NeMo/tools/nemo_forced_aligner/align.py \
pretrained_name=null \
model_path="./indicconformer_stt_multi_hybrid_rnnt_600m.nemo" \
manifest_filepath="hindi_manifest.json" \
output_dir="alignment_outputs" \
language_id=hi \
save_output_file_formats=["ctm"] \
ctm_file_config.remove_blank_tokens=True
the script runs and finishes without any such error that could crash. but outputs a subtle warning in the end:
Transcribing: 0%| | 0/1 [00:00<?, ?it/s][NeMo W 2025-09-04 05:49:30 nemo_logging:349] /mnt/raid_drive/drive_data/drive_2/timestamp_decoding/ai4bharat-Nemo/NeMo/nemo/collections/asr/parts/preprocessing/features.py:417: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
Transcribing: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 2.35it/s]
[NeMo W 2025-09-04 05:49:31 data_prep:287] CTC decoder vocabulary size (5632) doesn't match CTC output size (257). Using character-based tokenization instead.
[NeMo W 2025-09-04 05:49:31 data_prep:725] Model doesn't support character-based tokenization. Using word-based alignment with safe tokenization.
[NeMo I 2025-09-04 05:49:31 data_prep:1064] Calculated that the model downsample factor is 8 and therefore the ASR model output timestep duration is 0.08 -- will use this for all batches
upon investigating deeper i noticed that there are issues with the tokenizer. which results in extremely bad aligned transcriptions, when manually inspected upon rendering the ctm files on a video with subtitles. (entire transcriptions are extremely misaligned )
can you share the .json you passed