CANARY-0.7B-ko

CANARY-0.7B-ko is a 0.7B parameter FastConformer sequence-to-sequence model packaged as a single .nemo checkpoint. This model delivers large-vocabulary Korean ASR with optional inverse text normalization (ITN) and punctuation (PnC) control via prompts.

Model Highlights

FastConformer encoder (24 layers) with Transformer decoder (8 layers).
Prompt-driven toggles for ITN and punctuation & casing (PnC) to adapt results at inference time.

Model Details

Item	Value
Architecture	FastConformer encoder (x24) + Transformer decoder (x8)
Parameters	~0.7B
Languages	Korean (ko-KR)
Domain	General speech, conversational data, lectures
Checkpoint format	NVIDIA NeMo `.nemo` archive
File	`canary-0.7b-ko.nemo`

Input: 16 kHz mono PCM audio (wav/flac).
Output: UTF-8 Korean transcripts optionally normalized according to the prompt controls.
Dependencies: nemo_toolkit[asr], torch, plus huggingface_hub if downloading from the Hub.

Promptable Controls (ITN & PnC)

pnc: set to "True" to enable punctuation and casing restoration, "False" to keep minimal formatting.
itn: set to "True" to enable ITN for dates, numbers, and currency, "False" for verbatim texts.
lang, source_lang, target_lang: keep as "ko" to enforce Korean input/output, or adjust when integrating into multilingual workflows.

The controls are carried via the manifest (JSONL) used for transcription, making it easy to switch behavior per utterance without reloading the checkpoint.

Usage

Install NeMo and dependencies (CUDA-enabled PyTorch recommended):

pip install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install nemo_toolkit[asr] huggingface_hub

Download and run inference with a manifest of audio samples:

import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.ASRModel.from_pretrained("lee1jun/canary-0.7b-ko")

### Configure transcription parameters in sample.json

# Create a UTF-8, newline-separated JSONL manifest (sample.json) with per-utterance controls. The model reads these fields to toggle inverse text normalization (itn) and punctuation & casing (pnc) at inference time.

# Valid keys:
# - audio_filepath: path to the audio file
# - pnc: "True" or "False" (enable/disable punctuation & casing)
# - itn: "True" or "False" (enable/disable inverse text normalization)
# - lang / source_lang / target_lang: use "ko" for Korean

# Example sample.json (JSONL, one record per line):

# {"audio_filepath": "audio1.wav", "pnc": "True",  "itn": "True",  "lang": "ko", "source_lang": "ko", "target_lang": "ko"}
# {"audio_filepath": "audio2.wav", "pnc": "True",  "itn": "False", "lang": "ko", "source_lang": "ko", "target_lang": "ko"}
# {"audio_filepath": "audio3.wav", "pnc": "False", "itn": "True",  "lang": "ko", "source_lang": "ko", "target_lang": "ko"}
# {"audio_filepath": "audio4.wav", "pnc": "False", "itn": "False", "lang": "ko", "source_lang": "ko", "target_lang": "ko"}


# Save this file as sample.json and pass its path to asr_model.transcribe("sample.json").

transcriptions = asr_model.transcribe("sample.json")
print(transcriptions)

Sample manifest entry (for all variations)

{"audio_filepath": "path/to/audio.wav", "pnc": "True", "itn": "True", "lang": "ko", "source_lang": "ko", "target_lang": "ko"}
{"audio_filepath": "path/to/audio.wav", "pnc": "True", "itn": "False", "lang": "ko", "source_lang": "ko", "target_lang": "ko"}
{"audio_filepath": "path/to/audio.wav", "pnc": "False", "itn": "True", "lang": "ko", "source_lang": "ko", "target_lang": "ko"}
{"audio_filepath": "path/to/audio.wav", "pnc": "False", "itn": "False", "lang": "ko", "source_lang": "ko", "target_lang": "ko"}

Training Data

Dataset	Hours	Notes
KsponSpeech	1,000h	https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=123
Zeroth-Korean	100h
Korean Lecture Speech	4,800h	https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=71627

(Metadata fields intentionally left blank for future updates.)

Evaluation

Dataset	Metric	Value	Notes
Zeroth Korean test	WER	TBD	Evaluate with default decoding
KsponSpeech test	WER	TBD	Pending internal validation

Community benchmarks are welcome—please share pull requests with scripts and manifest details if you evaluate on new corpora.

Intended Use & Limitations

Optimized for dictation, lecture transcription, and conversational Korean speech.
Not tuned for heavily noisy far-field capture or non-Korean utterances.
May reflect biases present in the listed datasets; review outputs before deploying in high-stakes settings.

Ethical Considerations

Avoid uploading personally identifiable audio without consent.
Use human-in-the-loop review for safety-critical or sensitive deployments.
Ensure compliance with the individual dataset licenses when redistributing transcripts.

Citation

Please cite CANARY and NeMo if you use this checkpoint:

@software{nvidia2024nemo,
  author    = {NVIDIA NeMo Team},
  title     = {NeMo: a toolkit for building state-of-the-art conversational AI models},
  year      = {2024},
  url       = {https://github.com/NVIDIA/NeMo}
}

@misc{canary2024,
  title        = {CANARY 0.7B Korean FastConformer ASR},
  author       = {lee1jun},
  year         = {2025},
  howpublished = {Hugging Face Hub},
  url          = {https://huggingface.co/lee1jun/canary-0.7b-ko}
}

License

This repository is distributed under the Apache License 2.0. See LICENSE for details.

Contact & Support

For issues or feature requests, please open a GitHub issue in this repository or reach out via the contact information on the model card page.

Downloads last month: 9