CANARY-0.7B-ko

CANARY-0.7B-ko is a 0.7B parameter FastConformer sequence-to-sequence model packaged as a single .nemo checkpoint. This model delivers large-vocabulary Korean ASR with optional inverse text normalization (ITN) and punctuation (PnC) control via prompts.

Model Highlights

  • FastConformer encoder (24 layers) with Transformer decoder (8 layers).
  • Prompt-driven toggles for ITN and punctuation & casing (PnC) to adapt results at inference time.

Model Details

Item Value
Architecture FastConformer encoder (x24) + Transformer decoder (x8)
Parameters ~0.7B
Languages Korean (ko-KR)
Domain General speech, conversational data, lectures
Checkpoint format NVIDIA NeMo .nemo archive
File canary-0.7b-ko.nemo
  • Input: 16 kHz mono PCM audio (wav/flac).
  • Output: UTF-8 Korean transcripts optionally normalized according to the prompt controls.
  • Dependencies: nemo_toolkit[asr], torch, plus huggingface_hub if downloading from the Hub.

Promptable Controls (ITN & PnC)

  • pnc: set to "True" to enable punctuation and casing restoration, "False" to keep minimal formatting.
  • itn: set to "True" to enable ITN for dates, numbers, and currency, "False" for verbatim texts.
  • lang, source_lang, target_lang: keep as "ko" to enforce Korean input/output, or adjust when integrating into multilingual workflows.

The controls are carried via the manifest (JSONL) used for transcription, making it easy to switch behavior per utterance without reloading the checkpoint.

Usage

Install NeMo and dependencies (CUDA-enabled PyTorch recommended):

pip install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install nemo_toolkit[asr] huggingface_hub

Download and run inference with a manifest of audio samples:

import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.ASRModel.from_pretrained("lee1jun/canary-0.7b-ko")

### Configure transcription parameters in sample.json

# Create a UTF-8, newline-separated JSONL manifest (sample.json) with per-utterance controls. The model reads these fields to toggle inverse text normalization (itn) and punctuation & casing (pnc) at inference time.

# Valid keys:
# - audio_filepath: path to the audio file
# - pnc: "True" or "False" (enable/disable punctuation & casing)
# - itn: "True" or "False" (enable/disable inverse text normalization)
# - lang / source_lang / target_lang: use "ko" for Korean

# Example sample.json (JSONL, one record per line):

# {"audio_filepath": "audio1.wav", "pnc": "True",  "itn": "True",  "lang": "ko", "source_lang": "ko", "target_lang": "ko"}
# {"audio_filepath": "audio2.wav", "pnc": "True",  "itn": "False", "lang": "ko", "source_lang": "ko", "target_lang": "ko"}
# {"audio_filepath": "audio3.wav", "pnc": "False", "itn": "True",  "lang": "ko", "source_lang": "ko", "target_lang": "ko"}
# {"audio_filepath": "audio4.wav", "pnc": "False", "itn": "False", "lang": "ko", "source_lang": "ko", "target_lang": "ko"}


# Save this file as sample.json and pass its path to asr_model.transcribe("sample.json").

transcriptions = asr_model.transcribe("sample.json")
print(transcriptions)

Sample manifest entry (for all variations)

{"audio_filepath": "path/to/audio.wav", "pnc": "True", "itn": "True", "lang": "ko", "source_lang": "ko", "target_lang": "ko"}
{"audio_filepath": "path/to/audio.wav", "pnc": "True", "itn": "False", "lang": "ko", "source_lang": "ko", "target_lang": "ko"}
{"audio_filepath": "path/to/audio.wav", "pnc": "False", "itn": "True", "lang": "ko", "source_lang": "ko", "target_lang": "ko"}
{"audio_filepath": "path/to/audio.wav", "pnc": "False", "itn": "False", "lang": "ko", "source_lang": "ko", "target_lang": "ko"}

Training Data

(Metadata fields intentionally left blank for future updates.)

Evaluation

Dataset Metric Value Notes
Zeroth Korean test WER TBD Evaluate with default decoding
KsponSpeech test WER TBD Pending internal validation

Community benchmarks are welcome—please share pull requests with scripts and manifest details if you evaluate on new corpora.

Intended Use & Limitations

  • Optimized for dictation, lecture transcription, and conversational Korean speech.
  • Not tuned for heavily noisy far-field capture or non-Korean utterances.
  • May reflect biases present in the listed datasets; review outputs before deploying in high-stakes settings.

Ethical Considerations

  • Avoid uploading personally identifiable audio without consent.
  • Use human-in-the-loop review for safety-critical or sensitive deployments.
  • Ensure compliance with the individual dataset licenses when redistributing transcripts.

Citation

Please cite CANARY and NeMo if you use this checkpoint:

@software{nvidia2024nemo,
  author    = {NVIDIA NeMo Team},
  title     = {NeMo: a toolkit for building state-of-the-art conversational AI models},
  year      = {2024},
  url       = {https://github.com/NVIDIA/NeMo}
}
@misc{canary2024,
  title        = {CANARY 0.7B Korean FastConformer ASR},
  author       = {lee1jun},
  year         = {2025},
  howpublished = {Hugging Face Hub},
  url          = {https://huggingface.co/lee1jun/canary-0.7b-ko}
}

License

This repository is distributed under the Apache License 2.0. See LICENSE for details.

Contact & Support

For issues or feature requests, please open a GitHub issue in this repository or reach out via the contact information on the model card page.

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support