Kanary 1B ๐Ÿ‡ฐ๐Ÿ‡ท

Kanary (Korean Canary) 1B is a ~0.95B-parameter NeMo autoregressive encoder-decoder ASR checkpoint for Korean speech recognition, built on NVIDIAโ€™s canary-1b-v2. It keeps the 32 FastConformer encoder blocks and 8 Transformer decoder layers, and adds a KanaryPrompt formatter tuned for Korean text normalization and foreign-word rendering. Prompt controls for punctuation (pnc), inverse text normalization (itn), and foreign-word style (foreign) are passed via prompt tokens. This repo ships the .nemo checkpoint plus minimal inference scripts.

What's included

  • kanary-1b-20260119.nemo: NeMo checkpoint.
  • kanary_prompt: Python package that extends the Canary prompt format.
  • infer.py: Example that builds a temporary manifest and prints transcriptions.

About kanary_prompt

Extending the original Canary2PromptFormatter, KanaryPromptFormatter adds a new |foreign| prompt token to control foreign-word writing styles. We also fine-tuned Kanary to support extended |itn| tokens (itn, noitn, itn:undefined), which are not available in the original Canary.

kanary prompt template: "{CANARY2_BOCTX}|decodercontext|{CANARY_BOS}|emotion||source_lang||target_lang||pnc||itn||timestamp||diarize||foreign|"

Quickstart (local inference)

requirements.txt

# Install torch separately to match your CUDA version if needed.
nemo_toolkit[asr]
soundfile
huggingface_hub>=0.22.0
git+https://github.com/21jun/kanary_prompt.git

Inference code

import nemo.collections.asr as nemo_asr
from kanary_prompt import kanary
import json
import tempfile
import soundfile as sf

# Initialize ASR model
asr_model = nemo_asr.models.ASRModel.from_pretrained("lee1jun/kanary-1b-20260119")

# Prepare your audio file paths

audio_paths = ["path/to/your/audio1.wav", "path/to/your/audio2.wav", "path/to/your/audio3.wav"]


def _get_duration_seconds(audio_path: str) -> float:
    info = sf.info(audio_path)
    if info.samplerate == 0:
        raise ValueError(f"Sample rate is zero for {audio_path}")
    return float(info.frames / info.samplerate)

# Since the current NeMo EncDecMultiTaskModel (Canary) does not support prompt configuration directly in the `transcribe` method,
# we create a wrapper function to handle the prompt configuration via a temporary JSON manifest file.
def transcribe(audio_paths, show_manifest=False, **kwargs):
    # Create tmp json file of audio paths and kwargs, then call asr_model.transcribe on it
    tmp_json = tempfile.NamedTemporaryFile(mode='w+', delete=True, suffix=".json")
    data = []
    for audio_path in audio_paths:
        entry = {"audio_filepath": audio_path}
        entry.update(kwargs)
        if "duration" not in entry:
            try:
                entry["duration"] = _get_duration_seconds(audio_path)
            except Exception as exc:
                print(f"Warning: unable to compute duration for {audio_path}: {exc}; defaulting to 0.0")
                entry["duration"] = 0.0
        data.append(entry)
    for item in data:
        tmp_json.write(json.dumps(item) + "\n")
    tmp_json.flush()
    if show_manifest:
        with open(tmp_json.name, 'r') as f:
            print(f.read())

    transcriptions = asr_model.transcribe(tmp_json.name)
    tmp_json.close()
    return transcriptions


# Example usage:

transcriptions = transcribe(audio_paths, source_lang="ko", target_lang="ko", itn="itn", pnc="True", foreign="foreign:en")
for transcription in transcriptions:
    print(transcription.text)


transcriptions = transcribe(audio_paths, source_lang="ko", target_lang="ko", itn="noitn", pnc="False", foreign="foreign:ko")
for transcription in transcriptions:
    print(transcription.text)
  • source_lang / target_lang: Language codes in the manifest; currently only ko is supported.
  • foreign: Foreign-word style, e.g., foreign:en, foreign:ko, or foreign:undefined if unused.
  • itn : Inverse Text Normalization, e.g., itn, noitn, itn:undefined if unused.
  • pnc: Use "True" or "False" strings as expected by the model.

You can also supply your own JSON/JSONL manifest with required fields: audio_filepath, duration, source_lang, target_lang, foreign, itn, pnc.

import nemo.collections.asr as nemo_asr
from kanary_prompt import kanary
import json
import tempfile
import soundfile as sf

# Initialize ASR model
asr_model = nemo_asr.models.ASRModel.from_pretrained("lee1jun/kanary-1b-20260119")
# Example usage:
transcriptions = asr_model.transcribe("path/to/your/manifest.json")

Characteristics of Kanary

Transcription Style

Kanary enables prompt-controlled transcription styles for ITN, punctuation, and foreign-word rendering, allowing flexible yet robust generation of multiple text formats from a single speech input.

MODEL ITN PNC FOREIGN Transcription
whisper-large-v3-turbo - - - ๊ทธ๊ฒŒ ์ฒ˜์Œ์— ๋งˆ์ดํŽ˜์ด์ง€๋ฅผ ๋“ค์–ด๊ฐ€์„œ ์บ์‹œ๋ฅผ ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๊ณ  ๋˜ ์บ์‹œ ํ• ์ธ๊ถŒ ๋ฉ”๋‰ด์–ด ๋ฉ”๋‰ด์—์„œ ์‰์–ด๋ง ์บ์‹œ ๊ทธ๋ฆฌ๊ณ  ์ถฉ์ „ํ•˜๊ธฐ ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๋ฉด
kanary itn True foreign:en ๊ทธ๊ฒŒ ์ฒ˜์Œ์— my page๋ฅผ ๋“ค์–ด๊ฐ€์„œ cash๋ฅผ ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๊ณ  ๋˜ cash ํ• ์ธ๊ถŒ manu manu์—์„œ sharing cash ๊ทธ๋ฆฌ๊ณ  ์ถฉ์ „ํ•˜๊ธฐ ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๋ฉด.
kanary noitn True foreign:ko ๊ทธ๊ฒŒ ์ฒ˜์Œ์— ๋งˆ์ด ํŽ˜์ด์ง€๋ฅผ ๋“ค์–ด๊ฐ€์„œ ์บ์‹œ๋ฅผ ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๊ณ  ๋˜ ์บ์‹œ ํ• ์ธ๊ถŒ ๋ฉ”๋‰ด์—ฌ ๋ฉ”๋‰ด์—์„œ ์…ฐ์–ด๋ง ์บ์‹œ ๊ทธ๋ฆฌ๊ณ  ์ถฉ์ „ํ•˜๊ธฐ ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๋ฉด.
kanary itn:undefined True foreign:undefined ๊ทธ๊ฒŒ ์ฒ˜์Œ์— ๋งˆ์ดํŽ˜์ด์ง€๋ฅผ ๋“ค์–ด๊ฐ€์„œ ์บ์‹œ๋ฅผ ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๊ณ  ๋˜ ์บ์‹œ ํ• ์ธ๊ถŒ ๋ฉ”๋‰ด์—ฌ ๋ฉ”๋‰ด์—์„œ ์‰์–ด๋ง ์บ์‹œ ๊ทธ๋ฆฌ๊ณ  ์ถฉ์ „ํ•˜๊ธฐ ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๋ฉด.
MODEL ITN PNC FOREIGN Transcription
whisper-large-v3-turbo - - - ์œ ํšจ๊ธฐ๊ฐ„ 0622๊ณ ์š”. CVC ๋ฒˆํ˜ธ๊ฐ€ 333์ด๋„ค์š”.
kanary noitn True foreign:en ์œ ํšจ๊ธฐ๊ฐ„ ๊ณต ์œก ์ด ์ด๊ตฌ์š”. cvc๋ฒˆํ˜ธ๊ฐ€ ์‚ผ ์‚ผ ์‚ผ์ด๋„ค์š”.
kanary itn True foreign:en ์œ ํšจ๊ธฐ๊ฐ„ 0622๊ณ ์š”. cvc๋ฒˆํ˜ธ๊ฐ€ 333์ด๋„ค์š”.
kanary noitn True foreign:ko ์œ ํšจ๊ธฐ๊ฐ„ ๊ณต ์œก ์ด ์ด๊ตฌ์š”. ์”จ๋ธŒ์ด์”จ๋ฒˆํ˜ธ๊ฐ€ ์‚ผ ์‚ผ ์‚ผ์ด๋„ค์š”.
kanary itn True foreign:ko ์œ ํšจ๊ธฐ๊ฐ„ 0622๊ณ ์š”. ์”จ๋ธŒ์ด์”จ๋ฒˆํ˜ธ๊ฐ€ 333์ด๋„ค์š”.

Try your own Korean audio with different itn, pnc, and foreign prompts.

Architecture

Uses the original Canary backbone: 32 FastConformer encoder blocks and 8 Transformer decoder layers, totaling 948.74M parameters. Full hyperparameters are in model_config.yaml.

Aggregated Tokenizer

Employs a concatenated tokenizer that stitches together per-language SentencePiece models, making it straightforward to add more languages.

  • 1,152 special tokens from nvidia/canary-1b-flash, plus 1 itn:undefined token and 3 new foreign-handling tokens.
  • Custom Korean BPE tokenizer (2,142 vocab) trained on the Kanary training corpus.

model_config.yaml

tokenizer:
  dir: null
  type: agg
  langs:
    spl_tokens:
      dir: null
      type: bpe
      model_path: nemo:7f3ebf9af63d49ca88c0929ccacd3bd9_tokenizer.model
      vocab_path: nemo:e910c2757df4403b82d3f2e421ea5f0b_vocab.txt
      spe_tokenizer_vocab: nemo:548dc4893ea64b92886055409f7e8713_tokenizer.vocab
    ko:
      dir: null
      type: bpe
      model_path: nemo:f29eefa5554e40848746cf1f81ee9421_tokenizer.model
      vocab_path: nemo:01f79140103246f892ea17db8aed2985_vocab.txt
      spe_tokenizer_vocab: nemo:34ed5c3872dc42cda8125930bae15efa_tokenizer.vocab
  custom_tokenizer:
    _target_: nemo.collections.common.tokenizers.canary_tokenizer.CanaryTokenizer
    tokenizers: null

You can add other language tokenizers and continue training to adapt the model to new languages.

Training Details

Training Dataset

TBD

Prompt-Tagged Data Curation

TBD

Evaluation

KsponSpeech, KtelSpeech, Zeroth-Korean, MCV,

TBD

Contact

Author: lee1jun@postech.ac.kr

homepage

License

CC-BY-4.0

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for lee1jun/kanay-1b-20260119

Finetuned
(3)
this model