Kanary 1B ๐ฐ๐ท
Kanary (Korean Canary) 1B is a ~0.95B-parameter NeMo autoregressive encoder-decoder ASR checkpoint for Korean speech recognition, built on NVIDIAโs canary-1b-v2. It keeps the 32 FastConformer encoder blocks and 8 Transformer decoder layers, and adds a KanaryPrompt formatter tuned for Korean text normalization and foreign-word rendering. Prompt controls for punctuation (pnc), inverse text normalization (itn), and foreign-word style (foreign) are passed via prompt tokens. This repo ships the .nemo checkpoint plus minimal inference scripts.
What's included
kanary-1b-20260119.nemo: NeMo checkpoint.kanary_prompt: Python package that extends the Canary prompt format.infer.py: Example that builds a temporary manifest and prints transcriptions.
About kanary_prompt
- https://github.com/21jun/kanary_prompt: Kanary prompt token formatting dependency.
Extending the original Canary2PromptFormatter, KanaryPromptFormatter adds a new |foreign| prompt token to control foreign-word writing styles. We also fine-tuned Kanary to support extended |itn| tokens (itn, noitn, itn:undefined), which are not available in the original Canary.
kanary prompt template: "{CANARY2_BOCTX}|decodercontext|{CANARY_BOS}|emotion||source_lang||target_lang||pnc||itn||timestamp||diarize||foreign|"
Quickstart (local inference)
requirements.txt
# Install torch separately to match your CUDA version if needed.
nemo_toolkit[asr]
soundfile
huggingface_hub>=0.22.0
git+https://github.com/21jun/kanary_prompt.git
Inference code
import nemo.collections.asr as nemo_asr
from kanary_prompt import kanary
import json
import tempfile
import soundfile as sf
# Initialize ASR model
asr_model = nemo_asr.models.ASRModel.from_pretrained("lee1jun/kanary-1b-20260119")
# Prepare your audio file paths
audio_paths = ["path/to/your/audio1.wav", "path/to/your/audio2.wav", "path/to/your/audio3.wav"]
def _get_duration_seconds(audio_path: str) -> float:
info = sf.info(audio_path)
if info.samplerate == 0:
raise ValueError(f"Sample rate is zero for {audio_path}")
return float(info.frames / info.samplerate)
# Since the current NeMo EncDecMultiTaskModel (Canary) does not support prompt configuration directly in the `transcribe` method,
# we create a wrapper function to handle the prompt configuration via a temporary JSON manifest file.
def transcribe(audio_paths, show_manifest=False, **kwargs):
# Create tmp json file of audio paths and kwargs, then call asr_model.transcribe on it
tmp_json = tempfile.NamedTemporaryFile(mode='w+', delete=True, suffix=".json")
data = []
for audio_path in audio_paths:
entry = {"audio_filepath": audio_path}
entry.update(kwargs)
if "duration" not in entry:
try:
entry["duration"] = _get_duration_seconds(audio_path)
except Exception as exc:
print(f"Warning: unable to compute duration for {audio_path}: {exc}; defaulting to 0.0")
entry["duration"] = 0.0
data.append(entry)
for item in data:
tmp_json.write(json.dumps(item) + "\n")
tmp_json.flush()
if show_manifest:
with open(tmp_json.name, 'r') as f:
print(f.read())
transcriptions = asr_model.transcribe(tmp_json.name)
tmp_json.close()
return transcriptions
# Example usage:
transcriptions = transcribe(audio_paths, source_lang="ko", target_lang="ko", itn="itn", pnc="True", foreign="foreign:en")
for transcription in transcriptions:
print(transcription.text)
transcriptions = transcribe(audio_paths, source_lang="ko", target_lang="ko", itn="noitn", pnc="False", foreign="foreign:ko")
for transcription in transcriptions:
print(transcription.text)
source_lang/target_lang: Language codes in the manifest; currently onlykois supported.foreign: Foreign-word style, e.g.,foreign:en,foreign:ko, orforeign:undefinedif unused.itn: Inverse Text Normalization, e.g.,itn,noitn,itn:undefinedif unused.pnc: Use"True"or"False"strings as expected by the model.
You can also supply your own JSON/JSONL manifest with required fields: audio_filepath, duration, source_lang, target_lang, foreign, itn, pnc.
import nemo.collections.asr as nemo_asr
from kanary_prompt import kanary
import json
import tempfile
import soundfile as sf
# Initialize ASR model
asr_model = nemo_asr.models.ASRModel.from_pretrained("lee1jun/kanary-1b-20260119")
# Example usage:
transcriptions = asr_model.transcribe("path/to/your/manifest.json")
Characteristics of Kanary
Transcription Style
Kanary enables prompt-controlled transcription styles for ITN, punctuation, and foreign-word rendering, allowing flexible yet robust generation of multiple text formats from a single speech input.
| MODEL | ITN | PNC | FOREIGN | Transcription |
|---|---|---|---|---|
| whisper-large-v3-turbo | - | - | - | ๊ทธ๊ฒ ์ฒ์์ ๋ง์ดํ์ด์ง๋ฅผ ๋ค์ด๊ฐ์ ์บ์๋ฅผ ๋ฒํผ์ ๋๋ฅด๊ณ ๋ ์บ์ ํ ์ธ๊ถ ๋ฉ๋ด์ด ๋ฉ๋ด์์ ์์ด๋ง ์บ์ ๊ทธ๋ฆฌ๊ณ ์ถฉ์ ํ๊ธฐ ๋ฒํผ์ ๋๋ฅด๋ฉด |
| kanary | itn | True | foreign:en | ๊ทธ๊ฒ ์ฒ์์ my page๋ฅผ ๋ค์ด๊ฐ์ cash๋ฅผ ๋ฒํผ์ ๋๋ฅด๊ณ ๋ cash ํ ์ธ๊ถ manu manu์์ sharing cash ๊ทธ๋ฆฌ๊ณ ์ถฉ์ ํ๊ธฐ ๋ฒํผ์ ๋๋ฅด๋ฉด. |
| kanary | noitn | True | foreign:ko | ๊ทธ๊ฒ ์ฒ์์ ๋ง์ด ํ์ด์ง๋ฅผ ๋ค์ด๊ฐ์ ์บ์๋ฅผ ๋ฒํผ์ ๋๋ฅด๊ณ ๋ ์บ์ ํ ์ธ๊ถ ๋ฉ๋ด์ฌ ๋ฉ๋ด์์ ์ ฐ์ด๋ง ์บ์ ๊ทธ๋ฆฌ๊ณ ์ถฉ์ ํ๊ธฐ ๋ฒํผ์ ๋๋ฅด๋ฉด. |
| kanary | itn:undefined | True | foreign:undefined | ๊ทธ๊ฒ ์ฒ์์ ๋ง์ดํ์ด์ง๋ฅผ ๋ค์ด๊ฐ์ ์บ์๋ฅผ ๋ฒํผ์ ๋๋ฅด๊ณ ๋ ์บ์ ํ ์ธ๊ถ ๋ฉ๋ด์ฌ ๋ฉ๋ด์์ ์์ด๋ง ์บ์ ๊ทธ๋ฆฌ๊ณ ์ถฉ์ ํ๊ธฐ ๋ฒํผ์ ๋๋ฅด๋ฉด. |
| MODEL | ITN | PNC | FOREIGN | Transcription |
|---|---|---|---|---|
| whisper-large-v3-turbo | - | - | - | ์ ํจ๊ธฐ๊ฐ 0622๊ณ ์. CVC ๋ฒํธ๊ฐ 333์ด๋ค์. |
| kanary | noitn | True | foreign:en | ์ ํจ๊ธฐ๊ฐ ๊ณต ์ก ์ด ์ด๊ตฌ์. cvc๋ฒํธ๊ฐ ์ผ ์ผ ์ผ์ด๋ค์. |
| kanary | itn | True | foreign:en | ์ ํจ๊ธฐ๊ฐ 0622๊ณ ์. cvc๋ฒํธ๊ฐ 333์ด๋ค์. |
| kanary | noitn | True | foreign:ko | ์ ํจ๊ธฐ๊ฐ ๊ณต ์ก ์ด ์ด๊ตฌ์. ์จ๋ธ์ด์จ๋ฒํธ๊ฐ ์ผ ์ผ ์ผ์ด๋ค์. |
| kanary | itn | True | foreign:ko | ์ ํจ๊ธฐ๊ฐ 0622๊ณ ์. ์จ๋ธ์ด์จ๋ฒํธ๊ฐ 333์ด๋ค์. |
Try your own Korean audio with different itn, pnc, and foreign prompts.
Architecture
Uses the original Canary backbone: 32 FastConformer encoder blocks and 8 Transformer decoder layers, totaling 948.74M parameters. Full hyperparameters are in model_config.yaml.
Aggregated Tokenizer
Employs a concatenated tokenizer that stitches together per-language SentencePiece models, making it straightforward to add more languages.
- 1,152 special tokens from nvidia/canary-1b-flash, plus 1
itn:undefinedtoken and 3 newforeign-handling tokens. - Custom Korean BPE tokenizer (2,142 vocab) trained on the Kanary training corpus.
model_config.yaml
tokenizer:
dir: null
type: agg
langs:
spl_tokens:
dir: null
type: bpe
model_path: nemo:7f3ebf9af63d49ca88c0929ccacd3bd9_tokenizer.model
vocab_path: nemo:e910c2757df4403b82d3f2e421ea5f0b_vocab.txt
spe_tokenizer_vocab: nemo:548dc4893ea64b92886055409f7e8713_tokenizer.vocab
ko:
dir: null
type: bpe
model_path: nemo:f29eefa5554e40848746cf1f81ee9421_tokenizer.model
vocab_path: nemo:01f79140103246f892ea17db8aed2985_vocab.txt
spe_tokenizer_vocab: nemo:34ed5c3872dc42cda8125930bae15efa_tokenizer.vocab
custom_tokenizer:
_target_: nemo.collections.common.tokenizers.canary_tokenizer.CanaryTokenizer
tokenizers: null
You can add other language tokenizers and continue training to adapt the model to new languages.
Training Details
Training Dataset
TBD
Prompt-Tagged Data Curation
TBD
Evaluation
KsponSpeech, KtelSpeech, Zeroth-Korean, MCV,
TBD
Contact
Author: lee1jun@postech.ac.kr
License
CC-BY-4.0
- Downloads last month
- 6
Model tree for lee1jun/kanay-1b-20260119
Base model
nvidia/canary-1b-v2