πŸ‡°πŸ‡· Kanary 1B (pre-release)

Kanary (Korean Canary) 1B is a pre-release ~0.95B NeMo checkpoint for Korean ASR, built on NVIDIA’s canary-1b-v2. It keeps the 32 FastConformer encoder blocks and 8 Transformer decoder layers, and adds a KanaryPrompt formatter tuned for Korean text normalization and foreign-word rendering. Prompt controls for punctuation (pnc), inverse text normalization (itn), and foreign-word style (foreign) are passed via the prompt tokens. This repo ships the .nemo checkpoint plus minimal inference scripts.

This checkpoint is a proof-of-concept; a public release will follow with finalized assets.

What is here

  • kanary-1b-pre-v0.1.nemo: NeMo checkpoint.
  • Total model parameters: 951.67M (approx. 0.95B).
  • infer.py: example that builds a temporary manifest and prints transcriptions.

About kanary_prompt

Extending original Canary2PromptFormatter, KanaryPromptFormatter supports new prompt token |foreign| that controls foreign word writting styles.

kanary prompt template: "{CANARY2_BOCTX}|decodercontext|{CANARY_BOS}|emotion||source_lang||target_lang||pnc||itn||timestamp||diarize||foreign|"

Prompt Defaults

prompt_format: kanary
prompt_defaults:
- role: user
  slots:
    decodercontext: ''
    source_lang: <|ko|>
    target_lang: <|ko|>
    emotion: <|emo:undefined|>
    pnc: <|pnc|>
    itn: <|noitn|>
    diarize: <|nodiarize|>
    timestamp: <|notimestamp|>
    foreign: <|foreign_undefined|>
- role: user_partial

Quickstart (local inference)

requirements.txt

# Install torch separately to match your CUDA version if needed.
nemo_toolkit[asr]
soundfile
huggingface_hub>=0.22.0
git+https://github.com/21jun/kanary_prompt.git
pip install -r requirements.txt

infer.py

import nemo.collections.asr as nemo_asr
from kanary_prompt import kanary
import json
import tempfile
import soundfile as sf

# Initialize ASR model
asr_model = nemo_asr.models.ASRModel.from_pretrained("lee1jun/kanary-1b-pre-v0.1")

# Prepare your audio file paths

audio_paths = ["path/to/wav1.wav",
               "path/to/wav2.wav",
               "path/to/wav3.wav",
               "path/to/wav4.wav",
               ...
               ]

def _get_duration_seconds(audio_path: str) -> float:
    info = sf.info(audio_path)
    if info.samplerate == 0:
        raise ValueError(f"Sample rate is zero for {audio_path}")
    return float(info.frames / info.samplerate)

# Since current NeMo EncDecMultiTaskModel (Canary) dose not support prompt configuration directly in the `transcribe` method,
# we create a wrapper function to handle the prompt configuration via temporary JSON manifest file.
def transcribe(audio_paths, show_manifest=False, **kwargs):
    # Create tmp json file of audio paths and kwargs, then call asr_model.transcribe on it
    tmp_json = tempfile.NamedTemporaryFile(mode='w+', delete=True, suffix=".json")
    data = []
    for audio_path in audio_paths:
        entry = {"audio_filepath": audio_path}
        entry.update(kwargs)
        if "duration" not in entry:
            try:
                entry["duration"] = _get_duration_seconds(audio_path)
            except Exception as exc:
                print(f"Warning: unable to compute duration for {audio_path}: {exc}; defaulting to 0.0")
                entry["duration"] = 0.0
        data.append(entry)
    for item in data:
        tmp_json.write(json.dumps(item) + "\n")
    tmp_json.flush()
    if show_manifest:
        with open(tmp_json.name, 'r') as f:
            print(f.read())

    transcriptions = asr_model.transcribe(tmp_json.name)
    tmp_json.close()
    return transcriptions

# Example usage:
transcriptions = transcribe(audio_paths, source_lang="ko", target_lang="ko", itn="True", pnc="True", foreign="foreign_en")
for transcription in transcriptions:
    print(transcription.text)
  • source_lang / target_lang: Language codes in the manifest; currently only ko is supported.
  • foreign: Foreign-word style, e.g., foreign_en, foreign_ko, or foreign_undefined if unused.
  • itn / pnc: Use "True" or "False" strings as expected by the model.

You can also supply your own JSON/JSONL manifest with required fields: audio_filepath, duration, source_lang, target_lang, foreign, itn, pnc.

# Example usage:
transcriptions = asr_model.transcribe("path/to/your/manifest.json")

Example transcription variants:

transcriptions = transcribe(audio_paths, source_lang="ko", target_lang="ko", itn="True", pnc="True", foreign="foreign_en")
itn pnc foreign output
True True foreign_en λ‚΄κ°€ μ•Œλ˜ μ‚¬λžŒμ€ 쑰립 pc ν•œ 이백만 원 총 μƒκ°ν–ˆλ˜ μ‚¬λžŒ μžˆμ—ˆκ±°λ“ .
True True foreign_ko λ‚΄κ°€ μ•Œλ˜ μ‚¬λžŒμ€ 쑰립 피씨 ν•œ 이백만 원 총 μƒκ°ν–ˆλ˜ μ‚¬λžŒ μžˆμ—ˆκ±°λ“ .
True False foreign_en λ‚΄κ°€ μ•Œλ˜ μ‚¬λžŒμ€ 쑰립 pc ν•œ 이백만 원 총 μƒκ°ν–ˆλ˜ μ‚¬λžŒ μžˆμ—ˆκ±°λ“ 
True False foreign_ko λ‚΄κ°€ μ•Œλ˜ μ‚¬λžŒμ€ 쑰립 피씨 ν•œ 이백만 원 총 μƒκ°ν–ˆλ˜ μ‚¬λžŒ μžˆμ—ˆκ±°λ“ 
False True foreign_en λ‚΄κ°€ μ•Œλ˜ μ‚¬λžŒμ€ 쑰립 pc ν•œ 200λ§Œμ› 총 μƒκ°ν–ˆλ˜ μ‚¬λžŒ μžˆμ—ˆκ±°λ“ .
False True foreign_ko λ‚΄κ°€ μ•Œλ˜ μ‚¬λžŒμ€ 쑰립 피씨 ν•œ 200λ§Œμ› 총 μƒκ°ν–ˆλ˜ μ‚¬λžŒ μžˆμ—ˆκ±°λ“ .
itn pnc foreign output
True True foreign_en enable이 1인데 reset이 1μ΄λ‹ˆκΉŒ 좜λ ₯은 0 enable 0λ©΄ μœ μ§€.
True True foreign_ko 이넀이블이 1인데 리셋이 1μ΄λ‹ˆκΉŒ 좜λ ₯은 0 이넀이블 0λ©΄ μœ μ§€.
True False foreign_en enable이 1인데 reset이 1μ΄λ‹ˆκΉŒ 좜λ ₯은 0 enable 0λ©΄ μœ μ§€
True False foreign_ko 이넀이블이 1인데 리셋이 1μ΄λ‹ˆκΉŒ 좜λ ₯은 0 이넀이블 0λ©΄ μœ μ§€
False True foreign_en enable이 일인데 reset이 μΌμ΄λ‹ˆκΉŒ 좜λ ₯은 제둜 enable 제둜면 μœ μ§€.
False True foreign_ko 이넀이블이 일인데 리셋이 μΌμ΄λ‹ˆκΉŒ 좜λ ₯은 제둜 이넀이블 제둜면 μœ μ§€.

Model Details

Architecture

Uses the original Canary backbone: 32 FastConformer encoder blocks and 8 Transformer decoder layers, totaling ~0.95B parameters. Full hyperparameters are in model_config.yaml.

Aggregated Tokenizer

Employs a concatenated tokenizer that stitches together per-language SentencePiece models, making it straightforward to add more languages.

  • 1,152 special tokens from nvidia/canary-1b-flash, plus 3 new foreign-handling tokens.
  • Custom Korean BPE tokenizer (5,000 vocab) trained on the Kanary training corpus.

model_config.yaml

tokenizer:
  dir: null
  type: agg
  langs:
    spl_tokens:
      dir: null
      type: bpe
      model_path: nemo:16c4aa8fc4b842cea64b44ccdb0cfde6_tokenizer.model
      vocab_path: nemo:1fb7608d4e704a349755b8bc74826b32_vocab.txt
      spe_tokenizer_vocab: nemo:7727e9891e724123b5acb70a70f7c139_tokenizer.vocab
    ko:
      dir: null
      type: bpe
      model_path: nemo:4ea4ae1adeae41e4868bf470b856cd37_tokenizer.model
      vocab_path: nemo:15141f641292474ebbd465f65707a637_vocab.txt
      spe_tokenizer_vocab: nemo:bf66d3c13f6649839c39ec07b72bd6da_tokenizer.vocab
  custom_tokenizer:
    _target_: nemo.collections.common.tokenizers.canary_tokenizer.CanaryTokenizer
    tokenizers: null

Training Details

Training Dataset

TBD

Prompt-Tagged Data Curation

TBD

Evaluation

TBD

Contact

author: lee1jun@postech.ac.kr

homepage

Licence

MIT

Downloads last month
20
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for lee1jun/kanary-1b-pre-v0.1

Finetuned
(2)
this model