π°π· Kanary 1B (pre-release)
Kanary (Korean Canary) 1B is a pre-release ~0.95B NeMo checkpoint for Korean ASR, built on NVIDIAβs canary-1b-v2. It keeps the 32 FastConformer encoder blocks and 8 Transformer decoder layers, and adds a KanaryPrompt formatter tuned for Korean text normalization and foreign-word rendering. Prompt controls for punctuation (pnc), inverse text normalization (itn), and foreign-word style (foreign) are passed via the prompt tokens. This repo ships the .nemo checkpoint plus minimal inference scripts.
This checkpoint is a proof-of-concept; a public release will follow with finalized assets.
What is here
kanary-1b-pre-v0.1.nemo: NeMo checkpoint.- Total model parameters: 951.67M (approx. 0.95B).
infer.py: example that builds a temporary manifest and prints transcriptions.
About kanary_prompt
- https://github.com/21jun/kanary_prompt : Kanary prompt token formatting dependency
Extending original Canary2PromptFormatter, KanaryPromptFormatter supports new prompt token |foreign| that controls foreign word writting styles.
kanary prompt template: "{CANARY2_BOCTX}|decodercontext|{CANARY_BOS}|emotion||source_lang||target_lang||pnc||itn||timestamp||diarize||foreign|"
Prompt Defaults
prompt_format: kanary
prompt_defaults:
- role: user
slots:
decodercontext: ''
source_lang: <|ko|>
target_lang: <|ko|>
emotion: <|emo:undefined|>
pnc: <|pnc|>
itn: <|noitn|>
diarize: <|nodiarize|>
timestamp: <|notimestamp|>
foreign: <|foreign_undefined|>
- role: user_partial
Quickstart (local inference)
requirements.txt
# Install torch separately to match your CUDA version if needed.
nemo_toolkit[asr]
soundfile
huggingface_hub>=0.22.0
git+https://github.com/21jun/kanary_prompt.git
pip install -r requirements.txt
infer.py
import nemo.collections.asr as nemo_asr
from kanary_prompt import kanary
import json
import tempfile
import soundfile as sf
# Initialize ASR model
asr_model = nemo_asr.models.ASRModel.from_pretrained("lee1jun/kanary-1b-pre-v0.1")
# Prepare your audio file paths
audio_paths = ["path/to/wav1.wav",
"path/to/wav2.wav",
"path/to/wav3.wav",
"path/to/wav4.wav",
...
]
def _get_duration_seconds(audio_path: str) -> float:
info = sf.info(audio_path)
if info.samplerate == 0:
raise ValueError(f"Sample rate is zero for {audio_path}")
return float(info.frames / info.samplerate)
# Since current NeMo EncDecMultiTaskModel (Canary) dose not support prompt configuration directly in the `transcribe` method,
# we create a wrapper function to handle the prompt configuration via temporary JSON manifest file.
def transcribe(audio_paths, show_manifest=False, **kwargs):
# Create tmp json file of audio paths and kwargs, then call asr_model.transcribe on it
tmp_json = tempfile.NamedTemporaryFile(mode='w+', delete=True, suffix=".json")
data = []
for audio_path in audio_paths:
entry = {"audio_filepath": audio_path}
entry.update(kwargs)
if "duration" not in entry:
try:
entry["duration"] = _get_duration_seconds(audio_path)
except Exception as exc:
print(f"Warning: unable to compute duration for {audio_path}: {exc}; defaulting to 0.0")
entry["duration"] = 0.0
data.append(entry)
for item in data:
tmp_json.write(json.dumps(item) + "\n")
tmp_json.flush()
if show_manifest:
with open(tmp_json.name, 'r') as f:
print(f.read())
transcriptions = asr_model.transcribe(tmp_json.name)
tmp_json.close()
return transcriptions
# Example usage:
transcriptions = transcribe(audio_paths, source_lang="ko", target_lang="ko", itn="True", pnc="True", foreign="foreign_en")
for transcription in transcriptions:
print(transcription.text)
source_lang/target_lang: Language codes in the manifest; currently onlykois supported.foreign: Foreign-word style, e.g.,foreign_en,foreign_ko, orforeign_undefinedif unused.itn/pnc: Use"True"or"False"strings as expected by the model.
You can also supply your own JSON/JSONL manifest with required fields: audio_filepath, duration, source_lang, target_lang, foreign, itn, pnc.
# Example usage:
transcriptions = asr_model.transcribe("path/to/your/manifest.json")
Example transcription variants:
transcriptions = transcribe(audio_paths, source_lang="ko", target_lang="ko", itn="True", pnc="True", foreign="foreign_en")
| itn | pnc | foreign | output |
|---|---|---|---|
| True | True | foreign_en | λ΄κ° μλ μ¬λμ 쑰립 pc ν μ΄λ°±λ§ μ μ΄ μκ°νλ μ¬λ μμκ±°λ . |
| True | True | foreign_ko | λ΄κ° μλ μ¬λμ 쑰립 νΌμ¨ ν μ΄λ°±λ§ μ μ΄ μκ°νλ μ¬λ μμκ±°λ . |
| True | False | foreign_en | λ΄κ° μλ μ¬λμ 쑰립 pc ν μ΄λ°±λ§ μ μ΄ μκ°νλ μ¬λ μμκ±°λ |
| True | False | foreign_ko | λ΄κ° μλ μ¬λμ 쑰립 νΌμ¨ ν μ΄λ°±λ§ μ μ΄ μκ°νλ μ¬λ μμκ±°λ |
| False | True | foreign_en | λ΄κ° μλ μ¬λμ 쑰립 pc ν 200λ§μ μ΄ μκ°νλ μ¬λ μμκ±°λ . |
| False | True | foreign_ko | λ΄κ° μλ μ¬λμ 쑰립 νΌμ¨ ν 200λ§μ μ΄ μκ°νλ μ¬λ μμκ±°λ . |
| itn | pnc | foreign | output |
|---|---|---|---|
| True | True | foreign_en | enableμ΄ 1μΈλ° resetμ΄ 1μ΄λκΉ μΆλ ₯μ 0 enable 0λ©΄ μ μ§. |
| True | True | foreign_ko | μ΄λ€μ΄λΈμ΄ 1μΈλ° 리μ μ΄ 1μ΄λκΉ μΆλ ₯μ 0 μ΄λ€μ΄λΈ 0λ©΄ μ μ§. |
| True | False | foreign_en | enableμ΄ 1μΈλ° resetμ΄ 1μ΄λκΉ μΆλ ₯μ 0 enable 0λ©΄ μ μ§ |
| True | False | foreign_ko | μ΄λ€μ΄λΈμ΄ 1μΈλ° 리μ μ΄ 1μ΄λκΉ μΆλ ₯μ 0 μ΄λ€μ΄λΈ 0λ©΄ μ μ§ |
| False | True | foreign_en | enableμ΄ μΌμΈλ° resetμ΄ μΌμ΄λκΉ μΆλ ₯μ μ λ‘ enable μ λ‘λ©΄ μ μ§. |
| False | True | foreign_ko | μ΄λ€μ΄λΈμ΄ μΌμΈλ° 리μ μ΄ μΌμ΄λκΉ μΆλ ₯μ μ λ‘ μ΄λ€μ΄λΈ μ λ‘λ©΄ μ μ§. |
Model Details
Architecture
Uses the original Canary backbone: 32 FastConformer encoder blocks and 8 Transformer decoder layers, totaling ~0.95B parameters. Full hyperparameters are in model_config.yaml.
Aggregated Tokenizer
Employs a concatenated tokenizer that stitches together per-language SentencePiece models, making it straightforward to add more languages.
- 1,152 special tokens from nvidia/canary-1b-flash, plus 3 new
foreign-handling tokens. - Custom Korean BPE tokenizer (5,000 vocab) trained on the Kanary training corpus.
model_config.yaml
tokenizer:
dir: null
type: agg
langs:
spl_tokens:
dir: null
type: bpe
model_path: nemo:16c4aa8fc4b842cea64b44ccdb0cfde6_tokenizer.model
vocab_path: nemo:1fb7608d4e704a349755b8bc74826b32_vocab.txt
spe_tokenizer_vocab: nemo:7727e9891e724123b5acb70a70f7c139_tokenizer.vocab
ko:
dir: null
type: bpe
model_path: nemo:4ea4ae1adeae41e4868bf470b856cd37_tokenizer.model
vocab_path: nemo:15141f641292474ebbd465f65707a637_vocab.txt
spe_tokenizer_vocab: nemo:bf66d3c13f6649839c39ec07b72bd6da_tokenizer.vocab
custom_tokenizer:
_target_: nemo.collections.common.tokenizers.canary_tokenizer.CanaryTokenizer
tokenizers: null
Training Details
Training Dataset
TBD
Prompt-Tagged Data Curation
TBD
Evaluation
TBD
Contact
author: lee1jun@postech.ac.kr
Licence
MIT
- Downloads last month
- 20
Model tree for lee1jun/kanary-1b-pre-v0.1
Base model
nvidia/canary-1b-v2