--- license: apache-2.0 language: - de tags: - truecasing - text-processing - german - nlp - lstm - crf pipeline_tag: token-classification --- # Truecaser Models Truecasing models for restoring proper capitalization in lowercase ASR output. Used by [CrispASR](https://github.com/CrispStrobe/CrispASR) via `--truecase-model`. ## Available Models | File | Type | Language | Size | F1 | License | Flag | |------|------|----------|------|-----|---------|------| | `truecaser-lstm-de.bin` | BiLSTM char-level | German | 3.2 MB | 97.9% | Apache-2.0 | `lstm` or `lstm-de` | | `truecaser-lstm-en.bin` | BiLSTM char-level | English | 3.2 MB | 93.0% | Apache-2.0 | `lstm-en` | | `truecaser-lstm-es.bin` | BiLSTM char-level | Spanish | 3.2 MB | — | Apache-2.0 | `lstm-es` | | `truecaser-lstm-ru.bin` | BiLSTM char-level | Russian | 4.1 MB | — | Apache-2.0 | `lstm-ru` | | `truecaser-crf-de.bin` | CRF + context | German | 8.5 MB | ~95% | MIT | `crf` | | `truecaser-de.bin` | Statistical freq | German | 1.7 MB | ~93% | MIT | `auto` | ## BiLSTM Truecaser (recommended) Converted from [mayhewsw/pytorch-truecaser](https://github.com/mayhewsw/pytorch-truecaser) (Apache-2.0). - **Architecture**: Embedding(202, 50) → BiLSTM(50→150, 2 layers) → Linear(300, 2) - **Labels**: L (lowercase), U (uppercase) — per character - **Training**: WMT monolingual text (de: 2.6M tokens, 97.86% F1; en: Wikipedia, 93.01% F1; es: WMT; ru: LORELEI) - **Original paper**: Mayhew et al., "NER and POS When Nothing is Capitalized" (2019) - **Source**: [mayhewsw/pytorch-truecaser v1.0](https://github.com/mayhewsw/pytorch-truecaser/releases/tag/v1.0) — `wmt-truecaser-model-de.tar.gz` ### Example ``` Input: die schnelle braune katze springt über den faulen hund Output: Die schnelle braune Katze springt über den faulen Hund ``` Correctly handles: - Adjective vs noun: "braune" (lowercase) vs "Katze" (capitalize) - Formal pronouns: "Ihnen" (capitalize) - Compound words and proper nouns ## CRF Truecaser Trained on 245K sentences of WMT News Crawl German using [python-crfsuite](https://github.com/scrapinghub/python-crfsuite). - **Features**: word identity, 3-char suffix, noun suffixes, previous/next word, article context - **Decode**: Viterbi over linear-chain CRF (3 labels: lc, u1, uc) - **Training data**: WMT News Crawl 2023 German (8.5 MB model, MIT license) ## Statistical Truecaser Simple word-frequency lookup trained on WMT News Crawl 2023 German. - **Entries**: 71,142 unique words - **Size**: 1.7 MB - **Approach**: for each word, pick the casing variant (lowercase/capitalize/uppercase) seen most often - **Training data**: WMT News Crawl 2023 German (278K sentences), MIT license ## Usage with CrispASR ```bash # BiLSTM (recommended) crispasr --backend wav2vec2-de -m model.gguf --truecase-model lstm -f audio.wav # CRF crispasr --backend wav2vec2-de -m model.gguf --truecase-model crf -f audio.wav # Statistical crispasr --backend wav2vec2-de -m model.gguf --truecase-model auto -f audio.wav # Combined with punctuation restoration crispasr --backend moonshine -m model.gguf --punc-model punctuate-all --truecase-model lstm -f audio.wav ``` ## Conversion ```bash # BiLSTM: download from mayhewsw, convert to binary wget https://github.com/mayhewsw/pytorch-truecaser/releases/download/v1.0/wmt-truecaser-model-de.tar.gz tar xzf wmt-truecaser-model-de.tar.gz python models/convert-lstm-truecaser-to-bin.py --input wmt-truecaser-de/ --output truecaser-lstm-de.bin # CRF: train from Wikipedia python models/train-truecaser-crf.py --output truecaser-crf-de.bin ```