File size: 3,587 Bytes
e5bfca5 b6517e7 e5bfca5 b6517e7 e5bfca5 73c2687 e5bfca5 73c2687 e5bfca5 b6517e7 e5bfca5 73c2687 34a7d0b e5bfca5 b6517e7 e5bfca5 b6517e7 e5bfca5 b6517e7 73c2687 b6517e7 73c2687 b6517e7 73c2687 b6517e7 3af65f8 e5bfca5 3af65f8 b6517e7 3af65f8 e5bfca5 b6517e7 e5bfca5 b6517e7 e5bfca5 b6517e7 e5bfca5 b6517e7 e5bfca5 b6517e7 e5bfca5 b6517e7 e5bfca5 b6517e7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | ---
license: apache-2.0
language:
- de
tags:
- truecasing
- text-processing
- german
- nlp
- lstm
- crf
pipeline_tag: token-classification
---
# Truecaser Models
Truecasing models for restoring proper capitalization in lowercase ASR output. Used by [CrispASR](https://github.com/CrispStrobe/CrispASR) via `--truecase-model`.
## Available Models
| File | Type | Language | Size | F1 | License | Flag |
|------|------|----------|------|-----|---------|------|
| `truecaser-lstm-de.bin` | BiLSTM char-level | German | 3.2 MB | 97.9% | Apache-2.0 | `lstm` or `lstm-de` |
| `truecaser-lstm-en.bin` | BiLSTM char-level | English | 3.2 MB | 93.0% | Apache-2.0 | `lstm-en` |
| `truecaser-lstm-es.bin` | BiLSTM char-level | Spanish | 3.2 MB | — | Apache-2.0 | `lstm-es` |
| `truecaser-lstm-ru.bin` | BiLSTM char-level | Russian | 4.1 MB | — | Apache-2.0 | `lstm-ru` |
| `truecaser-crf-de.bin` | CRF + context | German | 8.5 MB | ~95% | MIT | `crf` |
| `truecaser-de.bin` | Statistical freq | German | 1.7 MB | ~93% | MIT | `auto` |
## BiLSTM Truecaser (recommended)
Converted from [mayhewsw/pytorch-truecaser](https://github.com/mayhewsw/pytorch-truecaser) (Apache-2.0).
- **Architecture**: Embedding(202, 50) → BiLSTM(50→150, 2 layers) → Linear(300, 2)
- **Labels**: L (lowercase), U (uppercase) — per character
- **Training**: WMT monolingual text (de: 2.6M tokens, 97.86% F1; en: Wikipedia, 93.01% F1; es: WMT; ru: LORELEI)
- **Original paper**: Mayhew et al., "NER and POS When Nothing is Capitalized" (2019)
- **Source**: [mayhewsw/pytorch-truecaser v1.0](https://github.com/mayhewsw/pytorch-truecaser/releases/tag/v1.0) — `wmt-truecaser-model-de.tar.gz`
### Example
```
Input: die schnelle braune katze springt über den faulen hund
Output: Die schnelle braune Katze springt über den faulen Hund
```
Correctly handles:
- Adjective vs noun: "braune" (lowercase) vs "Katze" (capitalize)
- Formal pronouns: "Ihnen" (capitalize)
- Compound words and proper nouns
## CRF Truecaser
Trained on 245K sentences of WMT News Crawl German using [python-crfsuite](https://github.com/scrapinghub/python-crfsuite).
- **Features**: word identity, 3-char suffix, noun suffixes, previous/next word, article context
- **Decode**: Viterbi over linear-chain CRF (3 labels: lc, u1, uc)
- **Training data**: WMT News Crawl 2023 German (8.5 MB model, MIT license)
## Statistical Truecaser
Simple word-frequency lookup trained on WMT News Crawl 2023 German.
- **Entries**: 71,142 unique words
- **Size**: 1.7 MB
- **Approach**: for each word, pick the casing variant (lowercase/capitalize/uppercase) seen most often
- **Training data**: WMT News Crawl 2023 German (278K sentences), MIT license
## Usage with CrispASR
```bash
# BiLSTM (recommended)
crispasr --backend wav2vec2-de -m model.gguf --truecase-model lstm -f audio.wav
# CRF
crispasr --backend wav2vec2-de -m model.gguf --truecase-model crf -f audio.wav
# Statistical
crispasr --backend wav2vec2-de -m model.gguf --truecase-model auto -f audio.wav
# Combined with punctuation restoration
crispasr --backend moonshine -m model.gguf --punc-model punctuate-all --truecase-model lstm -f audio.wav
```
## Conversion
```bash
# BiLSTM: download from mayhewsw, convert to binary
wget https://github.com/mayhewsw/pytorch-truecaser/releases/download/v1.0/wmt-truecaser-model-de.tar.gz
tar xzf wmt-truecaser-model-de.tar.gz
python models/convert-lstm-truecaser-to-bin.py --input wmt-truecaser-de/ --output truecaser-lstm-de.bin
# CRF: train from Wikipedia
python models/train-truecaser-crf.py --output truecaser-crf-de.bin
```
|