truecaser-de / README.md
cstr's picture
Upload README.md with huggingface_hub
34a7d0b verified
---
license: apache-2.0
language:
- de
tags:
- truecasing
- text-processing
- german
- nlp
- lstm
- crf
pipeline_tag: token-classification
---
# Truecaser Models
Truecasing models for restoring proper capitalization in lowercase ASR output. Used by [CrispASR](https://github.com/CrispStrobe/CrispASR) via `--truecase-model`.
## Available Models
| File | Type | Language | Size | F1 | License | Flag |
|------|------|----------|------|-----|---------|------|
| `truecaser-lstm-de.bin` | BiLSTM char-level | German | 3.2 MB | 97.9% | Apache-2.0 | `lstm` or `lstm-de` |
| `truecaser-lstm-en.bin` | BiLSTM char-level | English | 3.2 MB | 93.0% | Apache-2.0 | `lstm-en` |
| `truecaser-lstm-es.bin` | BiLSTM char-level | Spanish | 3.2 MB | — | Apache-2.0 | `lstm-es` |
| `truecaser-lstm-ru.bin` | BiLSTM char-level | Russian | 4.1 MB | — | Apache-2.0 | `lstm-ru` |
| `truecaser-crf-de.bin` | CRF + context | German | 8.5 MB | ~95% | MIT | `crf` |
| `truecaser-de.bin` | Statistical freq | German | 1.7 MB | ~93% | MIT | `auto` |
## BiLSTM Truecaser (recommended)
Converted from [mayhewsw/pytorch-truecaser](https://github.com/mayhewsw/pytorch-truecaser) (Apache-2.0).
- **Architecture**: Embedding(202, 50) → BiLSTM(50→150, 2 layers) → Linear(300, 2)
- **Labels**: L (lowercase), U (uppercase) — per character
- **Training**: WMT monolingual text (de: 2.6M tokens, 97.86% F1; en: Wikipedia, 93.01% F1; es: WMT; ru: LORELEI)
- **Original paper**: Mayhew et al., "NER and POS When Nothing is Capitalized" (2019)
- **Source**: [mayhewsw/pytorch-truecaser v1.0](https://github.com/mayhewsw/pytorch-truecaser/releases/tag/v1.0) — `wmt-truecaser-model-de.tar.gz`
### Example
```
Input: die schnelle braune katze springt über den faulen hund
Output: Die schnelle braune Katze springt über den faulen Hund
```
Correctly handles:
- Adjective vs noun: "braune" (lowercase) vs "Katze" (capitalize)
- Formal pronouns: "Ihnen" (capitalize)
- Compound words and proper nouns
## CRF Truecaser
Trained on 245K sentences of WMT News Crawl German using [python-crfsuite](https://github.com/scrapinghub/python-crfsuite).
- **Features**: word identity, 3-char suffix, noun suffixes, previous/next word, article context
- **Decode**: Viterbi over linear-chain CRF (3 labels: lc, u1, uc)
- **Training data**: WMT News Crawl 2023 German (8.5 MB model, MIT license)
## Statistical Truecaser
Simple word-frequency lookup trained on WMT News Crawl 2023 German.
- **Entries**: 71,142 unique words
- **Size**: 1.7 MB
- **Approach**: for each word, pick the casing variant (lowercase/capitalize/uppercase) seen most often
- **Training data**: WMT News Crawl 2023 German (278K sentences), MIT license
## Usage with CrispASR
```bash
# BiLSTM (recommended)
crispasr --backend wav2vec2-de -m model.gguf --truecase-model lstm -f audio.wav
# CRF
crispasr --backend wav2vec2-de -m model.gguf --truecase-model crf -f audio.wav
# Statistical
crispasr --backend wav2vec2-de -m model.gguf --truecase-model auto -f audio.wav
# Combined with punctuation restoration
crispasr --backend moonshine -m model.gguf --punc-model punctuate-all --truecase-model lstm -f audio.wav
```
## Conversion
```bash
# BiLSTM: download from mayhewsw, convert to binary
wget https://github.com/mayhewsw/pytorch-truecaser/releases/download/v1.0/wmt-truecaser-model-de.tar.gz
tar xzf wmt-truecaser-model-de.tar.gz
python models/convert-lstm-truecaser-to-bin.py --input wmt-truecaser-de/ --output truecaser-lstm-de.bin
# CRF: train from Wikipedia
python models/train-truecaser-crf.py --output truecaser-crf-de.bin
```