File size: 3,587 Bytes
e5bfca5
b6517e7
e5bfca5
 
 
 
 
 
 
b6517e7
 
e5bfca5
 
 
73c2687
e5bfca5
73c2687
e5bfca5
b6517e7
e5bfca5
73c2687
 
 
 
 
 
34a7d0b
 
e5bfca5
b6517e7
e5bfca5
b6517e7
e5bfca5
b6517e7
 
73c2687
b6517e7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73c2687
b6517e7
 
 
73c2687
b6517e7
 
 
3af65f8
e5bfca5
3af65f8
 
b6517e7
3af65f8
e5bfca5
 
 
 
b6517e7
 
e5bfca5
b6517e7
 
e5bfca5
b6517e7
 
e5bfca5
b6517e7
 
 
e5bfca5
b6517e7
e5bfca5
b6517e7
 
 
 
 
e5bfca5
b6517e7
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
license: apache-2.0
language:
- de
tags:
- truecasing
- text-processing
- german
- nlp
- lstm
- crf
pipeline_tag: token-classification
---

# Truecaser Models

Truecasing models for restoring proper capitalization in lowercase ASR output. Used by [CrispASR](https://github.com/CrispStrobe/CrispASR) via `--truecase-model`.

## Available Models

| File | Type | Language | Size | F1 | License | Flag |
|------|------|----------|------|-----|---------|------|
| `truecaser-lstm-de.bin` | BiLSTM char-level | German | 3.2 MB | 97.9% | Apache-2.0 | `lstm` or `lstm-de` |
| `truecaser-lstm-en.bin` | BiLSTM char-level | English | 3.2 MB | 93.0% | Apache-2.0 | `lstm-en` |
| `truecaser-lstm-es.bin` | BiLSTM char-level | Spanish | 3.2 MB | — | Apache-2.0 | `lstm-es` |
| `truecaser-lstm-ru.bin` | BiLSTM char-level | Russian | 4.1 MB | — | Apache-2.0 | `lstm-ru` |
| `truecaser-crf-de.bin` | CRF + context | German | 8.5 MB | ~95% | MIT | `crf` |
| `truecaser-de.bin` | Statistical freq | German | 1.7 MB | ~93% | MIT | `auto` |

## BiLSTM Truecaser (recommended)

Converted from [mayhewsw/pytorch-truecaser](https://github.com/mayhewsw/pytorch-truecaser) (Apache-2.0).

- **Architecture**: Embedding(202, 50) → BiLSTM(50→150, 2 layers) → Linear(300, 2)
- **Labels**: L (lowercase), U (uppercase) — per character
- **Training**: WMT monolingual text (de: 2.6M tokens, 97.86% F1; en: Wikipedia, 93.01% F1; es: WMT; ru: LORELEI)
- **Original paper**: Mayhew et al., "NER and POS When Nothing is Capitalized" (2019)
- **Source**: [mayhewsw/pytorch-truecaser v1.0](https://github.com/mayhewsw/pytorch-truecaser/releases/tag/v1.0) — `wmt-truecaser-model-de.tar.gz`

### Example

```
Input:  die schnelle braune katze springt über den faulen hund
Output: Die schnelle braune Katze springt über den faulen Hund
```

Correctly handles:
- Adjective vs noun: "braune" (lowercase) vs "Katze" (capitalize)
- Formal pronouns: "Ihnen" (capitalize)
- Compound words and proper nouns

## CRF Truecaser

Trained on 245K sentences of WMT News Crawl German using [python-crfsuite](https://github.com/scrapinghub/python-crfsuite).

- **Features**: word identity, 3-char suffix, noun suffixes, previous/next word, article context
- **Decode**: Viterbi over linear-chain CRF (3 labels: lc, u1, uc)
- **Training data**: WMT News Crawl 2023 German (8.5 MB model, MIT license)

## Statistical Truecaser

Simple word-frequency lookup trained on WMT News Crawl 2023 German.

- **Entries**: 71,142 unique words
- **Size**: 1.7 MB
- **Approach**: for each word, pick the casing variant (lowercase/capitalize/uppercase) seen most often
- **Training data**: WMT News Crawl 2023 German (278K sentences), MIT license

## Usage with CrispASR

```bash
# BiLSTM (recommended)
crispasr --backend wav2vec2-de -m model.gguf --truecase-model lstm -f audio.wav

# CRF
crispasr --backend wav2vec2-de -m model.gguf --truecase-model crf -f audio.wav

# Statistical
crispasr --backend wav2vec2-de -m model.gguf --truecase-model auto -f audio.wav

# Combined with punctuation restoration
crispasr --backend moonshine -m model.gguf --punc-model punctuate-all --truecase-model lstm -f audio.wav
```

## Conversion

```bash
# BiLSTM: download from mayhewsw, convert to binary
wget https://github.com/mayhewsw/pytorch-truecaser/releases/download/v1.0/wmt-truecaser-model-de.tar.gz
tar xzf wmt-truecaser-model-de.tar.gz
python models/convert-lstm-truecaser-to-bin.py --input wmt-truecaser-de/ --output truecaser-lstm-de.bin

# CRF: train from Wikipedia
python models/train-truecaser-crf.py --output truecaser-crf-de.bin
```