cstr commited on
Commit
b6517e7
·
verified ·
1 Parent(s): cea3dc2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +62 -34
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- license: mit
3
  language:
4
  - de
5
  tags:
@@ -7,57 +7,85 @@ tags:
7
  - text-processing
8
  - german
9
  - nlp
10
- - statistical
 
11
  pipeline_tag: token-classification
12
  ---
13
 
14
- # German Statistical Truecaser
15
 
16
- Statistical truecasing model for German, trained on 2M lines of German Wikipedia (CC-BY-SA source, model released under MIT).
17
 
18
- Restores proper capitalization of German nouns, proper names, and acronyms in lowercase ASR output.
19
 
20
- ## How It Works
 
 
 
 
21
 
22
- For each word (lowercased), the model stores frequency counts of three casing variants:
23
- - **lc**: all lowercase (e.g. "die")
24
- - **u1**: first letter capitalized (e.g. "Katze")
25
- - **uc**: all uppercase (e.g. "NATO")
26
 
27
- At inference, the variant with the highest count is applied. Sentence-initial words are always capitalized.
28
 
29
- ## Stats
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  - **Entries**: 375,283 unique words
32
- - **Training data**: German Wikipedia (3M lines, mid-sentence words only, min count 5)
33
- - **File size**: 9.2 MB
34
- - **Inference**: instant (hash table lookup, no neural network)
35
 
36
  ## Usage with CrispASR
37
 
38
  ```bash
39
- # Auto-download German truecaser
40
- crispasr --backend moonshine -m model.gguf --truecase-model auto -f audio.wav
41
-
42
- # Combined with punctuation restoration
43
- crispasr --backend wav2vec2-de -m model.gguf \
44
- --punc-model punctuate-all --truecase-model auto -f audio.wav
45
- ```
46
 
47
- ## Example
 
48
 
49
- | Stage | Output |
50
- |-------|--------|
51
- | Raw ASR | `die schnelle braune katze springt über den faulen hund` |
52
- | + punctuation | `die schnelle braune katze springt über den faulen hund.` |
53
- | + truecasing | `Die schnelle Braune Katze springt über den faulen Hund.` |
54
 
55
- ## Limitations
 
 
56
 
57
- - Statistical only — no context awareness (adjective "braune" vs surname "Braun" are ambiguous)
58
- - German-specific (separate models needed for other languages)
59
- - Does not handle mixed-case words like "mRNA" or "iPhone"
60
 
61
- ## License
 
 
 
 
62
 
63
- MIT
 
 
 
1
  ---
2
+ license: apache-2.0
3
  language:
4
  - de
5
  tags:
 
7
  - text-processing
8
  - german
9
  - nlp
10
+ - lstm
11
+ - crf
12
  pipeline_tag: token-classification
13
  ---
14
 
15
+ # German Truecaser Models
16
 
17
+ Three truecasing models for restoring proper German capitalization in lowercase ASR output. Used by [CrispASR](https://github.com/CrispStrobe/CrispASR) via `--truecase-model`.
18
 
19
+ ## Available Models
20
 
21
+ | File | Type | Size | F1 | License | Recommended |
22
+ |------|------|------|-----|---------|-------------|
23
+ | `truecaser-lstm-de.bin` | BiLSTM char-level | 3.2 MB | 97.9% | Apache-2.0 | **Yes** |
24
+ | `truecaser-crf-de.bin` | CRF + context | 24 MB | ~95% | MIT | |
25
+ | `truecaser-de.bin` | Statistical freq | 9.2 MB | ~93% | MIT | |
26
 
27
+ ## BiLSTM Truecaser (recommended)
 
 
 
28
 
29
+ Converted from [mayhewsw/pytorch-truecaser](https://github.com/mayhewsw/pytorch-truecaser) (Apache-2.0).
30
 
31
+ - **Architecture**: Embedding(202, 50) → BiLSTM(50→150, 2 layers) → Linear(300, 2)
32
+ - **Labels**: L (lowercase), U (uppercase) — per character
33
+ - **Training**: 2.6M tokens of WMT German monolingual text, 97.86% F1
34
+ - **Original paper**: Mayhew et al., "NER and POS When Nothing is Capitalized" (2019)
35
+ - **Source**: [mayhewsw/pytorch-truecaser v1.0](https://github.com/mayhewsw/pytorch-truecaser/releases/tag/v1.0) — `wmt-truecaser-model-de.tar.gz`
36
+
37
+ ### Example
38
+
39
+ ```
40
+ Input: die schnelle braune katze springt über den faulen hund
41
+ Output: Die schnelle braune Katze springt über den faulen Hund
42
+ ```
43
+
44
+ Correctly handles:
45
+ - Adjective vs noun: "braune" (lowercase) vs "Katze" (capitalize)
46
+ - Formal pronouns: "Ihnen" (capitalize)
47
+ - Compound words and proper nouns
48
+
49
+ ## CRF Truecaser
50
+
51
+ Trained on 860K German Wikipedia sentences using [python-crfsuite](https://github.com/scrapinghub/python-crfsuite).
52
+
53
+ - **Features**: word identity, 3-char suffix, noun suffixes, previous/next word, article context
54
+ - **Decode**: Viterbi over linear-chain CRF (3 labels: lc, u1, uc)
55
+ - **Training data**: German Wikipedia (CC-BY-SA), model released under MIT
56
+
57
+ ## Statistical Truecaser
58
+
59
+ Simple word-frequency lookup trained on 3M lines of German Wikipedia.
60
 
61
  - **Entries**: 375,283 unique words
62
+ - **Approach**: for each word, pick the casing variant (lowercase/capitalize/uppercase) seen most often
63
+ - **Training data**: German Wikipedia (CC-BY-SA), model released under MIT
 
64
 
65
  ## Usage with CrispASR
66
 
67
  ```bash
68
+ # BiLSTM (recommended)
69
+ crispasr --backend wav2vec2-de -m model.gguf --truecase-model lstm -f audio.wav
 
 
 
 
 
70
 
71
+ # CRF
72
+ crispasr --backend wav2vec2-de -m model.gguf --truecase-model crf -f audio.wav
73
 
74
+ # Statistical
75
+ crispasr --backend wav2vec2-de -m model.gguf --truecase-model auto -f audio.wav
 
 
 
76
 
77
+ # Combined with punctuation restoration
78
+ crispasr --backend moonshine -m model.gguf --punc-model punctuate-all --truecase-model lstm -f audio.wav
79
+ ```
80
 
81
+ ## Conversion
 
 
82
 
83
+ ```bash
84
+ # BiLSTM: download from mayhewsw, convert to binary
85
+ wget https://github.com/mayhewsw/pytorch-truecaser/releases/download/v1.0/wmt-truecaser-model-de.tar.gz
86
+ tar xzf wmt-truecaser-model-de.tar.gz
87
+ python models/convert-lstm-truecaser-to-bin.py --input wmt-truecaser-de/ --output truecaser-lstm-de.bin
88
 
89
+ # CRF: train from Wikipedia
90
+ python models/train-truecaser-crf.py --output truecaser-crf-de.bin
91
+ ```