Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
---
|
| 2 |
-
license:
|
| 3 |
language:
|
| 4 |
- de
|
| 5 |
tags:
|
|
@@ -7,57 +7,85 @@ tags:
|
|
| 7 |
- text-processing
|
| 8 |
- german
|
| 9 |
- nlp
|
| 10 |
-
-
|
|
|
|
| 11 |
pipeline_tag: token-classification
|
| 12 |
---
|
| 13 |
|
| 14 |
-
# German
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
-
|
| 23 |
-
- **lc**: all lowercase (e.g. "die")
|
| 24 |
-
- **u1**: first letter capitalized (e.g. "Katze")
|
| 25 |
-
- **uc**: all uppercase (e.g. "NATO")
|
| 26 |
|
| 27 |
-
|
| 28 |
|
| 29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
- **Entries**: 375,283 unique words
|
| 32 |
-
- **
|
| 33 |
-
- **
|
| 34 |
-
- **Inference**: instant (hash table lookup, no neural network)
|
| 35 |
|
| 36 |
## Usage with CrispASR
|
| 37 |
|
| 38 |
```bash
|
| 39 |
-
#
|
| 40 |
-
crispasr --backend
|
| 41 |
-
|
| 42 |
-
# Combined with punctuation restoration
|
| 43 |
-
crispasr --backend wav2vec2-de -m model.gguf \
|
| 44 |
-
--punc-model punctuate-all --truecase-model auto -f audio.wav
|
| 45 |
-
```
|
| 46 |
|
| 47 |
-
#
|
|
|
|
| 48 |
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
| Raw ASR | `die schnelle braune katze springt über den faulen hund` |
|
| 52 |
-
| + punctuation | `die schnelle braune katze springt über den faulen hund.` |
|
| 53 |
-
| + truecasing | `Die schnelle Braune Katze springt über den faulen Hund.` |
|
| 54 |
|
| 55 |
-
#
|
|
|
|
|
|
|
| 56 |
|
| 57 |
-
|
| 58 |
-
- German-specific (separate models needed for other languages)
|
| 59 |
-
- Does not handle mixed-case words like "mRNA" or "iPhone"
|
| 60 |
|
| 61 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
-
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
language:
|
| 4 |
- de
|
| 5 |
tags:
|
|
|
|
| 7 |
- text-processing
|
| 8 |
- german
|
| 9 |
- nlp
|
| 10 |
+
- lstm
|
| 11 |
+
- crf
|
| 12 |
pipeline_tag: token-classification
|
| 13 |
---
|
| 14 |
|
| 15 |
+
# German Truecaser Models
|
| 16 |
|
| 17 |
+
Three truecasing models for restoring proper German capitalization in lowercase ASR output. Used by [CrispASR](https://github.com/CrispStrobe/CrispASR) via `--truecase-model`.
|
| 18 |
|
| 19 |
+
## Available Models
|
| 20 |
|
| 21 |
+
| File | Type | Size | F1 | License | Recommended |
|
| 22 |
+
|------|------|------|-----|---------|-------------|
|
| 23 |
+
| `truecaser-lstm-de.bin` | BiLSTM char-level | 3.2 MB | 97.9% | Apache-2.0 | **Yes** |
|
| 24 |
+
| `truecaser-crf-de.bin` | CRF + context | 24 MB | ~95% | MIT | |
|
| 25 |
+
| `truecaser-de.bin` | Statistical freq | 9.2 MB | ~93% | MIT | |
|
| 26 |
|
| 27 |
+
## BiLSTM Truecaser (recommended)
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
+
Converted from [mayhewsw/pytorch-truecaser](https://github.com/mayhewsw/pytorch-truecaser) (Apache-2.0).
|
| 30 |
|
| 31 |
+
- **Architecture**: Embedding(202, 50) → BiLSTM(50→150, 2 layers) → Linear(300, 2)
|
| 32 |
+
- **Labels**: L (lowercase), U (uppercase) — per character
|
| 33 |
+
- **Training**: 2.6M tokens of WMT German monolingual text, 97.86% F1
|
| 34 |
+
- **Original paper**: Mayhew et al., "NER and POS When Nothing is Capitalized" (2019)
|
| 35 |
+
- **Source**: [mayhewsw/pytorch-truecaser v1.0](https://github.com/mayhewsw/pytorch-truecaser/releases/tag/v1.0) — `wmt-truecaser-model-de.tar.gz`
|
| 36 |
+
|
| 37 |
+
### Example
|
| 38 |
+
|
| 39 |
+
```
|
| 40 |
+
Input: die schnelle braune katze springt über den faulen hund
|
| 41 |
+
Output: Die schnelle braune Katze springt über den faulen Hund
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
Correctly handles:
|
| 45 |
+
- Adjective vs noun: "braune" (lowercase) vs "Katze" (capitalize)
|
| 46 |
+
- Formal pronouns: "Ihnen" (capitalize)
|
| 47 |
+
- Compound words and proper nouns
|
| 48 |
+
|
| 49 |
+
## CRF Truecaser
|
| 50 |
+
|
| 51 |
+
Trained on 860K German Wikipedia sentences using [python-crfsuite](https://github.com/scrapinghub/python-crfsuite).
|
| 52 |
+
|
| 53 |
+
- **Features**: word identity, 3-char suffix, noun suffixes, previous/next word, article context
|
| 54 |
+
- **Decode**: Viterbi over linear-chain CRF (3 labels: lc, u1, uc)
|
| 55 |
+
- **Training data**: German Wikipedia (CC-BY-SA), model released under MIT
|
| 56 |
+
|
| 57 |
+
## Statistical Truecaser
|
| 58 |
+
|
| 59 |
+
Simple word-frequency lookup trained on 3M lines of German Wikipedia.
|
| 60 |
|
| 61 |
- **Entries**: 375,283 unique words
|
| 62 |
+
- **Approach**: for each word, pick the casing variant (lowercase/capitalize/uppercase) seen most often
|
| 63 |
+
- **Training data**: German Wikipedia (CC-BY-SA), model released under MIT
|
|
|
|
| 64 |
|
| 65 |
## Usage with CrispASR
|
| 66 |
|
| 67 |
```bash
|
| 68 |
+
# BiLSTM (recommended)
|
| 69 |
+
crispasr --backend wav2vec2-de -m model.gguf --truecase-model lstm -f audio.wav
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
+
# CRF
|
| 72 |
+
crispasr --backend wav2vec2-de -m model.gguf --truecase-model crf -f audio.wav
|
| 73 |
|
| 74 |
+
# Statistical
|
| 75 |
+
crispasr --backend wav2vec2-de -m model.gguf --truecase-model auto -f audio.wav
|
|
|
|
|
|
|
|
|
|
| 76 |
|
| 77 |
+
# Combined with punctuation restoration
|
| 78 |
+
crispasr --backend moonshine -m model.gguf --punc-model punctuate-all --truecase-model lstm -f audio.wav
|
| 79 |
+
```
|
| 80 |
|
| 81 |
+
## Conversion
|
|
|
|
|
|
|
| 82 |
|
| 83 |
+
```bash
|
| 84 |
+
# BiLSTM: download from mayhewsw, convert to binary
|
| 85 |
+
wget https://github.com/mayhewsw/pytorch-truecaser/releases/download/v1.0/wmt-truecaser-model-de.tar.gz
|
| 86 |
+
tar xzf wmt-truecaser-model-de.tar.gz
|
| 87 |
+
python models/convert-lstm-truecaser-to-bin.py --input wmt-truecaser-de/ --output truecaser-lstm-de.bin
|
| 88 |
|
| 89 |
+
# CRF: train from Wikipedia
|
| 90 |
+
python models/train-truecaser-crf.py --output truecaser-crf-de.bin
|
| 91 |
+
```
|