cstr commited on
Commit
73c2687
Β·
verified Β·
1 Parent(s): 08b20bc

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +11 -8
README.md CHANGED
@@ -12,15 +12,18 @@ tags:
12
  pipeline_tag: token-classification
13
  ---
14
 
15
- # German Truecaser Models
16
 
17
- Three truecasing models for restoring proper German capitalization in lowercase ASR output. Used by [CrispASR](https://github.com/CrispStrobe/CrispASR) via `--truecase-model`.
18
 
19
  ## Available Models
20
 
21
- | File | Type | Size | F1 | License | Recommended |
22
- |------|------|------|-----|---------|-------------|
23
- | `truecaser-lstm-de.bin` | BiLSTM char-level | 3.2 MB | 97.9% | Apache-2.0 | **Yes** |
 
 
 
24
  | `truecaser-crf-de.bin` | CRF + context | 24 MB | ~95% | MIT | |
25
  | `truecaser-de.bin` | Statistical freq | 9.2 MB | ~93% | MIT | |
26
 
@@ -30,7 +33,7 @@ Converted from [mayhewsw/pytorch-truecaser](https://github.com/mayhewsw/pytorch-
30
 
31
  - **Architecture**: Embedding(202, 50) β†’ BiLSTM(50β†’150, 2 layers) β†’ Linear(300, 2)
32
  - **Labels**: L (lowercase), U (uppercase) β€” per character
33
- - **Training**: 2.6M tokens of WMT German monolingual text, 97.86% F1
34
  - **Original paper**: Mayhew et al., "NER and POS When Nothing is Capitalized" (2019)
35
  - **Source**: [mayhewsw/pytorch-truecaser v1.0](https://github.com/mayhewsw/pytorch-truecaser/releases/tag/v1.0) β€” `wmt-truecaser-model-de.tar.gz`
36
 
@@ -48,11 +51,11 @@ Correctly handles:
48
 
49
  ## CRF Truecaser
50
 
51
- Trained on 860K German Wikipedia sentences using [python-crfsuite](https://github.com/scrapinghub/python-crfsuite).
52
 
53
  - **Features**: word identity, 3-char suffix, noun suffixes, previous/next word, article context
54
  - **Decode**: Viterbi over linear-chain CRF (3 labels: lc, u1, uc)
55
- - **Training data**: German Wikipedia (CC-BY-SA), model released under MIT
56
 
57
  ## Statistical Truecaser
58
 
 
12
  pipeline_tag: token-classification
13
  ---
14
 
15
+ # Truecaser Models
16
 
17
+ Truecasing models for restoring proper capitalization in lowercase ASR output. Used by [CrispASR](https://github.com/CrispStrobe/CrispASR) via `--truecase-model`.
18
 
19
  ## Available Models
20
 
21
+ | File | Type | Language | Size | F1 | License | Flag |
22
+ |------|------|----------|------|-----|---------|------|
23
+ | `truecaser-lstm-de.bin` | BiLSTM char-level | German | 3.2 MB | 97.9% | Apache-2.0 | `lstm` or `lstm-de` |
24
+ | `truecaser-lstm-en.bin` | BiLSTM char-level | English | 3.2 MB | 93.0% | Apache-2.0 | `lstm-en` |
25
+ | `truecaser-lstm-es.bin` | BiLSTM char-level | Spanish | 3.2 MB | β€” | Apache-2.0 | `lstm-es` |
26
+ | `truecaser-lstm-ru.bin` | BiLSTM char-level | Russian | 4.1 MB | β€” | Apache-2.0 | `lstm-ru` |
27
  | `truecaser-crf-de.bin` | CRF + context | 24 MB | ~95% | MIT | |
28
  | `truecaser-de.bin` | Statistical freq | 9.2 MB | ~93% | MIT | |
29
 
 
33
 
34
  - **Architecture**: Embedding(202, 50) β†’ BiLSTM(50β†’150, 2 layers) β†’ Linear(300, 2)
35
  - **Labels**: L (lowercase), U (uppercase) β€” per character
36
+ - **Training**: WMT monolingual text (de: 2.6M tokens, 97.86% F1; en: Wikipedia, 93.01% F1; es: WMT; ru: LORELEI)
37
  - **Original paper**: Mayhew et al., "NER and POS When Nothing is Capitalized" (2019)
38
  - **Source**: [mayhewsw/pytorch-truecaser v1.0](https://github.com/mayhewsw/pytorch-truecaser/releases/tag/v1.0) β€” `wmt-truecaser-model-de.tar.gz`
39
 
 
51
 
52
  ## CRF Truecaser
53
 
54
+ Trained on 245K sentences of WMT News Crawl German using [python-crfsuite](https://github.com/scrapinghub/python-crfsuite).
55
 
56
  - **Features**: word identity, 3-char suffix, noun suffixes, previous/next word, article context
57
  - **Decode**: Viterbi over linear-chain CRF (3 labels: lc, u1, uc)
58
+ - **Training data**: WMT News Crawl 2023 German (8.5 MB model, MIT license)
59
 
60
  ## Statistical Truecaser
61