kaldan commited on
Commit
c860dd7
·
verified ·
1 Parent(s): 08b65f0

Upload KenLM model

Browse files
Files changed (2) hide show
  1. BoKenlm-sp.arpa +2 -2
  2. README.md +17 -17
BoKenlm-sp.arpa CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ba82428afdbf0c87689cd241d521d57f7431200bb8e2d318a7ed9e854a28e1ef
3
- size 1318866790
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:afae57135f995f6e5817bbb9d77bb9bf8879ad40e3b1353713660cbd7d938612
3
+ size 1522379816
README.md CHANGED
@@ -10,39 +10,39 @@ A KenLM n-gram language model trained on Tibetan text, tokenized with sentencepi
10
  | **Tokenizer** | [openpecha/BoSentencePiece](https://huggingface.co/openpecha/BoSentencePiece) (Unigram, 20k vocab) |
11
  | **Training Corpus** | `bo_corpus.txt` |
12
  | **Pruning** | 0 0 1 |
13
- | **Tokens** | 38,532,313 |
14
- | **Vocabulary Size** | 19,974 |
15
 
16
  ## N-gram Statistics
17
 
18
  | Order | Count | D1 | D2 | D3+ |
19
  | --- | --- | --- | --- | --- |
20
- | 1 | 19,974 | 0.4286 | 0.4732 | 1.6466 |
21
- | 2 | 6,644,290 | 0.6716 | 1.1474 | 1.5430 |
22
- | 3 | 4,300,626 | 0.8465 | 1.2657 | 1.4802 |
23
- | 4 | 3,485,091 | 0.9175 | 1.3852 | 1.5176 |
24
- | 5 | 2,597,780 | 0.8773 | 1.4487 | 1.5846 |
25
 
26
  ## Memory Estimates
27
 
28
  | Type | MB | Details |
29
  | --- | --- | --- |
30
- | probing | 375 | assuming -p 1.5 |
31
- | probing | 458 | assuming -r models -p 1.5 |
32
- | trie | 187 | without quantization |
33
- | trie | 99 | assuming -q 8 -b 8 quantization |
34
- | trie | 159 | assuming -a 22 array pointer compression |
35
- | trie | 71 | assuming -a 22 -q 8 -b 8 array pointer compression and quantization |
36
 
37
  ## Training Resources
38
 
39
  | Metric | Value |
40
  | --- | --- |
41
  | **Peak Virtual Memory** | 12,333 MB |
42
- | **Peak RSS** | 2,976 MB |
43
- | **Wall Time** | 33.1s |
44
- | **User Time** | 37.0s |
45
- | **System Time** | 16.6s |
46
 
47
  ## Usage
48
 
 
10
  | **Tokenizer** | [openpecha/BoSentencePiece](https://huggingface.co/openpecha/BoSentencePiece) (Unigram, 20k vocab) |
11
  | **Training Corpus** | `bo_corpus.txt` |
12
  | **Pruning** | 0 0 1 |
13
+ | **Tokens** | 42,010,347 |
14
+ | **Vocabulary Size** | 20,003 |
15
 
16
  ## N-gram Statistics
17
 
18
  | Order | Count | D1 | D2 | D3+ |
19
  | --- | --- | --- | --- | --- |
20
+ | 1 | 20,003 | 0.4921 | 0.3393 | 1.0317 |
21
+ | 2 | 6,945,893 | 0.6676 | 1.1495 | 1.5504 |
22
+ | 3 | 4,960,553 | 0.8443 | 1.2638 | 1.4835 |
23
+ | 4 | 4,211,842 | 0.9154 | 1.3888 | 1.5332 |
24
+ | 5 | 3,276,583 | 0.8525 | 1.5142 | 1.6453 |
25
 
26
  ## Memory Estimates
27
 
28
  | Type | MB | Details |
29
  | --- | --- | --- |
30
+ | probing | 425 | assuming -p 1.5 |
31
+ | probing | 517 | assuming -r models -p 1.5 |
32
+ | trie | 211 | without quantization |
33
+ | trie | 112 | assuming -q 8 -b 8 quantization |
34
+ | trie | 180 | assuming -a 22 array pointer compression |
35
+ | trie | 81 | assuming -a 22 -q 8 -b 8 array pointer compression and quantization |
36
 
37
  ## Training Resources
38
 
39
  | Metric | Value |
40
  | --- | --- |
41
  | **Peak Virtual Memory** | 12,333 MB |
42
+ | **Peak RSS** | 3,578 MB |
43
+ | **Wall Time** | 42.9s |
44
+ | **User Time** | 48.5s |
45
+ | **System Time** | 19.7s |
46
 
47
  ## Usage
48