openpecha
/

BoKenlm-sp

Model card Files Files and versions

kaldan commited on Mar 6

Commit

c860dd7

·

verified ·

1 Parent(s): 08b65f0

Upload KenLM model

Files changed (2) hide show

BoKenlm-sp.arpa +2 -2
README.md +17 -17

BoKenlm-sp.arpa CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ba82428afdbf0c87689cd241d521d57f7431200bb8e2d318a7ed9e854a28e1ef
-size 1318866790

 version https://git-lfs.github.com/spec/v1
+oid sha256:afae57135f995f6e5817bbb9d77bb9bf8879ad40e3b1353713660cbd7d938612
+size 1522379816

README.md CHANGED Viewed

@@ -10,39 +10,39 @@ A KenLM n-gram language model trained on Tibetan text, tokenized with sentencepi
 | **Tokenizer** | [openpecha/BoSentencePiece](https://huggingface.co/openpecha/BoSentencePiece) (Unigram, 20k vocab) |
 | **Training Corpus** | `bo_corpus.txt` |
 | **Pruning** | 0 0 1 |
-| **Tokens** | 38,532,313 |
-| **Vocabulary Size** | 19,974 |
 ## N-gram Statistics
 | Order | Count | D1 | D2 | D3+ |
 | --- | --- | --- | --- | --- |
-| 1 | 19,974 | 0.4286 | 0.4732 | 1.6466 |
-| 2 | 6,644,290 | 0.6716 | 1.1474 | 1.5430 |
-| 3 | 4,300,626 | 0.8465 | 1.2657 | 1.4802 |
-| 4 | 3,485,091 | 0.9175 | 1.3852 | 1.5176 |
-| 5 | 2,597,780 | 0.8773 | 1.4487 | 1.5846 |
 ## Memory Estimates
 | Type | MB | Details |
 | --- | --- | --- |
-| probing | 375 | assuming -p 1.5 |
-| probing | 458 | assuming -r models -p 1.5 |
-| trie | 187 | without quantization |
-| trie | 99 | assuming -q 8 -b 8 quantization |
-| trie | 159 | assuming -a 22 array pointer compression |
-| trie | 71 | assuming -a 22 -q 8 -b 8 array pointer compression and quantization |
 ## Training Resources
 | Metric | Value |
 | --- | --- |
 | **Peak Virtual Memory** | 12,333 MB |
-| **Peak RSS** | 2,976 MB |
-| **Wall Time** | 33.1s |
-| **User Time** | 37.0s |
-| **System Time** | 16.6s |
 ## Usage

 | **Tokenizer** | [openpecha/BoSentencePiece](https://huggingface.co/openpecha/BoSentencePiece) (Unigram, 20k vocab) |
 | **Training Corpus** | `bo_corpus.txt` |
 | **Pruning** | 0 0 1 |
+| **Tokens** | 42,010,347 |
+| **Vocabulary Size** | 20,003 |
 ## N-gram Statistics
 | Order | Count | D1 | D2 | D3+ |
 | --- | --- | --- | --- | --- |
+| 1 | 20,003 | 0.4921 | 0.3393 | 1.0317 |
+| 2 | 6,945,893 | 0.6676 | 1.1495 | 1.5504 |
+| 3 | 4,960,553 | 0.8443 | 1.2638 | 1.4835 |
+| 4 | 4,211,842 | 0.9154 | 1.3888 | 1.5332 |
+| 5 | 3,276,583 | 0.8525 | 1.5142 | 1.6453 |
 ## Memory Estimates
 | Type | MB | Details |
 | --- | --- | --- |
+| probing | 425 | assuming -p 1.5 |
+| probing | 517 | assuming -r models -p 1.5 |
+| trie | 211 | without quantization |
+| trie | 112 | assuming -q 8 -b 8 quantization |
+| trie | 180 | assuming -a 22 array pointer compression |
+| trie | 81 | assuming -a 22 -q 8 -b 8 array pointer compression and quantization |
 ## Training Resources
 | Metric | Value |
 | --- | --- |
 | **Peak Virtual Memory** | 12,333 MB |
+| **Peak RSS** | 3,578 MB |
+| **Wall Time** | 42.9s |
+| **User Time** | 48.5s |
+| **System Time** | 19.7s |
 ## Usage