omarkamali commited on Jan 3

Commit

1815cdf

verified ·

1 Parent(s): 1e6f4ab

Upload all models and assets for bbc (20251001)

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +313 -135
models/embeddings/monolingual/bbc_128d.bin +2 -2
models/embeddings/monolingual/bbc_128d_metadata.json +5 -3
models/embeddings/monolingual/bbc_32d.bin +2 -2
models/embeddings/monolingual/bbc_32d_metadata.json +5 -3
models/embeddings/monolingual/bbc_64d.bin +2 -2
models/embeddings/monolingual/bbc_64d_metadata.json +5 -3
models/subword_markov/bbc_markov_ctx1_subword.parquet +2 -2
models/subword_markov/bbc_markov_ctx1_subword_metadata.json +2 -2
models/subword_markov/bbc_markov_ctx2_subword.parquet +2 -2
models/subword_markov/bbc_markov_ctx2_subword_metadata.json +2 -2
models/subword_markov/bbc_markov_ctx3_subword.parquet +2 -2
models/subword_markov/bbc_markov_ctx3_subword_metadata.json +2 -2
models/subword_markov/bbc_markov_ctx4_subword.parquet +2 -2
models/subword_markov/bbc_markov_ctx4_subword_metadata.json +2 -2
models/subword_ngram/bbc_2gram_subword.parquet +2 -2
models/subword_ngram/bbc_2gram_subword_metadata.json +2 -2
models/subword_ngram/bbc_3gram_subword.parquet +2 -2
models/subword_ngram/bbc_3gram_subword_metadata.json +2 -2
models/subword_ngram/bbc_4gram_subword.parquet +2 -2
models/subword_ngram/bbc_4gram_subword_metadata.json +2 -2
models/tokenizer/bbc_tokenizer_16k.model +2 -2
models/tokenizer/bbc_tokenizer_16k.vocab +0 -0
models/tokenizer/bbc_tokenizer_32k.model +2 -2
models/tokenizer/bbc_tokenizer_32k.vocab +0 -0
models/tokenizer/bbc_tokenizer_8k.model +2 -2
models/tokenizer/bbc_tokenizer_8k.vocab +0 -0
models/vocabulary/bbc_vocabulary.parquet +2 -2
models/vocabulary/bbc_vocabulary_metadata.json +10 -9
models/word_markov/bbc_markov_ctx1_word.parquet +2 -2
models/word_markov/bbc_markov_ctx1_word_metadata.json +2 -2
models/word_markov/bbc_markov_ctx2_word.parquet +2 -2
models/word_markov/bbc_markov_ctx2_word_metadata.json +2 -2
models/word_markov/bbc_markov_ctx3_word.parquet +2 -2
models/word_markov/bbc_markov_ctx3_word_metadata.json +2 -2
models/word_markov/bbc_markov_ctx4_word.parquet +2 -2
models/word_markov/bbc_markov_ctx4_word_metadata.json +2 -2
models/word_ngram/bbc_2gram_word.parquet +2 -2
models/word_ngram/bbc_2gram_word_metadata.json +2 -2
models/word_ngram/bbc_3gram_word.parquet +2 -2
models/word_ngram/bbc_3gram_word_metadata.json +2 -2
models/word_ngram/bbc_4gram_word.parquet +2 -2
models/word_ngram/bbc_4gram_word_metadata.json +2 -2
visualizations/embedding_isotropy.png +0 -0
visualizations/embedding_norms.png +0 -0
visualizations/embedding_similarity.png +2 -2
visualizations/markov_branching.png +0 -0
visualizations/markov_contexts.png +0 -0
visualizations/markov_entropy.png +0 -0
visualizations/model_sizes.png +0 -0

README.md CHANGED Viewed

@@ -23,14 +23,14 @@ dataset_info:
 metrics:
   - name: best_compression_ratio
     type: compression
-    value: 4.433
   - name: best_isotropy
     type: isotropy
-    value: 0.8253
   - name: vocabulary_size
     type: vocab
-    value: 24711
-generated: 2025-12-28
 ---
 # BBC - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
-- N-gram models (2, 3, 4-gram)
-- Markov chains (context of 1, 2, 3 and 4)
 - Subword N-gram and Markov chains
-- Embeddings in various sizes and dimensions
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
-- [6. Summary & Recommendations](#6-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
@@ -68,53 +70,53 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
-| **8k** | 3.867x | 3.83 | 0.1235% | 1,466,873 |
-| **16k** | 4.118x | 4.08 | 0.1315% | 1,377,727 |
-| **32k** | 4.304x | 4.27 | 0.1375% | 1,318,061 |
-| **64k** | 4.433x 🏆 | 4.39 | 0.1416% | 1,279,829 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
-**Sample 1:** `Panjunan i ma sada huta  na adong di Kecamatan Petarukan, Kabupaten Pemalang, Pr...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁panj unan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ... (+11 more)` | 21 |
-| 16k | `▁panj unan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ... (+11 more)` | 21 |
-| 32k | `▁panj unan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ... (+11 more)` | 21 |
-| 64k | `▁panjunan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ▁petarukan ... (+10 more)` | 20 |
-**Sample 2:** `Ampapaga (Surat Batak:ᯀᯔ᯲ᯇᯇᯎ) i ma sada suansuanan na tubu di gadu ni hauma.
- P...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁amp ap aga ▁( surat ▁batak : ᯀ ᯔ᯲ ᯇ ... (+20 more)` | 30 |
-| 16k | `▁amp ap aga ▁( surat ▁batak : ᯀ ᯔ᯲ᯇ ᯇ ... (+18 more)` | 28 |
-| 32k | `▁amp apaga ▁( surat ▁batak : ᯀ ᯔ᯲ᯇᯇᯎ ) ▁i ... (+15 more)` | 25 |
-| 64k | `▁ampapaga ▁( surat ▁batak : ᯀᯔ᯲ᯇᯇᯎ ) ▁i ▁ma ▁sada ... (+13 more)` | 23 |
-**Sample 3:** `Sungapan i ma sada huta  na adong di Kecamatan Pemalang, Kabupaten Pemalang, Pro...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁sung apan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ... (+11 more)` | 21 |
-| 16k | `▁sung apan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ... (+11 more)` | 21 |
-| 32k | `▁sung apan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ... (+11 more)` | 21 |
-| 64k | `▁sungapan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ▁pemalang ... (+10 more)` | 20 |
 ### Key Findings
-- **Best Compression:** 64k achieves 4.433x compression
-- **Lowest UNK Rate:** 8k with 0.1235% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
@@ -123,57 +125,89 @@ Below are sample sentences tokenized with each vocabulary size:
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
-| N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
-|--------|------------|---------|----------------|------------------|-------------------|
-| **2-gram** | 7,191 🏆 | 12.81 | 30,496 | 17.8% | 49.6% |
-| **2-gram** | 209 🏆 | 7.70 | 2,868 | 75.2% | 99.1% |
-| **3-gram** | 25,989 | 14.67 | 62,993 | 8.7% | 26.3% |
-| **3-gram** | 1,399 | 10.45 | 19,851 | 36.8% | 80.6% |
-| **4-gram** | 63,619 | 15.96 | 114,847 | 4.7% | 15.6% |
-| **4-gram** | 6,375 | 12.64 | 81,359 | 18.9% | 52.8% |
 ### Top 5 N-grams by Size
-**2-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `, jala` | 8,780 |
-| 2 | `i ,` | 6,997 |
-| 3 | `ᯬ ᯲` | 4,573 |
-| 4 | `angka na` | 4,409 |
-| 5 | `dung i` | 4,328 |
-**3-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `anak ni si` | 1,611 |
-| 2 | `, angka na` | 1,528 |
-| 3 | `. 2 :` | 1,079 |
-| 4 | `, anak ni` | 1,069 |
-| 5 | `ᯰ ᯄ ᯦` | 1,063 |
-**4-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `, anak ni si` | 906 |
-| 2 | `ᯀ ᯰ ᯄ ᯦` | 686 |
-| 3 | `ᯘ ᯪ ᯀᯉ ᯲` | 457 |
-| 4 | `ᯬ ᯂᯖ ᯬ ᯲` | 432 |
-| 5 | `on do hata ni` | 421 |
 ### Key Findings
-- **Best Perplexity:** 2-gram with 209
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
-- **Coverage:** Top-1000 patterns cover ~53% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
@@ -181,55 +215,86 @@ Below are sample sentences tokenized with each vocabulary size:
 ![Markov Entropy](visualizations/markov_entropy.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
-| Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
-|---------|-------------|------------|------------------|-----------------|----------------|
-| **1** | 0.8281 | 1.775 | 6.09 | 50,531 | 17.2% |
-| **1** | 1.0960 | 2.138 | 7.35 | 1,187 | 0.0% |
-| **2** | 0.3983 | 1.318 | 2.17 | 307,642 | 60.2% |
-| **2** | 0.8091 | 1.752 | 4.70 | 8,725 | 19.1% |
-| **3** | 0.1982 | 1.147 | 1.41 | 667,901 | 80.2% |
-| **3** | 0.7878 | 1.726 | 3.57 | 41,004 | 21.2% |
-| **4** | 0.0988 🏆 | 1.071 | 1.16 | 943,940 | 90.1% |
-| **4** | 0.5526 🏆 | 1.467 | 2.40 | 146,535 | 44.7% |
-### Generated Text Samples
-Below are text samples generated from each Markov chain model:
 **Context Size 1:**
-1. `, jala tu si wasti , jala dipadeakdeak hamu sian i jala peakkononna tu palaspalas pamurunan`
-2. `. 12 : songon i hatangku tu bagasan saluhut angka ari puasa bintang di jolo i`
-3. `: ida ma angka na margoar milo dohot paredangedangan , 2 : 35 : 25 pelean`
 **Context Size 2:**
-1. `, jala tudoshon gumora angka anak ni si joab soara ni sarune i laho ma ho ,`
-2. `i , naung pinauli ni tangan ni halak pangarupa umpogo upa ni pambahenannasida . 99 : 7`
-3. `ᯬ ᯲ ᯘᯞ ᯮ ᯂᯖ ᯮ ᯲ ᯑ ᯩ ᯇᯔ ᯲ ᯅᯂ ᯩ ᯉᯉ ᯲ ᯔ ᯉ`
 **Context Size 3:**
-1. `anak ni si ahilud do panuturi . 18 : 13 tung sura ahu parohon begu masa tu luat`
-2. `, angka na so bangso hian ; marhite sian bangso na asing , jala marbarita goarmu di betlehem`
-3. `. 2 : 15 alai tarrimas situtu ma si abner dohot di halak na marroha pangansion , ndang`
 **Context Size 4:**
-1. `, anak ni si ammiel . 3 : 6 ndada tu torop bangso , angka parhata bobang , manang`
-2. `ᯀ ᯰ ᯄ ᯦ ᯂᯖ ᯉ ᯪ ᯘ ᯪ ᯅ ᯩ ᯀ ᯩ ᯒ ᯪ ᯑ ᯪ ᯉᯘ ᯪ`
-3. `ᯘ ᯪ ᯀᯉ ᯲ ᯂᯔ ᯪ ᯐ ᯮ ᯔ ᯬ ᯞ ᯬ ᯉ ᯪ ᯑ ᯩ ᯅᯖ 2 :`
 ### Key Findings
-- **Best Predictability:** Context-4 with 90.1% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
-- **Memory Trade-off:** Larger contexts require more storage (146,535 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
@@ -245,64 +310,64 @@ Below are text samples generated from each Markov chain model:
 | Metric | Value |
 |--------|-------|
-| Vocabulary Size | 24,711 |
-| Total Tokens | 1,019,541 |
-| Mean Frequency | 41.26 |
 | Median Frequency | 4 |
-| Frequency Std Dev | 565.94 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | ni | 35,101 |
-| 2 | na | 34,088 |
-| 3 | i | 33,056 |
-| 4 | ma | 26,769 |
-| 5 | di | 26,053 |
-| 6 | tu | 20,450 |
-| 7 | do | 19,163 |
-| 8 | angka | 17,428 |
-| 9 | jala | 14,598 |
-| 10 | dohot | 13,609 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | ᯝᯇᯞ | 2 |
-| 2 | kayo | 2 |
-| 3 | uttar | 2 |
-| 4 | ltr | 2 |
-| 5 | ebrima | 2 |
-| 6 | 290px | 2 |
-| 7 | td | 2 |
-| 8 | height | 2 |
-| 9 | 260px | 2 |
-| 10 | 22251 | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
-| Zipf Coefficient | 1.1956 |
-| R² (Goodness of Fit) | 0.996705 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
-| Top 100 | 52.8% |
-| Top 1,000 | 78.6% |
-| Top 5,000 | 91.7% |
-| Top 10,000 | 95.9% |
 ### Key Findings
-- **Zipf Compliance:** R²=0.9967 indicates excellent adherence to Zipf's law
-- **High Frequency Dominance:** Top 100 words cover 52.8% of corpus
-- **Long Tail:** 14,711 words needed for remaining 4.1% coverage
 ---
 ## 5. Word Embeddings Evaluation
@@ -315,24 +380,134 @@ Below are text samples generated from each Markov chain model:
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
-### Model Comparison
-| Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
-|-------|------------|-----------|----------|----------|----------|
-| **mono_32d** | 15,079 | 32 | 3.458 | 0.818 | 0.8253 🏆 |
-| **mono_64d** | 15,079 | 64 | 3.886 | 0.738 | 0.7641 |
-| **mono_128d** | 15,079 | 128 | 4.143 | 0.691 | 0.4668 |
-| **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 ### Key Findings
-- **Best Isotropy:** mono_32d with 0.8253 (more uniform distribution)
-- **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
-- **Vocabulary Coverage:** All models cover 15,079 words
-- **Recommendation:** 100d for balanced semantic capture and efficiency
 ---
-## 6. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
@@ -340,11 +515,12 @@ Below are text samples generated from each Markov chain model:
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
-| Tokenizer | **32k BPE** | Best compression (4.43x) with low UNK rate |
-| N-gram | **5-gram** | Lowest perplexity (209) |
-| Markov | **Context-4** | Highest predictability (90.1%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
@@ -534,7 +710,8 @@ If you use these models in your research, please cite:
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
-  publisher = {HuggingFace},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
@@ -550,7 +727,8 @@ MIT License - Free for academic and commercial use.
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 ---
 *Generated by Wikilangs Models Pipeline*
-*Report Date: 2025-12-28 00:12:38*

 metrics:
   - name: best_compression_ratio
     type: compression
+    value: 3.663
   - name: best_isotropy
     type: isotropy
+    value: 0.8223
   - name: vocabulary_size
     type: vocab
+    value: 0
+generated: 2026-01-03
 ---
 # BBC - Wikilangs Models
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
+- N-gram models (2, 3, 4, 5-gram)
+- Markov chains (context of 1, 2, 3, 4 and 5)
 - Subword N-gram and Markov chains
+- Embeddings in various sizes and dimensions (aligned and unaligned)
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
+- [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
+- [7. Summary & Recommendations](#7-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
+![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
+![Tokenizer OOV](visualizations/tokenizer_oov.png)
+![Total Tokens](visualizations/tokenizer_total_tokens.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
+| **8k** | 3.308x | 3.31 | 0.2131% | 1,665,136 |
+| **16k** | 3.527x | 3.53 | 0.2273% | 1,561,615 |
+| **32k** | 3.663x 🏆 | 3.66 | 0.2360% | 1,503,727 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
+**Sample 1:** `Pedurungan i ma sada huta na adong di Kecamatan Taman, Kabupaten Pemalang, Propi...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁ped ur ungan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ... (+12 more)` | 22 |
+| 16k | `▁pedurungan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ▁taman ... (+10 more)` | 20 |
+| 32k | `▁pedurungan ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ▁taman ... (+10 more)` | 20 |
+**Sample 2:** `Mulyoharjo i ma sada Kelurahan na adong di Kecamatan Pemalang, Kabupaten Pemalan...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁mul y oharjo ▁i ▁ma ▁sada ▁kelurahan ▁na ▁adong ▁di ... (+12 more)` | 22 |
+| 16k | `▁mulyoharjo ▁i ▁ma ▁sada ▁kelurahan ▁na ▁adong ▁di ▁kecamatan ▁pemalang ... (+10 more)` | 20 |
+| 32k | `▁mulyoharjo ▁i ▁ma ▁sada ▁kelurahan ▁na ▁adong ▁di ▁kecamatan ▁pemalang ... (+10 more)` | 20 |
+**Sample 3:** `Klegen i ma sada huta na adong di Kecamatan Comal, Kabupaten Pemalang, Propinsi ...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁kl egen ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ... (+11 more)` | 21 |
+| 16k | `▁klegen ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ▁comal ... (+10 more)` | 20 |
+| 32k | `▁klegen ▁i ▁ma ▁sada ▁huta ▁na ▁adong ▁di ▁kecamatan ▁comal ... (+10 more)` | 20 |
 ### Key Findings
+- **Best Compression:** 32k achieves 3.663x compression
+- **Lowest UNK Rate:** 8k with 0.2131% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
+![N-gram Unique](visualizations/ngram_unique.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
+| N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
+|--------|---------|------------|---------|----------------|------------------|-------------------|
+| **2-gram** | Word | 8,528 | 13.06 | 26,426 | 17.5% | 42.8% |
+| **2-gram** | Subword | 185 🏆 | 7.53 | 3,491 | 77.6% | 99.2% |
+| **3-gram** | Word | 22,579 | 14.46 | 43,165 | 8.3% | 25.2% |
+| **3-gram** | Subword | 1,219 | 10.25 | 18,183 | 38.1% | 83.2% |
+| **4-gram** | Word | 44,749 | 15.45 | 67,595 | 5.7% | 16.0% |
+| **4-gram** | Subword | 5,604 | 12.45 | 70,417 | 19.7% | 54.7% |
 ### Top 5 N-grams by Size
+**2-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `angka na` | 4,424 |
+| 2 | `dung i` | 4,327 |
+| 3 | `ni si` | 4,061 |
+| 4 | `i ma` | 3,622 |
+| 5 | `ni jahowa` | 2,892 |
+**3-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `anak ni si` | 1,613 |
+| 2 | `dung i ninna` | 735 |
+| 3 | `i ma sada` | 728 |
+| 4 | `hata ni jahowa` | 703 |
+| 5 | `na adong di` | 690 |
+**4-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `on do hata ni` | 423 |
+| 2 | `songon on do hata` | 408 |
+| 3 | `i ma sada huta` | 363 |
+| 4 | `angka anak ni si` | 336 |
+| 5 | `na adong di kecamatan` | 297 |
+**2-grams (Subword):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `a _` | 206,904 |
+| 2 | `a n` | 205,541 |
+| 3 | `n g` | 154,122 |
+| 4 | `i _` | 143,001 |
+| 5 | `n a` | 122,611 |
+**3-grams (Subword):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `a n g` | 81,985 |
+| 2 | `_ m a` | 76,327 |
+| 3 | `n a _` | 58,974 |
+| 4 | `_ n a` | 53,547 |
+| 5 | `a n _` | 51,343 |
+**4-grams (Subword):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `_ n i _` | 34,971 |
+| 2 | `_ n a _` | 33,600 |
+| 3 | `_ d i _` | 25,982 |
+| 4 | `a n g k` | 24,957 |
+| 5 | `_ m a _` | 23,771 |
 ### Key Findings
+- **Best Perplexity:** 2-gram (subword) with 185
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
+- **Coverage:** Top-1000 patterns cover ~55% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
 ![Markov Entropy](visualizations/markov_entropy.png)
+![Markov Contexts](visualizations/markov_contexts.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
+| Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
+|---------|---------|-------------|------------|------------------|-----------------|----------------|
+| **1** | Word | 0.9188 | 1.890 | 6.44 | 50,697 | 8.1% |
+| **1** | Subword | 0.9378 | 1.916 | 7.17 | 1,435 | 6.2% |
+| **2** | Word | 0.3742 | 1.296 | 2.02 | 325,909 | 62.6% |
+| **2** | Subword | 0.7095 | 1.635 | 4.04 | 10,290 | 29.0% |
+| **3** | Word | 0.1535 | 1.112 | 1.28 | 658,447 | 84.6% |
+| **3** | Subword | 0.6445 | 1.563 | 3.15 | 41,529 | 35.5% |
+| **4** | Word | 0.0590 🏆 | 1.042 | 1.09 | 840,046 | 94.1% |
+| **4** | Subword | 0.5191 | 1.433 | 2.39 | 130,734 | 48.1% |
+### Generated Text Samples (Word-based)
+Below are text samples generated from each word-based Markov chain model:
 **Context Size 1:**
+1. `ni harbangan asa dioloi si ferdinand lumban gaol boi manampung 1 11 3 naung disunat 2`
+2. `na marumur sataon na humaliang 5 000 m2 hira hira songon sondang ni raja iii diida`
+3. `i gok daupa sian si daud sian pangasammu do sarita pardapot ni na sampuludua i 22`
 **Context Size 2:**
+1. `angka na niuhir dohot na tarulang angka bagasnasida jala ndang marnalemba 10 38 ingkon mago roham ba...`
+2. `dung i ninna jesus ma siseanna i ninna parompuan na mabalu disi marmudu ho 17 2 dung`
+3. `ni si beor pangarunding i dibunu halak daniel 6 6 1 hamu pe ditanda sada pangituai parheheon`
 **Context Size 3:**
+1. `anak ni si jaasia si beno 24 27 ia angka anak ni si aron ma mudar i jala`
+2. `dung i ninna ibana tu ahu hombar tu bagabagam 119 171 sai marbulakkon pujipujian ma angka bibirhu al...`
+3. `i ma sada kecamatan na adong di kecamatan petarukan kabupaten pemalang propinsi jawa tonga indonesia...`
 **Context Size 4:**
+1. `on do hata ni tuhan jahowa ida ma ahu sandiri pahehehon tu nasida hahisaron dohot hamalumon jala pam...`
+2. `songon on do hata ni jahowa zebaot tu hamu malim angka na palea goarhu hape lam didok hamu do`
+3. `i ma sada huta na maringanan di kecamatan tarutung kabupaten tapanuli utara propinsi sumatera utara ...`
+### Generated Text Samples (Subword-based)
+Below are text samples generated from each subword-based Markov chain model:
+**Context Size 1:**
+1. `_nobedi_anoa?_hi`
+2. `akoni_jarin_a_ng`
+3. `nabomaholalowap_`
+**Context Size 2:**
+1. `a_sia_ahaan_rohot`
+2. `anahu:_tana_raela`
+3. `ngon_nak_i._10:2_`
+**Context Size 3:**
+1. `anggo_ia_ingka_jor`
+2. `_marik_marhabus;_a`
+3. `na_5_menjangkup_he`
+**Context Size 4:**
+1. `_ni_angka_halak_juj`
+2. `_na_hian_gabe_manan`
+3. `_di_bagaska_indones`
 ### Key Findings
+- **Best Predictability:** Context-4 (word) with 94.1% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
+- **Memory Trade-off:** Larger contexts require more storage (130,734 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
 | Metric | Value |
 |--------|-------|
+| Vocabulary Size | 24,970 |
+| Total Tokens | 972,166 |
+| Mean Frequency | 38.93 |
 | Median Frequency | 4 |
+| Frequency Std Dev | 557.36 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | ni | 35,042 |
+| 2 | na | 33,939 |
+| 3 | i | 32,856 |
+| 4 | ma | 26,602 |
+| 5 | di | 26,003 |
+| 6 | tu | 20,420 |
+| 7 | do | 19,118 |
+| 8 | angka | 17,417 |
+| 9 | jala | 14,585 |
+| 10 | dohot | 13,546 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | continua | 2 |
+| 2 | giuseppe | 2 |
+| 3 | mamutuskan | 2 |
+| 4 | disidang | 2 |
+| 5 | disuspensi | 2 |
+| 6 | formula | 2 |
+| 7 | dibenarkan | 2 |
+| 8 | pidana | 2 |
+| 9 | piazza | 2 |
+| 10 | fontana | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
+| Zipf Coefficient | 1.1798 |
+| R² (Goodness of Fit) | 0.997075 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
+| Top 100 | 53.7% |
+| Top 1,000 | 78.4% |
+| Top 5,000 | 91.4% |
+| Top 10,000 | 95.7% |
 ### Key Findings
+- **Zipf Compliance:** R²=0.9971 indicates excellent adherence to Zipf's law
+- **High Frequency Dominance:** Top 100 words cover 53.7% of corpus
+- **Long Tail:** 14,970 words needed for remaining 4.3% coverage
 ---
 ## 5. Word Embeddings Evaluation
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
+### 5.1 Cross-Lingual Alignment
+> *Note: Multilingual alignment visualization not available for this language.*
+### 5.2 Model Comparison
+| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
+|-------|-----------|----------|------------------|---------------|----------------|
+| **mono_32d** | 32 | 0.8223 🏆 | 0.3239 | N/A | N/A |
+| **mono_64d** | 64 | 0.7605 | 0.2710 | N/A | N/A |
+| **mono_128d** | 128 | 0.4652 | 0.2352 | N/A | N/A |
 ### Key Findings
+- **Best Isotropy:** mono_32d with 0.8223 (more uniform distribution)
+- **Semantic Density:** Average pairwise similarity of 0.2767. Lower values indicate better semantic separation.
+- **Alignment Quality:** No aligned models evaluated in this run.
+- **Recommendation:** 128d aligned for best cross-lingual performance
 ---
+## 6.  Morphological Analysis (Experimental)
+> ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
+This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
+### 6.1 Productivity & Complexity
+| Metric | Value | Interpretation | Recommendation |
+|--------|-------|----------------|----------------|
+| Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
+| Idiomaticity Gap | **-1.000** | Low formulaic content | - |
+### 6.2 Affix Inventory (Productive Units)
+These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
+#### Productive Prefixes
+| Prefix | Examples |
+|--------|----------|
+| `-pa` | paunsathon, paraguay, panimbukbuk |
+| `-ma` | matan, marsogotna, marimbang |
+| `-di` | dihalungunhon, dipajonok, dikunjungi |
+| `-mar` | marsogotna, marimbang, marsuhat |
+| `-si` | sintuana, simalolongna, siotihotik |
+| `-man` | manongoshon, maneat, manongos |
+| `-par` | paraguay, paransis, parmaraan |
+| `-ha` | haro, hadoboon, hakristenon |
+#### Productive Suffixes
+| Suffix | Examples |
+|--------|----------|
+| `-n` | matan, kanan, paunsathon |
+| `-a` | bulanda, tanomanmuna, musikna |
+| `-on` | paunsathon, manongoshon, dihalungunhon |
+| `-na` | tanomanmuna, musikna, marsogotna |
+| `-an` | matan, kanan, tibetan |
+| `-ng` | marimbang, palding, lumeleng |
+| `-hon` | paunsathon, manongoshon, dihalungunhon |
+| `-nna` | anginna, binoanna, parumaenna |
+### 6.3 Bound Stems (Lexical Roots)
+Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
+| Stem | Cohesion | Substitutability | Examples |
+|------|----------|------------------|----------|
+| `anga` | 1.65x | 126 contexts | angan, sanga, langa |
+| `angk` | 1.46x | 153 contexts | angka, rangka, dangka |
+| `mang` | 1.72x | 61 contexts | amang, damang, mangae |
+| `ngka` | 1.53x | 87 contexts | angka, rangka, dangka |
+| `ngko` | 1.76x | 41 contexts | ingkon, angkot, tingko |
+| `onga` | 1.75x | 36 contexts | longa, tonga, dongan |
+| `angg` | 1.39x | 75 contexts | anggo, anggi, anggia |
+| `anna` | 1.73x | 31 contexts | hanna, manna, annai |
+| `bahe` | 1.78x | 26 contexts | bahen, dibahe, ibahen |
+| `ingk` | 1.43x | 59 contexts | ingkon, tingki, lingka |
+| `ngan` | 1.37x | 65 contexts | ingan, angan, dongan |
+| `ndan` | 1.68x | 25 contexts | ndang, pandan, undang |
+### 6.4 Affix Compatibility (Co-occurrence)
+This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
+| Prefix | Suffix | Frequency | Examples |
+|--------|--------|-----------|----------|
+| `-pa` | `-n` | 372 words | paboaon, pangalelaon |
+| `-ma` | `-n` | 227 words | malungun, mambahen |
+| `-pa` | `-on` | 208 words | paboaon, pangalelaon |
+| `-pa` | `-a` | 196 words | pangkeannasida, padanna |
+| `-pa` | `-an` | 162 words | pamalian, parmiahan |
+| `-di` | `-n` | 135 words | diparsahitan, dison |
+| `-ma` | `-on` | 127 words | mangkalungunhon, mangkasogohon |
+| `-pa` | `-na` | 124 words | padanna, pandokna |
+| `-ha` | `-n` | 124 words | harun, hamulian |
+| `-di` | `-on` | 113 words | dison, ditahbishon |
+### 6.5 Recursive Morpheme Segmentation
+Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
+| Word | Suggested Split | Confidence | Stem |
+|------|-----------------|------------|------|
+| dipahatahata | **`di-pa-ha-ta-hata`** | 9.0 | `hata` |
+| marhatomanon | **`mar-ha-toman-on`** | 7.5 | `toman` |
+| panimbangan | **`pan-imba-ng-an`** | 7.5 | `imba` |
+| patongonhon | **`pa-tong-on-hon`** | 7.5 | `tong` |
+| hatigoranku | **`ha-tigor-an-ku`** | 7.5 | `tigor` |
+| pargogoanku | **`par-gogo-an-ku`** | 7.5 | `gogo` |
+| hagaleonku | **`ha-gale-on-ku`** | 7.5 | `gale` |
+| dipatongon | **`di-pa-tong-on`** | 7.5 | `tong` |
+| taparrohahon | **`ta-par-roha-hon`** | 7.5 | `roha` |
+| parhaporseaon | **`par-ha-porsea-on`** | 7.5 | `porsea` |
+| hamuliaonku | **`ha-mulia-on-ku`** | 7.5 | `mulia` |
+| paluhutonku | **`pa-luhut-on-ku`** | 7.5 | `luhut` |
+| hamateanna | **`ha-ma-tean-na`** | 7.5 | `tean` |
+| silehononku | **`si-lehon-on-ku`** | 7.5 | `lehon` |
+| patoltolonku | **`pa-toltol-on-ku`** | 7.5 | `toltol` |
+### 6.6 Linguistic Interpretation
+> **Automated Insight:**
+The language BBC appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
+---
+## 7. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
+| Tokenizer | **32k BPE** | Best compression (3.66x) |
+| N-gram | **2-gram** | Lowest perplexity (185) |
+| Markov | **Context-4** | Highest predictability (94.1%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
+  doi = {10.5281/zenodo.18073153},
+  publisher = {Zenodo},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
+- 🤝 Sponsor: [Featherless AI](https://featherless.ai)
 ---
 *Generated by Wikilangs Models Pipeline*
+*Report Date: 2026-01-03 06:19:26*

models/embeddings/monolingual/bbc_128d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0b0ca75d42a296503b6510e66321ba63fde51730d05e8ae06bf9ee333ebf3764
-size 1039704692

 version https://git-lfs.github.com/spec/v1
+oid sha256:299d624499ec82c937aa1769fa092d3bd9d976c6d08c2af9b22f767740074a59
+size 1039420296

models/embeddings/monolingual/bbc_128d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 128,
   "version": "monolingual",
   "training_params": {
-    "dim": 128,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 15079
 }

   "dimension": 128,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 128
   },
+  "vocab_size": 14806
 }

models/embeddings/monolingual/bbc_32d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1e806c00c5217f8c0bc9c0c7c7f674b5091cf2b888f0f09d5617c5947fdf7b85
-size 260124020

 version https://git-lfs.github.com/spec/v1
+oid sha256:981df92afd62ee12261b51c6b8c02f4b1e76c4f2d6c299c0dd0e78da12e9ac2b
+size 260049288

models/embeddings/monolingual/bbc_32d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 32,
   "version": "monolingual",
   "training_params": {
-    "dim": 32,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 15079
 }

   "dimension": 32,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 32
   },
+  "vocab_size": 14806
 }

models/embeddings/monolingual/bbc_64d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b313b52ddbda268ad3cbd1f0e06c19c9d337a7809b8be819492b088a18d3ea52
-size 519984244

 version https://git-lfs.github.com/spec/v1
+oid sha256:e7fddeff473650e7228b6c4a71a4299b4d9ee94908cca09fc7492ca327d34b12
+size 519839624

models/embeddings/monolingual/bbc_64d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 64,
   "version": "monolingual",
   "training_params": {
-    "dim": 64,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 15079
 }

   "dimension": 64,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 64
   },
+  "vocab_size": 14806
 }

models/subword_markov/bbc_markov_ctx1_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:31f366e845fdd50b876fa026287e3adc40435cfcd969f38d0a43a5e37677bf6f
-size 67861

 version https://git-lfs.github.com/spec/v1
+oid sha256:5aefbfb1171b4b00a844e78d07315efbcf62f55a2b30f5bd6d940b5ba83804d7
+size 81484

models/subword_markov/bbc_markov_ctx1_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "subword",
   "language": "bbc",
-  "unique_contexts": 1187,
-  "total_transitions": 5965783
 }

   "context_size": 1,
   "variant": "subword",
   "language": "bbc",
+  "unique_contexts": 1435,
+  "total_transitions": 5702376
 }

models/subword_markov/bbc_markov_ctx2_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b7007b2743717f219549dc35da4e22ef904f954017d220c8e61b57e8f201260f
-size 331323

 version https://git-lfs.github.com/spec/v1
+oid sha256:3f38ba522fe1bac3abb75948ca7e8ea124307971bfee224a267114a3c9466873
+size 358445

models/subword_markov/bbc_markov_ctx2_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "subword",
   "language": "bbc",
-  "unique_contexts": 8725,
-  "total_transitions": 5964472
 }

   "context_size": 2,
   "variant": "subword",
   "language": "bbc",
+  "unique_contexts": 10290,
+  "total_transitions": 5701163
 }

models/subword_markov/bbc_markov_ctx3_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c82b65a7ab369bc14ca7c650a4752310ee46abd1500b7705572d13c633829e72
-size 1156768

 version https://git-lfs.github.com/spec/v1
+oid sha256:0a3da300cf244b2ab02f8cc4190a829feb85819df2984a6b1a78f37ab7daec14
+size 1190319

models/subword_markov/bbc_markov_ctx3_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "subword",
   "language": "bbc",
-  "unique_contexts": 41004,
-  "total_transitions": 5963161
 }

   "context_size": 3,
   "variant": "subword",
   "language": "bbc",
+  "unique_contexts": 41529,
+  "total_transitions": 5699950
 }

models/subword_markov/bbc_markov_ctx4_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7f8adaacc604505a5ecc9a9df0ee1360e764fe6a44233577ea34de523534262e
-size 2961725

 version https://git-lfs.github.com/spec/v1
+oid sha256:32b0388a33d7717a0deb5945264197ca362aded134c2c728e20da7a5ddee82f4
+size 2842522

models/subword_markov/bbc_markov_ctx4_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "subword",
   "language": "bbc",
-  "unique_contexts": 146535,
-  "total_transitions": 5961850
 }

   "context_size": 4,
   "variant": "subword",
   "language": "bbc",
+  "unique_contexts": 130734,
+  "total_transitions": 5698737
 }

models/subword_ngram/bbc_2gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9e60fc5bcc5851afa5f2e2cc336875cafe4bf25d188866759e33cdfd81d92110
-size 36155

 version https://git-lfs.github.com/spec/v1
+oid sha256:d706d757ef98f7e2eaf523f46f0a602b8f06b25cb1d00579847f22b28eece7f1
+size 46126

models/subword_ngram/bbc_2gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "subword",
   "language": "bbc",
-  "unique_ngrams": 2868,
-  "total_ngrams": 5965783
 }

   "n": 2,
   "variant": "subword",
   "language": "bbc",
+  "unique_ngrams": 3491,
+  "total_ngrams": 5702376
 }

models/subword_ngram/bbc_3gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1a241f736f6533859dd3ceeb25dc4c326e50cb9bdae44188ae81918ec9b1e09d
-size 236441

 version https://git-lfs.github.com/spec/v1
+oid sha256:578bacf93c61827c89b4f8427603b775cd6b7891f9ceecc9c90545263f0f5768
+size 237045

models/subword_ngram/bbc_3gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "subword",
   "language": "bbc",
-  "unique_ngrams": 19851,
-  "total_ngrams": 5964472
 }

   "n": 3,
   "variant": "subword",
   "language": "bbc",
+  "unique_ngrams": 18183,
+  "total_ngrams": 5701163
 }

models/subword_ngram/bbc_4gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:de307e6c44b633e8d607ebdbd329107fa1f51230fa199d5272b932992601e309
-size 929802

 version https://git-lfs.github.com/spec/v1
+oid sha256:66675bb6688469ac3db798effd708b7b3e5f75e0f119c579a8f76b3e2b98fa07
+size 867430

models/subword_ngram/bbc_4gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 4,
   "variant": "subword",
   "language": "bbc",
-  "unique_ngrams": 81359,
-  "total_ngrams": 5963161
 }

   "n": 4,
   "variant": "subword",
   "language": "bbc",
+  "unique_ngrams": 70417,
+  "total_ngrams": 5699950
 }

models/tokenizer/bbc_tokenizer_16k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:136cf9b48df47bd0325da4d7c88b57560631e30b6199a6208f4260be8e9552ee
-size 526133

 version https://git-lfs.github.com/spec/v1
+oid sha256:6a176fe5ca0bdc5334c45eb11b39e17d285bd6a1469ac9c9df2ca41a79f9a60c
+size 510189

models/tokenizer/bbc_tokenizer_16k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/bbc_tokenizer_32k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2529fe3a9c19885eb39e8ccf47c668a7b0d05f23fec6e510f14e88006b6b7bfd
-size 829199

 version https://git-lfs.github.com/spec/v1
+oid sha256:92d486b2cdc4197f788488f7923c1a959c468622b7cc6f4cc652213c29688a42
+size 804933

models/tokenizer/bbc_tokenizer_32k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/bbc_tokenizer_8k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:013a29e4795c8a56295410d77b42d5f8e382048830cd0cb9713d5e5e7d4843e0
-size 380151

 version https://git-lfs.github.com/spec/v1
+oid sha256:cd7423a518aa21b134b1db88b66fe95833b16223e79c31fab654c72db01f19bd
+size 370418

models/tokenizer/bbc_tokenizer_8k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/vocabulary/bbc_vocabulary.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:98151fd7e6ed05510325054834af69754a9818fbdb10484a7067639a5bdf9974
-size 407231

 version https://git-lfs.github.com/spec/v1
+oid sha256:063a599013021fa7eb8f83c98669f631f73439cf9f45048819ae9c1bfe1070e2
+size 419766

models/vocabulary/bbc_vocabulary_metadata.json CHANGED Viewed

@@ -1,16 +1,17 @@
 {
   "language": "bbc",
-  "vocabulary_size": 24711,
   "statistics": {
-    "type_token_ratio": 0.0481935549300518,
     "coverage": {
-      "top_100": 0.5151721868117359,
-      "top_1000": 0.7666824211970509,
-      "top_5000": 0.894505559690854,
-      "top_10000": 0.9355091168979777
     },
-    "hapax_count": 25661,
-    "hapax_ratio": 0.5094298419757007,
-    "total_documents": 1311
   }
 }

 {
   "language": "bbc",
+  "vocabulary_size": 24970,
+  "variant": "full",
   "statistics": {
+    "type_token_ratio": 0.05084399284522539,
     "coverage": {
+      "top_100": 0.5228987859930757,
+      "top_1000": 0.7641910545275995,
+      "top_5000": 0.8904758325943073,
+      "top_10000": 0.9321639184916853
     },
+    "hapax_count": 25769,
+    "hapax_ratio": 0.507873627781391,
+    "total_documents": 1213
   }
 }

models/word_markov/bbc_markov_ctx1_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7e98885de06212fe4d458334ea6fca807e1bc10f3eeb95db70f713244d6ae9a7
-size 2077733

 version https://git-lfs.github.com/spec/v1
+oid sha256:71dc86808478f99a7145bcc0a4ff6b725b6972f69dc381f3ae944224fbe4c16e
+size 2205032

models/word_markov/bbc_markov_ctx1_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "word",
   "language": "bbc",
-  "unique_contexts": 50531,
-  "total_transitions": 1302268
 }

   "context_size": 1,
   "variant": "word",
   "language": "bbc",
+  "unique_contexts": 50697,
+  "total_transitions": 996722
 }

models/word_markov/bbc_markov_ctx2_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c2b1dee42bcb68e6a9fef3aa20e018a082f3dc777f1972d72933e2d6eea72d27
-size 6027713

 version https://git-lfs.github.com/spec/v1
+oid sha256:728d851895842f58fe81fb7ab5ed1ae78d0307c5849828fe282e48bfdcd66b91
+size 6594255

models/word_markov/bbc_markov_ctx2_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "word",
   "language": "bbc",
-  "unique_contexts": 307642,
-  "total_transitions": 1300957
 }

   "context_size": 2,
   "variant": "word",
   "language": "bbc",
+  "unique_contexts": 325909,
+  "total_transitions": 995509
 }

models/word_markov/bbc_markov_ctx3_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8acf1995ad2ef8973fc20a189c2be31adc36869291155c07279e2d2a5a7df71a
-size 10931319

 version https://git-lfs.github.com/spec/v1
+oid sha256:8f6ad77c32e92404bdcd431db829da9cd2b7fba3d776d4add45047f520d777bd
+size 10832852

models/word_markov/bbc_markov_ctx3_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "word",
   "language": "bbc",
-  "unique_contexts": 667901,
-  "total_transitions": 1299646
 }

   "context_size": 3,
   "variant": "word",
   "language": "bbc",
+  "unique_contexts": 658447,
+  "total_transitions": 994305
 }

models/word_markov/bbc_markov_ctx4_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c2cca265740b80274df07e8a920e3c873c44dd9f14f647cf7c26d83cbc3ac68f
-size 14909288

 version https://git-lfs.github.com/spec/v1
+oid sha256:b3f9fed8b38c168b6cb6bd140019ba39475f2256af74958a5be3b86590c35465
+size 13542247

models/word_markov/bbc_markov_ctx4_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "word",
   "language": "bbc",
-  "unique_contexts": 943940,
-  "total_transitions": 1298335
 }

   "context_size": 4,
   "variant": "word",
   "language": "bbc",
+  "unique_contexts": 840046,
+  "total_transitions": 993102
 }

models/word_ngram/bbc_2gram_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:d8d6402f7a1a57dd4dd1c3fd635fd7023724f3f67f2a4c61d61a5f224906cd25
-size 404690

 version https://git-lfs.github.com/spec/v1
+oid sha256:35e7179149d81c363c63014a32d763e82ed4c68c09f8bf11666d2a411687d048
+size 356865

models/word_ngram/bbc_2gram_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "word",
   "language": "bbc",
-  "unique_ngrams": 30496,
-  "total_ngrams": 1302268
 }

   "n": 2,
   "variant": "word",
   "language": "bbc",
+  "unique_ngrams": 26426,
+  "total_ngrams": 996722
 }

models/word_ngram/bbc_3gram_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f0e5d6a6cac168c51ea635254aca5bfd8c4702475845dac965c1b0991023487f
-size 849819

 version https://git-lfs.github.com/spec/v1
+oid sha256:82fe52ca4876909aaae283e65cee6b4f353c5ee4954be84a71334c4d597b03dd
+size 632278

models/word_ngram/bbc_3gram_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "word",
   "language": "bbc",
-  "unique_ngrams": 62993,
-  "total_ngrams": 1300957
 }

   "n": 3,
   "variant": "word",
   "language": "bbc",
+  "unique_ngrams": 43165,
+  "total_ngrams": 995509
 }

models/word_ngram/bbc_4gram_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ab915e5a6d57480fa0c0c0261b9d209b8f5d2872e033c606183a407608cfeff9
-size 1582171