omarkamali commited on Jan 3

Commit

3171718

verified ·

1 Parent(s): b9f325b

Upload all models and assets for dag (20251001)

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +302 -135
models/embeddings/monolingual/dag_128d.bin +2 -2
models/embeddings/monolingual/dag_128d_metadata.json +5 -3
models/embeddings/monolingual/dag_32d.bin +2 -2
models/embeddings/monolingual/dag_32d_metadata.json +5 -3
models/embeddings/monolingual/dag_64d.bin +2 -2
models/embeddings/monolingual/dag_64d_metadata.json +5 -3
models/subword_markov/dag_markov_ctx1_subword.parquet +2 -2
models/subword_markov/dag_markov_ctx1_subword_metadata.json +2 -2
models/subword_markov/dag_markov_ctx2_subword.parquet +2 -2
models/subword_markov/dag_markov_ctx2_subword_metadata.json +2 -2
models/subword_markov/dag_markov_ctx3_subword.parquet +2 -2
models/subword_markov/dag_markov_ctx3_subword_metadata.json +2 -2
models/subword_markov/dag_markov_ctx4_subword.parquet +2 -2
models/subword_markov/dag_markov_ctx4_subword_metadata.json +2 -2
models/subword_ngram/dag_2gram_subword.parquet +2 -2
models/subword_ngram/dag_2gram_subword_metadata.json +2 -2
models/subword_ngram/dag_3gram_subword.parquet +2 -2
models/subword_ngram/dag_3gram_subword_metadata.json +2 -2
models/subword_ngram/dag_4gram_subword.parquet +2 -2
models/subword_ngram/dag_4gram_subword_metadata.json +2 -2
models/tokenizer/dag_tokenizer_16k.model +2 -2
models/tokenizer/dag_tokenizer_16k.vocab +0 -0
models/tokenizer/dag_tokenizer_32k.model +2 -2
models/tokenizer/dag_tokenizer_32k.vocab +0 -0
models/tokenizer/dag_tokenizer_64k.model +2 -2
models/tokenizer/dag_tokenizer_64k.vocab +0 -0
models/tokenizer/dag_tokenizer_8k.model +2 -2
models/tokenizer/dag_tokenizer_8k.vocab +0 -0
models/vocabulary/dag_vocabulary.parquet +2 -2
models/vocabulary/dag_vocabulary_metadata.json +10 -9
models/word_markov/dag_markov_ctx1_word.parquet +2 -2
models/word_markov/dag_markov_ctx1_word_metadata.json +2 -2
models/word_markov/dag_markov_ctx2_word.parquet +2 -2
models/word_markov/dag_markov_ctx2_word_metadata.json +2 -2
models/word_markov/dag_markov_ctx3_word.parquet +2 -2
models/word_markov/dag_markov_ctx3_word_metadata.json +2 -2
models/word_markov/dag_markov_ctx4_word.parquet +2 -2
models/word_markov/dag_markov_ctx4_word_metadata.json +2 -2
models/word_ngram/dag_2gram_word.parquet +2 -2
models/word_ngram/dag_2gram_word_metadata.json +2 -2
models/word_ngram/dag_3gram_word.parquet +2 -2
models/word_ngram/dag_3gram_word_metadata.json +2 -2
models/word_ngram/dag_4gram_word.parquet +2 -2
models/word_ngram/dag_4gram_word_metadata.json +2 -2
visualizations/embedding_isotropy.png +0 -0
visualizations/embedding_norms.png +0 -0
visualizations/embedding_similarity.png +2 -2
visualizations/markov_branching.png +0 -0
visualizations/markov_contexts.png +0 -0

README.md CHANGED Viewed

@@ -23,14 +23,14 @@ dataset_info:
 metrics:
   - name: best_compression_ratio
     type: compression
-    value: 3.798
   - name: best_isotropy
     type: isotropy
-    value: 0.7945
   - name: vocabulary_size
     type: vocab
-    value: 144073
-generated: 2025-12-29
 ---
 # DAG - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
-- N-gram models (2, 3, 4-gram)
-- Markov chains (context of 1, 2, 3 and 4)
 - Subword N-gram and Markov chains
-- Embeddings in various sizes and dimensions
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
-- [6. Summary & Recommendations](#6-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
@@ -68,53 +70,57 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
-| **8k** | 3.288x | 3.27 | 0.0331% | 869,625 |
-| **16k** | 3.508x | 3.49 | 0.0353% | 815,031 |
-| **32k** | 3.685x | 3.66 | 0.0371% | 775,924 |
-| **64k** | 3.798x 🏆 | 3.78 | 0.0383% | 752,740 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
-**Sample 1:** `Soyauxia velutina nyɛla Soyauxia zuliya la puuni zaɣa yini.`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁soya u xia ▁v el ut ina ▁nyɛla ▁soya u ... (+7 more)` | 17 |
-| 16k | `▁soya u xia ▁vel ut ina ▁nyɛla ▁soya u xia ... (+6 more)` | 16 |
-| 32k | `▁soya u xia ▁vel ut ina ▁nyɛla ▁soya u xia ... (+6 more)` | 16 |
-| 64k | `▁soya u xia ▁vel utina ▁nyɛla ▁soya u xia ▁zuliya ... (+5 more)` | 15 |
-**Sample 2:** `Dagbanli Wikipedia nyɛla Wikipedia yaɣishɛli din sabbu nyɛla Dagbanli ka di tiri...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁dagbanli ▁wikipedia ▁nyɛla ▁wikipedia ▁yaɣi shɛli ▁din ▁sabbu ▁nyɛla ▁dagbanli ... (+23 more)` | 33 |
-| 16k | `▁dagbanli ▁wikipedia ▁nyɛla ▁wikipedia ▁yaɣishɛli ▁din ▁sabbu ▁nyɛla ▁dagbanli ▁ka ... (+19 more)` | 29 |
-| 32k | `▁dagbanli ▁wikipedia ▁nyɛla ▁wikipedia ▁yaɣishɛli ▁din ▁sabbu ▁nyɛla ▁dagbanli ▁ka ... (+18 more)` | 28 |
-| 64k | `▁dagbanli ▁wikipedia ▁nyɛla ▁wikipedia ▁yaɣishɛli ▁din ▁sabbu ▁nyɛla ▁dagbanli ▁ka ... (+18 more)` | 28 |
-**Sample 3:** `Kaʒiya nyɛla Yaa Naa paɣa ŋun pahiri ayi yuli.
-Taarihi
-Kundivihira`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁ka ʒi ya ▁nyɛla ▁yaa ▁naa ▁paɣa ▁ŋun ▁pahiri ▁ayi ... (+4 more)` | 14 |
-| 16k | `▁ka ʒi ya ▁nyɛla ▁yaa ▁naa ▁paɣa ▁ŋun ▁pahiri ▁ayi ... (+4 more)` | 14 |
-| 32k | `▁ka ʒiya ▁nyɛla ▁yaa ▁naa ▁paɣa ▁ŋun ▁pahiri ▁ayi ▁yuli ... (+3 more)` | 13 |
-| 64k | `▁ka ʒiya ▁nyɛla ▁yaa ▁naa ▁paɣa ▁ŋun ▁pahiri ▁ayi ▁yuli ... (+3 more)` | 13 |
 ### Key Findings
-- **Best Compression:** 64k achieves 3.798x compression
-- **Lowest UNK Rate:** 8k with 0.0331% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
@@ -123,57 +129,89 @@ Kundivihira`
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
-| N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
-|--------|------------|---------|----------------|------------------|-------------------|
-| **2-gram** | 24,966 🏆 | 14.61 | 188,164 | 17.0% | 36.0% |
-| **2-gram** | 423 🏆 | 8.72 | 7,859 | 55.8% | 97.7% |
-| **3-gram** | 60,310 | 15.88 | 362,741 | 14.0% | 28.1% |
-| **3-gram** | 4,210 | 12.04 | 70,197 | 17.9% | 59.2% |
-| **4-gram** | 110,926 | 16.76 | 665,471 | 13.6% | 24.8% |
-| **4-gram** | 26,280 | 14.68 | 408,931 | 8.8% | 29.5% |
 ### Top 5 N-grams by Size
-**2-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `| |` | 134,746 |
-| 2 | `align =` | 45,935 |
-| 3 | `| align` | 45,817 |
-| 4 | `= "` | 42,517 |
-| 5 | `" |` | 41,324 |
-**3-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `| align =` | 45,817 |
-| 2 | `align = "` | 38,202 |
-| 3 | `| | align` | 38,026 |
-| 4 | `= " center` | 23,000 |
-| 5 | `" center "` | 23,000 |
-**4-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `| align = "` | 38,119 |
-| 2 | `| | align =` | 38,026 |
-| 3 | `align = " center` | 23,000 |
-| 4 | `= " center "` | 22,999 |
-| 5 | `" center " |` | 22,948 |
 ### Key Findings
-- **Best Perplexity:** 2-gram with 423
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
-- **Coverage:** Top-1000 patterns cover ~29% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
@@ -181,55 +219,86 @@ Kundivihira`
 ![Markov Entropy](visualizations/markov_entropy.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
-| Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
-|---------|-------------|------------|------------------|-----------------|----------------|
-| **1** | 0.5033 | 1.417 | 4.95 | 440,521 | 49.7% |
-| **1** | 1.2280 | 2.342 | 7.57 | 3,894 | 0.0% |
-| **2** | 0.3109 | 1.241 | 1.99 | 2,180,092 | 68.9% |
-| **2** | 0.7059 | 1.631 | 5.02 | 29,465 | 29.4% |
-| **3** | 0.1506 | 1.110 | 1.33 | 4,328,117 | 84.9% |
-| **3** | 0.8636 | 1.820 | 4.71 | 148,003 | 13.6% |
-| **4** | 0.0788 🏆 | 1.056 | 1.15 | 5,753,766 | 92.1% |
-| **4** | 0.7332 🏆 | 1.662 | 3.29 | 697,702 | 26.7% |
-### Generated Text Samples
-Below are text samples generated from each Markov chain model:
 **Context Size 1:**
-1. `, trevor jones and circulation . " | 2019 . taylor sheridan , bruxelles ’ shεli`
-2. `| g | jim sturgess , emma thompson , kabɛ zani mi binyahiri maalibu mini queens`
-3. `. > ŋɔ ka tuunvɛla tumbu bee yahibu zuɣu o ŋmai jia , domin bɛ yεli`
 **Context Size 2:**
-1. `| | 2 . 2the price is rightvirtuosthe price is right : the voyage of sinbad ted`
-2. `align = center | | align = " top " | north carolina arts coalition 1979 :`
-3. `| align = " center " | | - | align = " center " | -`
 **Context Size 3:**
-1. `| align = " left " | kentucky | | align = center | | - | align`
-2. `align = " left " | real madrid | align = " left " | grambling state |`
-3. `| | align = " center " | | | align = center | | - | align`
 **Context Size 4:**
-1. `| align = " left " | tulsa | | align = " left " | liu brooklyn |`
-2. `| | align = " center " | 1 | | align = " center " | f |`
-3. `align = " center " | 1 | | align = " left " | utah | | align`
 ### Key Findings
-- **Best Predictability:** Context-4 with 92.1% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
-- **Memory Trade-off:** Larger contexts require more storage (697,702 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
@@ -245,64 +314,64 @@ Below are text samples generated from each Markov chain model:
 | Metric | Value |
 |--------|-------|
-| Vocabulary Size | 144,073 |
-| Total Tokens | 6,561,803 |
-| Mean Frequency | 45.54 |
 | Median Frequency | 4 |
-| Frequency Std Dev | 780.34 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | ni | 105,049 |
-| 2 | the | 93,758 |
-| 3 | of | 89,633 |
-| 4 | daa | 75,893 |
-| 5 | o | 71,379 |
-| 6 | ka | 70,354 |
-| 7 | n | 52,468 |
-| 8 | nyɛla | 50,022 |
-| 9 | din | 48,371 |
-| 10 | align | 46,328 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | hadja | 2 |
-| 2 | labcitoyen | 2 |
-| 3 | yikonim | 2 |
-| 4 | fiqhi | 2 |
-| 5 | sapuhi | 2 |
-| 6 | hoti | 2 |
-| 7 | xai | 2 |
-| 8 | coloboma | 2 |
-| 9 | ziɛ | 2 |
-| 10 | bɔɔlɔ | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
-| Zipf Coefficient | 1.0748 |
-| R² (Goodness of Fit) | 0.994305 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
-| Top 100 | 31.7% |
-| Top 1,000 | 59.4% |
-| Top 5,000 | 78.0% |
-| Top 10,000 | 84.7% |
 ### Key Findings
-- **Zipf Compliance:** R²=0.9943 indicates excellent adherence to Zipf's law
-- **High Frequency Dominance:** Top 100 words cover 31.7% of corpus
-- **Long Tail:** 134,073 words needed for remaining 15.3% coverage
 ---
 ## 5. Word Embeddings Evaluation
@@ -315,24 +384,119 @@ Below are text samples generated from each Markov chain model:
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
-### Model Comparison
-| Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
-|-------|------------|-----------|----------|----------|----------|
-| **mono_32d** | 82,599 | 32 | 3.850 | 1.140 | 0.7771 |
-| **mono_64d** | 82,599 | 64 | 4.436 | 1.061 | 0.7860 |
-| **mono_128d** | 82,599 | 128 | 5.151 | 0.960 | 0.7945 🏆 |
-| **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 ### Key Findings
-- **Best Isotropy:** mono_128d with 0.7945 (more uniform distribution)
-- **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
-- **Vocabulary Coverage:** All models cover 82,599 words
-- **Recommendation:** 100d for balanced semantic capture and efficiency
 ---
-## 6. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
@@ -340,11 +504,12 @@ Below are text samples generated from each Markov chain model:
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
-| Tokenizer | **32k BPE** | Best compression (3.80x) with low UNK rate |
-| N-gram | **5-gram** | Lowest perplexity (423) |
-| Markov | **Context-4** | Highest predictability (92.1%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
@@ -534,7 +699,8 @@ If you use these models in your research, please cite:
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
-  publisher = {HuggingFace},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
@@ -550,7 +716,8 @@ MIT License - Free for academic and commercial use.
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 ---
 *Generated by Wikilangs Models Pipeline*
-*Report Date: 2025-12-29 09:14:13*

 metrics:
   - name: best_compression_ratio
     type: compression
+    value: 3.797
   - name: best_isotropy
     type: isotropy
+    value: 0.8190
   - name: vocabulary_size
     type: vocab
+    value: 0
+generated: 2026-01-03
 ---
 # DAG - Wikilangs Models
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
+- N-gram models (2, 3, 4, 5-gram)
+- Markov chains (context of 1, 2, 3, 4 and 5)
 - Subword N-gram and Markov chains
+- Embeddings in various sizes and dimensions (aligned and unaligned)
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
+- [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
+- [7. Summary & Recommendations](#7-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
+![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
+![Tokenizer OOV](visualizations/tokenizer_oov.png)
+![Total Tokens](visualizations/tokenizer_total_tokens.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
+| **8k** | 3.299x | 3.30 | 0.0715% | 902,227 |
+| **16k** | 3.519x | 3.52 | 0.0763% | 845,892 |
+| **32k** | 3.683x | 3.68 | 0.0798% | 808,030 |
+| **64k** | 3.797x 🏆 | 3.80 | 0.0823% | 783,801 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
+**Sample 1:** `Tamale International School (TIS) nyɛla kariŋ zuŋ ti talli m bɛ Jisonayili, Sagn...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁tamale ▁international ▁school ▁( tis ) ▁nyɛla ▁kariŋ ▁zu ŋ ... (+14 more)` | 24 |
+| 16k | `▁tamale ▁international ▁school ▁( tis ) ▁nyɛla ▁kariŋ ▁zu ŋ ... (+11 more)` | 21 |
+| 32k | `▁tamale ▁international ▁school ▁( tis ) ▁nyɛla ▁kariŋ ▁zuŋ ▁ti ... (+10 more)` | 20 |
+| 64k | `▁tamale ▁international ▁school ▁( tis ) ▁nyɛla ▁kariŋ ▁zuŋ ▁ti ... (+10 more)` | 20 |
+**Sample 2:** `Bɛ nyɛla ti gbansabila paɣiba ban nyɛ toondanim bee tiŋgbani zuɣulanima`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁bɛ ▁nyɛla ▁ti ▁gbansabila ▁paɣiba ▁ban ▁nyɛ ▁toond anim ▁bee ... (+3 more)` | 13 |
+| 16k | `▁bɛ ▁nyɛla ▁ti ▁gbansabila ▁paɣiba ▁ban ▁nyɛ ▁toond anim ▁bee ... (+3 more)` | 13 |
+| 32k | `▁bɛ ▁nyɛla ▁ti ▁gbansabila ▁paɣiba ▁ban ▁nyɛ ▁toond anim ▁bee ... (+3 more)` | 13 |
+| 64k | `▁bɛ ▁nyɛla ▁ti ▁gbansabila ▁paɣiba ▁ban ▁nyɛ ▁toond anim ▁bee ... (+3 more)` | 13 |
+**Sample 3:** `GoondaaNaden, Tony. Dagbani dictionary. Webonary. Kundivihira`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁go on da anaden , ▁tony . ▁dagbani ▁dictionary . ... (+3 more)` | 13 |
+| 16k | `▁go on da anaden , ▁tony . ▁dagbani ▁dictionary . ... (+3 more)` | 13 |
+| 32k | `▁go on da anaden , ▁tony . ▁dagbani ▁dictionary . ... (+3 more)` | 13 |
+| 64k | `▁go onda anaden , ▁tony . ▁dagbani ▁dictionary . ▁webonary ... (+2 more)` | 12 |
 ### Key Findings
+- **Best Compression:** 64k achieves 3.797x compression
+- **Lowest UNK Rate:** 8k with 0.0715% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
+![N-gram Unique](visualizations/ngram_unique.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
+| N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
+|--------|---------|------------|---------|----------------|------------------|-------------------|
+| **2-gram** | Word | 32,119 | 14.97 | 135,454 | 12.8% | 30.2% |
+| **2-gram** | Subword | 338 🏆 | 8.40 | 6,662 | 61.1% | 98.8% |
+| **3-gram** | Word | 61,294 | 15.90 | 205,054 | 9.7% | 22.3% |
+| **3-gram** | Subword | 3,287 | 11.68 | 48,860 | 19.7% | 63.9% |
+| **4-gram** | Word | 122,956 | 16.91 | 377,494 | 8.8% | 17.3% |
+| **4-gram** | Subword | 20,734 | 14.34 | 281,639 | 9.1% | 31.1% |
 ### Top 5 N-grams by Size
+**2-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `of the` | 21,384 |
+| 2 | `n ti` | 15,953 |
+| 3 | `o daa` | 10,685 |
+| 4 | `din be` | 10,124 |
+| 5 | `ni daa` | 9,962 |
+**3-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `of the year` | 4,890 |
+| 2 | `n ti pahi` | 4,503 |
+| 3 | `zaŋ n ti` | 3,966 |
+| 4 | `nyɛla bɛ ni` | 3,607 |
+| 5 | `bɛ ni daa` | 3,248 |
+**4-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `ninsali biɛlim kalibu baŋsim` | 2,948 |
+| 2 | `biɛlim kalibu baŋsim bɔhimbu` | 2,948 |
+| 3 | `zalikpana mini gɔmnanti tali` | 2,947 |
+| 4 | `ni nyamma soya economy` | 2,945 |
+| 5 | `demographics ninsali biɛlim kalibu` | 2,944 |
+**2-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `a _` | 739,697 |
+| 2 | `i _` | 724,304 |
+| 3 | `n _` | 498,067 |
+| 4 | `a n` | 496,882 |
+| 5 | `, _` | 495,235 |
+**3-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `n i _` | 221,639 |
+| 2 | `_ n i` | 165,629 |
+| 3 | `_ m a` | 130,342 |
+| 4 | `l i _` | 130,046 |
+| 5 | `_ d a` | 129,510 |
+**4-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `t h e _` | 98,150 |
+| 2 | `_ t h e` | 92,918 |
+| 3 | `_ n i _` | 91,122 |
+| 4 | `_ o f _` | 87,857 |
+| 5 | `_ d a a` | 76,848 |
 ### Key Findings
+- **Best Perplexity:** 2-gram (subword) with 338
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
+- **Coverage:** Top-1000 patterns cover ~31% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
 ![Markov Entropy](visualizations/markov_entropy.png)
+![Markov Contexts](visualizations/markov_contexts.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
+| Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
+|---------|---------|-------------|------------|------------------|-----------------|----------------|
+| **1** | Word | 0.7248 | 1.653 | 6.35 | 344,988 | 27.5% |
+| **1** | Subword | 1.1283 | 2.186 | 6.69 | 4,037 | 0.0% |
+| **2** | Word | 0.2745 | 1.210 | 1.73 | 2,189,455 | 72.6% |
+| **2** | Subword | 0.6262 | 1.543 | 4.19 | 27,009 | 37.4% |
+| **3** | Word | 0.1110 | 1.080 | 1.21 | 3,779,471 | 88.9% |
+| **3** | Subword | 0.7294 | 1.658 | 4.22 | 113,279 | 27.1% |
+| **4** | Word | 0.0538 🏆 | 1.038 | 1.09 | 4,582,569 | 94.6% |
+| **4** | Subword | 0.7212 | 1.649 | 3.38 | 478,359 | 27.9% |
+### Generated Text Samples (Word-based)
+Below are text samples generated from each word-based Markov chain model:
+**Context Size 1:**
+1. `ni nyamma soya economy zalikpana mini polish o nyɛla bɛ tooi lahi sabiri yɛltɔɣa 23 47`
+2. `the break media binkpɛra transportation kundivihira pubu yaɣali tum yuuni fifa confederations cup s ...`
+3. `of the kurds of a african american lens nyɛla dolodolo mabiligu zaa tinsi salima di ni`
+**Context Size 2:**
+1. `of the visual arts general science karimba ni climatologist o piligu mini o tumo tarsi tɔ taali`
+2. `n ti wɔbigi paati jintɔra justice baah mathuselah daa nyɛla nigeria sasabira niriba bela n daa tɔ`
+3. `o daa lahi sôå kpaåsi kaya ni taɣada culture lahabali churi media binkpɛra transportation kundivihir...`
+**Context Size 3:**
+1. `of the year featuring farruko la familia urban album of the year lo siento bb himself best male`
+2. `n ti pahi metropolitan museum of art contemporary black artists july 1 31 counterpoints 23 march 16 ...`
+3. `zaŋ n ti daily graphic graphic communications group limited nima n daa ti o photographic curatorship...`
+**Context Size 4:**
+1. `biɛlim kalibu baŋsim bɔhimbu bomma ni nyamma soya economy zalikpana mini gɔmnanti tali law and gover...`
+2. `ninsali biɛlim kalibu baŋsim bɔhimbu bomma ni nyamma soya economy zalikpana mini gɔmnanti tali law a...`
+3. `zalikpana mini gɔmnanti tali law and government baŋsim bɔbu education kaya ni taada lahabali churi m...`
+### Generated Text Samples (Subword-based)
+Below are text samples generated from each subword-based Markov chain model:
 **Context Size 1:**
+1. `_tamprecstessia_`
+2. `abrae_devineri_f`
+3. `ir_imaa_munghica`
 **Context Size 2:**
+1. `a_noadoma_pause_a`
+2. `i_smi_bortion_ght`
+3. `n_sh_ana_/_mankss`
 **Context Size 3:**
+1. `ni_sologic_schardk`
+2. `_ni_bɛ_tumahaba_pv`
+3. `_may_les_populi_ma`
 **Context Size 4:**
+1. `the_cissued_tieth_c`
+2. `_the_sunships,_larr`
+3. `_ni_lebowalestory_c`
 ### Key Findings
+- **Best Predictability:** Context-4 (word) with 94.6% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
+- **Memory Trade-off:** Larger contexts require more storage (478,359 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
 | Metric | Value |
 |--------|-------|
+| Vocabulary Size | 131,668 |
+| Total Tokens | 5,761,123 |
+| Mean Frequency | 43.75 |
 | Median Frequency | 4 |
+| Frequency Std Dev | 757.65 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | ni | 104,103 |
+| 2 | the | 91,175 |
+| 3 | of | 87,976 |
+| 4 | daa | 75,182 |
+| 5 | o | 70,845 |
+| 6 | ka | 69,699 |
+| 7 | n | 51,684 |
+| 8 | nyɛla | 49,641 |
+| 9 | din | 47,965 |
+| 10 | di | 44,711 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | menteith | 2 |
+| 2 | marischal | 2 |
+| 3 | dupplin | 2 |
+| 4 | malakula | 2 |
+| 5 | ambrym | 2 |
+| 6 | malekula | 2 |
+| 7 | biili | 2 |
+| 8 | chaira | 2 |
+| 9 | juŋ | 2 |
+| 10 | surim | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
+| Zipf Coefficient | 1.0503 |
+| R² (Goodness of Fit) | 0.994826 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
+| Top 100 | 31.5% |
+| Top 1,000 | 58.6% |
+| Top 5,000 | 77.5% |
+| Top 10,000 | 84.5% |
 ### Key Findings
+- **Zipf Compliance:** R²=0.9948 indicates excellent adherence to Zipf's law
+- **High Frequency Dominance:** Top 100 words cover 31.5% of corpus
+- **Long Tail:** 121,668 words needed for remaining 15.5% coverage
 ---
 ## 5. Word Embeddings Evaluation
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
+### 5.1 Cross-Lingual Alignment
+> *Note: Multilingual alignment visualization not available for this language.*
+### 5.2 Model Comparison
+| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
+|-------|-----------|----------|------------------|---------------|----------------|
+| **mono_32d** | 32 | 0.7977 | 0.3405 | N/A | N/A |
+| **mono_64d** | 64 | 0.8086 | 0.2759 | N/A | N/A |
+| **mono_128d** | 128 | 0.8190 🏆 | 0.2136 | N/A | N/A |
 ### Key Findings
+- **Best Isotropy:** mono_128d with 0.8190 (more uniform distribution)
+- **Semantic Density:** Average pairwise similarity of 0.2767. Lower values indicate better semantic separation.
+- **Alignment Quality:** No aligned models evaluated in this run.
+- **Recommendation:** 128d aligned for best cross-lingual performance
 ---
+## 6.  Morphological Analysis (Experimental)
+> ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
+This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
+### 6.1 Productivity & Complexity
+| Metric | Value | Interpretation | Recommendation |
+|--------|-------|----------------|----------------|
+| Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
+| Idiomaticity Gap | **-1.000** | Low formulaic content | - |
+### 6.2 Affix Inventory (Productive Units)
+These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
+#### Productive Prefixes
+| Prefix | Examples |
+|--------|----------|
+| `-ma` | maresca, malaquais, maehara |
+#### Productive Suffixes
+| Suffix | Examples |
+|--------|----------|
+| `-er` | abaranger, bridgwater, alencier |
+| `-an` | seyitan, weitman, eghan |
+| `-ed` | crowned, programmed, loosed |
+| `-ng` | rongguang, invading, watling |
+| `-on` | ferguson, kongaction, turgeon |
+### 6.3 Bound Stems (Lexical Roots)
+Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
+| Stem | Cohesion | Substitutability | Examples |
+|------|----------|------------------|----------|
+| `ihir` | 2.44x | 42 contexts | vihir, vihiri, lihira |
+| `ison` | 2.20x | 60 contexts | sison, bison, isong |
+| `uuni` | 2.39x | 37 contexts | tuuni, nuuni, guuni |
+| `nter` | 1.87x | 69 contexts | unter, enter, inter |
+| `ctor` | 1.94x | 43 contexts | actor, actors, actora |
+| `riso` | 2.31x | 23 contexts | prison, bɔriso, arison |
+| `reen` | 1.99x | 37 contexts | green, breen, reena |
+| `atio` | 1.84x | 46 contexts | patio, ation, ratio |
+| `tern` | 1.80x | 48 contexts | terna, stern, terns |
+| `ture` | 1.74x | 54 contexts | cuture, mature, nature |
+| `rect` | 2.18x | 23 contexts | recta, recto, direct |
+| `awar` | 1.86x | 40 contexts | aware, pawar, yawar |
+### 6.4 Affix Compatibility (Co-occurrence)
+This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
+| Prefix | Suffix | Frequency | Examples |
+|--------|--------|-----------|----------|
+| `-ma` | `-ng` | 4 words | managing, mating |
+| `-ma` | `-ed` | 3 words | maherunited, manhandled |
+| `-ma` | `-on` | 2 words | manon, mathison |
+| `-ma` | `-an` | 2 words | magpakailanman, marjan |
+| `-ma` | `-er` | 1 words | manger, mater |
+### 6.5 Recursive Morpheme Segmentation
+Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
+| Word | Suggested Split | Confidence | Stem |
+|------|-----------------|------------|------|
+| kambangan | **`kamba-ng-an`** | 6.0 | `kamba` |
+| illumination | **`illuminati-on`** | 4.5 | `illuminati` |
+| parenting | **`parenti-ng`** | 4.5 | `parenti` |
+| gregorian | **`gregori-an`** | 4.5 | `gregori` |
+| transkeian | **`transkei-an`** | 4.5 | `transkei` |
+| sheltered | **`shelt-er-ed`** | 3.0 | `shelt` |
+| abandoned | **`aband-on-ed`** | 3.0 | `aband` |
+| mannheimer | **`ma-nnheim-er`** | 3.0 | `nnheim` |
+| malnutrition | **`ma-lnutriti-on`** | 3.0 | `lnutriti` |
+| homemaker | **`homemak-er`** | 1.5 | `homemak` |
+| swintonunited | **`swintonunit-ed`** | 1.5 | `swintonunit` |
+| xiaoxiang | **`xiaoxia-ng`** | 1.5 | `xiaoxia` |
+| venneraunited | **`venneraunit-ed`** | 1.5 | `venneraunit` |
+| grantunited | **`grantunit-ed`** | 1.5 | `grantunit` |
+| substation | **`substati-on`** | 1.5 | `substati` |
+### 6.6 Linguistic Interpretation
+> **Automated Insight:**
+The language DAG appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
+---
+## 7. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
+| Tokenizer | **64k BPE** | Best compression (3.80x) |
+| N-gram | **2-gram** | Lowest perplexity (338) |
+| Markov | **Context-4** | Highest predictability (94.6%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
+  doi = {10.5281/zenodo.18073153},
+  publisher = {Zenodo},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
+- 🤝 Sponsor: [Featherless AI](https://featherless.ai)
 ---
 *Generated by Wikilangs Models Pipeline*
+*Report Date: 2026-01-03 11:48:18*

models/embeddings/monolingual/dag_128d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f945153d1b68b1edb75044321cd587df1d48f52d17c013af1cf4c9c49e22894a
-size 1109989760

 version https://git-lfs.github.com/spec/v1
+oid sha256:4d00f16d9be6613f95ad245b565c957bc4b77c2ffe3c6c97ed55fc35874cbb02
+size 1103962437

models/embeddings/monolingual/dag_128d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 128,
   "version": "monolingual",
   "training_params": {
-    "dim": 128,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 82599
 }

   "dimension": 128,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 128
   },
+  "vocab_size": 76820
 }

models/embeddings/monolingual/dag_32d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:18d0cd46a86dd79214da91f2a3df6ef9c65e6668b1587cc98559a3965acaea04
-size 278553728

 version https://git-lfs.github.com/spec/v1
+oid sha256:7a71b8db5dcabf646b63006ac127c9431a0329bc1b838c8056ba69701f137080
+size 276964677

models/embeddings/monolingual/dag_32d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 32,
   "version": "monolingual",
   "training_params": {
-    "dim": 32,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 82599
 }

   "dimension": 32,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 32
   },
+  "vocab_size": 76820
 }

models/embeddings/monolingual/dag_64d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:4c1cd35ff33daa04fd2ea99f66368e3d72df9a82f6064c7f3608a3c63502d1a0
-size 555699072

 version https://git-lfs.github.com/spec/v1
+oid sha256:b0cbe4ba5bff3c754108b7e2cdf941847dabea53417274bd516d6558b4d8fe15
+size 552630597

models/embeddings/monolingual/dag_64d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 64,
   "version": "monolingual",
   "training_params": {
-    "dim": 64,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 82599
 }

   "dimension": 64,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 64
   },
+  "vocab_size": 76820
 }

models/subword_markov/dag_markov_ctx1_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:623eda077da228d68a78166343d09134ce1dcb888ad8cd2f1464875c5e0e46d3
-size 211790

 version https://git-lfs.github.com/spec/v1
+oid sha256:4dcab6aeb23a2a118f72e7f0aeae2b8870148e09f24543081b5214fa182f8de7
+size 199146

models/subword_markov/dag_markov_ctx1_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "subword",
   "language": "dag",
-  "unique_contexts": 3894,
-  "total_transitions": 44529804
 }

   "context_size": 1,
   "variant": "subword",
   "language": "dag",
+  "unique_contexts": 4037,
+  "total_transitions": 37349732
 }

models/subword_markov/dag_markov_ctx2_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:4e60104fd123c8a08af274dda2b93e9b7ec3ac0aaf6d4b2cf60bbe31242b5c3e
-size 1206647

 version https://git-lfs.github.com/spec/v1
+oid sha256:08f53c1fdb7f4c76ae169e25b12f81666d01dcfab8b202b8a44bfcdd2d782375
+size 968006

models/subword_markov/dag_markov_ctx2_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "subword",
   "language": "dag",
-  "unique_contexts": 29465,
-  "total_transitions": 44514709
 }

   "context_size": 2,
   "variant": "subword",
   "language": "dag",
+  "unique_contexts": 27009,
+  "total_transitions": 37334959
 }

models/subword_markov/dag_markov_ctx3_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:36f8027ca4fce4bf3dadd51e460200e88d4d0c095c79b05cb7e29c0433dbbc00
-size 5152998

 version https://git-lfs.github.com/spec/v1
+oid sha256:c51b7acbe6f43a51b3c918db657fb4e7e81da4d78e8c3084a0a87dad548b73a0
+size 3766626

models/subword_markov/dag_markov_ctx3_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "subword",
   "language": "dag",
-  "unique_contexts": 148003,
-  "total_transitions": 44499614
 }

   "context_size": 3,
   "variant": "subword",
   "language": "dag",
+  "unique_contexts": 113279,
+  "total_transitions": 37320186
 }

models/subword_markov/dag_markov_ctx4_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8cc197777a522334fc267d067bc251625a1e1aec095e9c28aeb8aebc0427f549
-size 17447977

 version https://git-lfs.github.com/spec/v1
+oid sha256:282669cd7b9e381bac5380875c673efece93e4c4cd175997c0d421789b5b7d2b
+size 12359006

models/subword_markov/dag_markov_ctx4_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "subword",
   "language": "dag",
-  "unique_contexts": 697702,
-  "total_transitions": 44484519
 }

   "context_size": 4,
   "variant": "subword",
   "language": "dag",
+  "unique_contexts": 478359,
+  "total_transitions": 37305413
 }

models/subword_ngram/dag_2gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:cb0c9ee865f5f6be5e336b948ef653abe84f4b00a044b3f849814d785b4c56ff
-size 103107

 version https://git-lfs.github.com/spec/v1
+oid sha256:424eb991c814c6bded4d9b5b7ae5b9c29b14dbb12e16ca36447fc358ba63b9d4
+size 88328

models/subword_ngram/dag_2gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "subword",
   "language": "dag",
-  "unique_ngrams": 7859,
-  "total_ngrams": 44529804
 }

   "n": 2,
   "variant": "subword",
   "language": "dag",
+  "unique_ngrams": 6662,
+  "total_ngrams": 37349732
 }

models/subword_ngram/dag_3gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2b3ae8520f2bc9d1853d7afb79b2d8606e3a534e55ee2e6c2ecd1a3f0d4a46d3
-size 838705

 version https://git-lfs.github.com/spec/v1
+oid sha256:1710397b15aa7f5e18eafea8221a048c648e298ec1ab262bbdb99a61b7eed2fb
+size 627540

models/subword_ngram/dag_3gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "subword",
   "language": "dag",
-  "unique_ngrams": 70197,
-  "total_ngrams": 44514709
 }

   "n": 3,
   "variant": "subword",
   "language": "dag",
+  "unique_ngrams": 48860,
+  "total_ngrams": 37334959
 }

models/subword_ngram/dag_4gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:450f1defa9f633d782a23a168973c56bd4a2fefbe9636636351d8921fef62725
-size 4510438

 version https://git-lfs.github.com/spec/v1
+oid sha256:d741ca12b56a31b6cec08e7dd10051326c0bffcc6e6f714ba6dbb0a748fc2af4
+size 3149822

models/subword_ngram/dag_4gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 4,
   "variant": "subword",
   "language": "dag",
-  "unique_ngrams": 408931,
-  "total_ngrams": 44499614
 }

   "n": 4,
   "variant": "subword",
   "language": "dag",
+  "unique_ngrams": 281639,
+  "total_ngrams": 37320186
 }

models/tokenizer/dag_tokenizer_16k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e9ecc269bcbaa87293a060665c9d55554adbe20cf6dad26e8a96452182e839c1
-size 501542

 version https://git-lfs.github.com/spec/v1
+oid sha256:5ec34742853f7f09d3c027bd4c59d25d4f51a22ea4ec1a56a84487038d0a3c7c
+size 501847

models/tokenizer/dag_tokenizer_16k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/dag_tokenizer_32k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:32bf5d9ccbeca9630e8a4ee3eda874ad86b4af05fbc5ceb2f64cd5008c99a529
-size 769222

 version https://git-lfs.github.com/spec/v1
+oid sha256:c741184d829058c5b71ec3219c290dd8af79ec0d95c96d7e85772dbcb15368e6
+size 767801

models/tokenizer/dag_tokenizer_32k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/dag_tokenizer_64k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:db01cc19b1f2a063f377218ef2aa4b890dfe31d0b042d85e25b0671d520b399e
-size 1315381

 version https://git-lfs.github.com/spec/v1
+oid sha256:816c70ef2066989f637164f982242822307293ed74ca5704893e8d0e90d112c5
+size 1308307

models/tokenizer/dag_tokenizer_64k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/dag_tokenizer_8k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:699db67fe4c4771bf34490bf448d9f572320a09c154f290db947c051b0a39145
-size 368785

 version https://git-lfs.github.com/spec/v1
+oid sha256:4eb6d54a65133acd4ad994c022d72028d652288dbb0a9cdbb3aaf9bbb3d462aa
+size 370487

models/tokenizer/dag_tokenizer_8k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/vocabulary/dag_vocabulary.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8d3b25fc3f84ce57bfb425cde12b6f3a252bf57664a8f82c09a765343f6e31bd
-size 2388042

 version https://git-lfs.github.com/spec/v1
+oid sha256:68dddcb0559f6c5317590e5b88cf3a16779cc57890a35aaac09a1125ef4e9d2f
+size 2215170

models/vocabulary/dag_vocabulary_metadata.json CHANGED Viewed

@@ -1,16 +1,17 @@
 {
   "language": "dag",
-  "vocabulary_size": 144073,
   "statistics": {
-    "type_token_ratio": 0.06420470311600282,
     "coverage": {
-      "top_100": 0.30289941060496944,
-      "top_1000": 0.5684920011507646,
-      "top_5000": 0.7466727053131291,
-      "top_10000": 0.81046446299815
     },
-    "hapax_count": 296246,
-    "hapax_ratio": 0.6727985846624833,
-    "total_documents": 15095
   }
 }

 {
   "language": "dag",
+  "vocabulary_size": 131668,
+  "variant": "full",
   "statistics": {
+    "type_token_ratio": 0.05775705051748038,
     "coverage": {
+      "top_100": 0.3041086439325898,
+      "top_1000": 0.564692496107641,
+      "top_5000": 0.7471635406725152,
+      "top_10000": 0.8147734230297098
     },
+    "hapax_count": 213403,
+    "hapax_ratio": 0.6184321487462,
+    "total_documents": 14773
   }
 }

models/word_markov/dag_markov_ctx1_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2f8098ae8deaca8d79c06abc15a50a7517d52588c5c5f08a1f6a662c91e1ed8c
-size 21241437

 version https://git-lfs.github.com/spec/v1
+oid sha256:f0e0f6a3232922b668c546d7e25e65d33a440e4451d4066dfab8999e9c34c44e
+size 19122863

models/word_markov/dag_markov_ctx1_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "word",
   "language": "dag",
-  "unique_contexts": 440521,
-  "total_transitions": 9137164
 }

   "context_size": 1,
   "variant": "word",
   "language": "dag",
+  "unique_contexts": 344988,
+  "total_transitions": 5959753
 }

models/word_markov/dag_markov_ctx2_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0b9ae3f14904e83c67214891ce4a5fc0e08e02185ba34e98c7c699e62759917d
-size 50283891

 version https://git-lfs.github.com/spec/v1
+oid sha256:57303e729f982b046fc9eb4582b38507579236a925f3ed865a2128ee0d272a33
+size 44774701

models/word_markov/dag_markov_ctx2_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "word",
   "language": "dag",
-  "unique_contexts": 2180092,
-  "total_transitions": 9122070
 }

   "context_size": 2,
   "variant": "word",
   "language": "dag",
+  "unique_contexts": 2189455,
+  "total_transitions": 5944980
 }

models/word_markov/dag_markov_ctx3_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c33e13f6ecdcab9ecdfdb177aae0bcfa10561d97c94703f794569223fdf05503
-size 78972243

 version https://git-lfs.github.com/spec/v1
+oid sha256:d22d602e9027a1d31cd1922128d6d422e4c5c0e54de81e6993efd6a05ad22bd3
+size 67340899

models/word_markov/dag_markov_ctx3_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "word",
   "language": "dag",
-  "unique_contexts": 4328117,
-  "total_transitions": 9106976
 }

   "context_size": 3,
   "variant": "word",
   "language": "dag",
+  "unique_contexts": 3779471,
+  "total_transitions": 5930207
 }

models/word_markov/dag_markov_ctx4_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2f9d66bf01d7d81cbb020388e788098cddbd001073b91bfa7c3174907bff6809
-size 101189795

 version https://git-lfs.github.com/spec/v1
+oid sha256:bcbd9360cb6e46014cadbbe6342e0b8005a4dc62771e553bd840366f8338c1b2
+size 81870321

models/word_markov/dag_markov_ctx4_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "word",
   "language": "dag",
-  "unique_contexts": 5753766,
-  "total_transitions": 9091884
 }

   "context_size": 4,
   "variant": "word",
   "language": "dag",
+  "unique_contexts": 4582569,
+  "total_transitions": 5915434
 }

models/word_ngram/dag_2gram_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2f90b4781fcb918be75d01d6c48f05d91a2ab2648ea71dcc044a1abc74abb9b5
-size 2425808

 version https://git-lfs.github.com/spec/v1
+oid sha256:b04b4c6ef97df3395fe513ce28fd2e87ccbe15b7f8b340de5981ee3aa29f89a2
+size 1834714

models/word_ngram/dag_2gram_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "word",
   "language": "dag",
-  "unique_ngrams": 188164,
-  "total_ngrams": 9137164
 }

   "n": 2,
   "variant": "word",
   "language": "dag",
+  "unique_ngrams": 135454,
+  "total_ngrams": 5959753
 }

models/word_ngram/dag_3gram_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c9294c1403b148eba8a2f3113a3bc3d04e557dedd3d271356f88c8b21eae2bc4
-size 4921395

 version https://git-lfs.github.com/spec/v1
+oid sha256:53ba8887357e49d3f7d6d5f4d361ceb8f336b1b16453f3b3da35aacc14dec364
+size 3008918

models/word_ngram/dag_3gram_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "word",
   "language": "dag",
-  "unique_ngrams": 362741,
-  "total_ngrams": 9122070
 }

   "n": 3,
   "variant": "word",
   "language": "dag",
+  "unique_ngrams": 205054,
+  "total_ngrams": 5944980
 }

models/word_ngram/dag_4gram_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:49868e930c1db710141544ca286252fd76d98cf8e8f5b2b36a174c9d3558ebc4
-size 9639784