omarkamali commited on Jan 3

Commit

77a2691

verified ·

1 Parent(s): bae74c6

Upload all models and assets for bm (20251001)

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +290 -132
models/embeddings/monolingual/bm_128d.bin +2 -2
models/embeddings/monolingual/bm_128d_metadata.json +5 -3
models/embeddings/monolingual/bm_32d.bin +2 -2
models/embeddings/monolingual/bm_32d_metadata.json +5 -3
models/embeddings/monolingual/bm_64d.bin +2 -2
models/embeddings/monolingual/bm_64d_metadata.json +5 -3
models/subword_markov/bm_markov_ctx1_subword.parquet +2 -2
models/subword_markov/bm_markov_ctx1_subword_metadata.json +2 -2
models/subword_markov/bm_markov_ctx2_subword.parquet +2 -2
models/subword_markov/bm_markov_ctx2_subword_metadata.json +2 -2
models/subword_markov/bm_markov_ctx3_subword.parquet +2 -2
models/subword_markov/bm_markov_ctx3_subword_metadata.json +2 -2
models/subword_markov/bm_markov_ctx4_subword.parquet +2 -2
models/subword_markov/bm_markov_ctx4_subword_metadata.json +2 -2
models/subword_ngram/bm_2gram_subword.parquet +2 -2
models/subword_ngram/bm_2gram_subword_metadata.json +2 -2
models/subword_ngram/bm_3gram_subword.parquet +2 -2
models/subword_ngram/bm_3gram_subword_metadata.json +2 -2
models/subword_ngram/bm_4gram_subword.parquet +2 -2
models/subword_ngram/bm_4gram_subword_metadata.json +2 -2
models/tokenizer/bm_tokenizer_16k.model +2 -2
models/tokenizer/bm_tokenizer_16k.vocab +0 -0
models/tokenizer/bm_tokenizer_32k.model +2 -2
models/tokenizer/bm_tokenizer_32k.vocab +0 -0
models/tokenizer/bm_tokenizer_8k.model +2 -2
models/tokenizer/bm_tokenizer_8k.vocab +0 -0
models/vocabulary/bm_vocabulary.parquet +2 -2
models/vocabulary/bm_vocabulary_metadata.json +10 -9
models/word_markov/bm_markov_ctx1_word.parquet +2 -2
models/word_markov/bm_markov_ctx1_word_metadata.json +2 -2
models/word_markov/bm_markov_ctx2_word.parquet +2 -2
models/word_markov/bm_markov_ctx2_word_metadata.json +2 -2
models/word_markov/bm_markov_ctx3_word.parquet +2 -2
models/word_markov/bm_markov_ctx3_word_metadata.json +2 -2
models/word_markov/bm_markov_ctx4_word.parquet +2 -2
models/word_markov/bm_markov_ctx4_word_metadata.json +2 -2
models/word_ngram/bm_2gram_word.parquet +2 -2
models/word_ngram/bm_2gram_word_metadata.json +2 -2
models/word_ngram/bm_3gram_word.parquet +2 -2
models/word_ngram/bm_3gram_word_metadata.json +2 -2
models/word_ngram/bm_4gram_word.parquet +2 -2
models/word_ngram/bm_4gram_word_metadata.json +2 -2
visualizations/embedding_isotropy.png +0 -0
visualizations/embedding_norms.png +0 -0
visualizations/embedding_similarity.png +2 -2
visualizations/markov_branching.png +0 -0
visualizations/markov_contexts.png +0 -0
visualizations/markov_entropy.png +0 -0
visualizations/model_sizes.png +0 -0

README.md CHANGED Viewed

@@ -23,14 +23,14 @@ dataset_info:
 metrics:
   - name: best_compression_ratio
     type: compression
-    value: 4.070
   - name: best_isotropy
     type: isotropy
-    value: 0.2244
   - name: vocabulary_size
     type: vocab
-    value: 7195
-generated: 2025-12-28
 ---
 # BM - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
-- N-gram models (2, 3, 4-gram)
-- Markov chains (context of 1, 2, 3 and 4)
 - Subword N-gram and Markov chains
-- Embeddings in various sizes and dimensions
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
-- [6. Summary & Recommendations](#6-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
@@ -68,52 +70,53 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
-| **8k** | 3.433x | 3.36 | 0.0887% | 117,303 |
-| **16k** | 3.752x | 3.68 | 0.0969% | 107,301 |
-| **32k** | 4.070x 🏆 | 3.99 | 0.1051% | 98,934 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
-**Sample 1:** `Los Angeles ye Amerika ka Kelenyalen Jamanaw ka dugu ye.
-Catégorie:Amerika ka...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁los ▁angeles ▁ye ▁amerika ▁ka ▁kelenyalen ▁jamanaw ▁ka ▁dugu ▁ye ... (+9 more)` | 19 |
-| 16k | `▁los ▁angeles ▁ye ▁amerika ▁ka ▁kelenyalen ▁jamanaw ▁ka ▁dugu ▁ye ... (+9 more)` | 19 |
-| 32k | `▁los ▁angeles ▁ye ▁amerika ▁ka ▁kelenyalen ▁jamanaw ▁ka ▁dugu ▁ye ... (+9 more)` | 19 |
-**Sample 2:** `Kunkolosɛmɛ yɛ baganw ani mogow ka kunkolo kɔnɔ yɛ. Mogow bɛ miri kunkolo kɔnɔ.
-...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁kunkolo sɛ mɛ ▁yɛ ▁baganw ▁ani ▁mogow ▁ka ▁kunkolo ▁kɔnɔ ... (+14 more)` | 24 |
-| 16k | `▁kunkolo sɛmɛ ▁yɛ ▁baganw ▁ani ▁mogow ▁ka ▁kunkolo ▁kɔnɔ ▁yɛ ... (+11 more)` | 21 |
-| 32k | `▁kunkolosɛmɛ ▁yɛ ▁baganw ▁ani ▁mogow ▁ka ▁kunkolo ▁kɔnɔ ▁yɛ . ... (+9 more)` | 19 |
-**Sample 3:** `TaliBailleul, Charles. 2008. Dictionnaire français-bambara.  Bamako: Éditions Do...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁tali bailleul , ▁charles . ▁ 2 0 0 8 ... (+31 more)` | 41 |
-| 16k | `▁tali bailleul , ▁charles . ▁ 2 0 0 8 ... (+31 more)` | 41 |
-| 32k | `▁talibailleul , ▁charles . ▁ 2 0 0 8 . ... (+30 more)` | 40 |
 ### Key Findings
-- **Best Compression:** 32k achieves 4.070x compression
-- **Lowest UNK Rate:** 8k with 0.0887% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
@@ -122,55 +125,87 @@ Catégorie:Amerika ka...`
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
-| N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
-|--------|------------|---------|----------------|------------------|-------------------|
-| **2-gram** | 950 🏆 | 9.89 | 2,854 | 45.2% | 80.3% |
-| **2-gram** | 322 🏆 | 8.33 | 2,127 | 63.8% | 97.9% |
-| **3-gram** | 976 | 9.93 | 3,554 | 47.1% | 75.8% |
-| **3-gram** | 2,159 | 11.08 | 11,428 | 27.6% | 72.5% |
-| **4-gram** | 1,930 | 10.91 | 8,337 | 41.6% | 58.5% |
-| **4-gram** | 8,659 | 13.08 | 39,905 | 14.4% | 46.6% |
 ### Top 5 N-grams by Size
-**2-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `catégorie :` | 1,068 |
-| 2 | `ye .` | 822 |
-| 3 | `’ a` | 627 |
-| 4 | `ka dugu` | 577 |
-| 5 | `. sababou` | 571 |
-**3-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `français - bambara` | 419 |
-| 2 | `dictionnaire français -` | 419 |
-| 3 | `. dictionnaire français` | 419 |
-| 4 | `2008 . dictionnaire` | 419 |
-| 5 | `. 2008 .` | 419 |
-**4-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `. dictionnaire français -` | 419 |
-| 2 | `- 04 - 8` | 419 |
-| 3 | `français - bambara .` | 419 |
-| 4 | `- bambara . bamako` | 419 |
-| 5 | `bambara . bamako :` | 419 |
 ### Key Findings
-- **Best Perplexity:** 2-gram with 322
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
 - **Coverage:** Top-1000 patterns cover ~47% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
@@ -180,55 +215,86 @@ Catégorie:Amerika ka...`
 ![Markov Entropy](visualizations/markov_entropy.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
-| Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
-|---------|-------------|------------|------------------|-----------------|----------------|
-| **1** | 0.5830 | 1.498 | 3.54 | 18,421 | 41.7% |
-| **1** | 1.2181 | 2.326 | 9.19 | 508 | 0.0% |
-| **2** | 0.2211 | 1.166 | 1.48 | 65,151 | 77.9% |
-| **2** | 1.0235 | 2.033 | 5.17 | 4,659 | 0.0% |
-| **3** | 0.0800 | 1.057 | 1.13 | 96,171 | 92.0% |
-| **3** | 0.7286 | 1.657 | 3.07 | 24,092 | 27.1% |
-| **4** | 0.0309 🏆 | 1.022 | 1.04 | 108,554 | 96.9% |
-| **4** | 0.4796 🏆 | 1.394 | 2.01 | 73,789 | 52.0% |
-### Generated Text Samples
-Below are text samples generated from each Markov chain model:
 **Context Size 1:**
-1. `. farrell , kɔɔnɔ du riz africain , o wati san 1648 - wong vong ye`
-2. `, i ka tɔgɔ tun te deli ka dugu tɔw kan , ni swedi , na`
-3. `ye ko an bolo suguya kɛmɛ tan kɔnɔ bamakɔ dugu ye ukrainekaw ka kɛnɛya ye jamana`
 **Context Size 2:**
-1. `catégorie : afrika catégorie : cema amerika . onu kuntilenna fɔlɔ ye mi ma yiriwa kosɛbɛ .`
-2. `ye . catégorie : faransi ka dugu ye . dugumogo be taa jon yooro . gallery sababou`
-3. `’ a kulɛriw kan . hadamadenw ka josariyaw ni politikitɔnw ka bɛnkansɛbɛn ye min sigira senkan c`
 **Context Size 3:**
-1. `éditions donniya . isbn 2 - 911741 - 04 - 8 . sababou catégorie : jɛgɛ`
-2. `bambara . bamako : éditions donniya . isbn 2 - 911741 - 04 - 8 . sababou catégorie`
-3. `français - bambara . bamako : éditions donniya . isbn 2 - 911741 - 04 - 8 .`
 **Context Size 4:**
-1. `, charles . 2008 . dictionnaire français - bambara . bamako : éditions donniya . isbn 2 - 911741`
-2. `. dictionnaire français - bambara . bamako : éditions donniya . isbn 2 - 911741 - 04 - 8`
-3. `français - bambara . bamako : éditions donniya . isbn 2 - 911741 - 04 - 8 . sababou`
 ### Key Findings
-- **Best Predictability:** Context-4 with 96.9% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
-- **Memory Trade-off:** Larger contexts require more storage (73,789 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
@@ -244,64 +310,64 @@ Below are text samples generated from each Markov chain model:
 | Metric | Value |
 |--------|-------|
-| Vocabulary Size | 7,195 |
-| Total Tokens | 102,263 |
-| Mean Frequency | 14.21 |
 | Median Frequency | 3 |
-| Frequency Std Dev | 106.29 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | ye | 4,480 |
-| 2 | ka | 4,400 |
-| 3 | a | 3,281 |
-| 4 | la | 1,931 |
-| 5 | ni | 1,900 |
-| 6 | bɛ | 1,844 |
-| 7 | na | 1,626 |
-| 8 | min | 1,192 |
-| 9 | o | 1,154 |
-| 10 | ani | 1,079 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | dakon | 2 |
-| 2 | taamaɲogonw | 2 |
-| 3 | abubakari | 2 |
-| 4 | ameniras | 2 |
-| 5 | kandasi | 2 |
-| 6 | qore | 2 |
-| 7 | amɔn | 2 |
-| 8 | bajiw | 2 |
-| 9 | dunbagaw | 2 |
-| 10 | mouvement | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
-| Zipf Coefficient | 1.0134 |
-| R² (Goodness of Fit) | 0.984519 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
-| Top 100 | 52.0% |
-| Top 1,000 | 79.0% |
-| Top 5,000 | 95.7% |
 | Top 10,000 | 0.0% |
 ### Key Findings
-- **Zipf Compliance:** R²=0.9845 indicates excellent adherence to Zipf's law
-- **High Frequency Dominance:** Top 100 words cover 52.0% of corpus
-- **Long Tail:** -2,805 words needed for remaining 100.0% coverage
 ---
 ## 5. Word Embeddings Evaluation
@@ -314,24 +380,113 @@ Below are text samples generated from each Markov chain model:
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
-### Model Comparison
-| Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
-|-------|------------|-----------|----------|----------|----------|
-| **mono_32d** | 2,309 | 32 | 2.998 | 0.730 | 0.2244 🏆 |
-| **mono_64d** | 2,309 | 64 | 2.963 | 0.687 | 0.0668 |
-| **mono_128d** | 2,309 | 128 | 2.969 | 0.695 | 0.0115 |
-| **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 ### Key Findings
-- **Best Isotropy:** mono_32d with 0.2244 (more uniform distribution)
-- **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
-- **Vocabulary Coverage:** All models cover 2,309 words
-- **Recommendation:** 100d for balanced semantic capture and efficiency
 ---
-## 6. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
@@ -339,11 +494,12 @@ Below are text samples generated from each Markov chain model:
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
-| Tokenizer | **32k BPE** | Best compression (4.07x) with low UNK rate |
-| N-gram | **5-gram** | Lowest perplexity (322) |
-| Markov | **Context-4** | Highest predictability (96.9%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
@@ -533,7 +689,8 @@ If you use these models in your research, please cite:
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
-  publisher = {HuggingFace},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
@@ -549,7 +706,8 @@ MIT License - Free for academic and commercial use.
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 ---
 *Generated by Wikilangs Models Pipeline*
-*Report Date: 2025-12-28 05:28:29*

 metrics:
   - name: best_compression_ratio
     type: compression
+    value: 4.016
   - name: best_isotropy
     type: isotropy
+    value: 0.2668
   - name: vocabulary_size
     type: vocab
+    value: 0
+generated: 2026-01-03
 ---
 # BM - Wikilangs Models
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
+- N-gram models (2, 3, 4, 5-gram)
+- Markov chains (context of 1, 2, 3, 4 and 5)
 - Subword N-gram and Markov chains
+- Embeddings in various sizes and dimensions (aligned and unaligned)
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
+- [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
+- [7. Summary & Recommendations](#7-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
+![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
+![Tokenizer OOV](visualizations/tokenizer_oov.png)
+![Total Tokens](visualizations/tokenizer_total_tokens.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
+| **8k** | 3.547x | 3.56 | 1.4000% | 104,860 |
+| **16k** | 3.831x | 3.84 | 1.5119% | 97,096 |
+| **32k** | 4.016x 🏆 | 4.03 | 1.5853% | 92,603 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
+**Sample 1:** `Denver ye Amerika ka Kelenyalen Jamanaw ka dugu ye. ka Kelenyalen Jamanaw ka dug...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁den ver ▁ye ▁amerika ▁ka ▁kelenyalen ▁jamanaw ▁ka ▁dugu ▁ye ... (+6 more)` | 16 |
+| 16k | `▁den ver ▁ye ▁amerika ▁ka ▁kelenyalen ▁jamanaw ▁ka ▁dugu ▁ye ... (+6 more)` | 16 |
+| 32k | `▁denver ▁ye ▁amerika ▁ka ▁kelenyalen ▁jamanaw ▁ka ▁dugu ▁ye . ... (+5 more)` | 15 |
+**Sample 2:** `Dakar ye Senegali faamadugu ye. A be Atlantiki kɔgɔji da la. thumb|Dakar-Indépen...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁dakar ▁ye ▁senegali ▁faama dugu ▁ye . ▁a ▁be ▁atlantiki ... (+19 more)` | 29 |
+| 16k | `▁dakar ▁ye ▁senegali ▁faamadugu ▁ye . ▁a ▁be ▁atlantiki ▁kɔgɔji ... (+12 more)` | 22 |
+| 32k | `▁dakar ▁ye ▁senegali ▁faamadugu ▁ye . ▁a ▁be ▁atlantiki ▁kɔgɔji ... (+11 more)` | 21 |
+**Sample 3:** `MugukɔnkɔnBailleul, Charles. Dictionnaire français-bambara. Bamako: Éditions Don...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁mugu kɔnkɔnbailleul , ▁charles . ▁dictionnaire ▁français - bambara . ... (+7 more)` | 17 |
+| 16k | `▁mugu kɔnkɔnbailleul , ▁charles . ▁dictionnaire ▁français - bambara . ... (+7 more)` | 17 |
+| 32k | `▁mugu kɔnkɔnbailleul , ▁charles . ▁dictionnaire ▁français - bambara . ... (+7 more)` | 17 |
 ### Key Findings
+- **Best Compression:** 32k achieves 4.016x compression
+- **Lowest UNK Rate:** 8k with 1.4000% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
+![N-gram Unique](visualizations/ngram_unique.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
+| N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
+|--------|---------|------------|---------|----------------|------------------|-------------------|
+| **2-gram** | Word | 923 | 9.85 | 2,067 | 40.5% | 82.4% |
+| **2-gram** | Subword | 272 🏆 | 8.09 | 1,826 | 67.7% | 98.7% |
+| **3-gram** | Word | 775 | 9.60 | 2,207 | 44.0% | 78.6% |
+| **3-gram** | Subword | 1,884 | 10.88 | 9,873 | 29.9% | 74.7% |
+| **4-gram** | Word | 2,048 | 11.00 | 5,635 | 33.1% | 51.1% |
+| **4-gram** | Subword | 8,105 | 12.98 | 35,658 | 14.6% | 47.0% |
 ### Top 5 N-grams by Size
+**2-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `ka dugu` | 526 |
+| 2 | `charles dictionnaire` | 419 |
+| 3 | `dictionnaire français` | 419 |
+| 4 | `français bambara` | 419 |
+| 5 | `bamako éditions` | 419 |
+**3-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `bamako éditions donniya` | 419 |
+| 2 | `éditions donniya isbn` | 419 |
+| 3 | `dictionnaire français bambara` | 419 |
+| 4 | `bambara bamako éditions` | 419 |
+| 5 | `charles dictionnaire français` | 419 |
+**4-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `dictionnaire français bambara bamako` | 419 |
+| 2 | `bamako éditions donniya isbn` | 419 |
+| 3 | `bambara bamako éditions donniya` | 419 |
+| 4 | `français bambara bamako éditions` | 419 |
+| 5 | `charles dictionnaire français bambara` | 419 |
+**2-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `a _` | 23,595 |
+| 2 | `_ k` | 13,733 |
+| 3 | `a n` | 13,570 |
+| 4 | `n _` | 12,447 |
+| 5 | `i _` | 9,856 |
+**3-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `_ k a` | 6,371 |
+| 2 | `k a _` | 4,967 |
+| 3 | `_ y e` | 4,581 |
+| 4 | `a n _` | 4,011 |
+| 5 | `n i _` | 3,940 |
+**4-grams (Subword):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `_ k a _` | 4,307 |
+| 2 | `_ y e _` | 3,197 |
+| 3 | `_ b ɛ _` | 1,818 |
+| 4 | `_ n i _` | 1,809 |
+| 5 | `_ m i n` | 1,781 |
 ### Key Findings
+- **Best Perplexity:** 2-gram (subword) with 272
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
 - **Coverage:** Top-1000 patterns cover ~47% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ![Markov Entropy](visualizations/markov_entropy.png)
+![Markov Contexts](visualizations/markov_contexts.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
+| Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
+|---------|---------|-------------|------------|------------------|-----------------|----------------|
+| **1** | Word | 0.5956 | 1.511 | 3.32 | 17,657 | 40.4% |
+| **1** | Subword | 1.1635 | 2.240 | 8.41 | 480 | 0.0% |
+| **2** | Word | 0.2005 | 1.149 | 1.41 | 58,338 | 80.0% |
+| **2** | Subword | 0.9884 | 1.984 | 5.02 | 4,032 | 1.2% |
+| **3** | Word | 0.0636 | 1.045 | 1.10 | 81,761 | 93.6% |
+| **3** | Subword | 0.7361 | 1.666 | 3.15 | 20,227 | 26.4% |
+| **4** | Word | 0.0198 🏆 | 1.014 | 1.03 | 89,097 | 98.0% |
+| **4** | Subword | 0.5022 | 1.416 | 2.09 | 63,561 | 49.8% |
+### Generated Text Samples (Word-based)
+Below are text samples generated from each word-based Markov chain model:
+**Context Size 1:**
+1. `ka fasojamana ye yɛrɛmahɔrɔnya jamanaw ka bi cɛ bawo wariko gɛlɛya wɛrɛ fɛ iko ala dɔnbali`
+2. `ye kumajago senw tigɛli ninakili dɔnni don kɔsa in municipality of south africa art solo exhibition`
+3. `a bonya ye masala jagodon kalanbolo kɔnɔ k ɲ 26 ma peninsula mara la amadu ni`
+**Context Size 2:**
+1. `éditions donniya isbn sababou kɔkan sirilanw lutrinae`
+2. `français bambara bamako éditions donniya isbn sababou kɔkan sirilanw donkey`
+3. `bamako éditions donniya isbn sababou kɔkan sirilanw lepus`
+**Context Size 3:**
+1. `dictionnaire français bambara bamako éditions donniya isbn sababou kɔkan sirilanw tragelaphus spekii`
+2. `français bambara bamako éditions donniya isbn sababou kɔkan sirilanw hyaenidae link wikiquote en hye...`
+3. `éditions donniya isbn sababou kɔkan sirilanw hippotragus equinus`
+**Context Size 4:**
+1. `bambara bamako éditions donniya isbn sababou kɔkan sirilanw tragelaphus spekii`
+2. `bamako éditions donniya isbn sababou kɔkan sirilanw tragelaphus spekii`
+3. `charles dictionnaire français bambara bamako éditions donniya isbn sababou kɔkan sirilanw herpestes ...`
+### Generated Text Samples (Subword-based)
+Below are text samples generated from each subword-based Markov chain model:
 **Context Size 1:**
+1. `_beu_yin_samesi_`
+2. `akoni_swan'bɛn'u`
+3. `n_kɔn_a-s_koon,_`
 **Context Size 2:**
+1. `a_ba_ya._bɛ,_marr`
+2. `_k'a_ye_ka_sɔra_y`
+3. `ana-as_duguru,_mi`
 **Context Size 3:**
+1. `_ka_min_sababou_kɛ`
+2. `ka_dumuniorussin_t`
+3. `_ye_jamand_reviese`
 **Context Size 4:**
+1. `_ka_so_kɔnɔ_milleul`
+2. `_ye_siby_sidenw_ka_`
+3. `_bɛ_lajɛ_kilɛ_mali_`
 ### Key Findings
+- **Best Predictability:** Context-4 (word) with 98.0% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
+- **Memory Trade-off:** Larger contexts require more storage (63,561 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
 | Metric | Value |
 |--------|-------|
+| Vocabulary Size | 6,895 |
+| Total Tokens | 95,713 |
+| Mean Frequency | 13.88 |
 | Median Frequency | 3 |
+| Frequency Std Dev | 106.18 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | ye | 4,391 |
+| 2 | ka | 4,364 |
+| 3 | a | 3,308 |
+| 4 | la | 1,918 |
+| 5 | ni | 1,903 |
+| 6 | bɛ | 1,828 |
+| 7 | na | 1,625 |
+| 8 | min | 1,195 |
+| 9 | o | 1,160 |
+| 10 | ani | 1,074 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | diverse | 2 |
+| 2 | cryptography | 2 |
+| 3 | career | 2 |
+| 4 | this | 2 |
+| 5 | corp | 2 |
+| 6 | strathspey | 2 |
+| 7 | holdings | 2 |
+| 8 | firm | 2 |
+| 9 | allergan | 2 |
+| 10 | hybe | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
+| Zipf Coefficient | 1.0043 |
+| R² (Goodness of Fit) | 0.984602 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
+| Top 100 | 52.1% |
+| Top 1,000 | 79.1% |
+| Top 5,000 | 96.0% |
 | Top 10,000 | 0.0% |
 ### Key Findings
+- **Zipf Compliance:** R²=0.9846 indicates excellent adherence to Zipf's law
+- **High Frequency Dominance:** Top 100 words cover 52.1% of corpus
+- **Long Tail:** -3,105 words needed for remaining 100.0% coverage
 ---
 ## 5. Word Embeddings Evaluation
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
+### 5.1 Cross-Lingual Alignment
+> *Note: Multilingual alignment visualization not available for this language.*
+### 5.2 Model Comparison
+| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
+|-------|-----------|----------|------------------|---------------|----------------|
+| **mono_32d** | 32 | 0.2668 🏆 | 0.5000 | N/A | N/A |
+| **mono_64d** | 64 | 0.0657 | 0.5219 | N/A | N/A |
+| **mono_128d** | 128 | 0.0127 | 0.4839 | N/A | N/A |
 ### Key Findings
+- **Best Isotropy:** mono_32d with 0.2668 (more uniform distribution)
+- **Semantic Density:** Average pairwise similarity of 0.5020. Lower values indicate better semantic separation.
+- **Alignment Quality:** No aligned models evaluated in this run.
+- **Recommendation:** 128d aligned for best cross-lingual performance
 ---
+## 6.  Morphological Analysis (Experimental)
+> ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
+This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
+### 6.1 Productivity & Complexity
+| Metric | Value | Interpretation | Recommendation |
+|--------|-------|----------------|----------------|
+| Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
+| Idiomaticity Gap | **-1.000** | Low formulaic content | - |
+### 6.2 Affix Inventory (Productive Units)
+These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
+#### Productive Prefixes
+| Prefix | Examples |
+|--------|----------|
+| `-ma` | maracogo, maraka, macron |
+#### Productive Suffixes
+| Suffix | Examples |
+|--------|----------|
+| `-a` | dɔnnikɛla, zanbia, miriya |
+| `-an` | balansan, abubuwan, 15nan |
+### 6.3 Bound Stems (Lexical Roots)
+Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
+| Stem | Cohesion | Substitutability | Examples |
+|------|----------|------------------|----------|
+| `alan` | 1.65x | 24 contexts | kalan, palan, balan |
+| `riya` | 1.81x | 11 contexts | suriya, miriya, sariya |
+| `alen` | 1.45x | 20 contexts | falen, dalen, jalen |
+| `aara` | 1.72x | 12 contexts | maara, taara, yaara |
+| `aman` | 1.31x | 25 contexts | daman, saman, faman |
+| `elen` | 1.65x | 12 contexts | yelen, kelen, selen |
+| `ɔgɔn` | 1.75x | 9 contexts | nɔgɔn, ɲɔgɔn, nyɔgɔn |
+| `ɛbɛn` | 1.82x | 8 contexts | sɛbɛn, sɛbɛnw, sɛbɛnni |
+| `anka` | 1.51x | 13 contexts | mankan, yankan, dankan |
+| `amin` | 1.51x | 13 contexts | damina, daminè, damine |
+| `nkan` | 1.35x | 14 contexts | bɛnkan, benkan, mankan |
+| `kili` | 1.49x | 10 contexts | hakili, kilisi, nkiliki |
+### 6.4 Affix Compatibility (Co-occurrence)
+This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
+| Prefix | Suffix | Frequency | Examples |
+|--------|--------|-----------|----------|
+| `-ma` | `-a` | 22 words | maa, maara |
+| `-ma` | `-an` | 8 words | masasigilan, mankaan |
+### 6.5 Recursive Morpheme Segmentation
+Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
+| Word | Suggested Split | Confidence | Stem |
+|------|-----------------|------------|------|
+| maninkakan | **`ma-ninkak-an`** | 3.0 | `ninkak` |
+| tlasabanan | **`tlasab-an-an`** | 3.0 | `tlasab` |
+| machecoul | **`ma-checoul`** | 1.5 | `checoul` |
+| marabolow | **`ma-rabolow`** | 1.5 | `rabolow` |
+| woroduguyanfan | **`woroduguyanf-an`** | 1.5 | `woroduguyanf` |
+| binkannikɛlan | **`binkannikɛl-an`** | 1.5 | `binkannikɛl` |
+| masakɛmuso | **`ma-sakɛmuso`** | 1.5 | `sakɛmuso` |
+| sɛnɛfɔkan | **`sɛnɛfɔk-an`** | 1.5 | `sɛnɛfɔk` |
+| ispanyikan | **`ispanyik-an`** | 1.5 | `ispanyik` |
+| tubabukan | **`tubabuk-an`** | 1.5 | `tubabuk` |
+| maramafen | **`ma-ramafen`** | 1.5 | `ramafen` |
+| balikukalan | **`balikukal-an`** | 1.5 | `balikukal` |
+| ukrayinakan | **`ukrayinak-an`** | 1.5 | `ukrayinak` |
+| marisikalo | **`ma-risikalo`** | 1.5 | `risikalo` |
+| matarafali | **`ma-tarafali`** | 1.5 | `tarafali` |
+### 6.6 Linguistic Interpretation
+> **Automated Insight:**
+The language BM appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
+---
+## 7. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
+| Tokenizer | **32k BPE** | Best compression (4.02x) |
+| N-gram | **2-gram** | Lowest perplexity (272) |
+| Markov | **Context-4** | Highest predictability (98.0%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
+  doi = {10.5281/zenodo.18073153},
+  publisher = {Zenodo},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
+- 🤝 Sponsor: [Featherless AI](https://featherless.ai)
 ---
 *Generated by Wikilangs Models Pipeline*
+*Report Date: 2026-01-03 07:27:32*

models/embeddings/monolingual/bm_128d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7f0d7c8930252db4f75ccff26903504087a5598f8e89d5877efdc3838fa71e4e
-size 1026401129

 version https://git-lfs.github.com/spec/v1
+oid sha256:60554973b7e758378671bf2709b7707495d4d05f0f6d5aa74b560379597f5337
+size 1026320797

models/embeddings/monolingual/bm_128d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 128,
   "version": "monolingual",
   "training_params": {
-    "dim": 128,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 2309
 }

   "dimension": 128,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 128
   },
+  "vocab_size": 2232
 }

models/embeddings/monolingual/bm_32d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ecbf3cf75d36d5786a480590916fad5d6bc30b12637de0b8e671f92799273b4e
-size 256627817

 version https://git-lfs.github.com/spec/v1
+oid sha256:d397f946ef2ae3aef026fefc210ae2f287c5d21b70e2355dd7bf1e9ad36afa1c
+size 256606621

models/embeddings/monolingual/bm_32d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 32,
   "version": "monolingual",
   "training_params": {
-    "dim": 32,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 2309
 }

   "dimension": 32,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 32
   },
+  "vocab_size": 2232
 }

models/embeddings/monolingual/bm_64d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:03ddc93065212141e650c80ef4273e04a2ec47432a07bd4dc33ebf690234049c
-size 513218921

 version https://git-lfs.github.com/spec/v1
+oid sha256:59a556264dfd6bffcee58e216ac1d731f3135622dfe1812992c391a2191779d2
+size 513178013

models/embeddings/monolingual/bm_64d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 64,
   "version": "monolingual",
   "training_params": {
-    "dim": 64,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 2309
 }

   "dimension": 64,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 64
   },
+  "vocab_size": 2232
 }

models/subword_markov/bm_markov_ctx1_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0819a99ba610dc7d0095d059a8cd0fbf0e93d96340fab94bfe408f6e78c431e4
-size 40112

 version https://git-lfs.github.com/spec/v1
+oid sha256:75665f8001ac1c98685e9eac5bd7af0ed39fc626141b1fd4cef1b034c92d438d
+size 35393

models/subword_markov/bm_markov_ctx1_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "subword",
   "language": "bm",
-  "unique_contexts": 508,
-  "total_transitions": 649338
 }

   "context_size": 1,
   "variant": "subword",
   "language": "bm",
+  "unique_contexts": 480,
+  "total_transitions": 596087
 }

models/subword_markov/bm_markov_ctx2_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6bb80b6a46cd1c2d4387947a64a191c8c3a0e17cb30ee46dd9853eb3f4e28c6a
-size 184560

 version https://git-lfs.github.com/spec/v1
+oid sha256:89493aa503c873924d3cc26fe8b12683b9ea51449a0ff9dd9a1825aa759bcfbf
+size 160110

models/subword_markov/bm_markov_ctx2_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "subword",
   "language": "bm",
-  "unique_contexts": 4659,
-  "total_transitions": 648039
 }

   "context_size": 2,
   "variant": "subword",
   "language": "bm",
+  "unique_contexts": 4032,
+  "total_transitions": 594884
 }

models/subword_markov/bm_markov_ctx3_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2346a42c76aff72fd2b3fe02f559333cd9dda578b8ced3ba260d8cf6e2171a4e
-size 546787

 version https://git-lfs.github.com/spec/v1
+oid sha256:c678a21c70cd7e4a8724cc5c8781d9a005a7b1153d5cfab9672b1e53bfdb3c44
+size 480692

models/subword_markov/bm_markov_ctx3_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "subword",
   "language": "bm",
-  "unique_contexts": 24092,
-  "total_transitions": 646740
 }

   "context_size": 3,
   "variant": "subword",
   "language": "bm",
+  "unique_contexts": 20227,
+  "total_transitions": 593681
 }

models/subword_markov/bm_markov_ctx4_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a1d7257d4abefb8c32f0420d2d9fc0d913bcdb5e4772236891d7490a5264704a
-size 1226126

 version https://git-lfs.github.com/spec/v1
+oid sha256:d36df5ddcd8f68677e8aeeb781950fca4c726d9b697ac0720d4fb6f5eb94a5a2
+size 1088116

models/subword_markov/bm_markov_ctx4_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "subword",
   "language": "bm",
-  "unique_contexts": 73789,
-  "total_transitions": 645441
 }

   "context_size": 4,
   "variant": "subword",
   "language": "bm",
+  "unique_contexts": 63561,
+  "total_transitions": 592478
 }

models/subword_ngram/bm_2gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:06ce57f95742b61708e17e229d33f174f2dabe7ae11cfc1872a7a6182573cfcc
-size 26169

 version https://git-lfs.github.com/spec/v1
+oid sha256:fe3125074954dbce7e5c347fb5b53e5f30cec7775dfd663157c67fbdc9e255bc
+size 23022

models/subword_ngram/bm_2gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "subword",
   "language": "bm",
-  "unique_ngrams": 2127,
-  "total_ngrams": 649338
 }

   "n": 2,
   "variant": "subword",
   "language": "bm",
+  "unique_ngrams": 1826,
+  "total_ngrams": 596087
 }

models/subword_ngram/bm_3gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2f6c3f360cbad86d31de6197b81525e30f7a976dc638b3ca0ae073b2a98553a6
-size 131173

 version https://git-lfs.github.com/spec/v1
+oid sha256:c97c794d987cb73cd6c3301a1516b8e4889869836d3d9c5369016ba8fe9c0fb7
+size 113452

models/subword_ngram/bm_3gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "subword",
   "language": "bm",
-  "unique_ngrams": 11428,
-  "total_ngrams": 648039
 }

   "n": 3,
   "variant": "subword",
   "language": "bm",
+  "unique_ngrams": 9873,
+  "total_ngrams": 594884
 }

models/subword_ngram/bm_4gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:66a769c12a2d8a97eff8c9d92dea9868be0c798775b864d08da7c4cc28cc988f
-size 486277

 version https://git-lfs.github.com/spec/v1
+oid sha256:dbf91bd19db052de26308667adfa16bf3463044ec8463971cda61775082b87c4
+size 431891

models/subword_ngram/bm_4gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 4,
   "variant": "subword",
   "language": "bm",
-  "unique_ngrams": 39905,
-  "total_ngrams": 646740
 }

   "n": 4,
   "variant": "subword",
   "language": "bm",
+  "unique_ngrams": 35658,
+  "total_ngrams": 593681
 }

models/tokenizer/bm_tokenizer_16k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9296d04641b8a3567ec23dd58f2570e88576b52be897baa8365b0b75d62817db
-size 512912

 version https://git-lfs.github.com/spec/v1
+oid sha256:55b76e9844ea83477a53ca0eacddd2c822efad2b886d04d9e9ba8ce2715dfacf
+size 514654

models/tokenizer/bm_tokenizer_16k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/bm_tokenizer_32k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:5c3728738bd953c0b0980dea17440614b9fc5c1944fd6a606c0819bd83da8f70
-size 801851

 version https://git-lfs.github.com/spec/v1
+oid sha256:6a3490f95bc0ae3183c66cee9b8b34a693f115b059216203609f99d8409129f6
+size 764320

models/tokenizer/bm_tokenizer_32k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/bm_tokenizer_8k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b7ebf2d24f769b1286b48c8e35b7389a0e0e1303d97a28ec9d43ae1701d59f89
-size 371453

 version https://git-lfs.github.com/spec/v1
+oid sha256:9361cd8391a5d749408299b263a632521ad4a28af1108df637b4399ef06a96bd
+size 372567

models/tokenizer/bm_tokenizer_8k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/vocabulary/bm_vocabulary.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:3d54a1977879161f36d59485af9d41f9f5140c9cf6590d94c12addac73e06114
-size 116321

 version https://git-lfs.github.com/spec/v1
+oid sha256:95cadef0ad07f1fcd065af0e35aa2cf7c22454a9e8a9d71f454288dae5ddf109
+size 110665

models/vocabulary/bm_vocabulary_metadata.json CHANGED Viewed

@@ -1,16 +1,17 @@
 {
   "language": "bm",
-  "vocabulary_size": 7195,
   "statistics": {
-    "type_token_ratio": 0.16176133457950517,
     "coverage": {
-      "top_100": 0.46858412541661526,
-      "top_1000": 0.7123194667325021,
-      "top_5000": 0.8629710617736788,
-      "top_10000": 0.9264112014389758
     },
-    "hapax_count": 11151,
-    "hapax_ratio": 0.607816417747738,
-    "total_documents": 1299
   }
 }

 {
   "language": "bm",
+  "vocabulary_size": 6895,
+  "variant": "full",
   "statistics": {
+    "type_token_ratio": 0.16638822668143338,
     "coverage": {
+      "top_100": 0.4682859985358437,
+      "top_1000": 0.7107728117432845,
+      "top_5000": 0.8627541155932649,
+      "top_10000": 0.9274679481163066
     },
+    "hapax_count": 10833,
+    "hapax_ratio": 0.611067238267148,
+    "total_documents": 1203
   }
 }

models/word_markov/bm_markov_ctx1_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c5771841e705a9dd2e47d794be576e0f304212cdfd165b9c86e4a3bbf40bc589
-size 575996

 version https://git-lfs.github.com/spec/v1
+oid sha256:789e3b5dbec8a4180554670cf6c8140274009e31544768738d599464c4f97eaf
+size 539344

models/word_markov/bm_markov_ctx1_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "word",
   "language": "bm",
-  "unique_contexts": 18421,
-  "total_transitions": 140738
 }

   "context_size": 1,
   "variant": "word",
   "language": "bm",
+  "unique_contexts": 17657,
+  "total_transitions": 105343
 }

models/word_markov/bm_markov_ctx2_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:75d31f3bd92ad9f4e4d1c4e921aeabc69c0202e90bc20dd9e3b9f2822ae86f7f
-size 1171294

 version https://git-lfs.github.com/spec/v1
+oid sha256:887a80e826ca473a6ab2e880642feffde94dc8b35a379b2e1de27ffb35ee82e6
+size 1068878

models/word_markov/bm_markov_ctx2_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "word",
   "language": "bm",
-  "unique_contexts": 65151,
-  "total_transitions": 139439
 }

   "context_size": 2,
   "variant": "word",
   "language": "bm",
+  "unique_contexts": 58338,
+  "total_transitions": 104140
 }

models/word_markov/bm_markov_ctx3_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:3fa392bd3a195161417c5b8dfa8792082e634a6a83b57348f2a245ae4a04efba
-size 1584409

 version https://git-lfs.github.com/spec/v1
+oid sha256:4ac6a4a0d3b0dfb35d2dc573351f32014393c9e3b47f7a32c3ea6caa6ee45484
+size 1381560

models/word_markov/bm_markov_ctx3_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "word",
   "language": "bm",
-  "unique_contexts": 96171,
-  "total_transitions": 138140
 }

   "context_size": 3,
   "variant": "word",
   "language": "bm",
+  "unique_contexts": 81761,
+  "total_transitions": 102937
 }

models/word_markov/bm_markov_ctx4_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:fa556a96029c6bafa0afbbae6795406371cd660317dcbb46cb396c604196f52b
-size 1827580