omarkamali commited on Jan 3

Commit

b0be5c4

verified ·

1 Parent(s): 77a2691

Upload all models and assets for bm (latest)

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +1 -0
README.md +208 -169
models/embeddings/aligned/bm_128d.bin +3 -0
models/embeddings/aligned/bm_128d.meta.json +1 -0
models/embeddings/aligned/bm_128d.projection.npy +3 -0
models/embeddings/aligned/bm_128d_metadata.json +8 -0
models/embeddings/aligned/bm_32d.bin +3 -0
models/embeddings/aligned/bm_32d.meta.json +1 -0
models/embeddings/aligned/bm_32d.projection.npy +3 -0
models/embeddings/aligned/bm_32d_metadata.json +8 -0
models/embeddings/aligned/bm_64d.bin +3 -0
models/embeddings/aligned/bm_64d.meta.json +1 -0
models/embeddings/aligned/bm_64d.projection.npy +3 -0
models/embeddings/aligned/bm_64d_metadata.json +8 -0
models/embeddings/monolingual/bm_128d.bin +2 -2
models/embeddings/monolingual/bm_128d_metadata.json +1 -1
models/embeddings/monolingual/bm_32d.bin +2 -2
models/embeddings/monolingual/bm_32d_metadata.json +1 -1
models/embeddings/monolingual/bm_64d.bin +2 -2
models/embeddings/monolingual/bm_64d_metadata.json +1 -1
models/subword_markov/bm_markov_ctx1_subword.parquet +2 -2
models/subword_markov/bm_markov_ctx1_subword_metadata.json +2 -2
models/subword_markov/bm_markov_ctx2_subword.parquet +2 -2
models/subword_markov/bm_markov_ctx2_subword_metadata.json +2 -2
models/subword_markov/bm_markov_ctx3_subword.parquet +2 -2
models/subword_markov/bm_markov_ctx3_subword_metadata.json +2 -2
models/subword_markov/bm_markov_ctx4_subword.parquet +2 -2
models/subword_markov/bm_markov_ctx4_subword_metadata.json +2 -2
models/subword_ngram/bm_2gram_subword.parquet +2 -2
models/subword_ngram/bm_2gram_subword_metadata.json +2 -2
models/subword_ngram/bm_3gram_subword.parquet +2 -2
models/subword_ngram/bm_3gram_subword_metadata.json +2 -2
models/subword_ngram/bm_4gram_subword.parquet +2 -2
models/subword_ngram/bm_4gram_subword_metadata.json +2 -2
models/subword_ngram/bm_5gram_subword.parquet +3 -0
models/subword_ngram/bm_5gram_subword_metadata.json +7 -0
models/tokenizer/bm_tokenizer_16k.model +2 -2
models/tokenizer/bm_tokenizer_16k.vocab +0 -0
models/tokenizer/bm_tokenizer_32k.model +2 -2
models/tokenizer/bm_tokenizer_32k.vocab +0 -0
models/tokenizer/bm_tokenizer_8k.model +2 -2
models/tokenizer/bm_tokenizer_8k.vocab +0 -0
models/vocabulary/bm_vocabulary.parquet +2 -2
models/vocabulary/bm_vocabulary_metadata.json +9 -9
models/word_markov/bm_markov_ctx1_word.parquet +2 -2
models/word_markov/bm_markov_ctx1_word_metadata.json +2 -2
models/word_markov/bm_markov_ctx2_word.parquet +2 -2
models/word_markov/bm_markov_ctx2_word_metadata.json +2 -2
models/word_markov/bm_markov_ctx3_word.parquet +2 -2
models/word_markov/bm_markov_ctx3_word_metadata.json +2 -2

.gitattributes CHANGED Viewed

@@ -39,3 +39,4 @@ visualizations/position_encoding_comparison.png filter=lfs diff=lfs merge=lfs -t
 visualizations/tsne_sentences.png filter=lfs diff=lfs merge=lfs -text
 visualizations/tsne_words.png filter=lfs diff=lfs merge=lfs -text
 visualizations/zipf_law.png filter=lfs diff=lfs merge=lfs -text

 visualizations/tsne_sentences.png filter=lfs diff=lfs merge=lfs -text
 visualizations/tsne_words.png filter=lfs diff=lfs merge=lfs -text
 visualizations/zipf_law.png filter=lfs diff=lfs merge=lfs -text
+visualizations/embedding_tsne_multilingual.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 language: bm
-language_name: BM
 language_family: atlantic_other
 tags:
   - wikilangs
@@ -10,11 +10,21 @@ tags:
   - n-gram
   - markov
   - wikipedia
   - monolingual
   - family-atlantic_other
 license: mit
 library_name: wikilangs
-pipeline_tag: feature-extraction
 datasets:
   - omarkamali/wikipedia-monthly
 dataset_info:
@@ -23,20 +33,20 @@ dataset_info:
 metrics:
   - name: best_compression_ratio
     type: compression
-    value: 4.016
   - name: best_isotropy
     type: isotropy
-    value: 0.2668
   - name: vocabulary_size
     type: vocab
     value: 0
 generated: 2026-01-03
 ---
-# BM - Wikilangs Models
 ## Comprehensive Research Report & Full Ablation Study
-This repository contains NLP models trained and evaluated by Wikilangs, specifically on **BM** Wikipedia data.
 We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and word embeddings.
 ## 📋 Repository Contents
@@ -60,7 +70,7 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
-- [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
 - [7. Summary & Recommendations](#7-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
@@ -80,43 +90,43 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
-| **8k** | 3.547x | 3.56 | 1.4000% | 104,860 |
-| **16k** | 3.831x | 3.84 | 1.5119% | 97,096 |
-| **32k** | 4.016x 🏆 | 4.03 | 1.5853% | 92,603 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
-**Sample 1:** `Denver ye Amerika ka Kelenyalen Jamanaw ka dugu ye. ka Kelenyalen Jamanaw ka dug...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁den ver ▁ye ▁amerika ▁ka ▁kelenyalen ▁jamanaw ▁ka ▁dugu ▁ye ... (+6 more)` | 16 |
-| 16k | `▁den ver ▁ye ▁amerika ▁ka ▁kelenyalen ▁jamanaw ▁ka ▁dugu ▁ye ... (+6 more)` | 16 |
-| 32k | `▁denver ▁ye ▁amerika ▁ka ▁kelenyalen ▁jamanaw ▁ka ▁dugu ▁ye . ... (+5 more)` | 15 |
-**Sample 2:** `Dakar ye Senegali faamadugu ye. A be Atlantiki kɔgɔji da la. thumb|Dakar-Indépen...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁dakar ▁ye ▁senegali ▁faama dugu ▁ye . ▁a ▁be ▁atlantiki ... (+19 more)` | 29 |
-| 16k | `▁dakar ▁ye ▁senegali ▁faamadugu ▁ye . ▁a ▁be ▁atlantiki ▁kɔgɔji ... (+12 more)` | 22 |
-| 32k | `▁dakar ▁ye ▁senegali ▁faamadugu ▁ye . ▁a ▁be ▁atlantiki ▁kɔgɔji ... (+11 more)` | 21 |
-**Sample 3:** `MugukɔnkɔnBailleul, Charles. Dictionnaire français-bambara. Bamako: Éditions Don...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁mugu kɔnkɔnbailleul , ▁charles . ▁dictionnaire ▁français - bambara . ... (+7 more)` | 17 |
-| 16k | `▁mugu kɔnkɔnbailleul , ▁charles . ▁dictionnaire ▁français - bambara . ... (+7 more)` | 17 |
-| 32k | `▁mugu kɔnkɔnbailleul , ▁charles . ▁dictionnaire ▁français - bambara . ... (+7 more)` | 17 |
 ### Key Findings
-- **Best Compression:** 32k achieves 4.016x compression
-- **Lowest UNK Rate:** 8k with 1.4000% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
@@ -133,12 +143,14 @@ Below are sample sentences tokenized with each vocabulary size:
 | N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
 |--------|---------|------------|---------|----------------|------------------|-------------------|
-| **2-gram** | Word | 923 | 9.85 | 2,067 | 40.5% | 82.4% |
-| **2-gram** | Subword | 272 🏆 | 8.09 | 1,826 | 67.7% | 98.7% |
-| **3-gram** | Word | 775 | 9.60 | 2,207 | 44.0% | 78.6% |
-| **3-gram** | Subword | 1,884 | 10.88 | 9,873 | 29.9% | 74.7% |
-| **4-gram** | Word | 2,048 | 11.00 | 5,635 | 33.1% | 51.1% |
-| **4-gram** | Subword | 8,105 | 12.98 | 35,658 | 14.6% | 47.0% |
 ### Top 5 N-grams by Size
@@ -146,68 +158,88 @@ Below are sample sentences tokenized with each vocabulary size:
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `ka dugu` | 526 |
-| 2 | `charles dictionnaire` | 419 |
-| 3 | `dictionnaire français` | 419 |
-| 4 | `français bambara` | 419 |
-| 5 | `bamako éditions` | 419 |
 **3-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `bamako éditions donniya` | 419 |
-| 2 | `éditions donniya isbn` | 419 |
-| 3 | `dictionnaire français bambara` | 419 |
 | 4 | `bambara bamako éditions` | 419 |
-| 5 | `charles dictionnaire français` | 419 |
 **4-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `dictionnaire français bambara bamako` | 419 |
-| 2 | `bamako éditions donniya isbn` | 419 |
-| 3 | `bambara bamako éditions donniya` | 419 |
-| 4 | `français bambara bamako éditions` | 419 |
 | 5 | `charles dictionnaire français bambara` | 419 |
 **2-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `a _` | 23,595 |
-| 2 | `_ k` | 13,733 |
-| 3 | `a n` | 13,570 |
-| 4 | `n _` | 12,447 |
-| 5 | `i _` | 9,856 |
 **3-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `_ k a` | 6,371 |
-| 2 | `k a _` | 4,967 |
-| 3 | `_ y e` | 4,581 |
-| 4 | `a n _` | 4,011 |
-| 5 | `n i _` | 3,940 |
 **4-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `_ k a _` | 4,307 |
-| 2 | `_ y e _` | 3,197 |
-| 3 | `_ b ɛ _` | 1,818 |
-| 4 | `_ n i _` | 1,809 |
-| 5 | `_ m i n` | 1,781 |
 ### Key Findings
-- **Best Perplexity:** 2-gram (subword) with 272
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
-- **Coverage:** Top-1000 patterns cover ~47% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
@@ -223,14 +255,14 @@ Below are sample sentences tokenized with each vocabulary size:
 | Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
 |---------|---------|-------------|------------|------------------|-----------------|----------------|
-| **1** | Word | 0.5956 | 1.511 | 3.32 | 17,657 | 40.4% |
-| **1** | Subword | 1.1635 | 2.240 | 8.41 | 480 | 0.0% |
-| **2** | Word | 0.2005 | 1.149 | 1.41 | 58,338 | 80.0% |
-| **2** | Subword | 0.9884 | 1.984 | 5.02 | 4,032 | 1.2% |
-| **3** | Word | 0.0636 | 1.045 | 1.10 | 81,761 | 93.6% |
-| **3** | Subword | 0.7361 | 1.666 | 3.15 | 20,227 | 26.4% |
-| **4** | Word | 0.0198 🏆 | 1.014 | 1.03 | 89,097 | 98.0% |
-| **4** | Subword | 0.5022 | 1.416 | 2.09 | 63,561 | 49.8% |
 ### Generated Text Samples (Word-based)
@@ -238,27 +270,27 @@ Below are text samples generated from each word-based Markov chain model:
 **Context Size 1:**
-1. `ka fasojamana ye yɛrɛmahɔrɔnya jamanaw ka bi cɛ bawo wariko gɛlɛya wɛrɛ fɛ iko ala dɔnbali`
-2. `ye kumajago senw tigɛli ninakili dɔnni don kɔsa in municipality of south africa art solo exhibition`
-3. `a bonya ye masala jagodon kalanbolo kɔnɔ k ɲ 26 ma peninsula mara la amadu ni`
 **Context Size 2:**
-1. `éditions donniya isbn sababou kɔkan sirilanw lutrinae`
-2. `français bambara bamako éditions donniya isbn sababou kɔkan sirilanw donkey`
-3. `bamako éditions donniya isbn sababou kɔkan sirilanw lepus`
 **Context Size 3:**
-1. `dictionnaire français bambara bamako éditions donniya isbn sababou kɔkan sirilanw tragelaphus spekii`
-2. `français bambara bamako éditions donniya isbn sababou kɔkan sirilanw hyaenidae link wikiquote en hye...`
-3. `éditions donniya isbn sababou kɔkan sirilanw hippotragus equinus`
 **Context Size 4:**
 1. `bambara bamako éditions donniya isbn sababou kɔkan sirilanw tragelaphus spekii`
-2. `bamako éditions donniya isbn sababou kɔkan sirilanw tragelaphus spekii`
-3. `charles dictionnaire français bambara bamako éditions donniya isbn sababou kɔkan sirilanw herpestes ...`
 ### Generated Text Samples (Subword-based)
@@ -267,34 +299,34 @@ Below are text samples generated from each subword-based Markov chain model:
 **Context Size 1:**
-1. `_beu_yin_samesi_`
-2. `akoni_swan'bɛn'u`
-3. `n_kɔn_a-s_koon,_`
 **Context Size 2:**
-1. `a_ba_ya._bɛ,_marr`
-2. `_k'a_ye_ka_sɔra_y`
-3. `ana-as_duguru,_mi`
 **Context Size 3:**
-1. `_ka_min_sababou_kɛ`
-2. `ka_dumuniorussin_t`
-3. `_ye_jamand_reviese`
 **Context Size 4:**
-1. `_ka_so_kɔnɔ_milleul`
-2. `_ye_siby_sidenw_ka_`
-3. `_bɛ_lajɛ_kilɛ_mali_`
 ### Key Findings
 - **Best Predictability:** Context-4 (word) with 98.0% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
-- **Memory Trade-off:** Larger contexts require more storage (63,561 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
@@ -310,64 +342,64 @@ Below are text samples generated from each subword-based Markov chain model:
 | Metric | Value |
 |--------|-------|
-| Vocabulary Size | 6,895 |
-| Total Tokens | 95,713 |
-| Mean Frequency | 13.88 |
 | Median Frequency | 3 |
-| Frequency Std Dev | 106.18 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | ye | 4,391 |
-| 2 | ka | 4,364 |
-| 3 | a | 3,308 |
-| 4 | la | 1,918 |
-| 5 | ni | 1,903 |
-| 6 | bɛ | 1,828 |
-| 7 | na | 1,625 |
-| 8 | min | 1,195 |
-| 9 | o | 1,160 |
-| 10 | ani | 1,074 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | diverse | 2 |
-| 2 | cryptography | 2 |
-| 3 | career | 2 |
-| 4 | this | 2 |
-| 5 | corp | 2 |
-| 6 | strathspey | 2 |
-| 7 | holdings | 2 |
-| 8 | firm | 2 |
-| 9 | allergan | 2 |
-| 10 | hybe | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
-| Zipf Coefficient | 1.0043 |
-| R² (Goodness of Fit) | 0.984602 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
-| Top 100 | 52.1% |
-| Top 1,000 | 79.1% |
-| Top 5,000 | 96.0% |
 | Top 10,000 | 0.0% |
 ### Key Findings
-- **Zipf Compliance:** R²=0.9846 indicates excellent adherence to Zipf's law
-- **High Frequency Dominance:** Top 100 words cover 52.1% of corpus
-- **Long Tail:** -3,105 words needed for remaining 100.0% coverage
 ---
 ## 5. Word Embeddings Evaluation
@@ -383,37 +415,40 @@ Below are text samples generated from each subword-based Markov chain model:
 ### 5.1 Cross-Lingual Alignment
-> *Note: Multilingual alignment visualization not available for this language.*
 ### 5.2 Model Comparison
 | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
 |-------|-----------|----------|------------------|---------------|----------------|
-| **mono_32d** | 32 | 0.2668 🏆 | 0.5000 | N/A | N/A |
-| **mono_64d** | 64 | 0.0657 | 0.5219 | N/A | N/A |
-| **mono_128d** | 128 | 0.0127 | 0.4839 | N/A | N/A |
 ### Key Findings
-- **Best Isotropy:** mono_32d with 0.2668 (more uniform distribution)
-- **Semantic Density:** Average pairwise similarity of 0.5020. Lower values indicate better semantic separation.
-- **Alignment Quality:** No aligned models evaluated in this run.
 - **Recommendation:** 128d aligned for best cross-lingual performance
 ---
 ## 6.  Morphological Analysis (Experimental)
-> ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
 This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
 ### 6.1 Productivity & Complexity
 | Metric | Value | Interpretation | Recommendation |
 |--------|-------|----------------|----------------|
-| Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
-| Idiomaticity Gap | **-1.000** | Low formulaic content | - |
 ### 6.2 Affix Inventory (Productive Units)
@@ -422,13 +457,14 @@ These are the most productive prefixes and suffixes identified by sampling the v
 #### Productive Prefixes
 | Prefix | Examples |
 |--------|----------|
-| `-ma` | maracogo, maraka, macron |
 #### Productive Suffixes
 | Suffix | Examples |
 |--------|----------|
-| `-a` | dɔnnikɛla, zanbia, miriya |
-| `-an` | balansan, abubuwan, 15nan |
 ### 6.3 Bound Stems (Lexical Roots)
@@ -436,18 +472,18 @@ Bound stems are high-frequency subword units that are semantically cohesive but
 | Stem | Cohesion | Substitutability | Examples |
 |------|----------|------------------|----------|
-| `alan` | 1.65x | 24 contexts | kalan, palan, balan |
-| `riya` | 1.81x | 11 contexts | suriya, miriya, sariya |
-| `alen` | 1.45x | 20 contexts | falen, dalen, jalen |
-| `aara` | 1.72x | 12 contexts | maara, taara, yaara |
-| `aman` | 1.31x | 25 contexts | daman, saman, faman |
-| `elen` | 1.65x | 12 contexts | yelen, kelen, selen |
-| `ɔgɔn` | 1.75x | 9 contexts | nɔgɔn, ɲɔgɔn, nyɔgɔn |
-| `ɛbɛn` | 1.82x | 8 contexts | sɛbɛn, sɛbɛnw, sɛbɛnni |
-| `anka` | 1.51x | 13 contexts | mankan, yankan, dankan |
-| `amin` | 1.51x | 13 contexts | damina, daminè, damine |
-| `nkan` | 1.35x | 14 contexts | bɛnkan, benkan, mankan |
-| `kili` | 1.49x | 10 contexts | hakili, kilisi, nkiliki |
 ### 6.4 Affix Compatibility (Co-occurrence)
@@ -455,8 +491,9 @@ This table shows which prefixes and suffixes most frequently co-occur on the sam
 | Prefix | Suffix | Frequency | Examples |
 |--------|--------|-----------|----------|
-| `-ma` | `-a` | 22 words | maa, maara |
-| `-ma` | `-an` | 8 words | masasigilan, mankaan |
 ### 6.5 Recursive Morpheme Segmentation
@@ -464,26 +501,28 @@ Using **Recursive Hierarchical Substitutability**, we decompose complex words in
 | Word | Suggested Split | Confidence | Stem |
 |------|-----------------|------------|------|
 | maninkakan | **`ma-ninkak-an`** | 3.0 | `ninkak` |
-| tlasabanan | **`tlasab-an-an`** | 3.0 | `tlasab` |
-| machecoul | **`ma-checoul`** | 1.5 | `checoul` |
-| marabolow | **`ma-rabolow`** | 1.5 | `rabolow` |
-| woroduguyanfan | **`woroduguyanf-an`** | 1.5 | `woroduguyanf` |
-| binkannikɛlan | **`binkannikɛl-an`** | 1.5 | `binkannikɛl` |
-| masakɛmuso | **`ma-sakɛmuso`** | 1.5 | `sakɛmuso` |
-| sɛnɛfɔkan | **`sɛnɛfɔk-an`** | 1.5 | `sɛnɛfɔk` |
-| ispanyikan | **`ispanyik-an`** | 1.5 | `ispanyik` |
-| tubabukan | **`tubabuk-an`** | 1.5 | `tubabuk` |
-| maramafen | **`ma-ramafen`** | 1.5 | `ramafen` |
-| balikukalan | **`balikukal-an`** | 1.5 | `balikukal` |
-| ukrayinakan | **`ukrayinak-an`** | 1.5 | `ukrayinak` |
-| marisikalo | **`ma-risikalo`** | 1.5 | `risikalo` |
-| matarafali | **`ma-tarafali`** | 1.5 | `tarafali` |
 ### 6.6 Linguistic Interpretation
 > **Automated Insight:**
-The language BM appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
 ---
 ## 7. Summary & Recommendations
@@ -495,7 +534,7 @@ The language BM appears to be more isolating or has a highly fixed vocabulary. W
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
 | Tokenizer | **32k BPE** | Best compression (4.02x) |
-| N-gram | **2-gram** | Lowest perplexity (272) |
 | Markov | **Context-4** | Highest predictability (98.0%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
@@ -710,4 +749,4 @@ MIT License - Free for academic and commercial use.
 ---
 *Generated by Wikilangs Models Pipeline*
-*Report Date: 2026-01-03 07:27:32*

 ---
 language: bm
+language_name: Bambara
 language_family: atlantic_other
 tags:
   - wikilangs
   - n-gram
   - markov
   - wikipedia
+  - feature-extraction
+  - sentence-similarity
+  - tokenization
+  - n-grams
+  - markov-chain
+  - text-mining
+  - fasttext
+  - babelvec
+  - vocabulous
+  - vocabulary
   - monolingual
   - family-atlantic_other
 license: mit
 library_name: wikilangs
+pipeline_tag: text-generation
 datasets:
   - omarkamali/wikipedia-monthly
 dataset_info:
 metrics:
   - name: best_compression_ratio
     type: compression
+    value: 4.018
   - name: best_isotropy
     type: isotropy
+    value: 0.3203
   - name: vocabulary_size
     type: vocab
     value: 0
 generated: 2026-01-03
 ---
+# Bambara - Wikilangs Models
 ## Comprehensive Research Report & Full Ablation Study
+This repository contains NLP models trained and evaluated by Wikilangs, specifically on **Bambara** Wikipedia data.
 We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and word embeddings.
 ## 📋 Repository Contents
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
+- [6. Morphological Analysis (Experimental)](#6--morphological-analysis-experimental)
 - [7. Summary & Recommendations](#7-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
+| **8k** | 3.554x | 3.56 | 1.4079% | 103,986 |
+| **16k** | 3.839x | 3.85 | 1.5205% | 96,281 |
+| **32k** | 4.018x 🏆 | 4.03 | 1.5915% | 91,989 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
+**Sample 1:** `TusyɛninBailleul, Charles. Dictionnaire français-bambara. Bamako: Éditions Donni...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁tu syɛn inbailleul , ▁charles . ▁dictionnaire ▁français - bambara ... (+8 more)` | 18 |
+| 16k | `▁tusyɛn inbailleul , ▁charles . ▁dictionnaire ▁français - bambara . ... (+7 more)` | 17 |
+| 32k | `▁tusyɛn inbailleul , ▁charles . ▁dictionnaire ▁français - bambara . ... (+7 more)` | 17 |
+**Sample 2:** `Brains ye Faransi ka dugu ye. Dugumogo be taa jon yooro Sababou Kɔfɛ sira Brains...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁brains ▁ye ▁faransi ▁ka ▁dugu ▁ye . ▁dugumogo ▁be ▁taa ... (+10 more)` | 20 |
+| 16k | `▁brains ▁ye ▁faransi ▁ka ▁dugu ▁ye . ▁dugumogo ▁be ▁taa ... (+10 more)` | 20 |
+| 32k | `▁brains ▁ye ▁faransi ▁ka ▁dugu ▁ye . ▁dugumogo ▁be ▁taa ... (+10 more)` | 20 |
+**Sample 3:** `KolanfuBailleul, Charles. Dictionnaire français-bambara. Bamako: Éditions Donniy...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁kolan fu bailleul , ▁charles . ▁dictionnaire ▁français - bambara ... (+8 more)` | 18 |
+| 16k | `▁kolan fubailleul , ▁charles . ▁dictionnaire ▁français - bambara . ... (+7 more)` | 17 |
+| 32k | `▁kolanfubailleul , ▁charles . ▁dictionnaire ▁français - bambara . ▁bamako ... (+6 more)` | 16 |
 ### Key Findings
+- **Best Compression:** 32k achieves 4.018x compression
+- **Lowest UNK Rate:** 8k with 1.4079% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
 | N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
 |--------|---------|------------|---------|----------------|------------------|-------------------|
+| **2-gram** | Word | 917 | 9.84 | 2,056 | 40.6% | 82.5% |
+| **2-gram** | Subword | 271 🏆 | 8.08 | 1,816 | 67.8% | 98.7% |
+| **3-gram** | Word | 757 | 9.56 | 2,167 | 44.4% | 79.2% |
+| **3-gram** | Subword | 1,867 | 10.87 | 9,795 | 30.1% | 75.0% |
+| **4-gram** | Word | 1,888 | 10.88 | 5,346 | 34.2% | 52.7% |
+| **4-gram** | Subword | 7,991 | 12.96 | 35,277 | 14.7% | 47.2% |
+| **5-gram** | Word | 1,411 | 10.46 | 4,196 | 36.6% | 54.4% |
+| **5-gram** | Subword | 17,676 | 14.11 | 58,257 | 10.4% | 34.3% |
 ### Top 5 N-grams by Size
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `ka dugu` | 524 |
+| 2 | `éditions donniya` | 419 |
+| 3 | `bambara bamako` | 419 |
+| 4 | `charles dictionnaire` | 419 |
+| 5 | `français bambara` | 419 |
 **3-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `dictionnaire français bambara` | 419 |
+| 2 | `charles dictionnaire français` | 419 |
+| 3 | `français bambara bamako` | 419 |
 | 4 | `bambara bamako éditions` | 419 |
+| 5 | `éditions donniya isbn` | 419 |
 **4-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `bamako éditions donniya isbn` | 419 |
+| 2 | `bambara bamako éditions donniya` | 419 |
+| 3 | `français bambara bamako éditions` | 419 |
+| 4 | `dictionnaire français bambara bamako` | 419 |
 | 5 | `charles dictionnaire français bambara` | 419 |
+**5-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `bambara bamako éditions donniya isbn` | 419 |
+| 2 | `charles dictionnaire français bambara bamako` | 419 |
+| 3 | `dictionnaire français bambara bamako éditions` | 419 |
+| 4 | `français bambara bamako éditions donniya` | 419 |
+| 5 | `bamako éditions donniya isbn sababou` | 415 |
 **2-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `a _` | 23,457 |
+| 2 | `_ k` | 13,682 |
+| 3 | `a n` | 13,488 |
+| 4 | `n _` | 12,358 |
+| 5 | `i _` | 9,793 |
 **3-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `_ k a` | 6,339 |
+| 2 | `k a _` | 4,941 |
+| 3 | `_ y e` | 4,556 |
+| 4 | `a n _` | 3,990 |
+| 5 | `n i _` | 3,929 |
 **4-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `_ k a _` | 4,284 |
+| 2 | `_ y e _` | 3,187 |
+| 3 | `_ b ɛ _` | 1,824 |
+| 4 | `_ n i _` | 1,804 |
+| 5 | `_ m i n` | 1,782 |
+**5-grams (Subword):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `a m a n a` | 1,291 |
+| 2 | `_ d u g u` | 1,271 |
+| 3 | `_ m i n _` | 1,168 |
+| 4 | `j a m a n` | 1,146 |
+| 5 | `a _ k a _` | 1,065 |
 ### Key Findings
+- **Best Perplexity:** 2-gram (subword) with 271
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
+- **Coverage:** Top-1000 patterns cover ~34% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
 | Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
 |---------|---------|-------------|------------|------------------|-----------------|----------------|
+| **1** | Word | 0.5962 | 1.512 | 3.33 | 17,463 | 40.4% |
+| **1** | Subword | 1.1592 | 2.233 | 8.34 | 482 | 0.0% |
+| **2** | Word | 0.2012 | 1.150 | 1.41 | 57,826 | 79.9% |
+| **2** | Subword | 0.9871 | 1.982 | 5.02 | 4,012 | 1.3% |
+| **3** | Word | 0.0638 | 1.045 | 1.10 | 81,186 | 93.6% |
+| **3** | Subword | 0.7347 | 1.664 | 3.14 | 20,106 | 26.5% |
+| **4** | Word | 0.0198 🏆 | 1.014 | 1.03 | 88,526 | 98.0% |
+| **4** | Subword | 0.5000 | 1.414 | 2.08 | 63,024 | 50.0% |
 ### Generated Text Samples (Word-based)
 **Context Size 1:**
+1. `ka dugu ye ɲ ŋ ɔ ɲ ka k u la litwanie duchy belebele naninan ye`
+2. `ye kan kaan kankan mali duo dɔnkilidalaw ye balikukalan ni faransi ka bɔ pretoria tɔgɔ ta`
+3. `a ka kɛ mɔgɔ nɛrɛmaw ye nga u ko majigilenya majigin kɔrɔtalenba ala kelenpe ani san`
 **Context Size 2:**
+1. `charles dictionnaire français bambara bamako éditions donniya isbn sababou kɔkan sirilanw basshunter...`
+2. `dictionnaire français bambara bamako éditions donniya isbn sababou kɔkan sirilanw michael jackson ka...`
+3. `donniya isbn sababou kɔkan sirilanw ourebia ourebi nkolonin thryonomys swinderianus kɔɲinɛ nkansole ...`
 **Context Size 3:**
+1. `bambara bamako éditions donniya isbn sababou kɔkan sirilanw herpestes ichneumon`
+2. `éditions donniya isbn sababou kɔkan sirilanw leptailurus serval`
+3. `bamako éditions donniya isbn sababou dutafilm`
 **Context Size 4:**
 1. `bambara bamako éditions donniya isbn sababou kɔkan sirilanw tragelaphus spekii`
+2. `dictionnaire français bambara bamako éditions donniya isbn sababou kɔkan sirilanw mungos mungo`
+3. `français bambara bamako éditions donniya isbn sababou kɔkan sirilanw papio anubis`
 ### Generated Text Samples (Subword-based)
 **Context Size 1:**
+1. `_t_edo_ba_faainɛ`
+2. `afoghmanọ_ne,_ji`
+3. `nyerayedambòrɔnk`
 **Context Size 2:**
+1. `a_aniyala:_zara._`
+2. `_kara_baridalatɔn`
+3. `anginkun_walf-c._`
 **Context Size 3:**
+1. `_kan_fila-jɔnjɛ_ye`
+2. `ka_san_na_ka_kɔrɔl`
+3. `_ye_dugu._virgia,_`
 **Context Size 4:**
+1. `_ka_ɲa._shiya_gossy`
+2. `_ye_danmasen_baara_`
+3. `_bɛ_daɲε_minnu_bɛ_a`
 ### Key Findings
 - **Best Predictability:** Context-4 (word) with 98.0% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
+- **Memory Trade-off:** Larger contexts require more storage (63,024 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
 | Metric | Value |
 |--------|-------|
+| Vocabulary Size | 6,824 |
+| Total Tokens | 94,926 |
+| Mean Frequency | 13.91 |
 | Median Frequency | 3 |
+| Frequency Std Dev | 106.26 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | ye | 4,371 |
+| 2 | ka | 4,340 |
+| 3 | a | 3,278 |
+| 4 | la | 1,926 |
+| 5 | ni | 1,899 |
+| 6 | bɛ | 1,834 |
+| 7 | na | 1,623 |
+| 8 | min | 1,189 |
+| 9 | o | 1,149 |
+| 10 | ani | 1,076 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | abubakari | 2 |
+| 2 | candaces | 2 |
+| 3 | ameniras | 2 |
+| 4 | kandasi | 2 |
+| 5 | qore | 2 |
+| 6 | candace | 2 |
+| 7 | amɔn | 2 |
+| 8 | bajiw | 2 |
+| 9 | dunbagaw | 2 |
+| 10 | mouvement | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
+| Zipf Coefficient | 1.0058 |
+| R² (Goodness of Fit) | 0.984137 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
+| Top 100 | 52.4% |
+| Top 1,000 | 79.3% |
+| Top 5,000 | 96.2% |
 | Top 10,000 | 0.0% |
 ### Key Findings
+- **Zipf Compliance:** R²=0.9841 indicates excellent adherence to Zipf's law
+- **High Frequency Dominance:** Top 100 words cover 52.4% of corpus
+- **Long Tail:** -3,176 words needed for remaining 100.0% coverage
 ---
 ## 5. Word Embeddings Evaluation
 ### 5.1 Cross-Lingual Alignment
+![Alignment Quality](visualizations/embedding_alignment_quality.png)
+![Multilingual t-SNE](visualizations/embedding_tsne_multilingual.png)
 ### 5.2 Model Comparison
 | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
 |-------|-----------|----------|------------------|---------------|----------------|
+| **mono_32d** | 32 | 0.3203 🏆 | 0.5260 | N/A | N/A |
+| **mono_64d** | 64 | 0.0572 | 0.5107 | N/A | N/A |
+| **mono_128d** | 128 | 0.0109 | 0.5108 | N/A | N/A |
+| **aligned_32d** | 32 | 0.3203 | 0.5505 | 0.0040 | 0.0600 |
+| **aligned_64d** | 64 | 0.0572 | 0.5015 | 0.0300 | 0.1740 |
+| **aligned_128d** | 128 | 0.0109 | 0.5061 | 0.0400 | 0.1700 |
 ### Key Findings
+- **Best Isotropy:** mono_32d with 0.3203 (more uniform distribution)
+- **Semantic Density:** Average pairwise similarity of 0.5176. Lower values indicate better semantic separation.
+- **Alignment Quality:** Aligned models achieve up to 4.0% R@1 in cross-lingual retrieval.
 - **Recommendation:** 128d aligned for best cross-lingual performance
 ---
 ## 6.  Morphological Analysis (Experimental)
 This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
 ### 6.1 Productivity & Complexity
 | Metric | Value | Interpretation | Recommendation |
 |--------|-------|----------------|----------------|
+| Productivity Index | **5.000** | High morphological productivity | Reliable analysis |
+| Idiomaticity Gap | **0.589** | High formulaic/idiomatic content | - |
 ### 6.2 Affix Inventory (Productive Units)
 #### Productive Prefixes
 | Prefix | Examples |
 |--------|----------|
+| `-ma` | masurunyala, mansaya, magana |
 #### Productive Suffixes
 | Suffix | Examples |
 |--------|----------|
+| `-a` | cɛnimusoya, fa, masurunyala |
+| `-an` | jigilan, dilan, irisikan |
+| `-en` | pen, tobilen, maliden |
 ### 6.3 Bound Stems (Lexical Roots)
 | Stem | Cohesion | Substitutability | Examples |
 |------|----------|------------------|----------|
+| `alan` | 1.63x | 24 contexts | balan, kalan, jalan |
+| `aman` | 1.32x | 25 contexts | daman, baman, saman |
+| `riya` | 1.72x | 11 contexts | miriya, sariya, suriya |
+| `aara` | 1.66x | 12 contexts | naara, yaara, taara |
+| `alen` | 1.36x | 20 contexts | salen, nalen, dalen |
+| `ɔgɔn` | 1.72x | 10 contexts | ɲɔgɔn, nɔgɔn, dɔgɔn |
+| `anka` | 1.52x | 13 contexts | yankan, kankan, dankan |
+| `elen` | 1.56x | 12 contexts | selen, kelen, yelen |
+| `amin` | 1.42x | 15 contexts | lamini, damina, daminè |
+| `ɛbɛn` | 1.74x | 8 contexts | sɛbɛn, sɛbɛnw, sɛbɛnni |
+| `nkan` | 1.37x | 14 contexts | yankan, kankan, benkan |
+| `ilan` | 1.33x | 13 contexts | tilan, dilan, filan |
 ### 6.4 Affix Compatibility (Co-occurrence)
 | Prefix | Suffix | Frequency | Examples |
 |--------|--------|-----------|----------|
+| `-ma` | `-a` | 20 words | mansamara, masa |
+| `-ma` | `-an` | 8 words | manyan, man |
+| `-ma` | `-en` | 5 words | maralen, madonnen |
 ### 6.5 Recursive Morpheme Segmentation
 | Word | Suggested Split | Confidence | Stem |
 |------|-----------------|------------|------|
+| datugunen | **`datugun-en`** | 4.5 | `datugun` |
+| masurunya | **`ma-surunya`** | 4.5 | `surunya` |
 | maninkakan | **`ma-ninkak-an`** | 3.0 | `ninkak` |
+| masafugulan | **`ma-safugul-an`** | 3.0 | `safugul` |
+| mandenkan | **`ma-ndenk-an`** | 3.0 | `ndenk` |
+| wolonwulanan | **`wolonwul-an-an`** | 3.0 | `wolonwul` |
+| maramafen | **`ma-ramaf-en`** | 3.0 | `ramaf` |
+| kɔrɔnyanfan | **`kɔrɔnyanf-an`** | 1.5 | `kɔrɔnyanf` |
+| tamashiyen | **`tamashiy-en`** | 1.5 | `tamashiy` |
+| quotidien | **`quotidi-en`** | 1.5 | `quotidi` |
+| bolofaran | **`bolofar-an`** | 1.5 | `bolofar` |
+| marcusenius | **`ma-rcusenius`** | 1.5 | `rcusenius` |
+| manuskrip | **`ma-nuskrip`** | 1.5 | `nuskrip` |
+| sεbεnnisen | **`sεbεnnis-en`** | 1.5 | `sεbεnnis` |
+| kɔnɔntɔnnan | **`kɔnɔntɔnn-an`** | 1.5 | `kɔnɔntɔnn` |
 ### 6.6 Linguistic Interpretation
 > **Automated Insight:**
+The language Bambara shows high morphological productivity. The subword models are significantly more efficient than word models, suggesting a rich system of affixation or compounding.
+> **Note on Idiomaticity:** The high Idiomaticity Gap suggests a large number of frequent multi-word expressions or formulaic sequences that are statistically distinct from their component parts.
 ---
 ## 7. Summary & Recommendations
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
 | Tokenizer | **32k BPE** | Best compression (4.02x) |
+| N-gram | **2-gram** | Lowest perplexity (271) |
 | Markov | **Context-4** | Highest predictability (98.0%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 *Generated by Wikilangs Models Pipeline*
+*Report Date: 2026-01-03 19:12:39*

models/embeddings/aligned/bm_128d.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:58448d51ab382ca0ebfbcd3f220a49d26c7b9fa1b85588af518145e83499917d
+size 1026298973

models/embeddings/aligned/bm_128d.meta.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"lang": "bm", "dim": 128, "max_seq_len": 512, "is_aligned": true}

models/embeddings/aligned/bm_128d.projection.npy ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:537a082df27db263c23ab55ffa9562d3deb5c372f9ca3c0d3b1c83df7b696158
+size 65664

models/embeddings/aligned/bm_128d_metadata.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "language": "bm",
+  "dimension": 128,
+  "version": "aligned",
+  "hub_language": "en",
+  "seed_vocab_size": 739,
+  "vocab_size": 2211
+}

models/embeddings/aligned/bm_32d.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:925a13bc3b0473844ae316fcc3342a46c1627719eb32c56d1a0abc795eb01cb2
+size 256600925

models/embeddings/aligned/bm_32d.meta.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"lang": "bm", "dim": 32, "max_seq_len": 512, "is_aligned": true}

models/embeddings/aligned/bm_32d.projection.npy ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8b9f7117865e9ac73a569578ea9dfa480830002f849875fc069bea2d15a22bd3
+size 4224

models/embeddings/aligned/bm_32d_metadata.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "language": "bm",
+  "dimension": 32,
+  "version": "aligned",
+  "hub_language": "en",
+  "seed_vocab_size": 739,
+  "vocab_size": 2211
+}

models/embeddings/aligned/bm_64d.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ab5cb58b897509925ccf3e5cf677b1bb64f5568e61f34882b9d6a34e63cdd934
+size 513166941

models/embeddings/aligned/bm_64d.meta.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"lang": "bm", "dim": 64, "max_seq_len": 512, "is_aligned": true}

models/embeddings/aligned/bm_64d.projection.npy ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f3cdb010affcbcd5046ec29ecf090f17b30da405062ded6bf50b324efcc5e904
+size 16512

models/embeddings/aligned/bm_64d_metadata.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "language": "bm",
+  "dimension": 64,
+  "version": "aligned",
+  "hub_language": "en",
+  "seed_vocab_size": 739,
+  "vocab_size": 2211
+}

models/embeddings/monolingual/bm_128d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:60554973b7e758378671bf2709b7707495d4d05f0f6d5aa74b560379597f5337
-size 1026320797

 version https://git-lfs.github.com/spec/v1
+oid sha256:58448d51ab382ca0ebfbcd3f220a49d26c7b9fa1b85588af518145e83499917d
+size 1026298973

models/embeddings/monolingual/bm_128d_metadata.json CHANGED Viewed

@@ -11,5 +11,5 @@
     "encoding_method": "rope",
     "dim": 128
   },
-  "vocab_size": 2232
 }

     "encoding_method": "rope",
     "dim": 128
   },
+  "vocab_size": 2211
 }

models/embeddings/monolingual/bm_32d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:d397f946ef2ae3aef026fefc210ae2f287c5d21b70e2355dd7bf1e9ad36afa1c
-size 256606621

 version https://git-lfs.github.com/spec/v1
+oid sha256:925a13bc3b0473844ae316fcc3342a46c1627719eb32c56d1a0abc795eb01cb2
+size 256600925

models/embeddings/monolingual/bm_32d_metadata.json CHANGED Viewed

@@ -11,5 +11,5 @@
     "encoding_method": "rope",
     "dim": 32
   },
-  "vocab_size": 2232
 }

     "encoding_method": "rope",
     "dim": 32
   },
+  "vocab_size": 2211
 }

models/embeddings/monolingual/bm_64d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:59a556264dfd6bffcee58e216ac1d731f3135622dfe1812992c391a2191779d2
-size 513178013

 version https://git-lfs.github.com/spec/v1
+oid sha256:ab5cb58b897509925ccf3e5cf677b1bb64f5568e61f34882b9d6a34e63cdd934
+size 513166941

models/embeddings/monolingual/bm_64d_metadata.json CHANGED Viewed

@@ -11,5 +11,5 @@
     "encoding_method": "rope",
     "dim": 64
   },
-  "vocab_size": 2232
 }

     "encoding_method": "rope",
     "dim": 64
   },
+  "vocab_size": 2211
 }

models/subword_markov/bm_markov_ctx1_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:75665f8001ac1c98685e9eac5bd7af0ed39fc626141b1fd4cef1b034c92d438d
-size 35393

 version https://git-lfs.github.com/spec/v1
+oid sha256:08c3d1c0cfb6c2bb97f6837020f3259399e530a88ff856728681fbb8d3ff56d9
+size 35318

models/subword_markov/bm_markov_ctx1_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "subword",
   "language": "bm",
-  "unique_contexts": 480,
-  "total_transitions": 596087
 }

   "context_size": 1,
   "variant": "subword",
   "language": "bm",
+  "unique_contexts": 482,
+  "total_transitions": 590687
 }

models/subword_markov/bm_markov_ctx2_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:89493aa503c873924d3cc26fe8b12683b9ea51449a0ff9dd9a1825aa759bcfbf
-size 160110

 version https://git-lfs.github.com/spec/v1
+oid sha256:ca9a7abe9f07de6208eb552e0ba1894c76235f0ef70fd5b0ed45db3ec3700cb9
+size 150844

models/subword_markov/bm_markov_ctx2_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "subword",
   "language": "bm",
-  "unique_contexts": 4032,
-  "total_transitions": 594884
 }

   "context_size": 2,
   "variant": "subword",
   "language": "bm",
+  "unique_contexts": 4012,
+  "total_transitions": 589489
 }

models/subword_markov/bm_markov_ctx3_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c678a21c70cd7e4a8724cc5c8781d9a005a7b1153d5cfab9672b1e53bfdb3c44
-size 480692

 version https://git-lfs.github.com/spec/v1
+oid sha256:4aa64b21db2df0542dc829b94de6eb9d164b3023e89b748f35c4999b7b1098bc
+size 470965

models/subword_markov/bm_markov_ctx3_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "subword",
   "language": "bm",
-  "unique_contexts": 20227,
-  "total_transitions": 593681
 }

   "context_size": 3,
   "variant": "subword",
   "language": "bm",
+  "unique_contexts": 20106,
+  "total_transitions": 588291
 }

models/subword_markov/bm_markov_ctx4_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:d36df5ddcd8f68677e8aeeb781950fca4c726d9b697ac0720d4fb6f5eb94a5a2
-size 1088116

 version https://git-lfs.github.com/spec/v1
+oid sha256:34978b61bed9c8dc1c236aa67e5f082032f0c17ee63bdf03658889b150fb8e88
+size 1087996

models/subword_markov/bm_markov_ctx4_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "subword",
   "language": "bm",
-  "unique_contexts": 63561,
-  "total_transitions": 592478
 }

   "context_size": 4,
   "variant": "subword",
   "language": "bm",
+  "unique_contexts": 63024,
+  "total_transitions": 587093
 }

models/subword_ngram/bm_2gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:fe3125074954dbce7e5c347fb5b53e5f30cec7775dfd663157c67fbdc9e255bc
-size 23022

 version https://git-lfs.github.com/spec/v1
+oid sha256:3e2d0e5aee6af2f4c269db63b8789197db63a50d5631832b13905c01b4e0e4a9
+size 22904

models/subword_ngram/bm_2gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "subword",
   "language": "bm",
-  "unique_ngrams": 1826,
-  "total_ngrams": 596087
 }

   "n": 2,
   "variant": "subword",
   "language": "bm",
+  "unique_ngrams": 1816,
+  "total_ngrams": 590687
 }

models/subword_ngram/bm_3gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c97c794d987cb73cd6c3301a1516b8e4889869836d3d9c5369016ba8fe9c0fb7
-size 113452

 version https://git-lfs.github.com/spec/v1
+oid sha256:e6f254928fe1d62371dc1b327938b02cf68794d7dd646557753a78e01db6561a
+size 112599

models/subword_ngram/bm_3gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "subword",
   "language": "bm",
-  "unique_ngrams": 9873,
-  "total_ngrams": 594884
 }

   "n": 3,
   "variant": "subword",
   "language": "bm",
+  "unique_ngrams": 9795,
+  "total_ngrams": 589489
 }

models/subword_ngram/bm_4gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:dbf91bd19db052de26308667adfa16bf3463044ec8463971cda61775082b87c4
-size 431891

 version https://git-lfs.github.com/spec/v1
+oid sha256:64fd74137d8669a3e57b4ef2eb14f145de682467bbca2fdb589aaf02a651dbcb
+size 426205

models/subword_ngram/bm_4gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 4,
   "variant": "subword",
   "language": "bm",
-  "unique_ngrams": 35658,
-  "total_ngrams": 593681
 }

   "n": 4,
   "variant": "subword",
   "language": "bm",
+  "unique_ngrams": 35277,
+  "total_ngrams": 588291
 }

models/subword_ngram/bm_5gram_subword.parquet ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3ed53b4ae74500b388d7e13bbd9d41554d0b5614df69bc09b83535548c718895
+size 687482

models/subword_ngram/bm_5gram_subword_metadata.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "n": 5,
+  "variant": "subword",
+  "language": "bm",
+  "unique_ngrams": 58257,
+  "total_ngrams": 587093
+}

models/tokenizer/bm_tokenizer_16k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:55b76e9844ea83477a53ca0eacddd2c822efad2b886d04d9e9ba8ce2715dfacf
-size 514654

 version https://git-lfs.github.com/spec/v1
+oid sha256:8f8ee08e80fd3a5c0da33dbe7268421131519af5a28bf9bf0a5fed058d143671
+size 514750

models/tokenizer/bm_tokenizer_16k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/bm_tokenizer_32k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6a3490f95bc0ae3183c66cee9b8b34a693f115b059216203609f99d8409129f6
-size 764320

 version https://git-lfs.github.com/spec/v1
+oid sha256:332bf29cfc8743794ff50aad54a372d93620e608eaf0214c5bb45ad6695f6799
+size 763411

models/tokenizer/bm_tokenizer_32k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/bm_tokenizer_8k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9361cd8391a5d749408299b263a632521ad4a28af1108df637b4399ef06a96bd
-size 372567

 version https://git-lfs.github.com/spec/v1
+oid sha256:1819b9d6793890963233509c462bce03f0d87db38af045546bd5d76145bd9e47
+size 373134

models/tokenizer/bm_tokenizer_8k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/vocabulary/bm_vocabulary.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:95cadef0ad07f1fcd065af0e35aa2cf7c22454a9e8a9d71f454288dae5ddf109
-size 110665

 version https://git-lfs.github.com/spec/v1
+oid sha256:a681c60c585b99f1d1a43c9995f3716c7e58e7ef5364debdf52569f2bcae0815
+size 110142

models/vocabulary/bm_vocabulary_metadata.json CHANGED Viewed

@@ -1,17 +1,17 @@
 {
   "language": "bm",
-  "vocabulary_size": 6895,
   "variant": "full",
   "statistics": {
-    "type_token_ratio": 0.16638822668143338,
     "coverage": {
-      "top_100": 0.4682859985358437,
-      "top_1000": 0.7107728117432845,
-      "top_5000": 0.8627541155932649,
-      "top_10000": 0.9274679481163066
     },
-    "hapax_count": 10833,
-    "hapax_ratio": 0.611067238267148,
-    "total_documents": 1203
   }
 }

 {
   "language": "bm",
+  "vocabulary_size": 6824,
   "variant": "full",
   "statistics": {
+    "type_token_ratio": 0.1659850808436518,
     "coverage": {
+      "top_100": 0.47053087962437046,
+      "top_1000": 0.7129482373433299,
+      "top_5000": 0.8640804271271157,
+      "top_10000": 0.9286796167973039
     },
+    "hapax_count": 10710,
+    "hapax_ratio": 0.6108132770617086,
+    "total_documents": 1198
   }
 }

models/word_markov/bm_markov_ctx1_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:789e3b5dbec8a4180554670cf6c8140274009e31544768738d599464c4f97eaf
-size 539344

 version https://git-lfs.github.com/spec/v1
+oid sha256:f54fb3270ccd0272f96abf6151adee510f418c74ad313a7b5d594ab2bf5c3c68
+size 533279

models/word_markov/bm_markov_ctx1_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "word",
   "language": "bm",
-  "unique_contexts": 17657,
-  "total_transitions": 105343
 }

   "context_size": 1,
   "variant": "word",
   "language": "bm",
+  "unique_contexts": 17463,
+  "total_transitions": 104438
 }

models/word_markov/bm_markov_ctx2_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:887a80e826ca473a6ab2e880642feffde94dc8b35a379b2e1de27ffb35ee82e6
-size 1068878

 version https://git-lfs.github.com/spec/v1
+oid sha256:b9644d765e52f69eb371f62cfb3a575404b2f90900de92bd38ac48c82cb5615c
+size 1058528

models/word_markov/bm_markov_ctx2_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "word",
   "language": "bm",
-  "unique_contexts": 58338,
-  "total_transitions": 104140
 }

   "context_size": 2,
   "variant": "word",
   "language": "bm",
+  "unique_contexts": 57826,
+  "total_transitions": 103240
 }

models/word_markov/bm_markov_ctx3_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:4ac6a4a0d3b0dfb35d2dc573351f32014393c9e3b47f7a32c3ea6caa6ee45484
-size 1381560

 version https://git-lfs.github.com/spec/v1
+oid sha256:364f460bbdd6c1f094c6f5d7b15d04002609dde043b07349a4f97360352acf5d
+size 1370583

models/word_markov/bm_markov_ctx3_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "word",
   "language": "bm",
-  "unique_contexts": 81761,
-  "total_transitions": 102937
 }

   "context_size": 3,
   "variant": "word",
   "language": "bm",
+  "unique_contexts": 81186,
+  "total_transitions": 102042
 }