omarkamali commited on Jan 3

Commit

46e1abd

verified ·

1 Parent(s): eaf3f42

Upload all models and assets for bcl (20251001)

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +314 -139
models/embeddings/monolingual/bcl_128d.bin +2 -2
models/embeddings/monolingual/bcl_128d_metadata.json +5 -3
models/embeddings/monolingual/bcl_32d.bin +2 -2
models/embeddings/monolingual/bcl_32d_metadata.json +5 -3
models/embeddings/monolingual/bcl_64d.bin +2 -2
models/embeddings/monolingual/bcl_64d_metadata.json +5 -3
models/subword_markov/bcl_markov_ctx1_subword.parquet +2 -2
models/subword_markov/bcl_markov_ctx1_subword_metadata.json +2 -2
models/subword_markov/bcl_markov_ctx2_subword.parquet +2 -2
models/subword_markov/bcl_markov_ctx2_subword_metadata.json +2 -2
models/subword_markov/bcl_markov_ctx3_subword.parquet +2 -2
models/subword_markov/bcl_markov_ctx3_subword_metadata.json +2 -2
models/subword_markov/bcl_markov_ctx4_subword.parquet +2 -2
models/subword_markov/bcl_markov_ctx4_subword_metadata.json +2 -2
models/subword_ngram/bcl_2gram_subword.parquet +2 -2
models/subword_ngram/bcl_2gram_subword_metadata.json +2 -2
models/subword_ngram/bcl_3gram_subword.parquet +2 -2
models/subword_ngram/bcl_3gram_subword_metadata.json +2 -2
models/subword_ngram/bcl_4gram_subword.parquet +2 -2
models/subword_ngram/bcl_4gram_subword_metadata.json +2 -2
models/tokenizer/bcl_tokenizer_16k.model +2 -2
models/tokenizer/bcl_tokenizer_16k.vocab +0 -0
models/tokenizer/bcl_tokenizer_32k.model +2 -2
models/tokenizer/bcl_tokenizer_32k.vocab +0 -0
models/tokenizer/bcl_tokenizer_64k.model +2 -2
models/tokenizer/bcl_tokenizer_64k.vocab +0 -0
models/tokenizer/bcl_tokenizer_8k.model +2 -2
models/tokenizer/bcl_tokenizer_8k.vocab +0 -0
models/vocabulary/bcl_vocabulary.parquet +2 -2
models/vocabulary/bcl_vocabulary_metadata.json +10 -9
models/word_markov/bcl_markov_ctx1_word.parquet +2 -2
models/word_markov/bcl_markov_ctx1_word_metadata.json +2 -2
models/word_markov/bcl_markov_ctx2_word.parquet +2 -2
models/word_markov/bcl_markov_ctx2_word_metadata.json +2 -2
models/word_markov/bcl_markov_ctx3_word.parquet +2 -2
models/word_markov/bcl_markov_ctx3_word_metadata.json +2 -2
models/word_markov/bcl_markov_ctx4_word.parquet +2 -2
models/word_markov/bcl_markov_ctx4_word_metadata.json +2 -2
models/word_ngram/bcl_2gram_word.parquet +2 -2
models/word_ngram/bcl_2gram_word_metadata.json +2 -2
models/word_ngram/bcl_3gram_word.parquet +2 -2
models/word_ngram/bcl_3gram_word_metadata.json +2 -2
models/word_ngram/bcl_4gram_word.parquet +2 -2
models/word_ngram/bcl_4gram_word_metadata.json +2 -2
visualizations/embedding_isotropy.png +0 -0
visualizations/embedding_norms.png +0 -0
visualizations/embedding_similarity.png +2 -2
visualizations/markov_branching.png +0 -0
visualizations/markov_contexts.png +0 -0

README.md CHANGED Viewed

@@ -23,14 +23,14 @@ dataset_info:
 metrics:
   - name: best_compression_ratio
     type: compression
-    value: 4.640
   - name: best_isotropy
     type: isotropy
-    value: 0.8200
   - name: vocabulary_size
     type: vocab
-    value: 139464
-generated: 2025-12-28
 ---
 # BCL - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
-- N-gram models (2, 3, 4-gram)
-- Markov chains (context of 1, 2, 3 and 4)
 - Subword N-gram and Markov chains
-- Embeddings in various sizes and dimensions
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
-- [6. Summary & Recommendations](#6-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
@@ -68,57 +70,57 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
-| **8k** | 3.849x | 3.74 | 0.0148% | 391,873 |
-| **16k** | 4.154x | 4.04 | 0.0160% | 363,086 |
-| **32k** | 4.421x | 4.30 | 0.0170% | 341,132 |
-| **64k** | 4.640x 🏆 | 4.51 | 0.0178% | 325,066 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
-**Sample 1:** `REDIRECT An Sanduguan`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁re dire ct ▁an ▁sand ug uan` | 7 |
-| 16k | `▁re dire ct ▁an ▁sand ug uan` | 7 |
-| 32k | `▁re direct ▁an ▁sand uguan` | 5 |
-| 64k | `▁re direct ▁an ▁sand uguan` | 5 |
-**Sample 2:** `An  sarong komyun asin banwaan sa Provincia nin Cosenza sa rehiyon Calabria kan ...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁an ▁sarong ▁komyun ▁asin ▁banwaan ▁sa ▁provincia ▁nin ▁cos enza ... (+6 more)` | 16 |
-| 16k | `▁an ▁sarong ▁komyun ▁asin ▁banwaan ▁sa ▁provincia ▁nin ▁cosenza ▁sa ... (+5 more)` | 15 |
-| 32k | `▁an ▁sarong ▁komyun ▁asin ▁banwaan ▁sa ▁provincia ▁nin ▁cosenza ▁sa ... (+5 more)` | 15 |
-| 64k | `▁an ▁sarong ▁komyun ▁asin ▁banwaan ▁sa ▁provincia ▁nin ▁cosenza ▁sa ... (+5 more)` | 15 |
-**Sample 3:** `An  sarong taon sa Gregoryanong kalendaryo.
- Enero
-Pebrero
- Marso
-Abril
- Mayo...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁an ▁sarong ▁taon ▁sa ▁gregoryanong ▁kalendaryo . ▁enero ▁pebrero ▁marso ... (+9 more)` | 19 |
-| 16k | `▁an ▁sarong ▁taon ▁sa ▁gregoryanong ▁kalendaryo . ▁enero ▁pebrero ▁marso ... (+9 more)` | 19 |
-| 32k | `▁an ▁sarong ▁taon ▁sa ▁gregoryanong ▁kalendaryo . ▁enero ▁pebrero ▁marso ... (+9 more)` | 19 |
-| 64k | `▁an ▁sarong ▁taon ▁sa ▁gregoryanong ▁kalendaryo . ▁enero ▁pebrero ▁marso ... (+9 more)` | 19 |
 ### Key Findings
-- **Best Compression:** 64k achieves 4.640x compression
-- **Lowest UNK Rate:** 8k with 0.0148% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
@@ -127,57 +129,89 @@ Abril
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
-| N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
-|--------|------------|---------|----------------|------------------|-------------------|
-| **2-gram** | 31,343 🏆 | 14.94 | 180,870 | 14.8% | 31.9% |
-| **2-gram** | 262 🏆 | 8.03 | 8,566 | 68.4% | 98.8% |
-| **3-gram** | 108,578 | 16.73 | 332,655 | 6.5% | 18.2% |
-| **3-gram** | 2,285 | 11.16 | 64,437 | 30.5% | 69.8% |
-| **4-gram** | 210,030 | 17.68 | 511,491 | 6.6% | 14.4% |
-| **4-gram** | 13,379 | 13.71 | 345,622 | 17.2% | 41.0% |
 ### Top 5 N-grams by Size
-**2-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `. an` | 41,934 |
-| 2 | `sa mga` | 30,441 |
-| 3 | `an mga` | 27,397 |
-| 4 | `, asin` | 26,685 |
-| 5 | `, an` | 24,473 |
-**3-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `kategorya : mga` | 16,293 |
-| 2 | `. an mga` | 6,827 |
-| 3 | `panluwas na takod` | 5,537 |
-| 4 | `mga panluwas na` | 4,931 |
-| 5 | `toltolan kategorya :` | 4,124 |
-**4-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `mga panluwas na takod` | 4,635 |
-| 2 | `toltolan kategorya : mga` | 2,861 |
-| 3 | `toltolan mga panluwas na` | 2,801 |
-| 4 | `— — — —` | 2,785 |
-| 5 | `. igwa ining sukol` | 2,225 |
 ### Key Findings
-- **Best Perplexity:** 2-gram with 262
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
-- **Coverage:** Top-1000 patterns cover ~41% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
@@ -185,55 +219,86 @@ Abril
 ![Markov Entropy](visualizations/markov_entropy.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
-| Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
-|---------|-------------|------------|------------------|-----------------|----------------|
-| **1** | 0.6497 | 1.569 | 5.59 | 379,065 | 35.0% |
-| **1** | 1.0949 | 2.136 | 6.69 | 6,611 | 0.0% |
-| **2** | 0.3654 | 1.288 | 2.19 | 2,116,590 | 63.5% |
-| **2** | 0.6035 | 1.519 | 3.87 | 44,194 | 39.6% |
-| **3** | 0.1662 | 1.122 | 1.36 | 4,629,958 | 83.4% |
-| **3** | 0.7134 | 1.640 | 3.84 | 171,168 | 28.7% |
-| **4** | 0.0685 🏆 | 1.049 | 1.12 | 6,293,312 | 93.1% |
-| **4** | 0.6518 🏆 | 1.571 | 2.96 | 656,881 | 34.8% |
-### Generated Text Samples
-Below are text samples generated from each Markov chain model:
 **Context Size 1:**
-1. `, sarong law jack white house of eastern europe award hale sa ' affaire jean nabiribid`
-2. `sa mga padalian na english rosalía nagpirma sa banwaan kan ikasampulong kabilogan nin edukasyon si a...`
-3. `na mga pagpreparar nin mayor na pigtuturing kan huring pararawitdawit , o tungkod " ) .`
 **Context Size 2:**
-1. `. an designadong zip code kaini iyo . susog ki milagros perfecto sanchez sa halipot na usipon`
-2. `sa mga osipon sa pilipino na may titulong paghinanyog man , siya nagpoon na mag - audition`
-3. `an mga botelya , pakete nin kakanon asin an responsibilidad . sa ibaba sa kabtang kaini .`
 **Context Size 3:**
-1. `kategorya : mga 2016 na kagadanan kategorya : mga tataramon na mansakan , iyo an pinagrekonstruhir n...`
-2. `. an mga bitis nin manok sarong seryosong peligro nin pagkahilo sa susunod na taon huli sa iyo`
-3. `panluwas na takod philatlas . com philippine standard geographic code local governance performance m...`
 **Context Size 4:**
-1. `mga panluwas na takod inactive volcanoes page ( arkibo ) kategorya : mga unibersidad asin kolehiyo s...`
-2. `toltolan kategorya : mga armadong sanga kan mga partido pulitika kategorya : mga organisasyon natugd...`
-3. `toltolan mga panluwas na takod philatlas . com philippine standard geographic code local governance ...`
 ### Key Findings
-- **Best Predictability:** Context-4 with 93.1% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
-- **Memory Trade-off:** Larger contexts require more storage (656,881 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
@@ -249,64 +314,64 @@ Below are text samples generated from each Markov chain model:
 | Metric | Value |
 |--------|-------|
-| Vocabulary Size | 139,464 |
-| Total Tokens | 6,306,562 |
-| Mean Frequency | 45.22 |
 | Median Frequency | 4 |
-| Frequency Std Dev | 1750.04 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | sa | 340,332 |
-| 2 | na | 337,956 |
-| 3 | an | 230,638 |
-| 4 | kan | 226,231 |
-| 5 | mga | 183,688 |
-| 6 | nin | 132,320 |
-| 7 | asin | 125,887 |
-| 8 | sarong | 62,639 |
-| 9 | si | 54,499 |
-| 10 | the | 44,508 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | zhaparova | 2 |
-| 2 | altynbekov | 2 |
-| 3 | wanatabe | 2 |
-| 4 | megapaniki | 2 |
-| 5 | kordon | 2 |
-| 6 | sobringaran | 2 |
-| 7 | khanid | 2 |
-| 8 | ganish | 2 |
-| 9 | archdioceseofcaceres | 2 |
-| 10 | niceno | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
-| Zipf Coefficient | 1.0291 |
-| R² (Goodness of Fit) | 0.993065 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
-| Top 100 | 41.8% |
-| Top 1,000 | 62.8% |
-| Top 5,000 | 79.1% |
-| Top 10,000 | 85.2% |
 ### Key Findings
-- **Zipf Compliance:** R²=0.9931 indicates excellent adherence to Zipf's law
-- **High Frequency Dominance:** Top 100 words cover 41.8% of corpus
-- **Long Tail:** 129,464 words needed for remaining 14.8% coverage
 ---
 ## 5. Word Embeddings Evaluation
@@ -319,24 +384,131 @@ Below are text samples generated from each Markov chain model:
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
-### Model Comparison
-| Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
-|-------|------------|-----------|----------|----------|----------|
-| **mono_32d** | 78,307 | 32 | 3.325 | 0.855 | 0.8200 🏆 |
-| **mono_64d** | 78,307 | 64 | 3.871 | 0.899 | 0.8194 |
-| **mono_128d** | 78,307 | 128 | 4.639 | 0.920 | 0.8065 |
-| **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 ### Key Findings
-- **Best Isotropy:** mono_32d with 0.8200 (more uniform distribution)
-- **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
-- **Vocabulary Coverage:** All models cover 78,307 words
-- **Recommendation:** 100d for balanced semantic capture and efficiency
 ---
-## 6. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
@@ -344,11 +516,12 @@ Below are text samples generated from each Markov chain model:
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
-| Tokenizer | **32k BPE** | Best compression (4.64x) with low UNK rate |
-| N-gram | **5-gram** | Lowest perplexity (262) |
-| Markov | **Context-4** | Highest predictability (93.1%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
@@ -538,7 +711,8 @@ If you use these models in your research, please cite:
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
-  publisher = {HuggingFace},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
@@ -554,7 +728,8 @@ MIT License - Free for academic and commercial use.
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 ---
 *Generated by Wikilangs Models Pipeline*
-*Report Date: 2025-12-28 00:25:48*

 metrics:
   - name: best_compression_ratio
     type: compression
+    value: 4.812
   - name: best_isotropy
     type: isotropy
+    value: 0.8253
   - name: vocabulary_size
     type: vocab
+    value: 0
+generated: 2026-01-03
 ---
 # BCL - Wikilangs Models
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
+- N-gram models (2, 3, 4, 5-gram)
+- Markov chains (context of 1, 2, 3, 4 and 5)
 - Subword N-gram and Markov chains
+- Embeddings in various sizes and dimensions (aligned and unaligned)
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
+- [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
+- [7. Summary & Recommendations](#7-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
+![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
+![Tokenizer OOV](visualizations/tokenizer_oov.png)
+![Total Tokens](visualizations/tokenizer_total_tokens.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
+| **8k** | 3.956x | 3.96 | 0.0173% | 358,080 |
+| **16k** | 4.291x | 4.29 | 0.0188% | 330,176 |
+| **32k** | 4.574x | 4.58 | 0.0200% | 309,738 |
+| **64k** | 4.812x 🏆 | 4.82 | 0.0211% | 294,409 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
+**Sample 1:** `Si Magno "Carlo" Jose Caparas (Marso 12, sa Pampanga - Mayo 25, sarong paragibon...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁si ▁mag no ▁" car lo " ▁jose ▁cap aras ... (+31 more)` | 41 |
+| 16k | `▁si ▁mag no ▁" car lo " ▁jose ▁cap aras ... (+28 more)` | 38 |
+| 32k | `▁si ▁magno ▁" carlo " ▁jose ▁caparas ▁( marso ▁ ... (+25 more)` | 35 |
+| 64k | `▁si ▁magno ▁" carlo " ▁jose ▁caparas ▁( marso ▁ ... (+25 more)` | 35 |
+**Sample 2:** `An Vermont sarong estado kan Estados Unidos. Kataytayan nin mga ladawan estado k...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁an ▁ver m ont ▁sarong ▁estado ▁kan ▁estados ▁unidos . ... (+8 more)` | 18 |
+| 16k | `▁an ▁ver mont ▁sarong ▁estado ▁kan ▁estados ▁unidos . ▁kataytayan ... (+7 more)` | 17 |
+| 32k | `▁an ▁vermont ▁sarong ▁estado ▁kan ▁estados ▁unidos . ▁kataytayan ▁nin ... (+6 more)` | 16 |
+| 64k | `▁an ▁vermont ▁sarong ▁estado ▁kan ▁estados ▁unidos . ▁kataytayan ▁nin ... (+6 more)` | 16 |
+**Sample 3:** `An sarong komyun asin banwaan sa Provincia nin Frosinone sa rehiyon Lazio kan It...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁an ▁sarong ▁komyun ▁asin ▁banwaan ▁sa ▁provincia ▁nin ▁f rosin ... (+7 more)` | 17 |
+| 16k | `▁an ▁sarong ▁komyun ▁asin ▁banwaan ▁sa ▁provincia ▁nin ▁frosinone ▁sa ... (+5 more)` | 15 |
+| 32k | `▁an ▁sarong ▁komyun ▁asin ▁banwaan ▁sa ▁provincia ▁nin ▁frosinone ▁sa ... (+5 more)` | 15 |
+| 64k | `▁an ▁sarong ▁komyun ▁asin ▁banwaan ▁sa ▁provincia ▁nin ▁frosinone ▁sa ... (+5 more)` | 15 |
 ### Key Findings
+- **Best Compression:** 64k achieves 4.812x compression
+- **Lowest UNK Rate:** 8k with 0.0173% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
+![N-gram Unique](visualizations/ngram_unique.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
+| N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
+|--------|---------|------------|---------|----------------|------------------|-------------------|
+| **2-gram** | Word | 29,761 | 14.86 | 138,758 | 13.5% | 31.1% |
+| **2-gram** | Subword | 216 🏆 | 7.75 | 6,792 | 72.6% | 99.3% |
+| **3-gram** | Word | 80,221 | 16.29 | 216,640 | 7.6% | 19.4% |
+| **3-gram** | Subword | 1,808 | 10.82 | 46,201 | 33.1% | 73.8% |
+| **4-gram** | Word | 126,144 | 16.94 | 300,994 | 9.3% | 17.1% |
+| **4-gram** | Subword | 10,403 | 13.34 | 248,296 | 18.8% | 43.7% |
 ### Top 5 N-grams by Size
+**2-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `sa mga` | 29,819 |
+| 2 | `an mga` | 26,719 |
+| 3 | `kan mga` | 22,256 |
+| 4 | `iyo an` | 17,168 |
+| 5 | `nin mga` | 16,442 |
+**3-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `panluwas na takod` | 5,464 |
+| 2 | `mga panluwas na` | 4,866 |
+| 3 | `toltolan mga panluwas` | 2,765 |
+| 4 | `para sa mga` | 2,679 |
+| 5 | `igwa ining sukol` | 2,227 |
+**4-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `mga panluwas na takod` | 4,571 |
+| 2 | `toltolan mga panluwas na` | 2,765 |
+| 3 | `igwa ining sukol na` | 2,139 |
+| 4 | `philippine standard geographic code` | 1,750 |
+| 5 | `sa sensus kan igwa` | 1,728 |
+**2-grams (Subword):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `a n` | 1,344,298 |
+| 2 | `a _` | 1,288,104 |
+| 3 | `n _` | 1,218,286 |
+| 4 | `_ s` | 827,447 |
+| 5 | `n a` | 787,793 |
+**3-grams (Subword):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `a n _` | 694,503 |
+| 2 | `_ n a` | 534,019 |
+| 3 | `_ s a` | 519,575 |
+| 4 | `n g _` | 461,251 |
+| 5 | `_ k a` | 374,182 |
+**4-grams (Subword):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `_ s a _` | 333,664 |
+| 2 | `_ n a _` | 329,330 |
+| 3 | `k a n _` | 234,296 |
+| 4 | `_ k a n` | 230,493 |
+| 5 | `_ a n _` | 210,822 |
 ### Key Findings
+- **Best Perplexity:** 2-gram (subword) with 216
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
+- **Coverage:** Top-1000 patterns cover ~44% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
 ![Markov Entropy](visualizations/markov_entropy.png)
+![Markov Contexts](visualizations/markov_contexts.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
+| Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
+|---------|---------|-------------|------------|------------------|-----------------|----------------|
+| **1** | Word | 0.7785 | 1.715 | 6.29 | 327,423 | 22.1% |
+| **1** | Subword | 0.9154 | 1.886 | 5.39 | 7,079 | 8.5% |
+| **2** | Word | 0.3185 | 1.247 | 1.98 | 2,054,215 | 68.1% |
+| **2** | Subword | 0.5355 | 1.449 | 3.36 | 38,137 | 46.5% |
+| **3** | Word | 0.1347 | 1.098 | 1.28 | 4,060,609 | 86.5% |
+| **3** | Subword | 0.6397 | 1.558 | 3.61 | 128,219 | 36.0% |
+| **4** | Word | 0.0494 🏆 | 1.035 | 1.08 | 5,171,638 | 95.1% |
+| **4** | Subword | 0.6483 | 1.567 | 3.06 | 463,318 | 35.2% |
+### Generated Text Samples (Word-based)
+Below are text samples generated from each word-based Markov chain model:
 **Context Size 1:**
+1. `sa sarong dating parakabayo na mitolohiya kan prepekturang hiroshima asin naglalaman nin estasyon pa...`
+2. `na binubuo an kapwa niya iyo an sityo sa filipinas pwesto kan mga panluwas na desenyo`
+3. `an elementong kimikal kaugalian na iran nag oogid nanggad nag aako sa vocals keyboards synths play`
 **Context Size 2:**
+1. `sa mga libreriya sa unibersidad kan klima permanenteng binabago an inskripsiyon na gapo iyo nahahama...`
+2. `an mga heswita na si bruce lee tanganing magtukdo sa saiyang komunidad sa online campaign kan gabnet`
+3. `kan mga cyclopes mayo nin neutron an kasarosarong istruktura sa salog patapsco durante kan panahon n...`
 **Context Size 3:**
+1. `panluwas na takod philatlas com philippine standard geographic code local governance performance man...`
+2. `mga panluwas na takod the incorporated owners of chungking mansions sha tsui`
+3. `toltolan mga panluwas na gubing na ini parateng ibinubuntog sa sipon ini tanganing masigurado na pag...`
 **Context Size 4:**
+1. `mga panluwas na takod philatlas com philippine standard geographic code philippine census informatio...`
+2. `toltolan mga panluwas na takod philatlas com philippine standard geographic code local governance pe...`
+3. `igwa ining sukol na kilometro kwadrado an designadong zip code kaini iyo sosog sa sensus kan igwa in...`
+### Generated Text Samples (Subword-based)
+Below are text samples generated from each subword-based Markov chain model:
+**Context Size 1:**
+1. `_ukikingama_ngam`
+2. `agrnan_ninin_n_i`
+3. `ntin_ag_teran_sw`
+**Context Size 2:**
+1. `angurehirin_mgank`
+2. `a_tawantenedyan._`
+3. `n_sin_of_ippelinc`
+**Context Size 3:**
+1. `an_anahi_mode_nin_`
+2. `_na_le_pula_04:35_`
+3. `_sanriquerto_paan_`
+**Context Size 4:**
+1. `_sa_kaze_anggaro_sa`
+2. `_na_siness_(princia`
+3. `kan_cabulanguro_nin`
 ### Key Findings
+- **Best Predictability:** Context-4 (word) with 95.1% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
+- **Memory Trade-off:** Larger contexts require more storage (463,318 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
 | Metric | Value |
 |--------|-------|
+| Vocabulary Size | 131,763 |
+| Total Tokens | 5,884,976 |
+| Mean Frequency | 44.66 |
 | Median Frequency | 4 |
+| Frequency Std Dev | 1759.83 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | sa | 336,085 |
+| 2 | na | 332,599 |
+| 3 | an | 226,864 |
+| 4 | kan | 223,487 |
+| 5 | mga | 165,146 |
+| 6 | nin | 129,650 |
+| 7 | asin | 123,857 |
+| 8 | sarong | 61,956 |
+| 9 | si | 54,132 |
+| 10 | the | 42,788 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | gorō | 2 |
+| 2 | amaji | 2 |
+| 3 | kasshi | 2 |
+| 4 | shukufuku | 2 |
+| 5 | teana | 2 |
+| 6 | siony | 2 |
+| 7 | keann | 2 |
+| 8 | libertadores | 2 |
+| 9 | rta | 2 |
+| 10 | kontoor | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
+| Zipf Coefficient | 1.0202 |
+| R² (Goodness of Fit) | 0.994749 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
+| Top 100 | 43.2% |
+| Top 1,000 | 63.6% |
+| Top 5,000 | 79.3% |
+| Top 10,000 | 85.4% |
 ### Key Findings
+- **Zipf Compliance:** R²=0.9947 indicates excellent adherence to Zipf's law
+- **High Frequency Dominance:** Top 100 words cover 43.2% of corpus
+- **Long Tail:** 121,763 words needed for remaining 14.6% coverage
 ---
 ## 5. Word Embeddings Evaluation
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
+### 5.1 Cross-Lingual Alignment
+> *Note: Multilingual alignment visualization not available for this language.*
+### 5.2 Model Comparison
+| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
+|-------|-----------|----------|------------------|---------------|----------------|
+| **mono_32d** | 32 | 0.8253 🏆 | 0.3513 | N/A | N/A |
+| **mono_64d** | 64 | 0.8232 | 0.2638 | N/A | N/A |
+| **mono_128d** | 128 | 0.8182 | 0.1917 | N/A | N/A |
 ### Key Findings
+- **Best Isotropy:** mono_32d with 0.8253 (more uniform distribution)
+- **Semantic Density:** Average pairwise similarity of 0.2689. Lower values indicate better semantic separation.
+- **Alignment Quality:** No aligned models evaluated in this run.
+- **Recommendation:** 128d aligned for best cross-lingual performance
 ---
+## 6.  Morphological Analysis (Experimental)
+> ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
+This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
+### 6.1 Productivity & Complexity
+| Metric | Value | Interpretation | Recommendation |
+|--------|-------|----------------|----------------|
+| Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
+| Idiomaticity Gap | **-1.000** | Low formulaic content | - |
+### 6.2 Affix Inventory (Productive Units)
+These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
+#### Productive Prefixes
+| Prefix | Examples |
+|--------|----------|
+| `-pa` | parliamentarians, panribay, pagpaharong |
+| `-na` | nasipit, nagdesisyong, nakakalibog |
+| `-ma` | magsolnop, maiko, magdebut |
+| `-pag` | pagpaharong, pagkotkot, pagkasambit |
+| `-ka` | karella, kantada, kaneko |
+| `-nag` | nagdesisyong, nagkakampanyang, nagwawagayway |
+| `-pi` | pilian, pinaatras, pinagmaigotan |
+#### Productive Suffixes
+| Suffix | Examples |
+|--------|----------|
+| `-n` | rubinstein, hizen, ballon |
+| `-an` | sutan, tagiliran, pilian |
+| `-ng` | chaeryeong, issuing, sinkretikong |
+| `-on` | ballon, indemnipikasyon, monsoon |
+| `-ong` | chaeryeong, sinkretikong, pagpaharong |
+| `-ing` | issuing, isporting, nakakaheling |
+### 6.3 Bound Stems (Lexical Roots)
+Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
+| Stem | Cohesion | Substitutability | Examples |
+|------|----------|------------------|----------|
+| `hili` | 2.57x | 38 contexts | chili, hilig, hilir |
+| `inak` | 2.14x | 68 contexts | pinak, inako, inakò |
+| `nter` | 1.96x | 91 contexts | inter, enter, antero |
+| `agka` | 1.87x | 107 contexts | pagka, magka, nagka |
+| `ista` | 1.82x | 115 contexts | pista, bista, lista |
+| `agpa` | 1.93x | 87 contexts | ragpa, agpay, pagpa |
+| `atio` | 2.05x | 51 contexts | patio, ratio, matios |
+| `nagp` | 2.38x | 25 contexts | nagpe, nagpa, nagpur |
+| `syon` | 1.80x | 72 contexts | bisyon, nasyon, posyon |
+| `kula` | 2.01x | 37 contexts | kulam, kulas, kulan |
+| `asyo` | 1.79x | 56 contexts | basyo, nasyo, hasyo |
+| `agin` | 1.89x | 44 contexts | sagin, aging, nagin |
+### 6.4 Affix Compatibility (Co-occurrence)
+This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
+| Prefix | Suffix | Frequency | Examples |
+|--------|--------|-----------|----------|
+| `-pa` | `-n` | 98 words | pagreparohon, patalingkason |
+| `-na` | `-n` | 86 words | nakaptan, naman |
+| `-ka` | `-n` | 81 words | kaaayon, kaenterohan |
+| `-na` | `-an` | 75 words | nakaptan, naman |
+| `-ka` | `-an` | 74 words | kaenterohan, kasilyasan |
+| `-pi` | `-n` | 70 words | pian, pinaomayan |
+| `-pi` | `-an` | 63 words | pian, pinaomayan |
+| `-pa` | `-an` | 59 words | patotoohan, panlibangan |
+| `-pa` | `-ng` | 55 words | pagsabing, paggurang |
+| `-ma` | `-ng` | 52 words | magarang, matabang |
+### 6.5 Recursive Morpheme Segmentation
+Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
+| Word | Suggested Split | Confidence | Stem |
+|------|-----------------|------------|------|
+| pagpapamahalang | **`pag-pa-pa-ma-hala-ng`** | 10.5 | `hala` |
+| pinakagurangan | **`pi-na-ka-gura-ng-an`** | 10.5 | `gura` |
+| pinakaprimerang | **`pi-na-ka-primera-ng`** | 9.0 | `primera` |
+| nakakapaugma | **`na-ka-ka-pa-ugma`** | 9.0 | `ugma` |
+| nakapagpalupad | **`na-ka-pag-pa-lupad`** | 9.0 | `lupad` |
+| makatarungan | **`ma-ka-taru-ng-an`** | 9.0 | `taru` |
+| nakakasumo | **`na-ka-ka-sumo`** | 7.5 | `sumo` |
+| pagpapainit | **`pag-pa-pa-init`** | 7.5 | `init` |
+| nagpapalihis | **`nag-pa-pa-lihis`** | 7.5 | `lihis` |
+| pagkanamamanwaan | **`pag-ka-na-ma-ma-nwaan`** | 7.5 | `nwaan` |
+| nagpabistong | **`nag-pa-bist-ong`** | 7.5 | `bist` |
+| nakakalangkaw | **`na-ka-ka-langkaw`** | 7.5 | `langkaw` |
+| nagpapalibot | **`nag-pa-pa-libot`** | 7.5 | `libot` |
+| makakapugol | **`ma-ka-ka-pugol`** | 7.5 | `pugol` |
+| pagkitabangan | **`pag-kitaba-ng-an`** | 7.5 | `kitaba` |
+### 6.6 Linguistic Interpretation
+> **Automated Insight:**
+The language BCL appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
+---
+## 7. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
+| Tokenizer | **64k BPE** | Best compression (4.81x) |
+| N-gram | **2-gram** | Lowest perplexity (216) |
+| Markov | **Context-4** | Highest predictability (95.1%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
+  doi = {10.5281/zenodo.18073153},
+  publisher = {Zenodo},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
+- 🤝 Sponsor: [Featherless AI](https://featherless.ai)
 ---
 *Generated by Wikilangs Models Pipeline*
+*Report Date: 2026-01-03 06:41:18*

models/embeddings/monolingual/bcl_128d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0b68fde739946da5ad381854c13e0c7e9f3ae925a20b45114dead92b2d7a5ec7
-size 1105550844

 version https://git-lfs.github.com/spec/v1
+oid sha256:bf3cdb55fbf130648cf3ffee8edc3e5ecdf0fb3e434c5eb95017d070b6ed68d5
+size 1101739857

models/embeddings/monolingual/bcl_128d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 128,
   "version": "monolingual",
   "training_params": {
-    "dim": 128,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 78307
 }

   "dimension": 128,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 128
   },
+  "vocab_size": 74648
 }

models/embeddings/monolingual/bcl_32d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2c5888b050fda02741996a649ef4e574917bf0c70c8bcd7b003fd8a04df805ab
-size 277411068

 version https://git-lfs.github.com/spec/v1
+oid sha256:dc294724d9b9b5f9e9882106bbd52a196e0d3ff5d980d094bd1552d8c0d1f5e1
+size 276410193

models/embeddings/monolingual/bcl_32d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 32,
   "version": "monolingual",
   "training_params": {
-    "dim": 32,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 78307
 }

   "dimension": 32,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 32
   },
+  "vocab_size": 74648
 }

models/embeddings/monolingual/bcl_64d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:73e0742024fc7fef32624a6cdeb09e3dc030122f49523813287cd56514f45ca0
-size 553457660

 version https://git-lfs.github.com/spec/v1
+oid sha256:62e5f9c945bbfd63c2c974b8baef649c3879ac74b839bc579143094844889c04
+size 551520081

models/embeddings/monolingual/bcl_64d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 64,
   "version": "monolingual",
   "training_params": {
-    "dim": 64,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 78307
 }

   "dimension": 64,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 64
   },
+  "vocab_size": 74648
 }

models/subword_markov/bcl_markov_ctx1_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:739fa35135e9500445bce8bcb51a25f5c3997c1ee417f4e594aafd5b60d2ad0f
-size 286254

 version https://git-lfs.github.com/spec/v1
+oid sha256:006988c9f7d26c290cee7f3efcec68f438194198095ef615a657d959d50bffe5
+size 281190

models/subword_markov/bcl_markov_ctx1_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "subword",
   "language": "bcl",
-  "unique_contexts": 6611,
-  "total_transitions": 42024331
 }

   "context_size": 1,
   "variant": "subword",
   "language": "bcl",
+  "unique_contexts": 7079,
+  "total_transitions": 38302882
 }

models/subword_markov/bcl_markov_ctx2_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f1fa5543c495a6dc7b592e6ed9db597421ed8a10f87195a57c3a5ae0adb04538
-size 1378883

 version https://git-lfs.github.com/spec/v1
+oid sha256:ae7603efc10031eeb84e9f4ea835d551f816490109984087110f408b59dbe37e
+size 1134736

models/subword_markov/bcl_markov_ctx2_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "subword",
   "language": "bcl",
-  "unique_contexts": 44194,
-  "total_transitions": 42002807
 }

   "context_size": 2,
   "variant": "subword",
   "language": "bcl",
+  "unique_contexts": 38137,
+  "total_transitions": 38281621
 }

models/subword_markov/bcl_markov_ctx3_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:21e6e12757b0460aafcbbdf777b4947f1a68cd9d5a25f399807b67cd4a70bd88
-size 5038178

 version https://git-lfs.github.com/spec/v1
+oid sha256:49ee3d03628a769fa0eaa6565a8c6cb451eeabe76d6105b458158c2fe59d0663
+size 3801377

models/subword_markov/bcl_markov_ctx3_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "subword",
   "language": "bcl",
-  "unique_contexts": 171168,
-  "total_transitions": 41981283
 }

   "context_size": 3,
   "variant": "subword",
   "language": "bcl",
+  "unique_contexts": 128219,
+  "total_transitions": 38260360
 }

models/subword_markov/bcl_markov_ctx4_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:d8aaf71ea0ea25c963c31d64367f987031042ce65e722e9f5e8475e81791c7af
-size 15230088

 version https://git-lfs.github.com/spec/v1
+oid sha256:ff2037012deaf809cbae28b5f83dbde59b2bd458ca4152e6a4cb0aacfe9c3446
+size 11339203

models/subword_markov/bcl_markov_ctx4_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "subword",
   "language": "bcl",
-  "unique_contexts": 656881,
-  "total_transitions": 41959759
 }

   "context_size": 4,
   "variant": "subword",
   "language": "bcl",
+  "unique_contexts": 463318,
+  "total_transitions": 38239099
 }

models/subword_ngram/bcl_2gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:5afe288d387fa175c7c566a01acbddcc3e251edf63334aafe7f54ed4dee114e5
-size 112319

 version https://git-lfs.github.com/spec/v1
+oid sha256:99ef93eead11f74c9add74db6169310c53797885fcb53f36d200d66ee5ef38b7
+size 89797

models/subword_ngram/bcl_2gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "subword",
   "language": "bcl",
-  "unique_ngrams": 8566,
-  "total_ngrams": 42024331
 }

   "n": 2,
   "variant": "subword",
   "language": "bcl",
+  "unique_ngrams": 6792,
+  "total_ngrams": 38302882
 }

models/subword_ngram/bcl_3gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6b01ba4b34f80ef57927352ca4ad9437096837f0d321a57fb99d039db1ecf0d5
-size 793658

 version https://git-lfs.github.com/spec/v1
+oid sha256:60182df9a558595b603a59b59d1c3ec77041802aa43a6afa665a1b221f701b27
+size 597753

models/subword_ngram/bcl_3gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "subword",
   "language": "bcl",
-  "unique_ngrams": 64437,
-  "total_ngrams": 42002807
 }

   "n": 3,
   "variant": "subword",
   "language": "bcl",
+  "unique_ngrams": 46201,
+  "total_ngrams": 38281621
 }

models/subword_ngram/bcl_4gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:aa935e3bfc1d4549c3217dc3d673805015911677b09755efe6121a07a25deb22
-size 3810759

 version https://git-lfs.github.com/spec/v1
+oid sha256:345f2173509b2b9bc8eed2c825e6591508bf37f060bb0ad7b1e85c6fc662db4c
+size 2804511

models/subword_ngram/bcl_4gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 4,
   "variant": "subword",
   "language": "bcl",
-  "unique_ngrams": 345622,
-  "total_ngrams": 41981283
 }

   "n": 4,
   "variant": "subword",
   "language": "bcl",
+  "unique_ngrams": 248296,
+  "total_ngrams": 38260360
 }

models/tokenizer/bcl_tokenizer_16k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:eb0c1d6ec9b37aa48ab4580025133b7115a3a321b90a538df159371b6b346dc5
-size 504162

 version https://git-lfs.github.com/spec/v1
+oid sha256:77a1b85ce56e78c7903b6f9dea869cefef456654bcf07353809aacdd461381f0
+size 505737

models/tokenizer/bcl_tokenizer_16k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/bcl_tokenizer_32k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:45ed3fdf3b5b2424886cfa7b6371ad864d24da591648b25007a712969fcd7bf9
-size 779719

 version https://git-lfs.github.com/spec/v1
+oid sha256:a387469a476988f3e12c7fd375505ee9b078c2d19b0076337812e87e50d6f204
+size 782150

models/tokenizer/bcl_tokenizer_32k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/bcl_tokenizer_64k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1bca4456326070a24d1192d3df157782d08cdd45ec794b4e39075351072d4119
-size 1339713

 version https://git-lfs.github.com/spec/v1
+oid sha256:0354558806559d124b6246dfc4d681e614984d19649abc6f6c225ad6e3692aae
+size 1348303

models/tokenizer/bcl_tokenizer_64k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/bcl_tokenizer_8k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:001d9a2fb8dd93cbd5b03c53f63b6fda18503cff21a9fdcf488356c5ddaba705
-size 370216

 version https://git-lfs.github.com/spec/v1
+oid sha256:94f252a579c2ba992311623e32fbb194c272d1a66be40c17c4ba4f8a8a434b14
+size 370855

models/tokenizer/bcl_tokenizer_8k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/vocabulary/bcl_vocabulary.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:377f83e39987b4da8cfa29b2470f276d727552a7628b8a387f78dd8541a70ab0
-size 2359026

 version https://git-lfs.github.com/spec/v1
+oid sha256:261d873870e07e5f3af5f7a71946694964dd00bc89c0db0799bfdffcdcae911b
+size 2250849

models/vocabulary/bcl_vocabulary_metadata.json CHANGED Viewed

@@ -1,16 +1,17 @@
 {
   "language": "bcl",
-  "vocabulary_size": 139464,
   "statistics": {
-    "type_token_ratio": 0.05786067345010461,
     "coverage": {
-      "top_100": 0.4028147626471449,
-      "top_1000": 0.6046892341630454,
-      "top_5000": 0.7619812262587947,
-      "top_10000": 0.8206949599325984
     },
-    "hapax_count": 239283,
-    "hapax_ratio": 0.6317753011905045,
-    "total_documents": 21524
   }
 }

 {
   "language": "bcl",
+  "vocabulary_size": 131763,
+  "variant": "full",
   "statistics": {
+    "type_token_ratio": 0.05388651103925395,
     "coverage": {
+      "top_100": 0.41810468235658227,
+      "top_1000": 0.6157275307187713,
+      "top_5000": 0.7676151406101507,
+      "top_10000": 0.8261024576825995
     },
+    "hapax_count": 195915,
+    "hapax_ratio": 0.5978887810594548,
+    "total_documents": 21261
   }
 }

models/word_markov/bcl_markov_ctx1_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:aa0505ff75b4bb563a9c6248c9ad624724bc60f8e390de4c1480153c67bb974e
-size 19592785

 version https://git-lfs.github.com/spec/v1
+oid sha256:6680aaaf3a96d97340daa50f62f0e696137d5ec1eb9abc2d5323a1ed5594af5e
+size 18651232

models/word_markov/bcl_markov_ctx1_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "word",
   "language": "bcl",
-  "unique_contexts": 379065,
-  "total_transitions": 7959439
 }

   "context_size": 1,
   "variant": "word",
   "language": "bcl",
+  "unique_contexts": 327423,
+  "total_transitions": 6059630
 }

models/word_markov/bcl_markov_ctx2_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:017bca82c0d603a73ea84c30f277a13bf8138f46a8bd609c2ca20192bc0339c2
-size 51192571

 version https://git-lfs.github.com/spec/v1
+oid sha256:3c5f462dad3387aa8f577ce6600d0c3d3137f0901270876d01c03c667bd229db
+size 47432243

models/word_markov/bcl_markov_ctx2_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "word",
   "language": "bcl",
-  "unique_contexts": 2116590,
-  "total_transitions": 7937915
 }

   "context_size": 2,
   "variant": "word",
   "language": "bcl",
+  "unique_contexts": 2054215,
+  "total_transitions": 6038369
 }

models/word_markov/bcl_markov_ctx3_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2c15d097a7dfd722a8f8ff785ce89195257c0b0623ae7522c58fc7ad017fe2a1
-size 85270277

 version https://git-lfs.github.com/spec/v1
+oid sha256:5f1aec535711d6b78be3bdf97f5be17854d10be0b6235c95a09d761d09f9c623
+size 74460008

models/word_markov/bcl_markov_ctx3_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "word",
   "language": "bcl",
-  "unique_contexts": 4629958,
-  "total_transitions": 7916395
 }

   "context_size": 3,
   "variant": "word",
   "language": "bcl",
+  "unique_contexts": 4060609,
+  "total_transitions": 6017108
 }

models/word_markov/bcl_markov_ctx4_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:06def7271772bbb24e5855c6419398ebbd5ee42dac50cd2db0ae0e82b52681ec
-size 107623142

 version https://git-lfs.github.com/spec/v1
+oid sha256:44044ee508a64aa6cf60b26a019c40c4eee8ce145b3ef6704b984ec3a21584e2
+size 90731082

models/word_markov/bcl_markov_ctx4_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "word",
   "language": "bcl",
-  "unique_contexts": 6293312,
-  "total_transitions": 7894880
 }

   "context_size": 4,
   "variant": "word",
   "language": "bcl",
+  "unique_contexts": 5171638,
+  "total_transitions": 5995847
 }

models/word_ngram/bcl_2gram_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:804450949415c5c59b5cff49dc971557289a8e8c8df8fe79908549dace5f80d9
-size 2384371

 version https://git-lfs.github.com/spec/v1
+oid sha256:08cf305cec6b2cf0ad98146fa9b9c75f750581fc4e8bec741b0701f0a9f6bb98
+size 1902913

models/word_ngram/bcl_2gram_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "word",
   "language": "bcl",
-  "unique_ngrams": 180870,
-  "total_ngrams": 7959439
 }

   "n": 2,
   "variant": "word",
   "language": "bcl",
+  "unique_ngrams": 138758,
+  "total_ngrams": 6059630
 }

models/word_ngram/bcl_3gram_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:bffb92dfe9eb20b87645783d0ec648103f9ae0fafa9ccd37d6e397d295243c2a
-size 4734062

 version https://git-lfs.github.com/spec/v1
+oid sha256:520274c65061d647ce7f8254cb17dbd9f49c11af62fe089154ba74173a7ddca7
+size 3252750

models/word_ngram/bcl_3gram_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "word",
   "language": "bcl",
-  "unique_ngrams": 332655,
-  "total_ngrams": 7937915
 }

   "n": 3,
   "variant": "word",
   "language": "bcl",
+  "unique_ngrams": 216640,
+  "total_ngrams": 6038369
 }

models/word_ngram/bcl_4gram_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:bb52d0e8bc028c69de74ba58fb8dc5143ff79cc0d378ce50a1d7f95e483fa874
-size 7776301