omarkamali commited on Jan 3

Commit

c9da813

verified ·

1 Parent(s): b577f76

Upload all models and assets for csb (20251001)

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +307 -132
models/embeddings/monolingual/csb_128d.bin +2 -2
models/embeddings/monolingual/csb_128d_metadata.json +5 -3
models/embeddings/monolingual/csb_32d.bin +2 -2
models/embeddings/monolingual/csb_32d_metadata.json +5 -3
models/embeddings/monolingual/csb_64d.bin +2 -2
models/embeddings/monolingual/csb_64d_metadata.json +5 -3
models/subword_markov/csb_markov_ctx1_subword.parquet +2 -2
models/subword_markov/csb_markov_ctx1_subword_metadata.json +2 -2
models/subword_markov/csb_markov_ctx2_subword.parquet +2 -2
models/subword_markov/csb_markov_ctx2_subword_metadata.json +2 -2
models/subword_markov/csb_markov_ctx3_subword.parquet +2 -2
models/subword_markov/csb_markov_ctx3_subword_metadata.json +2 -2
models/subword_markov/csb_markov_ctx4_subword.parquet +2 -2
models/subword_markov/csb_markov_ctx4_subword_metadata.json +2 -2
models/subword_ngram/csb_2gram_subword.parquet +2 -2
models/subword_ngram/csb_2gram_subword_metadata.json +2 -2
models/subword_ngram/csb_3gram_subword.parquet +2 -2
models/subword_ngram/csb_3gram_subword_metadata.json +2 -2
models/subword_ngram/csb_4gram_subword.parquet +2 -2
models/subword_ngram/csb_4gram_subword_metadata.json +2 -2
models/tokenizer/csb_tokenizer_16k.model +2 -2
models/tokenizer/csb_tokenizer_16k.vocab +0 -0
models/tokenizer/csb_tokenizer_32k.model +2 -2
models/tokenizer/csb_tokenizer_32k.vocab +0 -0
models/tokenizer/csb_tokenizer_64k.model +2 -2
models/tokenizer/csb_tokenizer_64k.vocab +0 -0
models/tokenizer/csb_tokenizer_8k.model +2 -2
models/tokenizer/csb_tokenizer_8k.vocab +0 -0
models/vocabulary/csb_vocabulary.parquet +2 -2
models/vocabulary/csb_vocabulary_metadata.json +10 -9
models/word_markov/csb_markov_ctx1_word.parquet +2 -2
models/word_markov/csb_markov_ctx1_word_metadata.json +2 -2
models/word_markov/csb_markov_ctx2_word.parquet +2 -2
models/word_markov/csb_markov_ctx2_word_metadata.json +2 -2
models/word_markov/csb_markov_ctx3_word.parquet +2 -2
models/word_markov/csb_markov_ctx3_word_metadata.json +2 -2
models/word_markov/csb_markov_ctx4_word.parquet +2 -2
models/word_markov/csb_markov_ctx4_word_metadata.json +2 -2
models/word_ngram/csb_2gram_word.parquet +2 -2
models/word_ngram/csb_2gram_word_metadata.json +2 -2
models/word_ngram/csb_3gram_word.parquet +2 -2
models/word_ngram/csb_3gram_word_metadata.json +2 -2
models/word_ngram/csb_4gram_word.parquet +2 -2
models/word_ngram/csb_4gram_word_metadata.json +2 -2
visualizations/embedding_isotropy.png +0 -0
visualizations/embedding_norms.png +0 -0
visualizations/embedding_similarity.png +2 -2
visualizations/markov_branching.png +0 -0
visualizations/markov_contexts.png +0 -0

README.md CHANGED Viewed

@@ -23,14 +23,14 @@ dataset_info:
 metrics:
   - name: best_compression_ratio
     type: compression
-    value: 3.993
   - name: best_isotropy
     type: isotropy
-    value: 0.7945
   - name: vocabulary_size
     type: vocab
-    value: 29805
-generated: 2025-12-29
 ---
 # CSB - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
-- N-gram models (2, 3, 4-gram)
-- Markov chains (context of 1, 2, 3 and 4)
 - Subword N-gram and Markov chains
-- Embeddings in various sizes and dimensions
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
-- [6. Summary & Recommendations](#6-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
@@ -68,51 +70,57 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
-| **8k** | 3.275x | 3.21 | 0.1515% | 221,765 |
-| **16k** | 3.531x | 3.46 | 0.1633% | 205,707 |
-| **32k** | 3.767x | 3.69 | 0.1743% | 192,804 |
-| **64k** | 3.993x 🏆 | 3.91 | 0.1847% | 181,902 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
-**Sample 1:** `Czôrnogłowòwô miewa (Ichthyaetus melanocephalus) - to je wiôldżi wòdny ptôch z r...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁czôrno g łowò wô ▁mie wa ▁( ich th ya ... (+32 more)` | 42 |
-| 16k | `▁czôrno g łowò wô ▁mie wa ▁( ich th ya ... (+29 more)` | 39 |
-| 32k | `▁czôrno g łowò wô ▁miewa ▁( ich th ya e ... (+26 more)` | 36 |
-| 64k | `▁czôrnog łowò wô ▁miewa ▁( ich th ya e tus ... (+23 more)` | 33 |
-**Sample 2:** `Czôłpino - to je kòlonia w pòmòrsczim wòjewództwie, w stołpsczim krézu, w gminie...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁cz ôł pino ▁- ▁to ▁je ▁kòlonia ▁w ▁pòmòrsczim ▁wòjewództwie ... (+18 more)` | 28 |
-| 16k | `▁czôł pino ▁- ▁to ▁je ▁kòlonia ▁w ▁pòmòrsczim ▁wòjewództwie , ... (+17 more)` | 27 |
-| 32k | `▁czôł pino ▁- ▁to ▁je ▁kòlonia ▁w ▁pòmòrsczim ▁wòjewództwie , ... (+17 more)` | 27 |
-| 64k | `▁czôłpino ▁- ▁to ▁je ▁kòlonia ▁w ▁pòmòrsczim ▁wòjewództwie , ▁w ... (+16 more)` | 26 |
-**Sample 3:** `Ágústa Eva Erlendsdóttir (ùr. 28 lëpinca 1982) je islandzkô spiéwôrka ë teatrown...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁ á g ú sta ▁e va ▁er le nd ... (+36 more)` | 46 |
-| 16k | `▁á g ú sta ▁e va ▁er le nds dó ... (+31 more)` | 41 |
-| 32k | `▁á g ú sta ▁eva ▁er le nds dóttir ▁( ... (+28 more)` | 38 |
-| 64k | `▁ágústa ▁eva ▁er le nds dóttir ▁( ùr . ▁ ... (+24 more)` | 34 |
 ### Key Findings
-- **Best Compression:** 64k achieves 3.993x compression
-- **Lowest UNK Rate:** 8k with 0.1515% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
@@ -121,55 +129,87 @@ Below are sample sentences tokenized with each vocabulary size:
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
-| N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
-|--------|------------|---------|----------------|------------------|-------------------|
-| **2-gram** | 2,815 🏆 | 11.46 | 10,915 | 29.0% | 63.1% |
-| **2-gram** | 527 🏆 | 9.04 | 3,311 | 50.6% | 96.8% |
-| **3-gram** | 3,961 | 11.95 | 15,184 | 24.1% | 58.5% |
-| **3-gram** | 4,481 | 12.13 | 27,002 | 18.9% | 56.1% |
-| **4-gram** | 6,940 | 12.76 | 27,953 | 19.6% | 50.7% |
-| **4-gram** | 20,373 | 14.31 | 120,017 | 10.9% | 33.0% |
 ### Top 5 N-grams by Size
-**2-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `kategòrëjô :` | 6,721 |
-| 2 | `to je` | 2,527 |
-| 3 | `. w` | 2,269 |
-| 4 | `, w` | 2,124 |
-| 5 | `) ,` | 1,989 |
-**3-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `. kategòrëjô :` | 1,909 |
-| 2 | `- to je` | 1,487 |
-| 3 | `align = "` | 1,080 |
-| 4 | `. bùtnowé lënczi` | 1,077 |
-| 5 | `< p align` | 1,053 |
-**4-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `= " right "` | 1,053 |
-| 2 | `< p align =` | 1,053 |
-| 3 | `p align = "` | 1,053 |
-| 4 | `" right " >` | 1,053 |
-| 5 | `align = " right` | 1,053 |
 ### Key Findings
-- **Best Perplexity:** 2-gram with 527
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
 - **Coverage:** Top-1000 patterns cover ~33% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
@@ -179,55 +219,86 @@ Below are sample sentences tokenized with each vocabulary size:
 ![Markov Entropy](visualizations/markov_entropy.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
-| Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
-|---------|-------------|------------|------------------|-----------------|----------------|
-| **1** | 0.5086 | 1.423 | 3.23 | 84,682 | 49.1% |
-| **1** | 1.1315 | 2.191 | 8.49 | 997 | 0.0% |
-| **2** | 0.1827 | 1.135 | 1.41 | 273,056 | 81.7% |
-| **2** | 1.0415 | 2.058 | 6.33 | 8,455 | 0.0% |
-| **3** | 0.0666 | 1.047 | 1.12 | 383,141 | 93.3% |
-| **3** | 0.8899 | 1.853 | 4.06 | 53,427 | 11.0% |
-| **4** | 0.0297 🏆 | 1.021 | 1.05 | 428,917 | 97.0% |
-| **4** | 0.6225 🏆 | 1.539 | 2.47 | 216,692 | 37.8% |
-### Generated Text Samples
-Below are text samples generated from each Markov chain model:
 **Context Size 1:**
-1. `. rok pierszégò tësąclecégò , karna d - 89 dniów . com ) bëł trzëmóny przez`
-2. `, a béł nié gregòriańsczi kalãdôrz na hewòtny rok 1857 kategòrëjô : 48 . geografiô lewo`
-3. `w 1938 - rowni - wënalôzôrz . w pòlsce za zastrzélënego miemca w repùblice ( 2001`
 **Context Size 2:**
-1. `kategòrëjô : mùzycë gaga , lady`
-2. `to je jednorocznô roscëna z rodzëznë arónowatëch ( araceae ) . charakteristny wëzdrzatk : samc wëzdr...`
-3. `. w rozmienim rzeczenia kùrë rozgrzebałë wieselé , tj . wedle òbecnoscë związków pùrynowëch mùszą ji...`
 **Context Size 3:**
-1. `. kategòrëjô : szląsczé wòjewództwò kategòrëjô : gardë w pòlsce kategòrëjô : pòmòrsczé wsë kategòrëj...`
-2. `- to je wies w gminie tëchómié , w bëtowsczim krézu , w gminie dãbnica . tu je`
-3. `align = " right " > 10 < p align = " right " > 12 < p`
 **Context Size 4:**
-1. `" right " > 28 < p align = " right " > 26 < p align = "`
-2. `< p align = " right " > 14 < p align = " right " > 27 <`
-3. `p align = " right " > 17 < p align = " right " > 14 < p`
 ### Key Findings
-- **Best Predictability:** Context-4 with 97.0% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
-- **Memory Trade-off:** Larger contexts require more storage (216,692 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
@@ -243,64 +314,64 @@ Below are text samples generated from each Markov chain model:
 | Metric | Value |
 |--------|-------|
-| Vocabulary Size | 29,805 |
-| Total Tokens | 403,484 |
-| Mean Frequency | 13.54 |
 | Median Frequency | 3 |
-| Frequency Std Dev | 150.89 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | w | 17,332 |
-| 2 | je | 7,887 |
-| 3 | i | 6,871 |
-| 4 | kategòrëjô | 6,724 |
-| 5 | na | 6,682 |
-| 6 | z | 4,990 |
-| 7 | to | 4,758 |
-| 8 | sã | 3,710 |
-| 9 | do | 3,395 |
-| 10 | rok | 3,184 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | eliminowanié | 2 |
-| 2 | pòliticznich | 2 |
-| 3 | pôłna | 2 |
-| 4 | kòntrola | 2 |
-| 5 | ùmòwã | 2 |
-| 6 | stalinizm | 2 |
-| 7 | ssr1922 | 2 |
-| 8 | ssr1936 | 2 |
-| 9 | ssr1925 | 2 |
-| 10 | fssr | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
-| Zipf Coefficient | 0.9980 |
-| R² (Goodness of Fit) | 0.996291 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
-| Top 100 | 34.8% |
-| Top 1,000 | 62.7% |
-| Top 5,000 | 79.9% |
-| Top 10,000 | 87.6% |
 ### Key Findings
-- **Zipf Compliance:** R²=0.9963 indicates excellent adherence to Zipf's law
-- **High Frequency Dominance:** Top 100 words cover 34.8% of corpus
-- **Long Tail:** 19,805 words needed for remaining 12.4% coverage
 ---
 ## 5. Word Embeddings Evaluation
@@ -313,24 +384,125 @@ Below are text samples generated from each Markov chain model:
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
-### Model Comparison
-| Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
-|-------|------------|-----------|----------|----------|----------|
-| **mono_32d** | 9,374 | 32 | 4.120 | 1.060 | 0.7945 🏆 |
-| **mono_64d** | 9,374 | 64 | 4.324 | 0.986 | 0.5096 |
-| **mono_128d** | 9,374 | 128 | 4.393 | 0.976 | 0.1336 |
-| **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 ### Key Findings
-- **Best Isotropy:** mono_32d with 0.7945 (more uniform distribution)
-- **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
-- **Vocabulary Coverage:** All models cover 9,374 words
-- **Recommendation:** 100d for balanced semantic capture and efficiency
 ---
-## 6. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
@@ -338,11 +510,12 @@ Below are text samples generated from each Markov chain model:
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
-| Tokenizer | **32k BPE** | Best compression (3.99x) with low UNK rate |
-| N-gram | **5-gram** | Lowest perplexity (527) |
-| Markov | **Context-4** | Highest predictability (97.0%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
@@ -532,7 +705,8 @@ If you use these models in your research, please cite:
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
-  publisher = {HuggingFace},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
@@ -548,7 +722,8 @@ MIT License - Free for academic and commercial use.
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 ---
 *Generated by Wikilangs Models Pipeline*
-*Report Date: 2025-12-29 05:39:24*

 metrics:
   - name: best_compression_ratio
     type: compression
+    value: 4.519
   - name: best_isotropy
     type: isotropy
+    value: 0.7759
   - name: vocabulary_size
     type: vocab
+    value: 0
+generated: 2026-01-03
 ---
 # CSB - Wikilangs Models
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
+- N-gram models (2, 3, 4, 5-gram)
+- Markov chains (context of 1, 2, 3, 4 and 5)
 - Subword N-gram and Markov chains
+- Embeddings in various sizes and dimensions (aligned and unaligned)
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
+- [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
+- [7. Summary & Recommendations](#7-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
+![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
+![Tokenizer OOV](visualizations/tokenizer_oov.png)
+![Total Tokens](visualizations/tokenizer_total_tokens.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
+| **8k** | 3.573x | 3.58 | 0.1681% | 180,853 |
+| **16k** | 3.908x | 3.91 | 0.1839% | 165,322 |
+| **32k** | 4.227x | 4.23 | 0.1989% | 152,876 |
+| **64k** | 4.519x 🏆 | 4.53 | 0.2126% | 142,981 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
+**Sample 1:** `Nowô Zelandzkô - je państwã na òstrowach Spòkójnégò Òceanu. w Aùstralëji i Ocean...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁nowô ▁zel an dzkô ▁- ▁je ▁państwã ▁na ▁òst rowa ... (+14 more)` | 24 |
+| 16k | `▁nowô ▁zelan dzkô ▁- ▁je ▁państwã ▁na ▁òstrowach ▁spòkójnégò ▁òceanu ... (+6 more)` | 16 |
+| 32k | `▁nowô ▁zelan dzkô ▁- ▁je ▁państwã ▁na ▁òstrowach ▁spòkójnégò ▁òceanu ... (+5 more)` | 15 |
+| 64k | `▁nowô ▁zelandzkô ▁- ▁je ▁państwã ▁na ▁òstrowach ▁spòkójnégò ▁òceanu . ... (+4 more)` | 14 |
+**Sample 2:** `802 / DCCCII 800 « 801 « 802 » 803 » 804 Wëdarzenia Ùrodzëlë sã Ùmarlë`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁ 8 0 2 ▁/ ▁dccc ii ▁ 8 0 ... (+25 more)` | 35 |
+| 16k | `▁ 8 0 2 ▁/ ▁dccc ii ▁ 8 0 ... (+25 more)` | 35 |
+| 32k | `▁ 8 0 2 ▁/ ▁dccc ii ▁ 8 0 ... (+25 more)` | 35 |
+| 64k | `▁ 8 0 2 ▁/ ▁dccc ii ▁ 8 0 ... (+25 more)` | 35 |
+**Sample 3:** `Smierdzący bòcónk (Geranium robertianum L.) – to je jednorocznô abò dwalatnô ros...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁smier dzą cy ▁bòc ónk ▁( ge ra nium ▁robert ... (+26 more)` | 36 |
+| 16k | `▁smier dzący ▁bòc ónk ▁( gera nium ▁robert ian um ... (+24 more)` | 34 |
+| 32k | `▁smier dzący ▁bòc ónk ▁( gera nium ▁robert ian um ... (+23 more)` | 33 |
+| 64k | `▁smier dzący ▁bòcónk ▁( geranium ▁robert ian um ▁l .) ... (+21 more)` | 31 |
 ### Key Findings
+- **Best Compression:** 64k achieves 4.519x compression
+- **Lowest UNK Rate:** 8k with 0.1681% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
+![N-gram Unique](visualizations/ngram_unique.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
+| N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
+|--------|---------|------------|---------|----------------|------------------|-------------------|
+| **2-gram** | Word | 1,973 | 10.95 | 6,252 | 31.3% | 68.4% |
+| **2-gram** | Subword | 459 🏆 | 8.84 | 2,759 | 53.4% | 98.1% |
+| **3-gram** | Word | 2,109 | 11.04 | 7,761 | 31.4% | 68.9% |
+| **3-gram** | Subword | 3,977 | 11.96 | 22,668 | 18.9% | 58.0% |
+| **4-gram** | Word | 3,756 | 11.88 | 15,387 | 27.9% | 59.4% |
+| **4-gram** | Subword | 19,041 | 14.22 | 103,678 | 9.9% | 32.9% |
 ### Top 5 N-grams by Size
+**2-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `to je` | 2,509 |
+| 2 | `bùtnowé lënczi` | 1,441 |
+| 3 | `ùrodzëlë sã` | 991 |
+| 4 | `w gminie` | 982 |
+| 5 | `m jin` | 873 |
+**3-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `wëdarzenia ùrodzëlë sã` | 849 |
+| 2 | `ùrodzëlë sã ùmarlë` | 814 |
+| 3 | `w pòmòrsczim wòjewództwie` | 642 |
+| 4 | `p p p` | 601 |
+| 5 | `pòmòrsczim wòjewództwie w` | 543 |
+**4-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `wëdarzenia ùrodzëlë sã ùmarlë` | 753 |
+| 2 | `p p p p` | 566 |
+| 3 | `w pòmòrsczim wòjewództwie w` | 537 |
+| 4 | `królestwa i jinëch słowiańsczich` | 489 |
+| 5 | `i jinëch słowiańsczich krajów` | 489 |
+**2-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `c z` | 39,994 |
+| 2 | `a _` | 39,475 |
+| 3 | `_ w` | 38,361 |
+| 4 | `. _` | 33,310 |
+| 5 | `_ p` | 33,120 |
+**3-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `c z i` | 17,651 |
+| 2 | `_ w _` | 16,987 |
+| 3 | `s c z` | 14,602 |
+| 4 | `_ p ò` | 12,455 |
+| 5 | `n a _` | 11,117 |
+**4-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `s c z i` | 9,987 |
+| 2 | `c z i _` | 8,529 |
+| 3 | `_ j e _` | 7,782 |
+| 4 | `é g ò _` | 7,756 |
+| 5 | `_ n a _` | 6,415 |
 ### Key Findings
+- **Best Perplexity:** 2-gram (subword) with 459
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
 - **Coverage:** Top-1000 patterns cover ~33% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ![Markov Entropy](visualizations/markov_entropy.png)
+![Markov Contexts](visualizations/markov_contexts.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
+| Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
+|---------|---------|-------------|------------|------------------|-----------------|----------------|
+| **1** | Word | 0.5452 | 1.459 | 2.98 | 81,304 | 45.5% |
+| **1** | Subword | 1.0127 | 2.018 | 7.32 | 978 | 0.0% |
+| **2** | Word | 0.1324 | 1.096 | 1.26 | 240,607 | 86.8% |
+| **2** | Subword | 0.9831 | 1.977 | 6.04 | 7,148 | 1.7% |
+| **3** | Word | 0.0409 | 1.029 | 1.07 | 299,264 | 95.9% |
+| **3** | Subword | 0.8865 | 1.849 | 4.14 | 43,078 | 11.4% |
+| **4** | Word | 0.0201 🏆 | 1.014 | 1.03 | 315,962 | 98.0% |
+| **4** | Subword | 0.6527 | 1.572 | 2.59 | 178,117 | 34.7% |
+### Generated Text Samples (Word-based)
+Below are text samples generated from each word-based Markov chain model:
+**Context Size 1:**
+1. `w gminie wickò w nocë dlô biédnëch robòta jakno dzél wsë czelińskô hëta to béł pòlsczi`
+2. `je rëba z rodzëznë lycosidae òna rosce m jin na zôczątkù leno w pò ùpôdkù kòmùnizmù`
+3. `i białków zgòrzałégò zgòrzôłczi pòl jeziora potęgowskie to tak samò rok znoszą od średniowiecza do c...`
+**Context Size 2:**
+1. `to je dzél gardu grëdządza nad wisłą we zdrojach nova berlyn berlyn nigenberlin berlin berlinichen b...`
+2. `bùtnowé lënczi tadzino w geògraficznym słowôrzu pòlsczégò królestwa i jinëch słowiańsczich krajów pù...`
+3. `ùrodzëlë sã ùmarlë stolaté`
+**Context Size 3:**
+1. `wëdarzenia ùrodzëlë sã ùmarlë lesser giełdziński kòlekcjonéra dokôzów kùńsztu lesser giełdziński gaz...`
+2. `ùrodzëlë sã ùmarlë kalãdôrz na hewòtny rok juliańsczi 914 915 916 917 918 919 920 921 922 923`
+3. `w pòmòrsczim wòjewództwie w kartësczim krézu w gminie pòtãgòwò w stołpsczim krézu w gminie przedkòwò...`
+**Context Size 4:**
+1. `wëdarzenia ùrodzëlë sã ùmarlë kalãdôrz na hewòtny rok juliańsczi 948 949 950 951 952 953 954 955 956...`
+2. `p p p p p p p p p p p p p p p p p p p`
+3. `w pòmòrsczim wòjewództwie w kartësczim krézu w òbéńdze gminë somònino tu w szkòle dzece ùczą sã kasz...`
+### Generated Text Samples (Subword-based)
+Below are text samples generated from each subword-based Markov chain model:
 **Context Size 1:**
+1. `_kaństrdk_gô_òdz`
+2. `arstk:_todł_dnch`
+3. `icze_zegòriczëni`
 **Context Size 2:**
+1. `cziwónégò._maińst`
+2. `a_spòl.)_terticho`
+3. `_w_rowimòriart_ka`
 **Context Size 3:**
+1. `czi,_„roxy_dobis_z`
+2. `_w_chtërnym_są_z_d`
+3. `sczé_czajny),_mie_`
 **Context Size 4:**
+1. `sczi_egipsczégò_pòc`
+2. `czi_rôtësz_bëc_kòle`
+3. `_je_człowiańsczi_kò`
 ### Key Findings
+- **Best Predictability:** Context-4 (word) with 98.0% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
+- **Memory Trade-off:** Larger contexts require more storage (178,117 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
 | Metric | Value |
 |--------|-------|
+| Vocabulary Size | 28,754 |
+| Total Tokens | 367,683 |
+| Mean Frequency | 12.79 |
 | Median Frequency | 3 |
+| Frequency Std Dev | 148.11 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | w | 17,439 |
+| 2 | je | 7,833 |
+| 3 | i | 6,889 |
+| 4 | na | 6,729 |
+| 5 | z | 5,037 |
+| 6 | to | 4,739 |
+| 7 | sã | 3,695 |
+| 8 | do | 3,401 |
+| 9 | rok | 3,185 |
+| 10 | a | 2,487 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | szahada | 2 |
+| 2 | allaha | 2 |
+| 3 | الله | 2 |
+| 4 | llāh | 2 |
+| 5 | tatarzy | 2 |
+| 6 | chtërzy | 2 |
+| 7 | prevost | 2 |
+| 8 | gwiôzdozbiór | 2 |
+| 9 | discover | 2 |
+| 10 | krakowska | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
+| Zipf Coefficient | 0.9905 |
+| R² (Goodness of Fit) | 0.995948 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
+| Top 100 | 36.0% |
+| Top 1,000 | 63.2% |
+| Top 5,000 | 79.8% |
+| Top 10,000 | 87.4% |
 ### Key Findings
+- **Zipf Compliance:** R²=0.9959 indicates excellent adherence to Zipf's law
+- **High Frequency Dominance:** Top 100 words cover 36.0% of corpus
+- **Long Tail:** 18,754 words needed for remaining 12.6% coverage
 ---
 ## 5. Word Embeddings Evaluation
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
+### 5.1 Cross-Lingual Alignment
+> *Note: Multilingual alignment visualization not available for this language.*
+### 5.2 Model Comparison
+| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
+|-------|-----------|----------|------------------|---------------|----------------|
+| **mono_32d** | 32 | 0.7759 🏆 | 0.3628 | N/A | N/A |
+| **mono_64d** | 64 | 0.4956 | 0.3193 | N/A | N/A |
+| **mono_128d** | 128 | 0.1441 | 0.3257 | N/A | N/A |
 ### Key Findings
+- **Best Isotropy:** mono_32d with 0.7759 (more uniform distribution)
+- **Semantic Density:** Average pairwise similarity of 0.3359. Lower values indicate better semantic separation.
+- **Alignment Quality:** No aligned models evaluated in this run.
+- **Recommendation:** 128d aligned for best cross-lingual performance
 ---
+## 6.  Morphological Analysis (Experimental)
+> ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
+This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
+### 6.1 Productivity & Complexity
+| Metric | Value | Interpretation | Recommendation |
+|--------|-------|----------------|----------------|
+| Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
+| Idiomaticity Gap | **-1.000** | Low formulaic content | - |
+### 6.2 Affix Inventory (Productive Units)
+These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
+#### Productive Prefixes
+| Prefix | Examples |
+|--------|----------|
+| `-pr` | prozã, przerôbianié, prostonórta |
+| `-pò` | pòmòcë, pòtémù, pòmòcnégò |
+#### Productive Suffixes
+| Suffix | Examples |
+|--------|----------|
+| `-a` | plëszka, svôta, jóna |
+| `-ch` | chtërnich, artisticznëch, tarnowsczich |
+| `-ów` | kònkùrsów, splecënków, piesniów |
+| `-zi` | marokańsczi, hélsczi, esteticzi |
+| `-czi` | marokańsczi, hélsczi, esteticzi |
+### 6.3 Bound Stems (Lexical Roots)
+Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
+| Stem | Cohesion | Substitutability | Examples |
+|------|----------|------------------|----------|
+| `tërn` | 2.01x | 29 contexts | chtërną, chtërna, chtërnã |
+| `htër` | 2.05x | 23 contexts | chtërô, chtërë, chtëre |
+| `chtë` | 1.91x | 27 contexts | chtërô, chtërë, chtëre |
+| `szëb` | 1.99x | 22 contexts | kaszëb, kaszëbi, kaszëba |
+| `zeni` | 1.64x | 32 contexts | zenice, ùczenié, ùczeniô |
+| `odzë` | 1.79x | 22 contexts | rodzëc, rodzënë, godzëną |
+| `stol` | 1.78x | 20 contexts | stole, stolp, stolpe |
+| `rodz` | 1.40x | 44 contexts | rodzy, rodze, rodzą |
+| `aszë` | 1.91x | 14 contexts | kaszëb, kaszëbi, kaszëba |
+| `sczé` | 1.41x | 30 contexts | rusczé, wąsczé, nisczé |
+| `zëzn` | 1.40x | 29 contexts | rodzëzna, rodzëznë, żëdzëzna |
+| `zëbs` | 2.04x | 9 contexts | kaszëbskô, kaszëbskù, kaszëbskò |
+### 6.4 Affix Compatibility (Co-occurrence)
+This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
+| Prefix | Suffix | Frequency | Examples |
+|--------|--------|-----------|----------|
+| `-pr` | `-a` | 36 words | prałata, prawidła |
+| `-pò` | `-a` | 22 words | pòéta, pòetka |
+| `-pr` | `-ów` | 19 words | procëmników, przezeblôkańców |
+| `-pò` | `-ch` | 15 words | pòlsczich, pòswiãconëch |
+| `-pò` | `-zi` | 12 words | pòrénszi, pòwieczi |
+| `-pò` | `-ów` | 12 words | pòétów, pòkôzków |
+| `-pr` | `-ch` | 10 words | przédnich, prësach |
+| `-pò` | `-czi` | 10 words | pòwieczi, pòprôwczi |
+| `-pr` | `-zi` | 4 words | prëczkòwsczi, prekmùrsczi |
+| `-pr` | `-czi` | 4 words | prëczkòwsczi, prekmùrsczi |
+### 6.5 Recursive Morpheme Segmentation
+Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
+| Word | Suggested Split | Confidence | Stem |
+|------|-----------------|------------|------|
+| francesczi | **`frances-czi`** | 4.5 | `frances` |
+| przebendowsczich | **`pr-zebendows-czi-ch`** | 4.5 | `zebendows` |
+| rozmajitéch | **`rozmajité-ch`** | 4.5 | `rozmajité` |
+| misyjnych | **`misyjny-ch`** | 4.5 | `misyjny` |
+| kòloniach | **`kòlonia-ch`** | 4.5 | `kòlonia` |
+| instrumentów | **`instrument-ów`** | 4.5 | `instrument` |
+| òpòwiesców | **`òpòwiesc-ów`** | 4.5 | `òpòwiesc` |
+| rockòwich | **`rockòwi-ch`** | 4.5 | `rockòwi` |
+| nôrodnych | **`nôrodny-ch`** | 4.5 | `nôrodny` |
+| kòntinentów | **`kòntinent-ów`** | 4.5 | `kòntinent` |
+| chtërnich | **`chtërni-ch`** | 4.5 | `chtërni` |
+| pierszëch | **`pierszë-ch`** | 4.5 | `pierszë` |
+| napùlsczich | **`napùls-czi-ch`** | 3.0 | `napùls` |
+| pòwijôczowatëch | **`pò-wijôczowatë-ch`** | 3.0 | `wijôczowatë` |
+| profesorów | **`pr-ofesor-ów`** | 3.0 | `ofesor` |
+### 6.6 Linguistic Interpretation
+> **Automated Insight:**
+The language CSB appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
+---
+## 7. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
+| Tokenizer | **64k BPE** | Best compression (4.52x) |
+| N-gram | **2-gram** | Lowest perplexity (459) |
+| Markov | **Context-4** | Highest predictability (98.0%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
+  doi = {10.5281/zenodo.18073153},
+  publisher = {Zenodo},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
+- 🤝 Sponsor: [Featherless AI](https://featherless.ai)
 ---
 *Generated by Wikilangs Models Pipeline*
+*Report Date: 2026-01-03 10:37:34*

models/embeddings/monolingual/csb_128d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b25696cdb50896086fa8765164ac99b1a74f6baa87855743340870de8feb8846
-size 1033764191

 version https://git-lfs.github.com/spec/v1
+oid sha256:41e6f20527f7bdf6b8fa87b31ba74d5a6f79ab296290f0dc98660613cb8a2fe1
+size 1032826887

models/embeddings/monolingual/csb_128d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 128,
   "version": "monolingual",
   "training_params": {
-    "dim": 128,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 9374
 }

   "dimension": 128,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 128
   },
+  "vocab_size": 8475
 }

models/embeddings/monolingual/csb_32d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:20720c555b098f34e4b779eef8775c678739812d35172517c163c47355e7324d
-size 258564959

 version https://git-lfs.github.com/spec/v1
+oid sha256:5cbc494ad59e671d08ac1ee2324782674a0601691c025b52f61a1dc262c55a50
+size 258318087

models/embeddings/monolingual/csb_32d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 32,
   "version": "monolingual",
   "training_params": {
-    "dim": 32,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 9374
 }

   "dimension": 32,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 32
   },
+  "vocab_size": 8475
 }

models/embeddings/monolingual/csb_64d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b8f45f0bad986067c8e7d0286759aba19f46dd6e1543da500e24b416baeacabf
-size 516964703

 version https://git-lfs.github.com/spec/v1
+oid sha256:bf5715a3cae0ee5b9bb8ac08a4ce47911ac64ce20e714c3910d2bc8b96c6e509
+size 516487687

models/embeddings/monolingual/csb_64d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 64,
   "version": "monolingual",
   "training_params": {
-    "dim": 64,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 9374
 }

   "dimension": 64,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 64
   },
+  "vocab_size": 8475
 }

models/subword_markov/csb_markov_ctx1_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f5068a579f2b171e606792ca829e69b9e3143e65339a9efc3506a7ca5115d428
-size 67058

 version https://git-lfs.github.com/spec/v1
+oid sha256:3e05a853044730e6b38caaa21dbd01cd8c459c5c597d8ef1f2b019b62bd3a09b
+size 59753

models/subword_markov/csb_markov_ctx1_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "subword",
   "language": "csb",
-  "unique_contexts": 997,
-  "total_transitions": 3218844
 }

   "context_size": 1,
   "variant": "subword",
   "language": "csb",
+  "unique_contexts": 978,
+  "total_transitions": 2901533
 }

models/subword_markov/csb_markov_ctx2_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:3967d1eec9d86350b77702cc6bf44691566e26d4118f3b6a5a951c72f64b857c
-size 417008

 version https://git-lfs.github.com/spec/v1
+oid sha256:9c40dba7c966b35d871440a303053f29b981727b16a9711bc7f3b928ef397d93
+size 348349

models/subword_markov/csb_markov_ctx2_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "subword",
   "language": "csb",
-  "unique_contexts": 8455,
-  "total_transitions": 3213321
 }

   "context_size": 2,
   "variant": "subword",
   "language": "csb",
+  "unique_contexts": 7148,
+  "total_transitions": 2896074
 }

models/subword_markov/csb_markov_ctx3_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1adb1ab503943920e66b3ff9b49e54d0155a7cf3929f47d77c4b006ada368b15
-size 1513527

 version https://git-lfs.github.com/spec/v1
+oid sha256:baf3ffd5d454791b7f04ed0e159954822bec5bc7bb1ca316a99d9ff1110cb21b
+size 1286038

models/subword_markov/csb_markov_ctx3_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "subword",
   "language": "csb",
-  "unique_contexts": 53427,
-  "total_transitions": 3207798
 }

   "context_size": 3,
   "variant": "subword",
   "language": "csb",
+  "unique_contexts": 43078,
+  "total_transitions": 2890615
 }

models/subword_markov/csb_markov_ctx4_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:033de4acd88afc42990f3445f74bc3b566c328579043dc2a44aba71413b171b1
-size 4278773

 version https://git-lfs.github.com/spec/v1
+oid sha256:4c7319c1902c4efc845966ed77f3e54a04e733309dd3f842ad88f03e692305da
+size 3560727

models/subword_markov/csb_markov_ctx4_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "subword",
   "language": "csb",
-  "unique_contexts": 216692,
-  "total_transitions": 3202275
 }

   "context_size": 4,
   "variant": "subword",
   "language": "csb",
+  "unique_contexts": 178117,
+  "total_transitions": 2885156
 }

models/subword_ngram/csb_2gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:391e1d583df21fdc46a2aa0ff203e194cbae2e42cdfbc71444077b4e75b45f91
-size 45859

 version https://git-lfs.github.com/spec/v1
+oid sha256:cd44ffe8316f9e70b216abc0f7350ace6965b2fb8ca598790d222616a87aa285
+size 39483

models/subword_ngram/csb_2gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "subword",
   "language": "csb",
-  "unique_ngrams": 3311,
-  "total_ngrams": 3218844
 }

   "n": 2,
   "variant": "subword",
   "language": "csb",
+  "unique_ngrams": 2759,
+  "total_ngrams": 2901533
 }

models/subword_ngram/csb_3gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:4c3002a67a08d5faaeac52d5440402498e06e19305e35f54d276c9528d4a3152
-size 336547

 version https://git-lfs.github.com/spec/v1
+oid sha256:ae511b1aec2b3de495c0e9773d3546aa9f63c0161f1fd07bc6a21de62ebbc168
+size 284170

models/subword_ngram/csb_3gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "subword",
   "language": "csb",
-  "unique_ngrams": 27002,
-  "total_ngrams": 3213321
 }

   "n": 3,
   "variant": "subword",
   "language": "csb",
+  "unique_ngrams": 22668,
+  "total_ngrams": 2896074
 }

models/subword_ngram/csb_4gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:666072268ae0a2ccac186322c5838b59fae96d23c14780857ba54c7685d9adfc
-size 1404811

 version https://git-lfs.github.com/spec/v1
+oid sha256:12babc3b06e25fb857cc7fd5e7bd597cfa74946c127a0e861d2126235b75b740
+size 1210992

models/subword_ngram/csb_4gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 4,
   "variant": "subword",
   "language": "csb",
-  "unique_ngrams": 120017,
-  "total_ngrams": 3207798
 }

   "n": 4,
   "variant": "subword",
   "language": "csb",
+  "unique_ngrams": 103678,
+  "total_ngrams": 2890615
 }

models/tokenizer/csb_tokenizer_16k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6794824430b2159e34f31e9d5399a3a51fa91bd5ddb23993929d095a06f814ab
-size 513958

 version https://git-lfs.github.com/spec/v1
+oid sha256:c396d5d00fa2d75c975c56bc121841f40ca7c7c22b9020fd8d1cada0062c2b40
+size 512151

models/tokenizer/csb_tokenizer_16k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/csb_tokenizer_32k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:189c4d650dc51fd321cd823422ea651c99f67ced54e082e8b10a68d0fd5dcf41
-size 795214

 version https://git-lfs.github.com/spec/v1
+oid sha256:a45eb87760776eef205d9de1e20ab9be70fab5b15ea858044a82e52b6e3c863a
+size 794869

models/tokenizer/csb_tokenizer_32k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/csb_tokenizer_64k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e90c512e91a4cca37e1ed52b1935794a8711b3c1be3ed9d23759b91198934d8c
-size 1410697

 version https://git-lfs.github.com/spec/v1
+oid sha256:7f9ecb06a069b33f750a0a07c327a2ef366efe2b86633a89f1e83698fc9f3c0b
+size 1419438

models/tokenizer/csb_tokenizer_64k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/csb_tokenizer_8k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:cb3964616e0fc66549431807a5f232c1bab21a60b780d08671a92f04877c67f4
-size 374311

 version https://git-lfs.github.com/spec/v1
+oid sha256:a6e6f3cf7f11f412d56e5d102ff3e025a5475742bdbb3e47358ced7e86ca13cf
+size 374756

models/tokenizer/csb_tokenizer_8k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/vocabulary/csb_vocabulary.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:43400e03ea20ace4b8bdb572e3883499c9f7b97bae69b50d4944fbe2ee536810
-size 512071

 version https://git-lfs.github.com/spec/v1
+oid sha256:d4ec8917a796775c330b809302d26f7ef684e595a22733614a196066a8dc7d30
+size 496629

models/vocabulary/csb_vocabulary_metadata.json CHANGED Viewed

@@ -1,16 +1,17 @@
 {
   "language": "csb",
-  "vocabulary_size": 29805,
   "statistics": {
-    "type_token_ratio": 0.1846802030013833,
     "coverage": {
-      "top_100": 0.30673194828090294,
-      "top_1000": 0.5521707445856843,
-      "top_5000": 0.7036886730290058,
-      "top_10000": 0.7713288910416694
     },
-    "hapax_count": 54838,
-    "hapax_ratio": 0.6478740120269839,
-    "total_documents": 5523
   }
 }

 {
   "language": "csb",
+  "vocabulary_size": 28754,
+  "variant": "full",
   "statistics": {
+    "type_token_ratio": 0.19407197802851062,
     "coverage": {
+      "top_100": 0.31498650560582103,
+      "top_1000": 0.5524925988895362,
+      "top_5000": 0.6979419562710293,
+      "top_10000": 0.7645555172454791
     },
+    "hapax_count": 52862,
+    "hapax_ratio": 0.6476916290923348,
+    "total_documents": 5459
   }
 }

models/word_markov/csb_markov_ctx1_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:4ed07c1fba3e9c978f2541528729168751fa7fe61ecec3655bb6f961a9aad71c
-size 2783036

 version https://git-lfs.github.com/spec/v1
+oid sha256:e732ca9b72afa9a30e582ebd48d54d02ff69454bc3acb08680e67d5bbf2db206
+size 2573619

models/word_markov/csb_markov_ctx1_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "word",
   "language": "csb",
-  "unique_contexts": 84682,
-  "total_transitions": 603497
 }

   "context_size": 1,
   "variant": "word",
   "language": "csb",
+  "unique_contexts": 81304,
+  "total_transitions": 415086
 }

models/word_markov/csb_markov_ctx2_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ddf878b89b081ed8f670e77e8aa7eb3a65026f82687fdb4e392596d39e7081e1
-size 5412690

 version https://git-lfs.github.com/spec/v1
+oid sha256:2660c524e5ebffcf9e518ae94f1ef886c064dab42f134947c976da478ff1da2f
+size 4877767

models/word_markov/csb_markov_ctx2_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "word",
   "language": "csb",
-  "unique_contexts": 273056,
-  "total_transitions": 597974
 }

   "context_size": 2,
   "variant": "word",
   "language": "csb",
+  "unique_contexts": 240607,
+  "total_transitions": 409627
 }

models/word_markov/csb_markov_ctx3_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:cc1dc5ef590474e910c47d1150f28c5a6b16f78c91a67ace5ed90a83a7ba3a3a
-size 7212126

 version https://git-lfs.github.com/spec/v1
+oid sha256:7cdd39ad4b22a2d46c4dd86b7ba493cfc9aa00961c4a78695d6c28451dc5014e
+size 6113660

models/word_markov/csb_markov_ctx3_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "word",
   "language": "csb",
-  "unique_contexts": 383141,
-  "total_transitions": 592451
 }

   "context_size": 3,
   "variant": "word",
   "language": "csb",
+  "unique_contexts": 299264,
+  "total_transitions": 404168
 }

models/word_markov/csb_markov_ctx4_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ba6380ce77c89d4260f28c12f1486e6c7dcf94d1aef3c758ae27ba08b8c73ea6
-size 8333568