omarkamali commited on Jan 3

Commit

9a7b63f

verified ·

1 Parent(s): 725aee7

Upload all models and assets for bug (20251001)

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +301 -136
models/embeddings/monolingual/bug_128d.bin +2 -2
models/embeddings/monolingual/bug_128d_metadata.json +5 -3
models/embeddings/monolingual/bug_32d.bin +2 -2
models/embeddings/monolingual/bug_32d_metadata.json +5 -3
models/embeddings/monolingual/bug_64d.bin +2 -2
models/embeddings/monolingual/bug_64d_metadata.json +5 -3
models/subword_markov/bug_markov_ctx1_subword.parquet +2 -2
models/subword_markov/bug_markov_ctx1_subword_metadata.json +2 -2
models/subword_markov/bug_markov_ctx2_subword.parquet +2 -2
models/subword_markov/bug_markov_ctx2_subword_metadata.json +2 -2
models/subword_markov/bug_markov_ctx3_subword.parquet +2 -2
models/subword_markov/bug_markov_ctx3_subword_metadata.json +2 -2
models/subword_markov/bug_markov_ctx4_subword.parquet +2 -2
models/subword_markov/bug_markov_ctx4_subword_metadata.json +2 -2
models/subword_ngram/bug_2gram_subword.parquet +2 -2
models/subword_ngram/bug_2gram_subword_metadata.json +2 -2
models/subword_ngram/bug_3gram_subword.parquet +2 -2
models/subword_ngram/bug_3gram_subword_metadata.json +2 -2
models/subword_ngram/bug_4gram_subword.parquet +2 -2
models/subword_ngram/bug_4gram_subword_metadata.json +2 -2
models/tokenizer/bug_tokenizer_16k.model +2 -2
models/tokenizer/bug_tokenizer_16k.vocab +0 -0
models/tokenizer/bug_tokenizer_32k.model +2 -2
models/tokenizer/bug_tokenizer_32k.vocab +0 -0
models/tokenizer/bug_tokenizer_8k.model +2 -2
models/tokenizer/bug_tokenizer_8k.vocab +0 -0
models/vocabulary/bug_vocabulary.parquet +2 -2
models/vocabulary/bug_vocabulary_metadata.json +10 -9
models/word_markov/bug_markov_ctx1_word.parquet +2 -2
models/word_markov/bug_markov_ctx1_word_metadata.json +2 -2
models/word_markov/bug_markov_ctx2_word.parquet +2 -2
models/word_markov/bug_markov_ctx2_word_metadata.json +2 -2
models/word_markov/bug_markov_ctx3_word.parquet +2 -2
models/word_markov/bug_markov_ctx3_word_metadata.json +2 -2
models/word_markov/bug_markov_ctx4_word.parquet +2 -2
models/word_markov/bug_markov_ctx4_word_metadata.json +2 -2
models/word_ngram/bug_2gram_word.parquet +2 -2
models/word_ngram/bug_2gram_word_metadata.json +2 -2
models/word_ngram/bug_3gram_word.parquet +2 -2
models/word_ngram/bug_3gram_word_metadata.json +2 -2
models/word_ngram/bug_4gram_word.parquet +2 -2
models/word_ngram/bug_4gram_word_metadata.json +2 -2
visualizations/embedding_isotropy.png +0 -0
visualizations/embedding_norms.png +0 -0
visualizations/embedding_similarity.png +2 -2
visualizations/markov_branching.png +0 -0
visualizations/markov_contexts.png +0 -0
visualizations/markov_entropy.png +0 -0
visualizations/model_sizes.png +0 -0

README.md CHANGED Viewed

@@ -23,14 +23,14 @@ dataset_info:
 metrics:
   - name: best_compression_ratio
     type: compression
-    value: 3.778
   - name: best_isotropy
     type: isotropy
-    value: 0.2564
   - name: vocabulary_size
     type: vocab
-    value: 17585
-generated: 2025-12-28
 ---
 # BUG - Wikilangs Models
@@ -44,12 +44,13 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
-- N-gram models (2, 3, 4-gram)
-- Markov chains (context of 1, 2, 3 and 4)
 - Subword N-gram and Markov chains
-- Embeddings in various sizes and dimensions
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
@@ -59,7 +60,8 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
-- [6. Summary & Recommendations](#6-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
@@ -68,59 +70,53 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
-| **8k** | 3.386x | 3.31 | 0.3368% | 52,547 |
-| **16k** | 3.508x | 3.43 | 0.3490% | 50,714 |
-| **32k** | 3.684x | 3.60 | 0.3665% | 48,290 |
-| **64k** | 3.778x 🏆 | 3.69 | 0.3758% | 47,094 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
-**Sample 1:** `Daours iyanaritu séuwa komun ri déparetema Somme ri Perancis.
- Ita to
- Komun r...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁da ours ▁iyanaritu ▁séuwa ▁komun ▁ri ▁déparetema ▁somme ▁ri ▁perancis ... (+12 more)` | 22 |
-| 16k | `▁da ours ▁iyanaritu ▁séuwa ▁komun ▁ri ▁déparetema ▁somme ▁ri ▁perancis ... (+12 more)` | 22 |
-| 32k | `▁daours ▁iyanaritu ▁séuwa ▁komun ▁ri ▁déparetema ▁somme ▁ri ▁perancis . ... (+11 more)` | 21 |
-| 64k | `▁daours ▁iyanaritu ▁séuwa ▁komun ▁ri ▁déparetema ▁somme ▁ri ▁perancis . ... (+11 more)` | 21 |
-**Sample 2:** `Damas-et-Bettegney iyanaritu séuwa komun ri déparetema Vosges ri Perancis.
- Ita...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁da mas - et - b ette gney ▁iyanaritu ▁séuwa ... (+18 more)` | 28 |
-| 16k | `▁damas - et - bette gney ▁iyanaritu ▁séuwa ▁komun ▁ri ... (+16 more)` | 26 |
-| 32k | `▁damas - et - bettegney ▁iyanaritu ▁séuwa ▁komun ▁ri ▁déparetema ... (+15 more)` | 25 |
-| 64k | `▁damas - et - bettegney ▁iyanaritu ▁séuwa ▁komun ▁ri ▁déparetema ... (+15 more)` | 25 |
-**Sample 3:** `Ita to
- Komun ri déparetema Aisne
-Kategori:Komun ri Aisne`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
-| 8k | `▁ita ▁to ▁komun ▁ri ▁déparetema ▁aisne ▁kategori : komun ▁ri ... (+1 more)` | 11 |
-| 16k | `▁ita ▁to ▁komun ▁ri ▁déparetema ▁aisne ▁kategori : komun ▁ri ... (+1 more)` | 11 |
-| 32k | `▁ita ▁to ▁komun ▁ri ▁déparetema ▁aisne ▁kategori : komun ▁ri ... (+1 more)` | 11 |
-| 64k | `▁ita ▁to ▁komun ▁ri ▁déparetema ▁aisne ▁kategori : komun ▁ri ... (+1 more)` | 11 |
 ### Key Findings
-- **Best Compression:** 64k achieves 3.778x compression
-- **Lowest UNK Rate:** 8k with 0.3368% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
@@ -129,57 +125,89 @@ Kategori:Komun ri Aisne`
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
-| N-gram | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
-|--------|------------|---------|----------------|------------------|-------------------|
-| **2-gram** | 158 🏆 | 7.31 | 3,164 | 74.9% | 95.6% |
-| **2-gram** | 243 🏆 | 7.93 | 2,007 | 71.2% | 99.3% |
-| **3-gram** | 256 | 8.00 | 6,284 | 66.5% | 92.3% |
-| **3-gram** | 849 | 9.73 | 13,719 | 54.2% | 82.2% |
-| **4-gram** | 498 | 8.96 | 15,843 | 55.1% | 87.5% |
-| **4-gram** | 1,815 | 10.83 | 58,717 | 50.9% | 71.2% |
 ### Top 5 N-grams by Size
-**2-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `komun ri` | 41,321 |
 | 2 | `ri déparetema` | 25,713 |
-| 3 | `kategori :` | 15,808 |
-| 4 | `: komun` | 15,504 |
-| 5 | `ita to` | 13,904 |
-**3-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
 | 1 | `komun ri déparetema` | 25,709 |
-| 2 | `kategori : komun` | 15,504 |
-| 3 | `: komun ri` | 15,485 |
 | 4 | `ita to komun` | 13,889 |
-| 5 | `to komun ri` | 13,889 |
-**4-grams:**
 | Rank | N-gram | Count |
 |------|--------|-------|
-| 1 | `kategori : komun ri` | 15,485 |
-| 2 | `ita to komun ri` | 13,889 |
-| 3 | `to komun ri déparetema` | 13,889 |
-| 4 | `. ita to komun` | 12,106 |
-| 5 | `perancis . ita to` | 12,102 |
 ### Key Findings
-- **Best Perplexity:** 2-gram with 158
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
-- **Coverage:** Top-1000 patterns cover ~71% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
@@ -187,55 +215,86 @@ Kategori:Komun ri Aisne`
 ![Markov Entropy](visualizations/markov_entropy.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
-| Context | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
-|---------|-------------|------------|------------------|-----------------|----------------|
-| **1** | 0.3793 | 1.301 | 2.08 | 62,536 | 62.1% |
-| **1** | 0.4436 | 1.360 | 4.60 | 1,033 | 55.6% |
-| **2** | 0.0917 | 1.066 | 1.24 | 129,845 | 90.8% |
-| **2** | 0.9491 | 1.931 | 5.44 | 4,741 | 5.1% |
-| **3** | 0.0634 | 1.045 | 1.15 | 160,528 | 93.7% |
-| **3** | 0.9150 | 1.886 | 3.78 | 25,755 | 8.5% |
-| **4** | 0.0426 🏆 | 1.030 | 1.07 | 184,304 | 95.7% |
-| **4** | 0.6671 🏆 | 1.588 | 2.52 | 97,349 | 33.3% |
-### Generated Text Samples
-Below are text samples generated from each Markov chain model:
 **Context Size 1:**
-1. `ri déparetema yvelines kategori : komun ri perancis . ita to komun ri perancis . ita`
-2. `- avit 16303 16300 challignac 16076 16350 le - d ' or ri déparetema gironde ri`
-3. `komun ri picardy ri déparetema aube ri haute - d ' aurelle iyanaritu séuwa komun ri`
 **Context Size 2:**
-1. `komun ri déparetema vosges ri perancis . ita to komun ri vienne kategori : komun ri déparetema`
-2. `ri déparetema côtes - d ' auberoche 24285 24140 montagnac - montpezat iyanaritu séuwa komun ri aube`
-3. `kategori : komun ri déparetema corrèze ri perancis . ita to komun ri déparetema yvelines kategori :`
 **Context Size 3:**
-1. `komun ri déparetema côtes - d ' armor kategori : komun ri deux - sèvres kategori : komun`
-2. `kategori : komun ri yvelines`
-3. `: komun ri haute - garonne ri atang - launa perancis . ita to komun ri déparetema côte`
 **Context Size 4:**
-1. `kategori : komun ri manche`
-2. `to komun ri déparetema vosges kategori : komun ri haute - saône`
-3. `ita to komun ri déparetema gironde kategori : komun ri gard`
 ### Key Findings
-- **Best Predictability:** Context-4 with 95.7% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
-- **Memory Trade-off:** Larger contexts require more storage (97,349 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
@@ -251,64 +310,64 @@ Below are text samples generated from each Markov chain model:
 | Metric | Value |
 |--------|-------|
-| Vocabulary Size | 17,585 |
-| Total Tokens | 403,124 |
-| Mean Frequency | 22.92 |
 | Median Frequency | 2 |
-| Frequency Std Dev | 633.16 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | ri | 55,764 |
-| 2 | komun | 43,065 |
 | 3 | déparetema | 27,244 |
-| 4 | kategori | 15,808 |
-| 5 | to | 14,030 |
-| 6 | ita | 13,905 |
-| 7 | iyanaritu | 13,506 |
-| 8 | séuwa | 13,394 |
 | 9 | perancis | 12,636 |
-| 10 | haute | 6,362 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
-| 1 | museum | 2 |
-| 2 | tychy | 2 |
-| 3 | tangnga | 2 |
-| 4 | miniaturowej | 2 |
-| 5 | sztuki | 2 |
-| 6 | profesjonalnej | 2 |
-| 7 | wideo | 2 |
-| 8 | nietypowe | 2 |
-| 9 | sztalugi | 2 |
-| 10 | zapałek | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
-| Zipf Coefficient | 0.9525 |
-| R² (Goodness of Fit) | 0.969218 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
-| Top 100 | 76.2% |
-| Top 1,000 | 84.1% |
-| Top 5,000 | 92.6% |
-| Top 10,000 | 96.2% |
 ### Key Findings
-- **Zipf Compliance:** R²=0.9692 indicates excellent adherence to Zipf's law
-- **High Frequency Dominance:** Top 100 words cover 76.2% of corpus
-- **Long Tail:** 7,585 words needed for remaining 3.8% coverage
 ---
 ## 5. Word Embeddings Evaluation
@@ -321,24 +380,127 @@ Below are text samples generated from each Markov chain model:
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
-### Model Comparison
-| Model | Vocab Size | Dimension | Avg Norm | Std Norm | Isotropy |
-|-------|------------|-----------|----------|----------|----------|
-| **mono_32d** | 3,688 | 32 | 4.349 | 1.484 | 0.2564 🏆 |
-| **mono_64d** | 3,688 | 64 | 4.648 | 1.349 | 0.0819 |
-| **mono_128d** | 3,688 | 128 | 4.775 | 1.336 | 0.0113 |
-| **embeddings_enhanced** | 0 | 0 | 0.000 | 0.000 | 0.0000 |
 ### Key Findings
-- **Best Isotropy:** mono_32d with 0.2564 (more uniform distribution)
-- **Dimension Trade-off:** Higher dimensions capture more semantics but reduce isotropy
-- **Vocabulary Coverage:** All models cover 3,688 words
-- **Recommendation:** 100d for balanced semantic capture and efficiency
 ---
-## 6. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
@@ -346,11 +508,12 @@ Below are text samples generated from each Markov chain model:
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
-| Tokenizer | **32k BPE** | Best compression (3.78x) with low UNK rate |
-| N-gram | **5-gram** | Lowest perplexity (158) |
-| Markov | **Context-4** | Highest predictability (95.7%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
@@ -540,7 +703,8 @@ If you use these models in your research, please cite:
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
-  publisher = {HuggingFace},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
@@ -556,7 +720,8 @@ MIT License - Free for academic and commercial use.
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
 ---
 *Generated by Wikilangs Models Pipeline*
-*Report Date: 2025-12-28 09:19:46*

 metrics:
   - name: best_compression_ratio
     type: compression
+    value: 4.924
   - name: best_isotropy
     type: isotropy
+    value: 0.0631
   - name: vocabulary_size
     type: vocab
+    value: 0
+generated: 2026-01-03
 ---
 # BUG - Wikilangs Models
 ### Models & Assets
 - Tokenizers (8k, 16k, 32k, 64k)
+- N-gram models (2, 3, 4, 5-gram)
+- Markov chains (context of 1, 2, 3, 4 and 5)
 - Subword N-gram and Markov chains
+- Embeddings in various sizes and dimensions (aligned and unaligned)
 - Language Vocabulary
 - Language Statistics
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 ### Analysis and Evaluation
 - [3. Markov Chain Evaluation](#3-markov-chain-evaluation)
 - [4. Vocabulary Analysis](#4-vocabulary-analysis)
 - [5. Word Embeddings Evaluation](#5-word-embeddings-evaluation)
+- [6. Morphological Analysis (Experimental)](#6-morphological-analysis)
+- [7. Summary & Recommendations](#7-summary--recommendations)
 - [Metrics Glossary](#appendix-metrics-glossary--interpretation-guide)
 - [Visualizations Index](#visualizations-index)
 ![Tokenizer Compression](visualizations/tokenizer_compression.png)
+![Tokenizer Fertility](visualizations/tokenizer_fertility.png)
+![Tokenizer OOV](visualizations/tokenizer_oov.png)
+![Total Tokens](visualizations/tokenizer_total_tokens.png)
 ### Results
 | Vocab Size | Compression | Avg Token Len | UNK Rate | Total Tokens |
 |------------|-------------|---------------|----------|--------------|
+| **8k** | 4.284x | 4.31 | 0.4916% | 36,818 |
+| **16k** | 4.514x | 4.54 | 0.5180% | 34,940 |
+| **32k** | 4.924x 🏆 | 4.96 | 0.5650% | 32,035 |
 ### Tokenization Examples
 Below are sample sentences tokenized with each vocabulary size:
+**Sample 1:** `Ita to Komun ri déparetema Allier Kategori:Komun ri Allier`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁ita ▁to ▁komun ▁ri ▁déparetema ▁allier ▁kategori : komun ▁ri ... (+1 more)` | 11 |
+| 16k | `▁ita ▁to ▁komun ▁ri ▁déparetema ▁allier ▁kategori : komun ▁ri ... (+1 more)` | 11 |
+| 32k | `▁ita ▁to ▁komun ▁ri ▁déparetema ▁allier ▁kategori : komun ▁ri ... (+1 more)` | 11 |
+**Sample 2:** `iyanaritu séuwa komun ri déparetema Manche ri Perancis. Ita to Komun ri déparete...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁iyanaritu ▁séuwa ▁komun ▁ri ▁déparetema ▁manche ▁ri ▁perancis . ▁ita ... (+10 more)` | 20 |
+| 16k | `▁iyanaritu ▁séuwa ▁komun ▁ri ▁déparetema ▁manche ▁ri ▁perancis . ▁ita ... (+10 more)` | 20 |
+| 32k | `▁iyanaritu ▁séuwa ▁komun ▁ri ▁déparetema ▁manche ▁ri ▁perancis . ▁ita ... (+10 more)` | 20 |
+**Sample 3:** `iyanaritu séuwa komun ri déparetema Gard ri Perancis. Ita to Komun ri déparetema...`
 | Vocab | Tokens | Count |
 |-------|--------|-------|
+| 8k | `▁iyanaritu ▁séuwa ▁komun ▁ri ▁déparetema ▁gard ▁ri ▁perancis . ▁ita ... (+10 more)` | 20 |
+| 16k | `▁iyanaritu ▁séuwa ▁komun ▁ri ▁déparetema ▁gard ▁ri ▁perancis . ▁ita ... (+10 more)` | 20 |
+| 32k | `▁iyanaritu ▁séuwa ▁komun ▁ri ▁déparetema ▁gard ▁ri ▁perancis . ▁ita ... (+10 more)` | 20 |
 ### Key Findings
+- **Best Compression:** 32k achieves 4.924x compression
+- **Lowest UNK Rate:** 8k with 0.4916% unknown tokens
 - **Trade-off:** Larger vocabularies improve compression but increase model size
 - **Recommendation:** 32k vocabulary provides optimal balance for production use
 ![N-gram Perplexity](visualizations/ngram_perplexity.png)
+![N-gram Unique](visualizations/ngram_unique.png)
 ![N-gram Coverage](visualizations/ngram_coverage.png)
 ### Results
+| N-gram | Variant | Perplexity | Entropy | Unique N-grams | Top-100 Coverage | Top-1000 Coverage |
+|--------|---------|------------|---------|----------------|------------------|-------------------|
+| **2-gram** | Word | 75 🏆 | 6.23 | 1,721 | 84.8% | 98.5% |
+| **2-gram** | Subword | 168 | 7.39 | 2,166 | 81.3% | 99.5% |
+| **3-gram** | Word | 118 | 6.89 | 2,060 | 74.9% | 98.6% |
+| **3-gram** | Subword | 512 | 9.00 | 10,883 | 62.7% | 89.5% |
+| **4-gram** | Word | 228 | 7.84 | 4,992 | 61.5% | 96.5% |
+| **4-gram** | Subword | 938 | 9.87 | 41,978 | 58.6% | 80.3% |
 ### Top 5 N-grams by Size
+**2-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `komun ri` | 40,954 |
 | 2 | `ri déparetema` | 25,713 |
+| 3 | `kategori komun` | 15,119 |
+| 4 | `ita to` | 13,903 |
+| 5 | `to komun` | 13,889 |
+**3-grams (Word):**
 | Rank | N-gram | Count |
 |------|--------|-------|
 | 1 | `komun ri déparetema` | 25,709 |
+| 2 | `kategori komun ri` | 15,118 |
+| 3 | `to komun ri` | 13,889 |
 | 4 | `ita to komun` | 13,889 |
+| 5 | `iyanaritu séuwa komun` | 13,324 |
+**4-grams (Word):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `ita to komun ri` | 13,889 |
+| 2 | `to komun ri déparetema` | 13,889 |
+| 3 | `perancis ita to komun` | 12,095 |
+| 4 | `iyanaritu séuwa komun ri` | 11,780 |
+| 5 | `séuwa komun ri déparetema` | 11,779 |
+**2-grams (Subword):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `r i` | 90,073 |
+| 2 | `a _` | 63,521 |
+| 3 | `i _` | 58,103 |
+| 4 | `_ r` | 57,562 |
+| 5 | `t e` | 57,384 |
+**3-grams (Subword):**
 | Rank | N-gram | Count |
 |------|--------|-------|
+| 1 | `_ r i` | 56,241 |
+| 2 | `r i _` | 55,682 |
+| 3 | `m u n` | 43,032 |
+| 4 | `u n _` | 42,982 |
+| 5 | `k o m` | 42,818 |
+**4-grams (Subword):**
+| Rank | N-gram | Count |
+|------|--------|-------|
+| 1 | `_ r i _` | 55,380 |
+| 2 | `o m u n` | 42,739 |
+| 3 | `k o m u` | 42,738 |
+| 4 | `m u n _` | 42,683 |
+| 5 | `n _ r i` | 41,407 |
 ### Key Findings
+- **Best Perplexity:** 2-gram (word) with 75
 - **Entropy Trend:** Decreases with larger n-grams (more predictable)
+- **Coverage:** Top-1000 patterns cover ~80% of corpus
 - **Recommendation:** 4-gram or 5-gram for best predictive performance
 ---
 ![Markov Entropy](visualizations/markov_entropy.png)
+![Markov Contexts](visualizations/markov_contexts.png)
 ![Markov Branching](visualizations/markov_branching.png)
 ### Results
+| Context | Variant | Avg Entropy | Perplexity | Branching Factor | Unique Contexts | Predictability |
+|---------|---------|-------------|------------|------------------|-----------------|----------------|
+| **1** | Word | 0.5094 | 1.423 | 2.20 | 33,148 | 49.1% |
+| **1** | Subword | 0.6420 | 1.560 | 6.04 | 1,115 | 35.8% |
+| **2** | Word | 0.1229 | 1.089 | 1.21 | 72,816 | 87.7% |
+| **2** | Subword | 0.6776 | 1.600 | 3.79 | 6,727 | 32.2% |
+| **3** | Word | 0.0488 | 1.034 | 1.07 | 87,927 | 95.1% |
+| **3** | Subword | 0.6911 | 1.614 | 3.05 | 25,458 | 30.9% |
+| **4** | Word | 0.0143 🏆 | 1.010 | 1.02 | 93,631 | 98.6% |
+| **4** | Subword | 0.5492 | 1.463 | 2.16 | 77,506 | 45.1% |
+### Generated Text Samples (Word-based)
+Below are text samples generated from each word-based Markov chain model:
 **Context Size 1:**
+1. `ri déparetema gironde ri déparetema eure et loir kategori komun ri déparetema manche ri aisne katego...`
+2. `komun ri perancis ita to komun ri dordogne ri déparetema somme kategori kota ri déparetema haute`
+3. `déparetema gironde ri haute saône ri déparetema haute provence ri perancis ita to komun ri déparetem...`
 **Context Size 2:**
+1. `komun ri provinsi messina komun ri déparetema haute loire ri perancis ita to komun ri déparetema dor...`
+2. `ri déparetema manche ri perancis ita to komun ri déparetema dordogne ri perancis ita to komun ri`
+3. `kategori komun ri déparetema manche kategori komun ri déparetema ain kategori komun ri alpes de haut...`
 **Context Size 3:**
+1. `komun ri déparetema somme kategori komun ri eure et loir ri perancis ita to komun ri déparetema haut...`
+2. `kategori komun ri aisne`
+3. `to komun ri déparetema dordogne ri perancis ita to komun ri déparetema vosges ri perancis ita to kom...`
 **Context Size 4:**
+1. `ita to komun ri déparetema gironde ri perancis ita to komun ri déparetema haute saône ri perancis it...`
+2. `to komun ri déparetema vosges kategori komun ri vosges`
+3. `perancis ita to komun ri déparetema côtes d armor kategori komun ri côtes d armor`
+### Generated Text Samples (Subword-based)
+Below are text samples generated from each subword-based Markov chain model:
+**Context Size 1:**
+1. `_séunas._koiya_r`
+2. `ares._retoépesat`
+3. `riséunanetaŋn_pa`
+**Context Size 2:**
+1. `ri_ube_katespia_f`
+2. `a_to_komun_ri:kom`
+3. `i_aretemay_(caven`
+**Context Size 3:**
+1. `_ri_déparetema_vos`
+2. `ri_déparetema_eure`
+3. `mun_ri_perancis._i`
+**Context Size 4:**
+1. `_ri_perancis._ita_t`
+2. `omun_ri_aisne_ri_pe`
+3. `komun_ri_déparetema`
 ### Key Findings
+- **Best Predictability:** Context-4 (word) with 98.6% predictability
 - **Branching Factor:** Decreases with context size (more deterministic)
+- **Memory Trade-off:** Larger contexts require more storage (77,506 contexts)
 - **Recommendation:** Context-3 or Context-4 for text generation
 ---
 | Metric | Value |
 |--------|-------|
+| Vocabulary Size | 13,441 |
+| Total Tokens | 358,235 |
+| Mean Frequency | 26.65 |
 | Median Frequency | 2 |
+| Frequency Std Dev | 719.11 |
 ### Most Common Words
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | ri | 55,390 |
+| 2 | komun | 42,680 |
 | 3 | déparetema | 27,244 |
+| 4 | kategori | 15,401 |
+| 5 | to | 14,028 |
+| 6 | ita | 13,904 |
+| 7 | iyanaritu | 13,503 |
+| 8 | séuwa | 13,393 |
 | 9 | perancis | 12,636 |
+| 10 | haute | 6,207 |
 ### Least Common Words (from vocabulary)
 | Rank | Word | Frequency |
 |------|------|-----------|
+| 1 | ᨆᨘᨄᨗ | 2 |
+| 2 | ᨕᨗᨊᨘᨂᨛᨊ | 2 |
+| 3 | ᨒᨘ | 2 |
+| 4 | ᨅᨀᨗ | 2 |
+| 5 | ᨀᨀᨀᨀ | 2 |
+| 6 | ᨉᨗᨛᨄᨗᨛ | 2 |
+| 7 | days | 2 |
+| 8 | after | 2 |
+| 9 | federal | 2 |
+| 10 | ᨔᨛᨀᨗᨈ | 2 |
 ### Zipf's Law Analysis
 | Metric | Value |
 |--------|-------|
+| Zipf Coefficient | 0.9107 |
+| R² (Goodness of Fit) | 0.956604 |
 | Adherence Quality | **excellent** |
 ### Coverage Analysis
 | Top N Words | Coverage |
 |-------------|----------|
+| Top 100 | 83.0% |
+| Top 1,000 | 89.7% |
+| Top 5,000 | 95.1% |
+| Top 10,000 | 98.1% |
 ### Key Findings
+- **Zipf Compliance:** R²=0.9566 indicates excellent adherence to Zipf's law
+- **High Frequency Dominance:** Top 100 words cover 83.0% of corpus
+- **Long Tail:** 3,441 words needed for remaining 1.9% coverage
 ---
 ## 5. Word Embeddings Evaluation
 ![t-SNE Sentences](visualizations/tsne_sentences.png)
+### 5.1 Cross-Lingual Alignment
+> *Note: Multilingual alignment visualization not available for this language.*
+### 5.2 Model Comparison
+| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
+|-------|-----------|----------|------------------|---------------|----------------|
+| **mono_32d** | 32 | 0.0631 🏆 | 0.6817 | N/A | N/A |
+| **mono_64d** | 64 | 0.0264 | 0.6525 | N/A | N/A |
+| **mono_128d** | 128 | 0.0035 | 0.7173 | N/A | N/A |
 ### Key Findings
+- **Best Isotropy:** mono_32d with 0.0631 (more uniform distribution)
+- **Semantic Density:** Average pairwise similarity of 0.6838. Lower values indicate better semantic separation.
+- **Alignment Quality:** No aligned models evaluated in this run.
+- **Recommendation:** 128d aligned for best cross-lingual performance
 ---
+## 6.  Morphological Analysis (Experimental)
+> ⚠️ **Warning:** This language shows low morphological productivity. The statistical signals used for this analysis may be noisy or less reliable than for morphologically rich languages.
+This section presents an automated morphological analysis derived from the statistical divergence between word-level and subword-level models. By analyzing where subword predictability spikes and where word-level coverage fails, we can infer linguistic structures without supervised data.
+### 6.1 Productivity & Complexity
+| Metric | Value | Interpretation | Recommendation |
+|--------|-------|----------------|----------------|
+| Productivity Index | **0.000** | Low morphological productivity | ⚠️ Likely unreliable |
+| Idiomaticity Gap | **-1.000** | Low formulaic content | - |
+### 6.2 Affix Inventory (Productive Units)
+These are the most productive prefixes and suffixes identified by sampling the vocabulary for global substitutability patterns. A unit is considered an affix if stripping it leaves a valid stem that appears in other contexts.
+#### Productive Prefixes
+| Prefix | Examples |
+|--------|----------|
+| `-ma` | marainville, mailhac, mauges |
+| `-mo` | molliens, montmotier, morin |
+| `-ch` | châtel, chèze, chauffourt |
+| `-co` | confort, coulombiers, coux |
+| `-la` | lacalm, lasse, lacour |
+#### Productive Suffixes
+| Suffix | Examples |
+|--------|----------|
+| `-s` | pozières, serbonnes, molliens |
+| `-e` | givonne, marainville, roville |
+| `-es` | pozières, serbonnes, mauges |
+| `-rt` | confort, chauffourt, saucourt |
+| `-le` | marainville, roville, touille |
+| `-urt` | chauffourt, saucourt, pignicourt |
+| `-ourt` | chauffourt, saucourt, pignicourt |
+| `-lle` | marainville, roville, touille |
+### 6.3 Bound Stems (Lexical Roots)
+Bound stems are high-frequency subword units that are semantically cohesive but rarely appear as standalone words. These often correspond to the 'core' of a word that requires inflection or derivation to be valid.
+| Stem | Cohesion | Substitutability | Examples |
+|------|----------|------------------|----------|
+| `ngka` | 1.37x | 20 contexts | éngka, angka, engka |
+| `appa` | 1.39x | 15 contexts | lappa, nappa, tappa |
+| `engk` | 1.41x | 9 contexts | engka, engkai, engkaé |
+| `seng` | 1.34x | 10 contexts | aseng, naseng, siseng |
+| `asen` | 1.35x | 8 contexts | aseng, naseng, asenna |
+| `unna` | 1.32x | 6 contexts | punna, umunna, punnai |
+| `enna` | 1.37x | 5 contexts | asenna, lalenna, sisenna |
+| `yana` | 1.30x | 5 contexts | iyana, iyanae, iyanaé |
+### 6.4 Affix Compatibility (Co-occurrence)
+This table shows which prefixes and suffixes most frequently co-occur on the same stems, revealing the 'stacking' rules of the language's morphology.
+| Prefix | Suffix | Frequency | Examples |
+|--------|--------|-----------|----------|
+| `-la` | `-e` | 76 words | lagardelle, lange |
+| `-ch` | `-s` | 63 words | charnois, chassagnes |
+| `-ma` | `-e` | 54 words | marville, maddare |
+| `-co` | `-s` | 53 words | collonges, corps |
+| `-mo` | `-s` | 46 words | moffans, moulédous |
+| `-ch` | `-es` | 44 words | chassagnes, chaumes |
+| `-ma` | `-s` | 43 words | mazères, martigues |
+| `-ch` | `-e` | 41 words | challerange, champclause |
+| `-co` | `-e` | 35 words | corbière, conie |
+| `-co` | `-es` | 28 words | collonges, coyolles |
+### 6.5 Recursive Morpheme Segmentation
+Using **Recursive Hierarchical Substitutability**, we decompose complex words into their constituent morphemes. This approach handles nested affixes (e.g., `prefix-prefix-root-suffix`).
+| Word | Suggested Split | Confidence | Stem |
+|------|-----------------|------------|------|
+| lagardelle | **`la-garde-lle`** | 6.0 | `garde` |
+| lavilledieu | **`la-villedieu`** | 4.5 | `villedieu` |
+| laboissière | **`la-boissière`** | 4.5 | `boissière` |
+| malaincourt | **`ma-la-inco-urt`** | 4.5 | `inco` |
+| colleville | **`co-llev-ille`** | 3.0 | `llev` |
+| maizières | **`ma-izièr-es`** | 3.0 | `izièr` |
+| champenoises | **`ch-ampenois-es`** | 3.0 | `ampenois` |
+| chavignon | **`ch-avign-on`** | 3.0 | `avign` |
+| montescourt | **`mo-ntesc-ourt`** | 3.0 | `ntesc` |
+| châteauredon | **`ch-âteaured-on`** | 3.0 | `âteaured` |
+| chevannes | **`ch-evann-es`** | 3.0 | `evann` |
+| lamasquère | **`la-ma-squère`** | 3.0 | `squère` |
+| landaville | **`la-ndav-ille`** | 3.0 | `ndav` |
+| mourvilles | **`mo-urvill-es`** | 3.0 | `urvill` |
+| malvières | **`ma-lvièr-es`** | 3.0 | `lvièr` |
+### 6.6 Linguistic Interpretation
+> **Automated Insight:**
+The language BUG appears to be more isolating or has a highly fixed vocabulary. Word-level models perform nearly as well as subword models, indicating fewer productive morphological processes.
+---
+## 7. Summary & Recommendations
 ![Performance Dashboard](visualizations/performance_dashboard.png)
 | Component | Recommended | Rationale |
 |-----------|-------------|-----------|
+| Tokenizer | **32k BPE** | Best compression (4.92x) |
+| N-gram | **2-gram** | Lowest perplexity (75) |
+| Markov | **Context-4** | Highest predictability (98.6%) |
 | Embeddings | **100d** | Balanced semantic capture and isotropy |
 ---
 ## Appendix: Metrics Glossary & Interpretation Guide
   author = {Kamali, Omar},
   title = {Wikilangs: Open NLP Models for Wikipedia Languages},
   year = {2025},
+  doi = {10.5281/zenodo.18073153},
+  publisher = {Zenodo},
   url = {https://huggingface.co/wikilangs}
   institution = {Omneity Labs}
 }
 - 🤗 Models: [huggingface.co/wikilangs](https://huggingface.co/wikilangs)
 - 📊 Data: [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly)
 - 👤 Author: [Omar Kamali](https://huggingface.co/omarkamali)
+- 🤝 Sponsor: [Featherless AI](https://featherless.ai)
 ---
 *Generated by Wikilangs Models Pipeline*
+*Report Date: 2026-01-03 08:55:12*

models/embeddings/monolingual/bug_128d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:5c1267bee5f2f288debdc605156a357bc5a3197216cd278eab56281cf3a8fe9b
-size 1027835758

 version https://git-lfs.github.com/spec/v1
+oid sha256:60e643ec1ece75f273fcafba6e20d15dd273d9a060acc636c62b079cc3827632
+size 1025503150

models/embeddings/monolingual/bug_128d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 128,
   "version": "monolingual",
   "training_params": {
-    "dim": 128,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 3688
 }

   "dimension": 128,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 128
   },
+  "vocab_size": 1443
 }

models/embeddings/monolingual/bug_32d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:aa7b81453bf1f3d813574a5fe4023cb26d6e3c3ad14e92b1861050d25b8000c2
-size 257003374

 version https://git-lfs.github.com/spec/v1
+oid sha256:becf192777316400ac67a33873637aadaeabcd3ab768294b2064c3c0615f144b
+size 256394926

models/embeddings/monolingual/bug_32d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 32,
   "version": "monolingual",
   "training_params": {
-    "dim": 32,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 3688
 }

   "dimension": 32,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 32
   },
+  "vocab_size": 1443
 }

models/embeddings/monolingual/bug_64d.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:58f8917956511c2d09a9a89fd35ff79f242caa683877d4d26cfd195ec0bba38f
-size 513947502

 version https://git-lfs.github.com/spec/v1
+oid sha256:d9823040ce7cd355f045c8caad10719684ec4e18bf1bef4ea891663c355cb360
+size 512764334

models/embeddings/monolingual/bug_64d_metadata.json CHANGED Viewed

@@ -3,11 +3,13 @@
   "dimension": 64,
   "version": "monolingual",
   "training_params": {
-    "dim": 64,
     "min_count": 5,
     "window": 5,
     "negative": 5,
-    "epochs": 5
   },
-  "vocab_size": 3688
 }

   "dimension": 64,
   "version": "monolingual",
   "training_params": {
+    "algorithm": "skipgram",
     "min_count": 5,
     "window": 5,
     "negative": 5,
+    "epochs": 5,
+    "encoding_method": "rope",
+    "dim": 64
   },
+  "vocab_size": 1443
 }

models/subword_markov/bug_markov_ctx1_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:efb62617feeefe41017648487e5f9fe4d79355c5c73d92f46456aa8b02bada56
-size 46959

 version https://git-lfs.github.com/spec/v1
+oid sha256:ea6b54847a784f51e2f16c837e7951d96c474085d40f16a957be9eecbdceb2d7
+size 56451

models/subword_markov/bug_markov_ctx1_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "subword",
   "language": "bug",
-  "unique_contexts": 1033,
-  "total_transitions": 2853689
 }

   "context_size": 1,
   "variant": "subword",
   "language": "bug",
+  "unique_contexts": 1115,
+  "total_transitions": 2377233
 }

models/subword_markov/bug_markov_ctx2_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:bcd0c5ec057106f9379be647aa6557c9e1675aebb66eee2a9ac0b1bcba631a36
-size 217154

 version https://git-lfs.github.com/spec/v1
+oid sha256:fc9770904232ce07f4aa4fa00d4f31741086ce42dde9337c557a94fcd91601d3
+size 229577

models/subword_markov/bug_markov_ctx2_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "subword",
   "language": "bug",
-  "unique_contexts": 4741,
-  "total_transitions": 2837729
 }

   "context_size": 2,
   "variant": "subword",
   "language": "bug",
+  "unique_contexts": 6727,
+  "total_transitions": 2361714
 }

models/subword_markov/bug_markov_ctx3_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:bb54eaef1652ce2ff8b0d6a660098b73b778cfdb536140d8cd83af9a4e6b8107
-size 718331

 version https://git-lfs.github.com/spec/v1
+oid sha256:1851961810e4cd0121317f3bf6e74a5b2ad2d767fbf3ea615e494b1b33a0d9ff
+size 651956

models/subword_markov/bug_markov_ctx3_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "subword",
   "language": "bug",
-  "unique_contexts": 25755,
-  "total_transitions": 2821769
 }

   "context_size": 3,
   "variant": "subword",
   "language": "bug",
+  "unique_contexts": 25458,
+  "total_transitions": 2346195
 }

models/subword_markov/bug_markov_ctx4_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a9990bb88ad20134384c84fd8445b8536cc8c81506abfc5c27b5d17e95fb8387
-size 1789921

 version https://git-lfs.github.com/spec/v1
+oid sha256:1991e713220325ecadbfc81c2871898ab38be9aff994ef7d45517d9081e2d843
+size 1459206

models/subword_markov/bug_markov_ctx4_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 4,
   "variant": "subword",
   "language": "bug",
-  "unique_contexts": 97349,
-  "total_transitions": 2805809
 }

   "context_size": 4,
   "variant": "subword",
   "language": "bug",
+  "unique_contexts": 77506,
+  "total_transitions": 2330676
 }

models/subword_ngram/bug_2gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f7586bbc8270b23a1ffe3db4c89019f9ee1f1ec2e498a8622b998eb0cc08a5e3
-size 26754

 version https://git-lfs.github.com/spec/v1
+oid sha256:bc0ffd67460f75034bcef5b208e6dcebf5026c731829639ee1c9b37cd94f0255
+size 29845

models/subword_ngram/bug_2gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 2,
   "variant": "subword",
   "language": "bug",
-  "unique_ngrams": 2007,
-  "total_ngrams": 2853689
 }

   "n": 2,
   "variant": "subword",
   "language": "bug",
+  "unique_ngrams": 2166,
+  "total_ngrams": 2377233
 }

models/subword_ngram/bug_3gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b99d42967059aa95572bee03832f6650d5606c60c552a3e23311a8b39c94977d
-size 158807

 version https://git-lfs.github.com/spec/v1
+oid sha256:27dae62a90b04631ca45a23226c4fb6e3e5676dd3b56fda2064206ccd05cc271
+size 133308

models/subword_ngram/bug_3gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 3,
   "variant": "subword",
   "language": "bug",
-  "unique_ngrams": 13719,
-  "total_ngrams": 2837729
 }

   "n": 3,
   "variant": "subword",
   "language": "bug",
+  "unique_ngrams": 10883,
+  "total_ngrams": 2361714
 }

models/subword_ngram/bug_4gram_subword.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ddc7c7c5395af68b1892528d9f2a92cb2bde6a7ee44f3a74fadf68caaaeee220
-size 657286

 version https://git-lfs.github.com/spec/v1
+oid sha256:4d7a75327568e461ca60da5b19c31846f906735013d0944cb9687eb7e175681a
+size 522479

models/subword_ngram/bug_4gram_subword_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "n": 4,
   "variant": "subword",
   "language": "bug",
-  "unique_ngrams": 58717,
-  "total_ngrams": 2821769
 }

   "n": 4,
   "variant": "subword",
   "language": "bug",
+  "unique_ngrams": 41978,
+  "total_ngrams": 2346195
 }

models/tokenizer/bug_tokenizer_16k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:847ae75bfd3f9bcc7cb922fcb9261145d659b67bfe89baf0d28fb3681260e212
-size 517198

 version https://git-lfs.github.com/spec/v1
+oid sha256:f695d915517db069983efcf78f1827a707c9d7e5df1fe96fe7b390346018196e
+size 518421

models/tokenizer/bug_tokenizer_16k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/bug_tokenizer_32k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:24d02d78f6866b6998b3d1a0d53471e427eae50724834601ed703b9943caa674
-size 825657

 version https://git-lfs.github.com/spec/v1
+oid sha256:41e741a9c77783f82b6eef2bd92933e422989f91b37dafb131fcbf96a6c44a90
+size 821175

models/tokenizer/bug_tokenizer_32k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer/bug_tokenizer_8k.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:725ff7d336e24ead292570325d0507e0807f84e914a9d6bd4d040ecaa2fcb7ef
-size 373322

 version https://git-lfs.github.com/spec/v1
+oid sha256:ae5e6410beed7f97e2ba0a2f82bd06f81b54b111021264a322f628b6ebd7930f
+size 372086

models/tokenizer/bug_tokenizer_8k.vocab CHANGED Viewed

The diff for this file is too large to render. See raw diff

models/vocabulary/bug_vocabulary.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:3a847582583e67e873e810037422c92790bcfba98a621670444ded809ac1218e
-size 270448

 version https://git-lfs.github.com/spec/v1
+oid sha256:eaf37b2a8977cb6828c5e00937c105d6dd1b5753eb506d9a9c3d3715c43116ba
+size 212963

models/vocabulary/bug_vocabulary_metadata.json CHANGED Viewed

@@ -1,16 +1,17 @@
 {
   "language": "bug",
-  "vocabulary_size": 17585,
   "statistics": {
-    "type_token_ratio": 0.13920667666916728,
     "coverage": {
-      "top_100": 0.6859772978959026,
-      "top_1000": 0.7572250205408495,
-      "top_5000": 0.8330520130032508,
-      "top_10000": 0.866185296324081
     },
-    "hapax_count": 44764,
-    "hapax_ratio": 0.7179585879484836,
-    "total_documents": 15960
   }
 }

 {
   "language": "bug",
+  "vocabulary_size": 13441,
+  "variant": "full",
   "statistics": {
+    "type_token_ratio": 0.08785621316176548,
     "coverage": {
+      "top_100": 0.7870075448937048,
+      "top_1000": 0.8499830689622332,
+      "top_5000": 0.9016359615242167,
+      "top_10000": 0.9294954550745494
     },
+    "hapax_count": 19769,
+    "hapax_ratio": 0.5952725082806384,
+    "total_documents": 15519
   }
 }

models/word_markov/bug_markov_ctx1_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:520170615fad8f2979246c99345d11d4477a9ad8cc1d9c4f664f2161ccc9d05e
-size 1388757

 version https://git-lfs.github.com/spec/v1
+oid sha256:11fe5e192c8147c9659e92fea9d4a1deda14c8f7dd8f48d353ccca9695ddc8e5
+size 879712

models/word_markov/bug_markov_ctx1_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 1,
   "variant": "word",
   "language": "bug",
-  "unique_contexts": 62536,
-  "total_transitions": 540662
 }

   "context_size": 1,
   "variant": "word",
   "language": "bug",
+  "unique_contexts": 33148,
+  "total_transitions": 362485
 }

models/word_markov/bug_markov_ctx2_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f0888437ec604e6fcc3683f91808c61b7c8a78deb835da05e814bf0c3d3863cc
-size 2150665

 version https://git-lfs.github.com/spec/v1
+oid sha256:6e286442d9daf9a0a40fbc601c40874bbd94485367efec9e0aa5fa8ada829a45
+size 1390138

models/word_markov/bug_markov_ctx2_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 2,
   "variant": "word",
   "language": "bug",
-  "unique_contexts": 129845,
-  "total_transitions": 524703
 }

   "context_size": 2,
   "variant": "word",
   "language": "bug",
+  "unique_contexts": 72816,
+  "total_transitions": 346966
 }

models/word_markov/bug_markov_ctx3_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:66c6a1a50310828591a7fc9fdd598d094ba0e858b0553148f07738337cde224d
-size 2740626

 version https://git-lfs.github.com/spec/v1
+oid sha256:c638220df23f4ff0bdbe9ae2d3083f2ce355e1c83d1a468b2bea91d89552f40e
+size 1674825

models/word_markov/bug_markov_ctx3_word_metadata.json CHANGED Viewed

@@ -2,6 +2,6 @@
   "context_size": 3,
   "variant": "word",
   "language": "bug",
-  "unique_contexts": 160528,
-  "total_transitions": 508746
 }

   "context_size": 3,
   "variant": "word",
   "language": "bug",
+  "unique_contexts": 87927,
+  "total_transitions": 331447
 }

models/word_markov/bug_markov_ctx4_word.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c82446b2cb3044b779dff3dba7e325d7f0a8b7d7932315234e004bf3ee31dc47
-size 3223859