Upload all models and assets for anp (latest)
Browse files- README.md +64 -64
- models/embeddings/aligned/anp_128d.bin +1 -1
- models/embeddings/aligned/anp_128d.projection.npy +1 -1
- models/embeddings/aligned/anp_32d.bin +1 -1
- models/embeddings/aligned/anp_32d.projection.npy +1 -1
- models/embeddings/aligned/anp_64d.bin +1 -1
- models/embeddings/aligned/anp_64d.projection.npy +1 -1
- models/embeddings/monolingual/anp_128d.bin +1 -1
- models/embeddings/monolingual/anp_32d.bin +1 -1
- models/embeddings/monolingual/anp_64d.bin +1 -1
- models/subword_markov/anp_markov_ctx1_subword.parquet +2 -2
- models/subword_markov/anp_markov_ctx2_subword.parquet +2 -2
- models/subword_markov/anp_markov_ctx3_subword.parquet +2 -2
- models/subword_markov/anp_markov_ctx4_subword.parquet +2 -2
- models/subword_ngram/anp_2gram_subword.parquet +2 -2
- models/subword_ngram/anp_3gram_subword.parquet +2 -2
- models/subword_ngram/anp_4gram_subword.parquet +2 -2
- models/subword_ngram/anp_5gram_subword.parquet +2 -2
- models/tokenizer/anp_tokenizer_16k.model +1 -1
- models/tokenizer/anp_tokenizer_32k.model +1 -1
- models/tokenizer/anp_tokenizer_8k.model +1 -1
- models/word_markov/anp_markov_ctx1_word.parquet +2 -2
- models/word_markov/anp_markov_ctx2_word.parquet +2 -2
- models/word_markov/anp_markov_ctx3_word.parquet +2 -2
- models/word_markov/anp_markov_ctx4_word.parquet +2 -2
- models/word_ngram/anp_2gram_word.parquet +2 -2
- models/word_ngram/anp_3gram_word.parquet +2 -2
- models/word_ngram/anp_4gram_word.parquet +2 -2
- models/word_ngram/anp_5gram_word.parquet +2 -2
- visualizations/embedding_alignment_quality.png +0 -0
- visualizations/embedding_isotropy.png +0 -0
- visualizations/embedding_norms.png +0 -0
- visualizations/embedding_similarity.png +2 -2
- visualizations/embedding_tsne_multilingual.png +2 -2
- visualizations/model_sizes.png +0 -0
- visualizations/ngram_perplexity.png +0 -0
- visualizations/performance_dashboard.png +2 -2
- visualizations/position_encoding_comparison.png +2 -2
- visualizations/tsne_sentences.png +2 -2
- visualizations/tsne_words.png +2 -2
README.md
CHANGED
|
@@ -36,7 +36,7 @@ metrics:
|
|
| 36 |
value: 3.777
|
| 37 |
- name: best_isotropy
|
| 38 |
type: isotropy
|
| 39 |
-
value: 0.
|
| 40 |
- name: vocabulary_size
|
| 41 |
type: vocab
|
| 42 |
value: 0
|
|
@@ -98,29 +98,29 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
|
|
| 98 |
|
| 99 |
Below are sample sentences tokenized with each vocabulary size:
|
| 100 |
|
| 101 |
-
**Sample 1:**
|
| 102 |
|
| 103 |
| Vocab | Tokens | Count |
|
| 104 |
|-------|--------|-------|
|
| 105 |
-
| 8k |
|
| 106 |
-
| 16k |
|
| 107 |
-
| 32k |
|
| 108 |
|
| 109 |
-
**Sample 2:**
|
| 110 |
|
| 111 |
| Vocab | Tokens | Count |
|
| 112 |
|-------|--------|-------|
|
| 113 |
-
| 8k |
|
| 114 |
-
| 16k |
|
| 115 |
-
| 32k |
|
| 116 |
|
| 117 |
-
**Sample 3:**
|
| 118 |
|
| 119 |
| Vocab | Tokens | Count |
|
| 120 |
|-------|--------|-------|
|
| 121 |
-
| 8k |
|
| 122 |
-
| 16k |
|
| 123 |
-
| 32k |
|
| 124 |
|
| 125 |
|
| 126 |
### Key Findings
|
|
@@ -270,27 +270,27 @@ Below are text samples generated from each word-based Markov chain model:
|
|
| 270 |
|
| 271 |
**Context Size 1:**
|
| 272 |
|
| 273 |
-
1. `के
|
| 274 |
-
2. `में
|
| 275 |
-
3. `छै
|
| 276 |
|
| 277 |
**Context Size 2:**
|
| 278 |
|
| 279 |
-
1. `के लिए
|
| 280 |
-
2. `के अनुसार
|
| 281 |
-
3. `छै जे
|
| 282 |
|
| 283 |
**Context Size 3:**
|
| 284 |
|
| 285 |
-
1. `छै जेकरा म
|
| 286 |
-
2. `जनगणना के अनुसार
|
| 287 |
-
3. `के रूप में
|
| 288 |
|
| 289 |
**Context Size 4:**
|
| 290 |
|
| 291 |
-
1. `छै जेकरा म कुल
|
| 292 |
-
2. `के औसत लिंग अनुपात
|
| 293 |
-
3. `छै जनगणना के अनुसार
|
| 294 |
|
| 295 |
|
| 296 |
### Generated Text Samples (Subword-based)
|
|
@@ -299,27 +299,27 @@ Below are text samples generated from each subword-based Markov chain model:
|
|
| 299 |
|
| 300 |
**Context Size 1:**
|
| 301 |
|
| 302 |
-
1. `_
|
| 303 |
-
2.
|
| 304 |
-
3.
|
| 305 |
|
| 306 |
**Context Size 2:**
|
| 307 |
|
| 308 |
-
1. `र_
|
| 309 |
-
2. `_के_
|
| 310 |
-
3. `के_
|
| 311 |
|
| 312 |
**Context Size 3:**
|
| 313 |
|
| 314 |
-
1. `_के_
|
| 315 |
-
2. `_में_
|
| 316 |
-
3. `_की_
|
| 317 |
|
| 318 |
**Context Size 4:**
|
| 319 |
|
| 320 |
-
1. `_और_
|
| 321 |
-
2. `_है।_
|
| 322 |
-
3. `_छै।_
|
| 323 |
|
| 324 |
|
| 325 |
### Key Findings
|
|
@@ -424,18 +424,18 @@ Below are text samples generated from each subword-based Markov chain model:
|
|
| 424 |
|
| 425 |
| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
|
| 426 |
|-------|-----------|----------|------------------|---------------|----------------|
|
| 427 |
-
| **mono_32d** | 32 | 0.
|
| 428 |
-
| **mono_64d** | 64 | 0.
|
| 429 |
-
| **mono_128d** | 128 | 0.
|
| 430 |
-
| **aligned_32d** | 32 | 0.
|
| 431 |
-
| **aligned_64d** | 64 | 0.
|
| 432 |
-
| **aligned_128d** | 128 | 0.
|
| 433 |
|
| 434 |
### Key Findings
|
| 435 |
|
| 436 |
-
- **Best Isotropy:** mono_32d with 0.
|
| 437 |
-
- **Semantic Density:** Average pairwise similarity of 0.
|
| 438 |
-
- **Alignment Quality:** Aligned models achieve up to 3.
|
| 439 |
- **Recommendation:** 128d aligned for best cross-lingual performance
|
| 440 |
|
| 441 |
---
|
|
@@ -461,7 +461,7 @@ These are the most productive prefixes and suffixes identified by sampling the v
|
|
| 461 |
#### Productive Suffixes
|
| 462 |
| Suffix | Examples |
|
| 463 |
|--------|----------|
|
| 464 |
-
| `-ों` |
|
| 465 |
|
| 466 |
### 6.3 Bound Stems (Lexical Roots)
|
| 467 |
|
|
@@ -469,9 +469,9 @@ Bound stems are high-frequency subword units that are semantically cohesive but
|
|
| 469 |
|
| 470 |
| Stem | Cohesion | Substitutability | Examples |
|
| 471 |
|------|----------|------------------|----------|
|
| 472 |
-
| `tion` | 2.
|
| 473 |
-
| `atio` | 2.
|
| 474 |
-
| `stat` | 2.
|
| 475 |
|
| 476 |
### 6.4 Affix Compatibility (Co-occurrence)
|
| 477 |
|
|
@@ -486,21 +486,21 @@ Using **Recursive Hierarchical Substitutability**, we decompose complex words in
|
|
| 486 |
|
| 487 |
| Word | Suggested Split | Confidence | Stem |
|
| 488 |
|------|-----------------|------------|------|
|
|
|
|
|
|
|
| 489 |
| महाविद्यालयों | **`महाविद्���ालय-ों`** | 4.5 | `महाविद्यालय` |
|
| 490 |
-
|
|
| 491 |
-
| चमत्कारों | **`चमत्कार-ों`** | 4.5 | `चमत्कार` |
|
| 492 |
-
| विद्वानों | **`विद्वान-ों`** | 4.5 | `विद्वान` |
|
| 493 |
-
| व्याख्यानों | **`व्याख्यान-ों`** | 4.5 | `व्याख्यान` |
|
| 494 |
-
| कार्टूनों | **`कार्टून-ों`** | 4.5 | `कार्टून` |
|
| 495 |
-
| शास्त्रों | **`शास्त्र-ों`** | 4.5 | `शास्त्र` |
|
| 496 |
-
| कंप्यूटरों | **`कंप्यूटर-ों`** | 4.5 | `कंप्यूटर` |
|
| 497 |
-
| संस्कारों | **`संस्कार-ों`** | 4.5 | `संस्कार` |
|
| 498 |
-
| महासागरों | **`महासागर-ों`** | 4.5 | `महासागर` |
|
| 499 |
-
| पाठ्यक्रमों | **`पाठ्यक्रम-ों`** | 4.5 | `पाठ्यक्रम` |
|
| 500 |
-
| मुसलमानों | **`मुसलमान-ों`** | 4.5 | `मुसलमान` |
|
| 501 |
-
| महाद्वारों | **`महाद्वार-ों`** | 4.5 | `महाद्वार` |
|
| 502 |
-
| चालुक्यों | **`चालुक्य-ों`** | 4.5 | `चालुक्य` |
|
| 503 |
| प्रकाशकों | **`प्रकाशक-ों`** | 4.5 | `प्रकाशक` |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 504 |
|
| 505 |
### 6.6 Linguistic Interpretation
|
| 506 |
|
|
@@ -734,4 +734,4 @@ MIT License - Free for academic and commercial use.
|
|
| 734 |
---
|
| 735 |
*Generated by Wikilangs Models Pipeline*
|
| 736 |
|
| 737 |
-
*Report Date: 2026-01-03
|
|
|
|
| 36 |
value: 3.777
|
| 37 |
- name: best_isotropy
|
| 38 |
type: isotropy
|
| 39 |
+
value: 0.8298
|
| 40 |
- name: vocabulary_size
|
| 41 |
type: vocab
|
| 42 |
value: 0
|
|
|
|
| 98 |
|
| 99 |
Below are sample sentences tokenized with each vocabulary size:
|
| 100 |
|
| 101 |
+
**Sample 1:** `साध्य रुप स॑ आइसलैण्ड दुनिया के सबसे पुराऽनो संसदीय लोकतंत्र छीकै। एकरा म॑ अभी 6...`
|
| 102 |
|
| 103 |
| Vocab | Tokens | Count |
|
| 104 |
|-------|--------|-------|
|
| 105 |
+
| 8k | `▁सा ध्य ▁रुप ▁स॑ ▁आइसलैण्ड ▁दुनिया ▁के ▁सबसे ▁पुरा ऽ ... (+26 more)` | 36 |
|
| 106 |
+
| 16k | `▁सा ध्य ▁रुप ▁स॑ ▁आइसलैण्ड ▁दुनिया ▁के ▁सबसे ▁पुरा ऽनो ... (+24 more)` | 34 |
|
| 107 |
+
| 32k | `▁साध्य ▁रुप ▁स॑ ▁आइसलैण्ड ▁दुनिया ▁के ▁सबसे ▁पुराऽनो ▁संसदीय ▁लोकतंत्र ... (+22 more)` | 32 |
|
| 108 |
|
| 109 |
+
**Sample 2:** `जनता दल एगो राष्ट्रीय दल छेकै। इतिहास एकरो देखौ बाहरी कड़ी संदर्भ`
|
| 110 |
|
| 111 |
| Vocab | Tokens | Count |
|
| 112 |
|-------|--------|-------|
|
| 113 |
+
| 8k | `▁जनता ▁दल ▁एगो ▁राष्ट्रीय ▁दल ▁छेकै । ▁इतिहास ▁एकरो ▁देखौ ... (+3 more)` | 13 |
|
| 114 |
+
| 16k | `▁जनता ▁दल ▁एगो ▁राष्ट्रीय ▁दल ▁छेकै । ▁इतिहास ▁एकरो ▁देखौ ... (+3 more)` | 13 |
|
| 115 |
+
| 32k | `▁जनता ▁दल ▁एगो ▁राष्ट्रीय ▁दल ▁छेकै । ▁इतिहास ▁एकरो ▁देखौ ... (+3 more)` | 13 |
|
| 116 |
|
| 117 |
+
**Sample 3:** `कोनो रोग सॆं मनुष्य के बचाव लेली जे विधि अपनैलॊ जाय छै, वोकरा चिकित्सा कहलॊ जाय ...`
|
| 118 |
|
| 119 |
| Vocab | Tokens | Count |
|
| 120 |
|-------|--------|-------|
|
| 121 |
+
| 8k | `▁कोनो ▁रोग ▁सॆं ▁मनुष्य ▁के ▁बच ाव ▁लेली ▁जे ▁विधि ... (+14 more)` | 24 |
|
| 122 |
+
| 16k | `▁कोनो ▁रोग ▁सॆं ▁मनुष्य ▁के ▁बचाव ▁लेली ▁जे ▁विधि ▁अपन ... (+12 more)` | 22 |
|
| 123 |
+
| 32k | `▁कोनो ▁रोग ▁सॆं ▁मनुष्य ▁के ▁बचाव ▁लेली ▁जे ▁विधि ▁अपनैलॊ ... (+9 more)` | 19 |
|
| 124 |
|
| 125 |
|
| 126 |
### Key Findings
|
|
|
|
| 270 |
|
| 271 |
**Context Size 1:**
|
| 272 |
|
| 273 |
+
1. `के महिमा बहुत थोड़ा या एक क्षेत्र क साबुन कारखानों में है प्रेमचंद अध्यापक फ्रांसिस प्रथम`
|
| 274 |
+
2. `में छै जेकरा मॅॆ कुल 650 महिला छै देवनागरी लिपि शब्दावली लिपि केरौ अधिकार प्राप्त छै`
|
| 275 |
+
3. `छै उदाहरणतः x11 रंगों के मौखिक संचार प्रतीक समूह भी पंचवटी प्रसिद्ध हुआ आज १५० से`
|
| 276 |
|
| 277 |
**Context Size 2:**
|
| 278 |
|
| 279 |
+
1. `के लिए मिस्र पर विजय प्राप्त करै छीयै जे कणोज स॑ भी अधिक अलग अलग रूप दिया`
|
| 280 |
+
2. `के अनुसार पत्रांग गांव के आबादी 105 छै जे गाँव के जनसंख्या छै जेकरा म 147 पुरुष`
|
| 281 |
+
3. `छै जे उत्तर प्रदेश राज्य मँ स्थित छै मानदंड के अनुसार कुंदरी सोन कुरहा हरला के कुल`
|
| 282 |
|
| 283 |
**Context Size 3:**
|
| 284 |
|
| 285 |
+
1. `छै जेकरा म 118 पुरुष आरु जबकि महिला छै तेलबाद्रो गांव म 0 6 आयु वर्ग के बच्चा`
|
| 286 |
+
2. `जनगणना के अनुसार हरवाडीह के बाल लिंग अनुपात 915 छै जे उत्तर प्रदेश के मिर्ज़ापुर जिले की बेलन`
|
| 287 |
+
3. `के रूप में देखा जाता है किंतु पाप के सभी परिणाम नष्ट नहीं होते उसके परिणाम दूर करने`
|
| 288 |
|
| 289 |
**Context Size 4:**
|
| 290 |
|
| 291 |
+
1. `छै जेकरा म कुल 72 पुरुष छै जबकि 80 महिला छै जैसनो कि के जनगणना म बतैलो गेलो छै`
|
| 292 |
+
2. `के औसत लिंग अनुपात 835 स कम छै`
|
| 293 |
+
3. `छै जनगणना के अनुसार सरोख गांव के आबादी 673 छेलै जेकरा म॑ स॑ 613 पुरुष आरू 503 महिला छै`
|
| 294 |
|
| 295 |
|
| 296 |
### Generated Text Samples (Subword-based)
|
|
|
|
| 299 |
|
| 300 |
**Context Size 1:**
|
| 301 |
|
| 302 |
+
1. `_(मेल_उत्पत्ति_सूत्रों_के_के_`
|
| 303 |
+
2. `र_हैं_रानर्जी_सम्परिसबना_दो`
|
| 304 |
+
3. `क_दौराजलड़कई_साथ_कुल_`
|
| 305 |
|
| 306 |
**Context Size 2:**
|
| 307 |
|
| 308 |
+
1. `र_के_इस_छै।_मुआविष्कार_दि`
|
| 309 |
+
2. `_के_ठीक_यौगिक_रक्षा_आवासी_`
|
| 310 |
+
3. `के_अध्ययन_में_लैटिन_का_दार्श`
|
| 311 |
|
| 312 |
**Context Size 3:**
|
| 313 |
|
| 314 |
+
1. `_के_भाई_थे_और_माना_जाता_है`
|
| 315 |
+
2. `_में_5%_छै।_जनताँत्रिक_रूप_`
|
| 316 |
+
3. `_की_जाती_हैं_जो_लगन_की_किता`
|
| 317 |
|
| 318 |
**Context Size 4:**
|
| 319 |
|
| 320 |
+
1. `_और_गैर-न्यायिक_सदन_की_आवृ`
|
| 321 |
+
2. `_है।_व्यापक_छै_तs_आखरी_सांस`
|
| 322 |
+
3. `_छै।_इतिहास_के_बाद_उसको_स`
|
| 323 |
|
| 324 |
|
| 325 |
### Key Findings
|
|
|
|
| 424 |
|
| 425 |
| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
|
| 426 |
|-------|-----------|----------|------------------|---------------|----------------|
|
| 427 |
+
| **mono_32d** | 32 | 0.8298 🏆 | 0.3551 | N/A | N/A |
|
| 428 |
+
| **mono_64d** | 64 | 0.7019 | 0.2957 | N/A | N/A |
|
| 429 |
+
| **mono_128d** | 128 | 0.3519 | 0.2719 | N/A | N/A |
|
| 430 |
+
| **aligned_32d** | 32 | 0.8298 | 0.3586 | 0.0160 | 0.0940 |
|
| 431 |
+
| **aligned_64d** | 64 | 0.7019 | 0.2950 | 0.0180 | 0.1240 |
|
| 432 |
+
| **aligned_128d** | 128 | 0.3519 | 0.2673 | 0.0300 | 0.1420 |
|
| 433 |
|
| 434 |
### Key Findings
|
| 435 |
|
| 436 |
+
- **Best Isotropy:** mono_32d with 0.8298 (more uniform distribution)
|
| 437 |
+
- **Semantic Density:** Average pairwise similarity of 0.3073. Lower values indicate better semantic separation.
|
| 438 |
+
- **Alignment Quality:** Aligned models achieve up to 3.0% R@1 in cross-lingual retrieval.
|
| 439 |
- **Recommendation:** 128d aligned for best cross-lingual performance
|
| 440 |
|
| 441 |
---
|
|
|
|
| 461 |
#### Productive Suffixes
|
| 462 |
| Suffix | Examples |
|
| 463 |
|--------|----------|
|
| 464 |
+
| `-ों` | चक्रवातों, अनुक्रमों, मोहरों |
|
| 465 |
|
| 466 |
### 6.3 Bound Stems (Lexical Roots)
|
| 467 |
|
|
|
|
| 469 |
|
| 470 |
| Stem | Cohesion | Substitutability | Examples |
|
| 471 |
|------|----------|------------------|----------|
|
| 472 |
+
| `tion` | 2.65x | 15 contexts | motion, action, edition |
|
| 473 |
+
| `atio` | 2.66x | 12 contexts | nations, station, national |
|
| 474 |
+
| `stat` | 2.68x | 6 contexts | state, status, statue |
|
| 475 |
|
| 476 |
### 6.4 Affix Compatibility (Co-occurrence)
|
| 477 |
|
|
|
|
| 486 |
|
| 487 |
| Word | Suggested Split | Confidence | Stem |
|
| 488 |
|------|-----------------|------------|------|
|
| 489 |
+
| अविष्कारों | **`अविष्कार-ों`** | 4.5 | `अविष्कार` |
|
| 490 |
+
| रूपान्तरणों | **`रूपान्तरण-ों`** | 4.5 | `रूपान्तरण` |
|
| 491 |
| महाविद्यालयों | **`महाविद्���ालय-ों`** | 4.5 | `महाविद्यालय` |
|
| 492 |
+
| यूरोपियनों | **`यूरोपियन-ों`** | 4.5 | `यूरोपियन` |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 493 |
| प्रकाशकों | **`प्रकाशक-ों`** | 4.5 | `प्रकाशक` |
|
| 494 |
+
| अनुक्रमों | **`अनुक्रम-ों`** | 4.5 | `अनुक्रम` |
|
| 495 |
+
| सम्मेलनों | **`सम्मेलन-ों`** | 4.5 | `सम्मेलन` |
|
| 496 |
+
| सुल्तानों | **`सुल्तान-ों`** | 4.5 | `सुल्तान` |
|
| 497 |
+
| गणितज्ञों | **`गणितज्ञ-ों`** | 4.5 | `गणितज्ञ` |
|
| 498 |
+
| पुस्तकालयों | **`पुस्तकालय-ों`** | 4.5 | `पुस्तकालय` |
|
| 499 |
+
| महाकाव्यों | **`महाकाव्य-ों`** | 4.5 | `महाकाव्य` |
|
| 500 |
+
| गुणसूत्रों | **`गुणसूत्र-ों`** | 4.5 | `गुणसूत्र` |
|
| 501 |
+
| शास्त्रों | **`शास्त्र-ों`** | 4.5 | `शास्त्र` |
|
| 502 |
+
| संग्रहालयों | **`संग्रहालय-ों`** | 4.5 | `संग्रहालय` |
|
| 503 |
+
| कार्यालयों | **`कार्यालय-ों`** | 4.5 | `कार्यालय` |
|
| 504 |
|
| 505 |
### 6.6 Linguistic Interpretation
|
| 506 |
|
|
|
|
| 734 |
---
|
| 735 |
*Generated by Wikilangs Models Pipeline*
|
| 736 |
|
| 737 |
+
*Report Date: 2026-01-03 16:32:35*
|
models/embeddings/aligned/anp_128d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 1036402426
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:876255fe497d906428a47921a579962c1e032f45f5946901e844e7919bdfac06
|
| 3 |
size 1036402426
|
models/embeddings/aligned/anp_128d.projection.npy
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 65664
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a70ab46701d9206f9b8a719c9c664b879a041079240d00a9e962048f37dc4186
|
| 3 |
size 65664
|
models/embeddings/aligned/anp_32d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 259328506
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1c524ba6016b57f0e02ddf42b2eae20e6804fdfa771bcbc9f2b1ccebd5119e57
|
| 3 |
size 259328506
|
models/embeddings/aligned/anp_32d.projection.npy
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 4224
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:012c0bfb01d3fd691249c16503dcbf4b0bd28739718ac0f3683b8881e0b9ee3d
|
| 3 |
size 4224
|
models/embeddings/aligned/anp_64d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 518353146
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0979ed932bb2ac48c53d041f0eb9021a264be14bbea5992e545f4d979173efb6
|
| 3 |
size 518353146
|
models/embeddings/aligned/anp_64d.projection.npy
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 16512
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:80c8a841c44c0578d82a02d9f700e03dd48117d5cd8af75b4f8dbf8cf3fb09b4
|
| 3 |
size 16512
|
models/embeddings/monolingual/anp_128d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 1036402426
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:876255fe497d906428a47921a579962c1e032f45f5946901e844e7919bdfac06
|
| 3 |
size 1036402426
|
models/embeddings/monolingual/anp_32d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 259328506
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1c524ba6016b57f0e02ddf42b2eae20e6804fdfa771bcbc9f2b1ccebd5119e57
|
| 3 |
size 259328506
|
models/embeddings/monolingual/anp_64d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 518353146
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0979ed932bb2ac48c53d041f0eb9021a264be14bbea5992e545f4d979173efb6
|
| 3 |
size 518353146
|
models/subword_markov/anp_markov_ctx1_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7c9fb4c52e0e87e967f3f7d9dc773f6d5db95feedec9abcc43358cbdb9239cab
|
| 3 |
+
size 371363
|
models/subword_markov/anp_markov_ctx2_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d13edcb4f30fd5294da845295ac9232ceaf59bb9496d2e4c341ea4840e577c0a
|
| 3 |
+
size 1640712
|
models/subword_markov/anp_markov_ctx3_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8cddffef0719b682192bbf58fd6e27a7d002473ce11dbfd201c44fc6102cd421
|
| 3 |
+
size 4887791
|
models/subword_markov/anp_markov_ctx4_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9b703a3dd5c8ca51530d4e2593c9ca8d93ede38e4956d2428176c5bd2af93c85
|
| 3 |
+
size 11032755
|
models/subword_ngram/anp_2gram_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:32e6f72363f4a1ea1193f915ea32a93f20b722060c5203991a7010560918bbbb
|
| 3 |
+
size 268051
|
models/subword_ngram/anp_3gram_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:df0d7fa9cc4b96ba843e3c5d94d0c2eaa3d3c5c03fc8d7ddbdace8b7d98c73b9
|
| 3 |
+
size 1065691
|
models/subword_ngram/anp_4gram_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7af5a1ce3113561abca60cff8c8b58f8f2737fd0c0ac5d015610b269cb368800
|
| 3 |
+
size 3167187
|
models/subword_ngram/anp_5gram_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ff6f7948d501fa7ee5efdb1256f5675a3b99076844c1a5c4c065207facec2980
|
| 3 |
+
size 4368124
|
models/tokenizer/anp_tokenizer_16k.model
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 618098
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:774cf4a58a99a822ce3c7b4af7d57a3be4b47122c38947133619943fc1403cc0
|
| 3 |
size 618098
|
models/tokenizer/anp_tokenizer_32k.model
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 1035857
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a6a6cbe586daf6b47f8bfad999334955395c69728ea6d64a4e58902776176b36
|
| 3 |
size 1035857
|
models/tokenizer/anp_tokenizer_8k.model
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 425391
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a6c09e774f002b4bde4e5c27518cc493ae377ee62a8bcce5c0473a204c053ada
|
| 3 |
size 425391
|
models/word_markov/anp_markov_ctx1_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:54ed2e2bcfee33ce09d9df878d48f229122588ee4ece68768d30e9a1a5831518
|
| 3 |
+
size 3037808
|
models/word_markov/anp_markov_ctx2_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:df821ce38dfaecb93e34dd5c5dccd31a58bce0f027dcb573375dda1df58e0d12
|
| 3 |
+
size 8870407
|
models/word_markov/anp_markov_ctx3_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b79a6e8ae2e05ac46c852acd75949a109458e3dc426638c730e39155dda243af
|
| 3 |
+
size 13019964
|
models/word_markov/anp_markov_ctx4_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:26c9945b42073ab90074ee6c2f828aa30a8f5da23be39234422ddccd8d150c65
|
| 3 |
+
size 15594517
|
models/word_ngram/anp_2gram_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f55b02c0e85fea588ab9f833cab08654a573346d3c15d52bb7a6a9d9516c1db3
|
| 3 |
+
size 314568
|
models/word_ngram/anp_3gram_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a2c51e0ccd5cc11bcd63ab0935cb7fc37fa5f20aa9004e325d029757492ab9d8
|
| 3 |
+
size 351861
|
models/word_ngram/anp_4gram_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:85bcc06f5b322511920a2c409ef0efa4bacfb89aa0b15ede1e38e0b77763b3a1
|
| 3 |
+
size 711216
|
models/word_ngram/anp_5gram_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:667c91b499f3ded7261257ead532e3eeb3e4471a990cd98bfca470732d0bfd3b
|
| 3 |
+
size 564652
|
visualizations/embedding_alignment_quality.png
CHANGED
|
|
visualizations/embedding_isotropy.png
CHANGED
|
|
visualizations/embedding_norms.png
CHANGED
|
|
visualizations/embedding_similarity.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
visualizations/embedding_tsne_multilingual.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
visualizations/model_sizes.png
CHANGED
|
|
visualizations/ngram_perplexity.png
CHANGED
|
|
visualizations/performance_dashboard.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
visualizations/position_encoding_comparison.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
visualizations/tsne_sentences.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
visualizations/tsne_words.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|