Upload all models and assets for an (latest)
Browse files- README.md +97 -96
- models/embeddings/aligned/an_128d.bin +1 -1
- models/embeddings/aligned/an_128d.projection.npy +1 -1
- models/embeddings/aligned/an_32d.bin +1 -1
- models/embeddings/aligned/an_32d.projection.npy +1 -1
- models/embeddings/aligned/an_64d.bin +1 -1
- models/embeddings/aligned/an_64d.projection.npy +1 -1
- models/embeddings/monolingual/an_128d.bin +1 -1
- models/embeddings/monolingual/an_32d.bin +1 -1
- models/embeddings/monolingual/an_64d.bin +1 -1
- models/subword_markov/an_markov_ctx1_subword.parquet +2 -2
- models/subword_markov/an_markov_ctx2_subword.parquet +2 -2
- models/subword_markov/an_markov_ctx3_subword.parquet +2 -2
- models/subword_markov/an_markov_ctx4_subword.parquet +2 -2
- models/subword_ngram/an_2gram_subword.parquet +2 -2
- models/subword_ngram/an_3gram_subword.parquet +2 -2
- models/subword_ngram/an_4gram_subword.parquet +2 -2
- models/subword_ngram/an_5gram_subword.parquet +2 -2
- models/tokenizer/an_tokenizer_16k.model +1 -1
- models/tokenizer/an_tokenizer_32k.model +1 -1
- models/tokenizer/an_tokenizer_64k.model +1 -1
- models/tokenizer/an_tokenizer_8k.model +1 -1
- models/word_markov/an_markov_ctx1_word.parquet +2 -2
- models/word_markov/an_markov_ctx2_word.parquet +2 -2
- models/word_markov/an_markov_ctx3_word.parquet +2 -2
- models/word_markov/an_markov_ctx4_word.parquet +2 -2
- models/word_ngram/an_2gram_word.parquet +2 -2
- models/word_ngram/an_3gram_word.parquet +2 -2
- models/word_ngram/an_4gram_word.parquet +2 -2
- models/word_ngram/an_5gram_word.parquet +2 -2
- visualizations/embedding_alignment_quality.png +0 -0
- visualizations/embedding_isotropy.png +0 -0
- visualizations/embedding_norms.png +0 -0
- visualizations/embedding_similarity.png +2 -2
- visualizations/embedding_tsne_multilingual.png +2 -2
- visualizations/ngram_perplexity.png +0 -0
- visualizations/performance_dashboard.png +2 -2
- visualizations/position_encoding_comparison.png +2 -2
- visualizations/tsne_sentences.png +2 -2
- visualizations/tsne_words.png +2 -2
README.md
CHANGED
|
@@ -36,7 +36,7 @@ metrics:
|
|
| 36 |
value: 4.275
|
| 37 |
- name: best_isotropy
|
| 38 |
type: isotropy
|
| 39 |
-
value: 0.
|
| 40 |
- name: vocabulary_size
|
| 41 |
type: vocab
|
| 42 |
value: 0
|
|
@@ -99,32 +99,32 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
|
|
| 99 |
|
| 100 |
Below are sample sentences tokenized with each vocabulary size:
|
| 101 |
|
| 102 |
-
**Sample 1:** `
|
| 103 |
|
| 104 |
| Vocab | Tokens | Count |
|
| 105 |
|-------|--------|-------|
|
| 106 |
-
| 8k | `▁
|
| 107 |
-
| 16k | `▁
|
| 108 |
-
| 32k | `▁
|
| 109 |
-
| 64k | `▁
|
| 110 |
|
| 111 |
-
**Sample 2:** `
|
| 112 |
|
| 113 |
| Vocab | Tokens | Count |
|
| 114 |
|-------|--------|-------|
|
| 115 |
-
| 8k | `▁
|
| 116 |
-
| 16k | `▁
|
| 117 |
-
| 32k | `▁
|
| 118 |
-
| 64k | `▁
|
| 119 |
|
| 120 |
-
**Sample 3:** `
|
| 121 |
|
| 122 |
| Vocab | Tokens | Count |
|
| 123 |
|-------|--------|-------|
|
| 124 |
-
| 8k | `▁
|
| 125 |
-
| 16k | `▁
|
| 126 |
-
| 32k | `▁
|
| 127 |
-
| 64k | `▁
|
| 128 |
|
| 129 |
|
| 130 |
### Key Findings
|
|
@@ -274,27 +274,27 @@ Below are text samples generated from each word-based Markov chain model:
|
|
| 274 |
|
| 275 |
**Context Size 1:**
|
| 276 |
|
| 277 |
-
1. `de
|
| 278 |
-
2. `d
|
| 279 |
-
3. `a
|
| 280 |
|
| 281 |
**Context Size 2:**
|
| 282 |
|
| 283 |
-
1. `d a
|
| 284 |
-
2. `d o
|
| 285 |
-
3. `en a
|
| 286 |
|
| 287 |
**Context Size 3:**
|
| 288 |
|
| 289 |
-
1. `a provincia de
|
| 290 |
-
2. `d a provincia de
|
| 291 |
-
3. `una superficie de
|
| 292 |
|
| 293 |
**Context Size 4:**
|
| 294 |
|
| 295 |
-
1. `suya población ye de
|
| 296 |
-
2. `en una superficie de
|
| 297 |
-
3. `d a provincia de
|
| 298 |
|
| 299 |
|
| 300 |
### Generated Text Samples (Subword-based)
|
|
@@ -303,27 +303,27 @@ Below are text samples generated from each subword-based Markov chain model:
|
|
| 303 |
|
| 304 |
**Context Size 1:**
|
| 305 |
|
| 306 |
-
1. `
|
| 307 |
-
2. `
|
| 308 |
-
3. `
|
| 309 |
|
| 310 |
**Context Size 2:**
|
| 311 |
|
| 312 |
-
1. `
|
| 313 |
-
2. `
|
| 314 |
-
3. `
|
| 315 |
|
| 316 |
**Context Size 3:**
|
| 317 |
|
| 318 |
-
1. `
|
| 319 |
-
2. `
|
| 320 |
-
3. `_d'
|
| 321 |
|
| 322 |
**Context Size 4:**
|
| 323 |
|
| 324 |
-
1. `
|
| 325 |
-
2. `
|
| 326 |
-
3. `_d'
|
| 327 |
|
| 328 |
|
| 329 |
### Key Findings
|
|
@@ -428,18 +428,18 @@ Below are text samples generated from each subword-based Markov chain model:
|
|
| 428 |
|
| 429 |
| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
|
| 430 |
|-------|-----------|----------|------------------|---------------|----------------|
|
| 431 |
-
| **mono_32d** | 32 | 0.
|
| 432 |
-
| **mono_64d** | 64 | 0.
|
| 433 |
-
| **mono_128d** | 128 | 0.
|
| 434 |
-
| **aligned_32d** | 32 | 0.
|
| 435 |
-
| **aligned_64d** | 64 | 0.
|
| 436 |
-
| **aligned_128d** | 128 | 0.
|
| 437 |
|
| 438 |
### Key Findings
|
| 439 |
|
| 440 |
-
- **Best Isotropy:** mono_64d with 0.
|
| 441 |
-
- **Semantic Density:** Average pairwise similarity of 0.
|
| 442 |
-
- **Alignment Quality:** Aligned models achieve up to 37.
|
| 443 |
- **Recommendation:** 128d aligned for best cross-lingual performance
|
| 444 |
|
| 445 |
---
|
|
@@ -461,19 +461,20 @@ These are the most productive prefixes and suffixes identified by sampling the v
|
|
| 461 |
#### Productive Prefixes
|
| 462 |
| Prefix | Examples |
|
| 463 |
|--------|----------|
|
| 464 |
-
| `-co` |
|
| 465 |
-
| `-ca` |
|
| 466 |
-
| `-
|
| 467 |
-
| `-
|
|
|
|
| 468 |
|
| 469 |
#### Productive Suffixes
|
| 470 |
| Suffix | Examples |
|
| 471 |
|--------|----------|
|
| 472 |
-
| `-s` |
|
| 473 |
-
| `-a` |
|
| 474 |
-
| `-as` |
|
| 475 |
-
| `-os` |
|
| 476 |
-
| `-es` |
|
| 477 |
|
| 478 |
### 6.3 Bound Stems (Lexical Roots)
|
| 479 |
|
|
@@ -481,18 +482,18 @@ Bound stems are high-frequency subword units that are semantically cohesive but
|
|
| 481 |
|
| 482 |
| Stem | Cohesion | Substitutability | Examples |
|
| 483 |
|------|----------|------------------|----------|
|
| 484 |
-
| `
|
| 485 |
-
| `
|
| 486 |
-
| `
|
| 487 |
-
| `
|
| 488 |
-
| `
|
| 489 |
-
| `
|
| 490 |
-
| `
|
| 491 |
-
| `
|
| 492 |
-
| `
|
| 493 |
-
| `cion` | 1.
|
| 494 |
-
| `
|
| 495 |
-
| `mbre` | 1.55x | 75 contexts |
|
| 496 |
|
| 497 |
### 6.4 Affix Compatibility (Co-occurrence)
|
| 498 |
|
|
@@ -500,16 +501,16 @@ This table shows which prefixes and suffixes most frequently co-occur on the sam
|
|
| 500 |
|
| 501 |
| Prefix | Suffix | Frequency | Examples |
|
| 502 |
|--------|--------|-----------|----------|
|
| 503 |
-
| `-
|
| 504 |
-
| `-
|
| 505 |
-
| `-
|
| 506 |
-
| `-
|
| 507 |
-
| `-
|
| 508 |
-
| `-
|
| 509 |
-
| `-re` | `-
|
| 510 |
-
| `-
|
| 511 |
-
| `-
|
| 512 |
-
| `-
|
| 513 |
|
| 514 |
### 6.5 Recursive Morpheme Segmentation
|
| 515 |
|
|
@@ -517,21 +518,21 @@ Using **Recursive Hierarchical Substitutability**, we decompose complex words in
|
|
| 517 |
|
| 518 |
| Word | Suggested Split | Confidence | Stem |
|
| 519 |
|------|-----------------|------------|------|
|
| 520 |
-
|
|
| 521 |
-
|
|
| 522 |
-
|
|
| 523 |
-
|
|
| 524 |
-
|
|
| 525 |
-
|
|
| 526 |
-
|
|
| 527 |
-
|
|
| 528 |
-
|
|
| 529 |
-
|
|
| 530 |
-
|
|
| 531 |
-
|
|
| 532 |
-
|
|
| 533 |
-
|
|
| 534 |
-
|
|
| 535 |
|
| 536 |
### 6.6 Linguistic Interpretation
|
| 537 |
|
|
@@ -763,4 +764,4 @@ MIT License - Free for academic and commercial use.
|
|
| 763 |
---
|
| 764 |
*Generated by Wikilangs Models Pipeline*
|
| 765 |
|
| 766 |
-
*Report Date: 2026-01-03
|
|
|
|
| 36 |
value: 4.275
|
| 37 |
- name: best_isotropy
|
| 38 |
type: isotropy
|
| 39 |
+
value: 0.8232
|
| 40 |
- name: vocabulary_size
|
| 41 |
type: vocab
|
| 42 |
value: 0
|
|
|
|
| 99 |
|
| 100 |
Below are sample sentences tokenized with each vocabulary size:
|
| 101 |
|
| 102 |
+
**Sample 1:** `Bobadilla puet estar: Bobadilla, un municipio de La Rioja. Bobadilla del Campo, ...`
|
| 103 |
|
| 104 |
| Vocab | Tokens | Count |
|
| 105 |
|-------|--------|-------|
|
| 106 |
+
| 8k | `▁bob ad illa ▁puet ▁estar : ▁bob ad illa , ... (+17 more)` | 27 |
|
| 107 |
+
| 16k | `▁bob ad illa ▁puet ▁estar : ▁bob ad illa , ... (+17 more)` | 27 |
|
| 108 |
+
| 32k | `▁bob ad illa ▁puet ▁estar : ▁bob ad illa , ... (+17 more)` | 27 |
|
| 109 |
+
| 64k | `▁bobadilla ▁puet ▁estar : ▁bobadilla , ▁un ▁municipio ▁de ▁la ... (+11 more)` | 21 |
|
| 110 |
|
| 111 |
+
**Sample 2:** `Charleville-Mézières ye una localidat y comuna francesa, capital d'o departament...`
|
| 112 |
|
| 113 |
| Vocab | Tokens | Count |
|
| 114 |
|-------|--------|-------|
|
| 115 |
+
| 8k | `▁char le ville - m é zi ères ▁ye ▁una ... (+26 more)` | 36 |
|
| 116 |
+
| 16k | `▁char le ville - mé zi ères ▁ye ▁una ▁localidat ... (+25 more)` | 35 |
|
| 117 |
+
| 32k | `▁char le ville - mé zi ères ▁ye ▁una ▁localidat ... (+23 more)` | 33 |
|
| 118 |
+
| 64k | `▁charleville - mézières ▁ye ▁una ▁localidat ▁y ▁comuna ▁francesa , ... (+19 more)` | 29 |
|
| 119 |
|
| 120 |
+
**Sample 3:** `Schöngeising (en bavaro Scheegeising) ye un municipio de Bavera, Alemanya. Se tr...`
|
| 121 |
|
| 122 |
| Vocab | Tokens | Count |
|
| 123 |
|-------|--------|-------|
|
| 124 |
+
| 8k | `▁sch ön ge is ing ▁( en ▁bavaro ▁s che ... (+29 more)` | 39 |
|
| 125 |
+
| 16k | `▁schön ge is ing ▁( en ▁bavaro ▁sche e ge ... (+25 more)` | 35 |
|
| 126 |
+
| 32k | `▁schön ge ising ▁( en ▁bavaro ▁sche e ge ising ... (+20 more)` | 30 |
|
| 127 |
+
| 64k | `▁schön ge ising ▁( en ▁bavaro ▁sche e ge ising ... (+20 more)` | 30 |
|
| 128 |
|
| 129 |
|
| 130 |
### Key Findings
|
|
|
|
| 274 |
|
| 275 |
**Context Size 1:**
|
| 276 |
|
| 277 |
+
1. `de las neveras y rfef aprebó a pachina web oficial d afers son asociadas con os`
|
| 278 |
+
2. `d elba antiparte la provincia d as que se veiga torda collerada rafel vidaller tricas libro`
|
| 279 |
+
3. `a rendición de sattler torna ta partecipar en ifriquiya y cariño homenage vasallage en aragonés vinc...`
|
| 280 |
|
| 281 |
**Context Size 2:**
|
| 282 |
|
| 283 |
+
1. `d a ciudat de zaragoza tomo i de castiella y leyón espanya o escritor de lausbubengeschichte ye`
|
| 284 |
+
2. `d o reino se consolida la influyencia de l exercito estatounitesne en europa s extiende dende os`
|
| 285 |
+
3. `en a provincia de teruel d o cual en fan parte 4 cantons y 129 comunas lista`
|
| 286 |
|
| 287 |
**Context Size 3:**
|
| 288 |
|
| 289 |
+
1. `a provincia de zaragoza en a provincia de concepción y d as tres serols estando dimpués enamplato a`
|
| 290 |
+
2. `d a provincia de guipuzcua ta atros usos se veiga carlos ix carlos ix 27 de chunio de`
|
| 291 |
+
3. `una superficie de 158 60 km y una densidat de población de 346 35 hab km a suya`
|
| 292 |
|
| 293 |
**Context Size 4:**
|
| 294 |
|
| 295 |
+
1. `suya población ye de 81 habitants en una superficie de 194 49 km con una densidat de población de`
|
| 296 |
+
2. `en una superficie de 64 16 km con una densidat de población de 43 44 hab km demografía administració...`
|
| 297 |
+
3. `d a provincia de burgos ta atros usos se veiga fort yuma desambigación fort yuma títol orichinal en ...`
|
| 298 |
|
| 299 |
|
| 300 |
### Generated Text Samples (Subword-based)
|
|
|
|
| 303 |
|
| 304 |
**Context Size 1:**
|
| 305 |
|
| 306 |
+
1. `_un_der_dent_ckm`
|
| 307 |
+
2. `as_en_as_2_tacla`
|
| 308 |
+
3. `en_lern_don_vitr`
|
| 309 |
|
| 310 |
**Context Size 2:**
|
| 311 |
|
| 312 |
+
1. `a_saus_dabinascer`
|
| 313 |
+
2. `_derfica_sublosti`
|
| 314 |
+
3. `e_manaisitau_suyo`
|
| 315 |
|
| 316 |
**Context Size 3:**
|
| 317 |
|
| 318 |
+
1. `_dens._val_novant,`
|
| 319 |
+
2. `de_319_de_fuel,_qu`
|
| 320 |
+
3. `_d'o_primetada_cic`
|
| 321 |
|
| 322 |
**Context Size 4:**
|
| 323 |
|
| 324 |
+
1. `_de_jean-jose_(naix`
|
| 325 |
+
2. `_en_sido_per_bueno,`
|
| 326 |
+
3. `_d'anglés_jean_sabi`
|
| 327 |
|
| 328 |
|
| 329 |
### Key Findings
|
|
|
|
| 428 |
|
| 429 |
| Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
|
| 430 |
|-------|-----------|----------|------------------|---------------|----------------|
|
| 431 |
+
| **mono_32d** | 32 | 0.8168 | 0.3517 | N/A | N/A |
|
| 432 |
+
| **mono_64d** | 64 | 0.8232 🏆 | 0.2779 | N/A | N/A |
|
| 433 |
+
| **mono_128d** | 128 | 0.8044 | 0.2016 | N/A | N/A |
|
| 434 |
+
| **aligned_32d** | 32 | 0.8168 | 0.3524 | 0.1520 | 0.4840 |
|
| 435 |
+
| **aligned_64d** | 64 | 0.8232 | 0.2773 | 0.2480 | 0.6340 |
|
| 436 |
+
| **aligned_128d** | 128 | 0.8044 | 0.2034 | 0.3740 | 0.7380 |
|
| 437 |
|
| 438 |
### Key Findings
|
| 439 |
|
| 440 |
+
- **Best Isotropy:** mono_64d with 0.8232 (more uniform distribution)
|
| 441 |
+
- **Semantic Density:** Average pairwise similarity of 0.2774. Lower values indicate better semantic separation.
|
| 442 |
+
- **Alignment Quality:** Aligned models achieve up to 37.4% R@1 in cross-lingual retrieval.
|
| 443 |
- **Recommendation:** 128d aligned for best cross-lingual performance
|
| 444 |
|
| 445 |
---
|
|
|
|
| 461 |
#### Productive Prefixes
|
| 462 |
| Prefix | Examples |
|
| 463 |
|--------|----------|
|
| 464 |
+
| `-co` | confrontatos, conchecturau, coluche |
|
| 465 |
+
| `-ca` | casartelli, camprodón, canthus |
|
| 466 |
+
| `-re` | reitzenstein, reformata, reinando |
|
| 467 |
+
| `-de` | destruyir, denasalizadas, debucourt |
|
| 468 |
+
| `-ma` | marktes, matosinhos, marciac |
|
| 469 |
|
| 470 |
#### Productive Suffixes
|
| 471 |
| Suffix | Examples |
|
| 472 |
|--------|----------|
|
| 473 |
+
| `-s` | mourvilles, iliricas, mylonas |
|
| 474 |
+
| `-a` | cingüenda, lecinyena, reformata |
|
| 475 |
+
| `-as` | iliricas, mylonas, aeneas |
|
| 476 |
+
| `-os` | confrontatos, agnatos, estranios |
|
| 477 |
+
| `-es` | mourvilles, marktes, forbes |
|
| 478 |
|
| 479 |
### 6.3 Bound Stems (Lexical Roots)
|
| 480 |
|
|
|
|
| 482 |
|
| 483 |
| Stem | Cohesion | Substitutability | Examples |
|
| 484 |
|------|----------|------------------|----------|
|
| 485 |
+
| `ient` | 1.70x | 176 contexts | cient, oient, dient |
|
| 486 |
+
| `ento` | 1.74x | 126 contexts | sento, bento, cento |
|
| 487 |
+
| `rago` | 2.03x | 58 contexts | arago, trago, ragot |
|
| 488 |
+
| `ranc` | 1.64x | 141 contexts | franc, rance, ranca |
|
| 489 |
+
| `ació` | 2.09x | 47 contexts | nació, ación, fació |
|
| 490 |
+
| `enci` | 1.53x | 164 contexts | encia, renci, oencia |
|
| 491 |
+
| `obla` | 1.90x | 56 contexts | robla, pobla, nobla |
|
| 492 |
+
| `nter` | 1.50x | 146 contexts | anter, enter, inter |
|
| 493 |
+
| `ncia` | 1.72x | 61 contexts | encia, uncia, oencia |
|
| 494 |
+
| `cion` | 1.50x | 110 contexts | scion, nacion, accion |
|
| 495 |
+
| `idat` | 2.00x | 28 contexts | unidat, deidat, humidat |
|
| 496 |
+
| `mbre` | 1.55x | 75 contexts | ambre, ombre, umbre |
|
| 497 |
|
| 498 |
### 6.4 Affix Compatibility (Co-occurrence)
|
| 499 |
|
|
|
|
| 501 |
|
| 502 |
| Prefix | Suffix | Frequency | Examples |
|
| 503 |
|--------|--------|-----------|----------|
|
| 504 |
+
| `-co` | `-s` | 71 words | concilios, comenges |
|
| 505 |
+
| `-ca` | `-s` | 53 words | cabrinos, caracteres |
|
| 506 |
+
| `-ca` | `-a` | 49 words | cafeína, caixera |
|
| 507 |
+
| `-co` | `-a` | 49 words | cosida, conquiolina |
|
| 508 |
+
| `-ma` | `-s` | 41 words | mauriscus, mandos |
|
| 509 |
+
| `-ma` | `-a` | 36 words | mainila, mamma |
|
| 510 |
+
| `-re` | `-s` | 34 words | reprimius, rechiradors |
|
| 511 |
+
| `-re` | `-a` | 33 words | relochería, renacentista |
|
| 512 |
+
| `-de` | `-a` | 30 words | desidia, dentada |
|
| 513 |
+
| `-de` | `-s` | 30 words | demograficos, deverbativos |
|
| 514 |
|
| 515 |
### 6.5 Recursive Morpheme Segmentation
|
| 516 |
|
|
|
|
| 518 |
|
| 519 |
| Word | Suggested Split | Confidence | Stem |
|
| 520 |
|------|-----------------|------------|------|
|
| 521 |
+
| repoblatos | **`re-poblat-os`** | 6.0 | `poblat` |
|
| 522 |
+
| altoaragonesas | **`altoaragon-es-as`** | 6.0 | `altoaragon` |
|
| 523 |
+
| recullindo | **`re-cullindo`** | 4.5 | `cullindo` |
|
| 524 |
+
| reorganizar | **`re-organizar`** | 4.5 | `organizar` |
|
| 525 |
+
| romanticos | **`romantic-os`** | 4.5 | `romantic` |
|
| 526 |
+
| casellato | **`ca-sellato`** | 4.5 | `sellato` |
|
| 527 |
+
| discapacitatos | **`discapacitat-os`** | 4.5 | `discapacitat` |
|
| 528 |
+
| lexicales | **`lexical-es`** | 4.5 | `lexical` |
|
| 529 |
+
| monetarias | **`monetari-as`** | 4.5 | `monetari` |
|
| 530 |
+
| reprodución | **`re-produción`** | 4.5 | `produción` |
|
| 531 |
+
| deportaban | **`de-portaban`** | 4.5 | `portaban` |
|
| 532 |
+
| desconoixitas | **`de-sconoixit-as`** | 3.0 | `sconoixit` |
|
| 533 |
+
| caspolinas | **`ca-spolin-as`** | 3.0 | `spolin` |
|
| 534 |
+
| conservaderas | **`co-nservader-as`** | 3.0 | `nservader` |
|
| 535 |
+
| decimetros | **`de-cimetr-os`** | 3.0 | `cimetr` |
|
| 536 |
|
| 537 |
### 6.6 Linguistic Interpretation
|
| 538 |
|
|
|
|
| 764 |
---
|
| 765 |
*Generated by Wikilangs Models Pipeline*
|
| 766 |
|
| 767 |
+
*Report Date: 2026-01-03 17:05:39*
|
models/embeddings/aligned/an_128d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 1149568062
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ac84dbcff516b57e98d5bb2f38432f3dbb7fbcc5d376686ddcf84f5fe2c8ed90
|
| 3 |
size 1149568062
|
models/embeddings/aligned/an_128d.projection.npy
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 65664
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5c50c45cbc54bfd87537fecabc141a51b8cc24fab6a01562f7eab6f3f67380dd
|
| 3 |
size 65664
|
models/embeddings/aligned/an_32d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 288983358
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:044b1985e3203661353a60d5fc823ff5dcdf618f0d044f1c133ba2af374c3a98
|
| 3 |
size 288983358
|
models/embeddings/aligned/an_32d.projection.npy
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 4224
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a78cf53ebf35034301a50aedca150893e531c05575b15e0fa980d7e00d38de78
|
| 3 |
size 4224
|
models/embeddings/aligned/an_64d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 575844926
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1125c90e30e6b620a58d0ab14de72ea4edb7f4ff2076ec610fe121f665628723
|
| 3 |
size 575844926
|
models/embeddings/aligned/an_64d.projection.npy
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 16512
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a1613a507b39a223bf7c8ea0a27567de19692f8961db6031ad1953e0aeac1b38
|
| 3 |
size 16512
|
models/embeddings/monolingual/an_128d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 1149568062
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ac84dbcff516b57e98d5bb2f38432f3dbb7fbcc5d376686ddcf84f5fe2c8ed90
|
| 3 |
size 1149568062
|
models/embeddings/monolingual/an_32d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 288983358
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:044b1985e3203661353a60d5fc823ff5dcdf618f0d044f1c133ba2af374c3a98
|
| 3 |
size 288983358
|
models/embeddings/monolingual/an_64d.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 575844926
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1125c90e30e6b620a58d0ab14de72ea4edb7f4ff2076ec610fe121f665628723
|
| 3 |
size 575844926
|
models/subword_markov/an_markov_ctx1_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:418a8dd35bc3f35f4c9a9f8bd9e49a75ef9f982d2a681c378c9b1ab291b79e05
|
| 3 |
+
size 172370
|
models/subword_markov/an_markov_ctx2_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:64ec46893ee8e958fb992f8a0d6b1efbe094ad45c2bced5fab94ef6c4d5be0ae
|
| 3 |
+
size 926386
|
models/subword_markov/an_markov_ctx3_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c85ea7316bf72d0414be0d5c5a80ba3defafef47dd6564d73476a4f472ed3ccd
|
| 3 |
+
size 3820524
|
models/subword_markov/an_markov_ctx4_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:47995cbd03e3f597500e069be7d15ea5b0f3b8dc1809809a35fff9125930aa43
|
| 3 |
+
size 13273622
|
models/subword_ngram/an_2gram_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ea93eb082876f788ceac81f55884b7629479387260dac53859896c4df59aa158
|
| 3 |
+
size 93694
|
models/subword_ngram/an_3gram_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a4b994cdf96bfe61e18ebf7054adee26a9bed7de5843246f543ec78587c38b52
|
| 3 |
+
size 689250
|
models/subword_ngram/an_4gram_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c6577e2b6c3d47dd3140a98a011f6dced507fc1d8fedf05628f215cc75e72f0f
|
| 3 |
+
size 3361849
|
models/subword_ngram/an_5gram_subword.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:19eb72098a1a0f1357d659b77cb07df99d8c6d0d79d9a1a5a0d1e91653334361
|
| 3 |
+
size 10680494
|
models/tokenizer/an_tokenizer_16k.model
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 511374
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:00b5360b1c6c573f0a26c6c9caa8f8616ad8b3590c412e044215e61b09ae45bc
|
| 3 |
size 511374
|
models/tokenizer/an_tokenizer_32k.model
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 791864
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:af11636daefec89ab2ed283c1f92a1a80d6935eeb7b27eb28adc5ce1a6d61fc2
|
| 3 |
size 791864
|
models/tokenizer/an_tokenizer_64k.model
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 1363420
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0b59aef8468601ed51d2b619bff9f61ab6385da57f1f5a42cc720807d71c8da8
|
| 3 |
size 1363420
|
models/tokenizer/an_tokenizer_8k.model
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 374577
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8826546ab7f635988b0fbd46cb99b29b0958ba8c0651006623da483a60e362bd
|
| 3 |
size 374577
|
models/word_markov/an_markov_ctx1_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:30996f7f6a3f5ddb43fa633991dcaae0b81892f02fbee1a2f05770048ecdcf48
|
| 3 |
+
size 23812223
|
models/word_markov/an_markov_ctx2_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:27f0b9160812153a28e0b058756dd952ab32f0301e008408a81a7b57503ce255
|
| 3 |
+
size 65243685
|
models/word_markov/an_markov_ctx3_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:df3fa39e60b02afcc490205e9561ff1be1ee90160d33ed2b69a5233b53751d46
|
| 3 |
+
size 104894050
|
models/word_markov/an_markov_ctx4_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c5116bdb5537880e6a2aace5c53eea87897150be7cb5c4b96eef2ccba19a2c81
|
| 3 |
+
size 132167575
|
models/word_ngram/an_2gram_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:05063ada0a28c7b63bdccdb07799f7bc8b489337a2361d7b3a631dd2fd65d936
|
| 3 |
+
size 3260032
|
models/word_ngram/an_3gram_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7948ddaa3f31659e4c68ba064ffff54008ef5201095f8ff9ea7f03e8f5c1e524
|
| 3 |
+
size 6869267
|
models/word_ngram/an_4gram_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:643b3056a717c8bce2f91de9f804d96ea3c0aa62a9621b1cc8daeae7a498e00c
|
| 3 |
+
size 14121384
|
models/word_ngram/an_5gram_word.parquet
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:92e8eda0653cd749f66c469b4dfa2cb5c19637bb90b6defb0be21b38d2e6c5b7
|
| 3 |
+
size 12899086
|
visualizations/embedding_alignment_quality.png
CHANGED
|
|
visualizations/embedding_isotropy.png
CHANGED
|
|
visualizations/embedding_norms.png
CHANGED
|
|
visualizations/embedding_similarity.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
visualizations/embedding_tsne_multilingual.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
visualizations/ngram_perplexity.png
CHANGED
|
|
visualizations/performance_dashboard.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
visualizations/position_encoding_comparison.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
visualizations/tsne_sentences.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
visualizations/tsne_words.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|