omarkamali commited on
Commit
e4fdd3b
·
verified ·
1 Parent(s): ecc905c

Upload all models and assets for an (latest)

Browse files
Files changed (40) hide show
  1. README.md +97 -96
  2. models/embeddings/aligned/an_128d.bin +1 -1
  3. models/embeddings/aligned/an_128d.projection.npy +1 -1
  4. models/embeddings/aligned/an_32d.bin +1 -1
  5. models/embeddings/aligned/an_32d.projection.npy +1 -1
  6. models/embeddings/aligned/an_64d.bin +1 -1
  7. models/embeddings/aligned/an_64d.projection.npy +1 -1
  8. models/embeddings/monolingual/an_128d.bin +1 -1
  9. models/embeddings/monolingual/an_32d.bin +1 -1
  10. models/embeddings/monolingual/an_64d.bin +1 -1
  11. models/subword_markov/an_markov_ctx1_subword.parquet +2 -2
  12. models/subword_markov/an_markov_ctx2_subword.parquet +2 -2
  13. models/subword_markov/an_markov_ctx3_subword.parquet +2 -2
  14. models/subword_markov/an_markov_ctx4_subword.parquet +2 -2
  15. models/subword_ngram/an_2gram_subword.parquet +2 -2
  16. models/subword_ngram/an_3gram_subword.parquet +2 -2
  17. models/subword_ngram/an_4gram_subword.parquet +2 -2
  18. models/subword_ngram/an_5gram_subword.parquet +2 -2
  19. models/tokenizer/an_tokenizer_16k.model +1 -1
  20. models/tokenizer/an_tokenizer_32k.model +1 -1
  21. models/tokenizer/an_tokenizer_64k.model +1 -1
  22. models/tokenizer/an_tokenizer_8k.model +1 -1
  23. models/word_markov/an_markov_ctx1_word.parquet +2 -2
  24. models/word_markov/an_markov_ctx2_word.parquet +2 -2
  25. models/word_markov/an_markov_ctx3_word.parquet +2 -2
  26. models/word_markov/an_markov_ctx4_word.parquet +2 -2
  27. models/word_ngram/an_2gram_word.parquet +2 -2
  28. models/word_ngram/an_3gram_word.parquet +2 -2
  29. models/word_ngram/an_4gram_word.parquet +2 -2
  30. models/word_ngram/an_5gram_word.parquet +2 -2
  31. visualizations/embedding_alignment_quality.png +0 -0
  32. visualizations/embedding_isotropy.png +0 -0
  33. visualizations/embedding_norms.png +0 -0
  34. visualizations/embedding_similarity.png +2 -2
  35. visualizations/embedding_tsne_multilingual.png +2 -2
  36. visualizations/ngram_perplexity.png +0 -0
  37. visualizations/performance_dashboard.png +2 -2
  38. visualizations/position_encoding_comparison.png +2 -2
  39. visualizations/tsne_sentences.png +2 -2
  40. visualizations/tsne_words.png +2 -2
README.md CHANGED
@@ -36,7 +36,7 @@ metrics:
36
  value: 4.275
37
  - name: best_isotropy
38
  type: isotropy
39
- value: 0.8230
40
  - name: vocabulary_size
41
  type: vocab
42
  value: 0
@@ -99,32 +99,32 @@ We analyze tokenizers, n-gram models, Markov chains, vocabulary statistics, and
99
 
100
  Below are sample sentences tokenized with each vocabulary size:
101
 
102
- **Sample 1:** `CA Monzón puet estar: O Centro Atlético Monzón. O Club de Fútbol Atlético de Mon...`
103
 
104
  | Vocab | Tokens | Count |
105
  |-------|--------|-------|
106
- | 8k | `▁ca ▁monzón ▁puet ▁estar : ▁o ▁centro ▁at lético ▁monzón ... (+10 more)` | 20 |
107
- | 16k | `▁ca ▁monzón ▁puet ▁estar : ▁o ▁centro ▁atlético ▁monzón . ... (+8 more)` | 18 |
108
- | 32k | `▁ca ▁monzón ▁puet ▁estar : ▁o ▁centro ▁atlético ▁monzón . ... (+8 more)` | 18 |
109
- | 64k | `▁camonzón ▁puet ▁estar : ▁ocentroatléticomonzón . ... (+8 more)` | 18 |
110
 
111
- **Sample 2:** `En ista lista s'incluyen toz os presidents d'o Real Zaragoza dica hue: María Gay...`
112
 
113
  | Vocab | Tokens | Count |
114
  |-------|--------|-------|
115
- | 8k | `▁en ▁ista ▁lista ▁s ' incluy en ▁tozospresid ... (+24 more)` | 34 |
116
- | 16k | `▁en ▁ista ▁lista ▁s ' incluyen ▁tozospresidentsd ... (+22 more)` | 32 |
117
- | 32k | `▁en ▁ista ▁lista ▁s ' incluyen ▁tozospresidentsd ... (+22 more)` | 32 |
118
- | 64k | `▁en ▁istalistas ' incluyen tozospresidentsd ... (+22 more)` | 32 |
119
 
120
- **Sample 3:** `Roitwalchen ye un lugar d'o municipio de Traunstein en o sud-este de Bavera, Ale...`
121
 
122
  | Vocab | Tokens | Count |
123
  |-------|--------|-------|
124
- | 8k | `▁ro it wal chen ▁yeunlugard ' o ... (+28 more)` | 38 |
125
- | 16k | `▁ro it wal chenye ▁unlugard ' o ... (+28 more)` | 38 |
126
- | 32k | `▁ro it wal chen ye ▁unlugard ' o ... (+28 more)` | 38 |
127
- | 64k | `▁ro it walchenye ▁unlugard ' o ▁municipio ... (+27 more)` | 37 |
128
 
129
 
130
  ### Key Findings
@@ -274,27 +274,27 @@ Below are text samples generated from each word-based Markov chain model:
274
 
275
  **Context Size 1:**
276
 
277
- 1. `de gúdar s j simpson cyril cusack georgi parvanov con a comarca d o mesmo significau`
278
- 2. `d a fundación d o comité d abril linear 8 d ixe paisache pocino de marzo`
279
- 3. `a circuscripción de linkedin pachina web autualment no la provincia de bixuteria fina cappa producti...`
280
 
281
  **Context Size 2:**
282
 
283
- 1. `d a provincia de beneviento 2 071 m o río xalón y provincia de zaragoza y atros`
284
- 2. `d o poro nucleyar afi komˈple k so di ˈi ˈpoɾo nuˈkliar ye per propio cualsiquier craba`
285
- 3. `en a pachina web oficial la procedencia d os mes aptos os caracters comuns con atras posesions`
286
 
287
  **Context Size 3:**
288
 
289
- 1. `a provincia de teruel comarca d a chacetania ye una d as prencipals actrices d o teatro y`
290
- 2. `d a provincia de castellón de la plana alta la suya población ye de 651 habitants en germany`
291
- 3. `una superficie de 62 99 km con una densidat de población de 8 hab km en ista localidat`
292
 
293
  **Context Size 4:**
294
 
295
- 1. `suya población ye de 100 habitants en una superficie de 9 77 km con una densidat de población de`
296
- 2. `en una superficie de 13 70 km con una densidat de población de 29 15 hab km cheografía a`
297
- 3. `d a provincia de cordoba ta atros usos se veiga o caire desambigación o caire u simplamnet caire الق...`
298
 
299
 
300
  ### Generated Text Samples (Subword-based)
@@ -303,27 +303,27 @@ Below are text samples generated from each subword-based Markov chain model:
303
 
304
  **Context Size 1:**
305
 
306
- 1. `_e_ra_ce_aren_pi`
307
- 2. `al._iargo_rennye`
308
- 3. `ebalobalo,_opo_c`
309
 
310
  **Context Size 2:**
311
 
312
- 1. `a_cominelde_al_de`
313
- 2. `_d'a_s'esmarroixe`
314
- 3. `e_au_gottorioními`
315
 
316
  **Context Size 3:**
317
 
318
- 1. `_de_papezina_creye`
319
- 2. `de_heyemios_poblac`
320
- 3. `_d'africa_de_mayo»`
321
 
322
  **Context Size 4:**
323
 
324
- 1. `_de_l'anyos._ye_reg`
325
- 2. `_en_la_menclaude._v`
326
- 3. `_d'africantón_de_se`
327
 
328
 
329
  ### Key Findings
@@ -428,18 +428,18 @@ Below are text samples generated from each subword-based Markov chain model:
428
 
429
  | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
430
  |-------|-----------|----------|------------------|---------------|----------------|
431
- | **mono_32d** | 32 | 0.8202 | 0.3403 | N/A | N/A |
432
- | **mono_64d** | 64 | 0.8230 🏆 | 0.2693 | N/A | N/A |
433
- | **mono_128d** | 128 | 0.8049 | 0.2129 | N/A | N/A |
434
- | **aligned_32d** | 32 | 0.8202 | 0.3495 | 0.1580 | 0.5000 |
435
- | **aligned_64d** | 64 | 0.8230 | 0.2690 | 0.2540 | 0.6280 |
436
- | **aligned_128d** | 128 | 0.8049 | 0.2090 | 0.3720 | 0.7380 |
437
 
438
  ### Key Findings
439
 
440
- - **Best Isotropy:** mono_64d with 0.8230 (more uniform distribution)
441
- - **Semantic Density:** Average pairwise similarity of 0.2750. Lower values indicate better semantic separation.
442
- - **Alignment Quality:** Aligned models achieve up to 37.2% R@1 in cross-lingual retrieval.
443
  - **Recommendation:** 128d aligned for best cross-lingual performance
444
 
445
  ---
@@ -461,19 +461,20 @@ These are the most productive prefixes and suffixes identified by sampling the v
461
  #### Productive Prefixes
462
  | Prefix | Examples |
463
  |--------|----------|
464
- | `-co` | comuni, concebitas, complices |
465
- | `-ca` | cassovia, cazadors, castieillo |
466
- | `-ma` | macromutacions, mariahilfkirche, maree |
467
- | `-re` | restauro, reumert, recomendable |
 
468
 
469
  #### Productive Suffixes
470
  | Suffix | Examples |
471
  |--------|----------|
472
- | `-s` | viescas, tomus, biggs |
473
- | `-a` | abaurrepea, telangana, actuaba |
474
- | `-as` | viescas, medas, rallatas |
475
- | `-os` | vientos, rasos, tecnolochicos |
476
- | `-es` | phoenicopteriformes, gabes, bolcheviques |
477
 
478
  ### 6.3 Bound Stems (Lexical Roots)
479
 
@@ -481,18 +482,18 @@ Bound stems are high-frequency subword units that are semantically cohesive but
481
 
482
  | Stem | Cohesion | Substitutability | Examples |
483
  |------|----------|------------------|----------|
484
- | `ento` | 1.72x | 126 contexts | rento, gento, sento |
485
- | `rago` | 2.03x | 58 contexts | crago, drago, ragot |
486
- | `ranc` | 1.60x | 141 contexts | rancó, ranch, rance |
487
- | `enci` | 1.55x | 164 contexts | renci, encia, encies |
488
- | `obla` | 1.95x | 56 contexts | nobla, robla, pobla |
489
- | `renc` | 1.77x | 82 contexts | renci, arenc, wrench |
490
- | `ació` | 2.02x | 47 contexts | nació, fació, ación |
491
- | `ient` | 1.50x | 176 contexts | oient, cient, aient |
492
- | `nter` | 1.55x | 146 contexts | anter, enter, unter |
493
- | `cion` | 1.56x | 110 contexts | scion, accion, nacion |
494
- | `ncia` | 1.69x | 61 contexts | encia, uncia, oencia |
495
- | `mbre` | 1.55x | 75 contexts | mbret, ambre, umbre |
496
 
497
  ### 6.4 Affix Compatibility (Co-occurrence)
498
 
@@ -500,16 +501,16 @@ This table shows which prefixes and suffixes most frequently co-occur on the sam
500
 
501
  | Prefix | Suffix | Frequency | Examples |
502
  |--------|--------|-----------|----------|
503
- | `-ca` | `-s` | 64 words | cavens, cabaixos |
504
- | `-co` | `-s` | 62 words | conejares, concordias |
505
- | `-co` | `-a` | 57 words | condeixa, cogota |
506
- | `-ma` | `-s` | 52 words | manganés, manus |
507
- | `-ca` | `-a` | 44 words | camberra, carola |
508
- | `-re` | `-s` | 33 words | relationships, refusadas |
509
- | `-re` | `-a` | 26 words | representa, relacionada |
510
- | `-ma` | `-a` | 23 words | maganza, magalona |
511
- | `-co` | `-as` | 19 words | concordias, contadas |
512
- | `-ca` | `-os` | 16 words | cabaixos, calibos |
513
 
514
  ### 6.5 Recursive Morpheme Segmentation
515
 
@@ -517,21 +518,21 @@ Using **Recursive Hierarchical Substitutability**, we decompose complex words in
517
 
518
  | Word | Suggested Split | Confidence | Stem |
519
  |------|-----------------|------------|------|
520
- | lombardes | **`lombard-es`** | 4.5 | `lombard` |
521
- | fanaticos | **`fanatic-os`** | 4.5 | `fanatic` |
522
- | retransmite | **`re-transmite`** | 4.5 | `transmite` |
523
- | dimitrios | **`dimitri-os`** | 4.5 | `dimitri` |
524
- | terroristas | **`terrorist-as`** | 4.5 | `terrorist` |
525
- | normandas | **`normand-as`** | 4.5 | `normand` |
526
- | castellan | **`ca-stellan`** | 4.5 | `stellan` |
527
- | coorganización | **`co-organización`** | 4.5 | `organización` |
528
- | tortosinas | **`tortosin-as`** | 4.5 | `tortosin` |
529
- | reportaje | **`re-portaje`** | 4.5 | `portaje` |
530
- | requiestas | **`re-quiest-as`** | 3.0 | `quiest` |
531
- | consumitos | **`co-nsumit-os`** | 3.0 | `nsumit` |
532
- | califatos | **`ca-lifat-os`** | 3.0 | `lifat` |
533
- | conservamos | **`co-nservam-os`** | 3.0 | `nservam` |
534
- | colonizatos | **`co-lonizat-os`** | 3.0 | `lonizat` |
535
 
536
  ### 6.6 Linguistic Interpretation
537
 
@@ -763,4 +764,4 @@ MIT License - Free for academic and commercial use.
763
  ---
764
  *Generated by Wikilangs Models Pipeline*
765
 
766
- *Report Date: 2026-01-03 14:50:06*
 
36
  value: 4.275
37
  - name: best_isotropy
38
  type: isotropy
39
+ value: 0.8232
40
  - name: vocabulary_size
41
  type: vocab
42
  value: 0
 
99
 
100
  Below are sample sentences tokenized with each vocabulary size:
101
 
102
+ **Sample 1:** `Bobadilla puet estar: Bobadilla, un municipio de La Rioja. Bobadilla del Campo, ...`
103
 
104
  | Vocab | Tokens | Count |
105
  |-------|--------|-------|
106
+ | 8k | `▁bob ad illa ▁puet ▁estar : ▁bob ad illa , ... (+17 more)` | 27 |
107
+ | 16k | `▁bob ad illa ▁puet ▁estar : ▁bob ad illa , ... (+17 more)` | 27 |
108
+ | 32k | `▁bob ad illa ▁puet ▁estar : ▁bob ad illa , ... (+17 more)` | 27 |
109
+ | 64k | `▁bobadilla ▁puet ▁estar : ▁bobadilla , unmunicipiode ▁la ... (+11 more)` | 21 |
110
 
111
+ **Sample 2:** `Charleville-Mézières ye una localidat y comuna francesa, capital d'o departament...`
112
 
113
  | Vocab | Tokens | Count |
114
  |-------|--------|-------|
115
+ | 8k | `▁char le ville - m é zi èresyeuna ... (+26 more)` | 36 |
116
+ | 16k | `▁char le ville - zi èresyeunalocalidat ... (+25 more)` | 35 |
117
+ | 32k | `▁char le ville - zi èresyeunalocalidat ... (+23 more)` | 33 |
118
+ | 64k | `▁charleville - mézières yeunalocalidatycomunafrancesa , ... (+19 more)` | 29 |
119
 
120
+ **Sample 3:** `Schöngeising (en bavaro Scheegeising) ye un municipio de Bavera, Alemanya. Se tr...`
121
 
122
  | Vocab | Tokens | Count |
123
  |-------|--------|-------|
124
+ | 8k | `▁sch ön ge is ing( en bavaros che ... (+29 more)` | 39 |
125
+ | 16k | `▁schön ge is ing( enbavarosche e ge ... (+25 more)` | 35 |
126
+ | 32k | `▁schön ge ising( enbavarosche e ge ising ... (+20 more)` | 30 |
127
+ | 64k | `▁schön ge ising( enbavarosche e ge ising ... (+20 more)` | 30 |
128
 
129
 
130
  ### Key Findings
 
274
 
275
  **Context Size 1:**
276
 
277
+ 1. `de las neveras y rfef aprebó a pachina web oficial d afers son asociadas con os`
278
+ 2. `d elba antiparte la provincia d as que se veiga torda collerada rafel vidaller tricas libro`
279
+ 3. `a rendición de sattler torna ta partecipar en ifriquiya y cariño homenage vasallage en aragonés vinc...`
280
 
281
  **Context Size 2:**
282
 
283
+ 1. `d a ciudat de zaragoza tomo i de castiella y leyón espanya o escritor de lausbubengeschichte ye`
284
+ 2. `d o reino se consolida la influyencia de l exercito estatounitesne en europa s extiende dende os`
285
+ 3. `en a provincia de teruel d o cual en fan parte 4 cantons y 129 comunas lista`
286
 
287
  **Context Size 3:**
288
 
289
+ 1. `a provincia de zaragoza en a provincia de concepción y d as tres serols estando dimpués enamplato a`
290
+ 2. `d a provincia de guipuzcua ta atros usos se veiga carlos ix carlos ix 27 de chunio de`
291
+ 3. `una superficie de 158 60 km y una densidat de población de 346 35 hab km a suya`
292
 
293
  **Context Size 4:**
294
 
295
+ 1. `suya población ye de 81 habitants en una superficie de 194 49 km con una densidat de población de`
296
+ 2. `en una superficie de 64 16 km con una densidat de población de 43 44 hab km demografía administració...`
297
+ 3. `d a provincia de burgos ta atros usos se veiga fort yuma desambigación fort yuma títol orichinal en ...`
298
 
299
 
300
  ### Generated Text Samples (Subword-based)
 
303
 
304
  **Context Size 1:**
305
 
306
+ 1. `_un_der_dent_ckm`
307
+ 2. `as_en_as_2_tacla`
308
+ 3. `en_lern_don_vitr`
309
 
310
  **Context Size 2:**
311
 
312
+ 1. `a_saus_dabinascer`
313
+ 2. `_derfica_sublosti`
314
+ 3. `e_manaisitau_suyo`
315
 
316
  **Context Size 3:**
317
 
318
+ 1. `_dens._val_novant,`
319
+ 2. `de_319_de_fuel,_qu`
320
+ 3. `_d'o_primetada_cic`
321
 
322
  **Context Size 4:**
323
 
324
+ 1. `_de_jean-jose_(naix`
325
+ 2. `_en_sido_per_bueno,`
326
+ 3. `_d'anglés_jean_sabi`
327
 
328
 
329
  ### Key Findings
 
428
 
429
  | Model | Dimension | Isotropy | Semantic Density | Alignment R@1 | Alignment R@10 |
430
  |-------|-----------|----------|------------------|---------------|----------------|
431
+ | **mono_32d** | 32 | 0.8168 | 0.3517 | N/A | N/A |
432
+ | **mono_64d** | 64 | 0.8232 🏆 | 0.2779 | N/A | N/A |
433
+ | **mono_128d** | 128 | 0.8044 | 0.2016 | N/A | N/A |
434
+ | **aligned_32d** | 32 | 0.8168 | 0.3524 | 0.1520 | 0.4840 |
435
+ | **aligned_64d** | 64 | 0.8232 | 0.2773 | 0.2480 | 0.6340 |
436
+ | **aligned_128d** | 128 | 0.8044 | 0.2034 | 0.3740 | 0.7380 |
437
 
438
  ### Key Findings
439
 
440
+ - **Best Isotropy:** mono_64d with 0.8232 (more uniform distribution)
441
+ - **Semantic Density:** Average pairwise similarity of 0.2774. Lower values indicate better semantic separation.
442
+ - **Alignment Quality:** Aligned models achieve up to 37.4% R@1 in cross-lingual retrieval.
443
  - **Recommendation:** 128d aligned for best cross-lingual performance
444
 
445
  ---
 
461
  #### Productive Prefixes
462
  | Prefix | Examples |
463
  |--------|----------|
464
+ | `-co` | confrontatos, conchecturau, coluche |
465
+ | `-ca` | casartelli, camprodón, canthus |
466
+ | `-re` | reitzenstein, reformata, reinando |
467
+ | `-de` | destruyir, denasalizadas, debucourt |
468
+ | `-ma` | marktes, matosinhos, marciac |
469
 
470
  #### Productive Suffixes
471
  | Suffix | Examples |
472
  |--------|----------|
473
+ | `-s` | mourvilles, iliricas, mylonas |
474
+ | `-a` | cingüenda, lecinyena, reformata |
475
+ | `-as` | iliricas, mylonas, aeneas |
476
+ | `-os` | confrontatos, agnatos, estranios |
477
+ | `-es` | mourvilles, marktes, forbes |
478
 
479
  ### 6.3 Bound Stems (Lexical Roots)
480
 
 
482
 
483
  | Stem | Cohesion | Substitutability | Examples |
484
  |------|----------|------------------|----------|
485
+ | `ient` | 1.70x | 176 contexts | cient, oient, dient |
486
+ | `ento` | 1.74x | 126 contexts | sento, bento, cento |
487
+ | `rago` | 2.03x | 58 contexts | arago, trago, ragot |
488
+ | `ranc` | 1.64x | 141 contexts | franc, rance, ranca |
489
+ | `ació` | 2.09x | 47 contexts | nació, ación, fació |
490
+ | `enci` | 1.53x | 164 contexts | encia, renci, oencia |
491
+ | `obla` | 1.90x | 56 contexts | robla, pobla, nobla |
492
+ | `nter` | 1.50x | 146 contexts | anter, enter, inter |
493
+ | `ncia` | 1.72x | 61 contexts | encia, uncia, oencia |
494
+ | `cion` | 1.50x | 110 contexts | scion, nacion, accion |
495
+ | `idat` | 2.00x | 28 contexts | unidat, deidat, humidat |
496
+ | `mbre` | 1.55x | 75 contexts | ambre, ombre, umbre |
497
 
498
  ### 6.4 Affix Compatibility (Co-occurrence)
499
 
 
501
 
502
  | Prefix | Suffix | Frequency | Examples |
503
  |--------|--------|-----------|----------|
504
+ | `-co` | `-s` | 71 words | concilios, comenges |
505
+ | `-ca` | `-s` | 53 words | cabrinos, caracteres |
506
+ | `-ca` | `-a` | 49 words | cafeína, caixera |
507
+ | `-co` | `-a` | 49 words | cosida, conquiolina |
508
+ | `-ma` | `-s` | 41 words | mauriscus, mandos |
509
+ | `-ma` | `-a` | 36 words | mainila, mamma |
510
+ | `-re` | `-s` | 34 words | reprimius, rechiradors |
511
+ | `-re` | `-a` | 33 words | relochería, renacentista |
512
+ | `-de` | `-a` | 30 words | desidia, dentada |
513
+ | `-de` | `-s` | 30 words | demograficos, deverbativos |
514
 
515
  ### 6.5 Recursive Morpheme Segmentation
516
 
 
518
 
519
  | Word | Suggested Split | Confidence | Stem |
520
  |------|-----------------|------------|------|
521
+ | repoblatos | **`re-poblat-os`** | 6.0 | `poblat` |
522
+ | altoaragonesas | **`altoaragon-es-as`** | 6.0 | `altoaragon` |
523
+ | recullindo | **`re-cullindo`** | 4.5 | `cullindo` |
524
+ | reorganizar | **`re-organizar`** | 4.5 | `organizar` |
525
+ | romanticos | **`romantic-os`** | 4.5 | `romantic` |
526
+ | casellato | **`ca-sellato`** | 4.5 | `sellato` |
527
+ | discapacitatos | **`discapacitat-os`** | 4.5 | `discapacitat` |
528
+ | lexicales | **`lexical-es`** | 4.5 | `lexical` |
529
+ | monetarias | **`monetari-as`** | 4.5 | `monetari` |
530
+ | reprodución | **`re-produción`** | 4.5 | `produción` |
531
+ | deportaban | **`de-portaban`** | 4.5 | `portaban` |
532
+ | desconoixitas | **`de-sconoixit-as`** | 3.0 | `sconoixit` |
533
+ | caspolinas | **`ca-spolin-as`** | 3.0 | `spolin` |
534
+ | conservaderas | **`co-nservader-as`** | 3.0 | `nservader` |
535
+ | decimetros | **`de-cimetr-os`** | 3.0 | `cimetr` |
536
 
537
  ### 6.6 Linguistic Interpretation
538
 
 
764
  ---
765
  *Generated by Wikilangs Models Pipeline*
766
 
767
+ *Report Date: 2026-01-03 17:05:39*
models/embeddings/aligned/an_128d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5fd4098fa2ab7a020113c01a17200837ba8ac73638e61c8edb659c85aaef93fc
3
  size 1149568062
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ac84dbcff516b57e98d5bb2f38432f3dbb7fbcc5d376686ddcf84f5fe2c8ed90
3
  size 1149568062
models/embeddings/aligned/an_128d.projection.npy CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:cd99fd2982f86d44f39d6905f1bab4d510cd2a435654c9497853d449b81da158
3
  size 65664
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5c50c45cbc54bfd87537fecabc141a51b8cc24fab6a01562f7eab6f3f67380dd
3
  size 65664
models/embeddings/aligned/an_32d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8162f33f0e10366ed7d3fdfb8fa216d05dbd6d4a16b2198859372aa289d98b14
3
  size 288983358
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:044b1985e3203661353a60d5fc823ff5dcdf618f0d044f1c133ba2af374c3a98
3
  size 288983358
models/embeddings/aligned/an_32d.projection.npy CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:353d511257321af48a3269657d78455f93120c5fcf17d70f5f3b6de72a95f856
3
  size 4224
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a78cf53ebf35034301a50aedca150893e531c05575b15e0fa980d7e00d38de78
3
  size 4224
models/embeddings/aligned/an_64d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:95a4be660051d6fcd8c79a595fd6237cd7f54790bfab31f65b7a8f9c2533a8e2
3
  size 575844926
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1125c90e30e6b620a58d0ab14de72ea4edb7f4ff2076ec610fe121f665628723
3
  size 575844926
models/embeddings/aligned/an_64d.projection.npy CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0a69faaf2fde42a4fb0c85609a8ab9ff29cffe1093a3932f0aa3d238f32d74f9
3
  size 16512
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a1613a507b39a223bf7c8ea0a27567de19692f8961db6031ad1953e0aeac1b38
3
  size 16512
models/embeddings/monolingual/an_128d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5fd4098fa2ab7a020113c01a17200837ba8ac73638e61c8edb659c85aaef93fc
3
  size 1149568062
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ac84dbcff516b57e98d5bb2f38432f3dbb7fbcc5d376686ddcf84f5fe2c8ed90
3
  size 1149568062
models/embeddings/monolingual/an_32d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8162f33f0e10366ed7d3fdfb8fa216d05dbd6d4a16b2198859372aa289d98b14
3
  size 288983358
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:044b1985e3203661353a60d5fc823ff5dcdf618f0d044f1c133ba2af374c3a98
3
  size 288983358
models/embeddings/monolingual/an_64d.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:95a4be660051d6fcd8c79a595fd6237cd7f54790bfab31f65b7a8f9c2533a8e2
3
  size 575844926
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1125c90e30e6b620a58d0ab14de72ea4edb7f4ff2076ec610fe121f665628723
3
  size 575844926
models/subword_markov/an_markov_ctx1_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:82737fda9661e7c5ecf12fa0a67a2d2bcc18af9b5e9e30a7b0158eef5c952909
3
- size 176131
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:418a8dd35bc3f35f4c9a9f8bd9e49a75ef9f982d2a681c378c9b1ab291b79e05
3
+ size 172370
models/subword_markov/an_markov_ctx2_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f0fc7fa0b6576fa1c527465494c2d88de956a333230c36d23818149db6ffd093
3
- size 938570
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:64ec46893ee8e958fb992f8a0d6b1efbe094ad45c2bced5fab94ef6c4d5be0ae
3
+ size 926386
models/subword_markov/an_markov_ctx3_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:002be2f1eb632f8e6574678f853e93f4bbca357d8fe282182dc11766db2955b4
3
- size 3816590
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c85ea7316bf72d0414be0d5c5a80ba3defafef47dd6564d73476a4f472ed3ccd
3
+ size 3820524
models/subword_markov/an_markov_ctx4_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1a8a70c3406dca6ea2afc3bd1a4133f81559c04cc64ebb35b40bd88bf0512e2e
3
- size 13281906
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:47995cbd03e3f597500e069be7d15ea5b0f3b8dc1809809a35fff9125930aa43
3
+ size 13273622
models/subword_ngram/an_2gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0f6ff5966a58dcbf3f78684a10d07bc43305bed96d2231b9a562c78ffe158f5e
3
- size 93856
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ea93eb082876f788ceac81f55884b7629479387260dac53859896c4df59aa158
3
+ size 93694
models/subword_ngram/an_3gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:2b9410a959938ec2ee19d3d785fdbe7c03b9888b017ac84c0907cc567cc67382
3
- size 690365
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a4b994cdf96bfe61e18ebf7054adee26a9bed7de5843246f543ec78587c38b52
3
+ size 689250
models/subword_ngram/an_4gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:54766951ed2a769fc5227d30241529307ad56476f75a1c6ded8dfd17cfd005c6
3
- size 3361576
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c6577e2b6c3d47dd3140a98a011f6dced507fc1d8fedf05628f215cc75e72f0f
3
+ size 3361849
models/subword_ngram/an_5gram_subword.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b2d2a42c46ff9f023d7a1e77f90126884ff7127984bdea7beb92f1a4ea9ba286
3
- size 10691226
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:19eb72098a1a0f1357d659b77cb07df99d8c6d0d79d9a1a5a0d1e91653334361
3
+ size 10680494
models/tokenizer/an_tokenizer_16k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:be4e4495dd91b35e24a5508fe0c746604c258fa5e24d680caa9984c3edca42a6
3
  size 511374
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:00b5360b1c6c573f0a26c6c9caa8f8616ad8b3590c412e044215e61b09ae45bc
3
  size 511374
models/tokenizer/an_tokenizer_32k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5fa703683fed7866aee835ffdc13a44b7a714de834a38b62b70558897787c1fb
3
  size 791864
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:af11636daefec89ab2ed283c1f92a1a80d6935eeb7b27eb28adc5ce1a6d61fc2
3
  size 791864
models/tokenizer/an_tokenizer_64k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f1989e3ff85462c98efc25880434afda45544dd0dff0bda857ccfabf9d8725dc
3
  size 1363420
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0b59aef8468601ed51d2b619bff9f61ab6385da57f1f5a42cc720807d71c8da8
3
  size 1363420
models/tokenizer/an_tokenizer_8k.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:24cd47460a4e12244cf3b7f7de3902be403e0466c77da2537d5c747ecd2e8c4d
3
  size 374577
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8826546ab7f635988b0fbd46cb99b29b0958ba8c0651006623da483a60e362bd
3
  size 374577
models/word_markov/an_markov_ctx1_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7b2b7d443bdb24601458b682628e19a693091fcc060312b8c29b8d0aeca42d42
3
- size 23695943
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:30996f7f6a3f5ddb43fa633991dcaae0b81892f02fbee1a2f05770048ecdcf48
3
+ size 23812223
models/word_markov/an_markov_ctx2_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f8a94e65ec2e1cbb3239d99357349f5de3f99c8f328da97ec36993f6bcaf9b2c
3
- size 65492964
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:27f0b9160812153a28e0b058756dd952ab32f0301e008408a81a7b57503ce255
3
+ size 65243685
models/word_markov/an_markov_ctx3_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:119d049347e9ea06e047d1c592bd25f448fb0db787357a8e2dde7fb81bdabf9a
3
- size 104857909
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:df3fa39e60b02afcc490205e9561ff1be1ee90160d33ed2b69a5233b53751d46
3
+ size 104894050
models/word_markov/an_markov_ctx4_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:cf40f71c21fa590cd93a8da91c56eacb62e024b3a99a039bea79c763d3f480a5
3
- size 132108456
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c5116bdb5537880e6a2aace5c53eea87897150be7cb5c4b96eef2ccba19a2c81
3
+ size 132167575
models/word_ngram/an_2gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c5767d5baec0276bb908cf29398dfa5499b6be1cc66006cd5708bac7b32e1a23
3
- size 3254179
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:05063ada0a28c7b63bdccdb07799f7bc8b489337a2361d7b3a631dd2fd65d936
3
+ size 3260032
models/word_ngram/an_3gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:32e5bc99c871366cc3acf00709b843ccb6ece85dff6f51fc4716cca234b59a2d
3
- size 6891480
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7948ddaa3f31659e4c68ba064ffff54008ef5201095f8ff9ea7f03e8f5c1e524
3
+ size 6869267
models/word_ngram/an_4gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5783556da79bc42a443ee21b9fcee41427cdd2ae4de0f99a8ee08fd84e0bca87
3
- size 14120282
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:643b3056a717c8bce2f91de9f804d96ea3c0aa62a9621b1cc8daeae7a498e00c
3
+ size 14121384
models/word_ngram/an_5gram_word.parquet CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6470d74718860413111212c0cf7d05e0f6074fa1f9835d5e95d0ff11cd0c4f45
3
- size 12888396
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:92e8eda0653cd749f66c469b4dfa2cb5c19637bb90b6defb0be21b38d2e6c5b7
3
+ size 12899086
visualizations/embedding_alignment_quality.png CHANGED
visualizations/embedding_isotropy.png CHANGED
visualizations/embedding_norms.png CHANGED
visualizations/embedding_similarity.png CHANGED

Git LFS Details

  • SHA256: 277961f2ebb8d82f5a370d964a5972ff00f7ebcd4c67af5dd811a67bd9224016
  • Pointer size: 131 Bytes
  • Size of remote file: 146 kB

Git LFS Details

  • SHA256: 6ae77173bc4fff940cc8569c7a4003654f5bc9c6ac4f201985293d9b6a3a2bcc
  • Pointer size: 131 Bytes
  • Size of remote file: 144 kB
visualizations/embedding_tsne_multilingual.png CHANGED

Git LFS Details

  • SHA256: dc5084370d9df7dcdeec114c0259e49e40e1142cbd66197df4f2d49a382b48f0
  • Pointer size: 131 Bytes
  • Size of remote file: 259 kB

Git LFS Details

  • SHA256: 4dd5a7f14e93d4be0df40aba8728f2c46e8be657ec8c38c646f16fdccf8dd193
  • Pointer size: 131 Bytes
  • Size of remote file: 246 kB
visualizations/ngram_perplexity.png CHANGED
visualizations/performance_dashboard.png CHANGED

Git LFS Details

  • SHA256: ef899c6687946afd9a501fefe6d55d020ea199123fdcb1f214343fbbee5d6e3e
  • Pointer size: 131 Bytes
  • Size of remote file: 365 kB

Git LFS Details

  • SHA256: 2c944ba9d4ec799f39c36bb99e917cb001ac80419a318a497d409f26dcc80368
  • Pointer size: 131 Bytes
  • Size of remote file: 365 kB
visualizations/position_encoding_comparison.png CHANGED

Git LFS Details

  • SHA256: 6592ca9237483f2a4247eca6db017047ea0214a0b8f1f2a281c9866703b432c9
  • Pointer size: 131 Bytes
  • Size of remote file: 116 kB

Git LFS Details

  • SHA256: bcb811aa6703849989320bbe7848fa926224a0a17f48f9e9b4ff148ae6053f03
  • Pointer size: 131 Bytes
  • Size of remote file: 117 kB
visualizations/tsne_sentences.png CHANGED

Git LFS Details

  • SHA256: 2f408364d335fe991e423167b4a0f5060be3b4e7e168a8e51a7a7f7fdd3c7c3e
  • Pointer size: 131 Bytes
  • Size of remote file: 283 kB

Git LFS Details

  • SHA256: ed8924cec141d02af5ad2a8e776d81045bf61e3bf4d1ea87a08d86545b838897
  • Pointer size: 131 Bytes
  • Size of remote file: 281 kB
visualizations/tsne_words.png CHANGED

Git LFS Details

  • SHA256: a5dede5a264913b6c6f9c98c41732cbbaf59badf61eb2e28abf9585a0417e173
  • Pointer size: 131 Bytes
  • Size of remote file: 664 kB

Git LFS Details

  • SHA256: 2d259645f9ec755c7485cde818ed94d33300c4ac8e88c3ea7696744d1b3a0877
  • Pointer size: 131 Bytes
  • Size of remote file: 665 kB