NLP
Collection
7 items
•
Updated
Check out our interactive website: The XPF Corpus
The preliminary manual of the corpus can be found here.
./Code contains the various scripts needed to obtain phoneme translation statistics. ./Data contains language specific information in terms of their profiles and phonemic grammars. ./docs contains the files strictly needed for the website. ./Guidelines and ./Manual contain relevant documentation pertaining to the corpus and the curation of it.| Language Code | Language (click for info) | Reason (more thorough explanation in Rmd files) | Comments |
|---|---|---|---|
| acr | Rabinal Achi' | suspect marking of vowel length | lacks lenition |
| ake | Akawaio | conflation between voiceless and voiced consonants | |
| amp | Alamblak | conflation between /ɘ/ and /o/ | |
| aoj | Mufian | conflation among vowels; ambiguity regarding vowel length and labialized consonant clusters | lacks lenition |
| ar | Arabic | ambiguous transcription of alif; conflation between vowels and glides | |
| arn | Mapudungun | ambiguous orthography; conflation between dental and alveolar consonants | |
| awx | Awara | conflation between /nd/, /mb/, /nɡ/ and /d/, /b/, /ɡ/, respectively | |
| bcl | Central Bikol | inconsistent marking of glottal stops | lacks lenition |
| bmu | Somba Siawari | phonetic alphabet | |
| btx | Batak Karo | conflation among /e/, /ɘ/, and /ɯ/ | |
| bzd | Bribri | phonetic alphabet; contradicting documentation | |
| bzh | Mapos Buang | conflation between /ɛ/ and other vowels | |
| ca | Catalan | conflation among vowels and glides; ambiguous phonological interpretations | |
| cav | Cavineña | ambiguity whether a digraph represents one phoneme or two, depending on syllable structure | lacks lenition |
| chf | Tabasco Chontal | conflation between ejectives and stop-glottal stop sequences | |
| chm | Mari | conflation with some palatalized and non-palatalized consonants; some vowels not always represented orthographically | lacks lenition |
| cho | Choctaw | phonetic alphabet | |
| cni | Asháninka | conflation among nasals | |
| cof | Colorado | orthographic ambiguity with glottal stops | |
| con | Cofan | conflation between consonants | |
| crm | Moose Cree | /h/ represented only when contrast is required | lacks lenition |
| dyo | Jola-Fogny | uncertainty around the marking of +ATR vowels | lacks lenition |
| es | Spanish | non-transparent transcription of diphthongs | |
| fuv | Nigerian Fulfulde | inconsistent marking of glottal stops; unclear transcription of palatalized glottal stop | |
| hi | Hindi | conflation between /æ/ and /ɛ/; vowel nasalization ambiguity; unreliable marking of some consonants | |
| id | Indonesian | conflation between /e/ and /ə/ | |
| ixl | Ixil | word-initial glottal stop not always marked; somewhat ambiguous orthography | |
| kea | Cape Verdean Creole | possible conflation between /a/ and /ɐ/, /e/ and /ɛ/, and /ɾ/ and /ʀ/ | lacks lenition |
| kek | Qeqchi | ambiguity between ejective stops and stop-glottal stop sequences | |
| kk | Kazakh | conflation between vowels and glides; widely contradicting phonological accounts of the language | |
| kmo | Kwoma | non-transparent transcription of glottal stops | |
| kyz | Kayabí | conflation between /i/ and /j/ | lacks lenition |
| mcf | Matsés | conflation between alveolar and retroflex consonants; conflation between vowels | |
| mek | Mekeo | non-transparent transcription of glottal stops | |
| mfe | Morisyen | highly suspect orthography; conflation among consonants | |
| ml | Malayalam | conflation between dental and alveolar /n/ | |
| mlp | Bargam | conflation between /n/ and /ŋ/ | lacks lenition |
| mnb | Muna | suspect orthography | |
| mpx | Misima-Panaeati | conflation between /e/ and /ɛ/ and between /o/ and /ɔ/ | lacks lenition |
| mt | Maltese | conflation between /ts/ and /dz/ and between /ʃ/ and /ʒ/ | |
| myv | Erzya | conflation between /n/ and /ŋ/ | lacks lenition |
| ne | Nepali | certain diacritics used interchangeably and inconsistently marked | |
| not | Nomatsiguenga | conflation among nasals | |
| or | Oriya | certain diacritics used interchangeably and inconsistently marked | |
| os | Ossetic | conflation among /u/, /w/, and /ʷ/; inconsistent marking of consonant gemination | |
| pag | Pangasinan | possible conflation between /ŋ/ and /nɡ/ | |
| pib | Yine | conflation between /n/ and /h̃/ | lacks lenition |
| plu | Palikúr | conflation between /ɡ/ and /ɣ/ | |
| qub | Huallaga Huanuco Quechua | suspect orthography; conflation between vowels and glides | |
| rwo | Rawa | conflation between /l/ and /r/ | |
| sah | Yakut | conflation between /j/ and /j̃/ | |
| sk | Slovak | non-transparent transcription of palatal consonants; ambiguity whether digraphs represent one phoneme or two | |
| sm | Samoan | marking of long vowels and glottal stops is suspect | |
| suz | Sunwar | conflation between /ɾ/, /ɭ/, and possibly /l̪/; inconsistent marking of glottal stops | |
| sw | Swahili | conflation between syllabic nasals and non-syllabic counterparts | |
| too | Xicotepec de Juárez Totonac | suspect transcription due to unclear documentation | |
| tpp | Pisaflores Tepehua | suspect marking of vowel length | |
| tzj | Tz'utujil | uncertainty around the marking of the glottal stop and the orthography | |
| tzm | Central Atlas Tamazight | conflation between /l̪/ and /l̪ˤ/, and between /ʒ/ and /ʒˀ/ | |
| wmw | Mwani | conflation between syllabic nasals and prenasalized stops | lacks lenition |
| zsm | Standard Malay | conflation between /e/ and /ə/; conflicting orthographies | |
| zza | Zaza | conflicting orthographies; conflation among vowels |
| Language Code | Language | Reason |
|---|---|---|
| ace | Acehnese | non-transparent transcription of vowel nasalization |
| ach | Acholi | non-transparent transcription of tones |
| acu | Achuar-Shiwiar | non-transparent transcription of vowel nasalization |
| adh | Adhola | non-transparent transcription of tones |
| af | Afrikaans | non-transparent transcription of vowels, vowel length, and diphthongs |
| agd | Agarabi | non-transparent transcription of tones |
| agm | Angaataha | non-transparent transcription of tones |
| agr | Aguaruna | non-transparent transcription of vowel nasalization |
| ak | Akan | non-transparent transcription of tones |
| alq | Algonquin | non-transparent transcription of vowel length |
| am | Amharic | non-transparent transcription of consonant gemination |
| anv | Denya | non-transparent transcription of tones |
| as | Assamese | non-transparent transcription of vowels |
| aso | Dano | non-transparent transcription of tones |
| avt | Avar | non-transparent transcription of consonant gemination |
| ban | Bali | non-standardized orthography |
| bem | Bemba | non-transparent transcription of tones |
| bba | Bariba | non-transparent transcription of tones |
| bcw | Bana | non-transparent transcription of tones |
| bhl | Bimin | non-transparent transcription of tones |
| bm | Bambara | non-transparent transcription of tones |
| bmr | Muinane | non-transparent transcription of tones |
| bs | Bosnian | non-transparent transcription of vowel length and tones |
| bsn | Barasana-Eduria | non-transparent transcription of tones |
| bua | Buryat | non-transparent transcription of palatalization |
| byr | Baruya | non-transparent transcription of tones |
| cao | Chácobo | non-transparent transcription of tones |
| cax | Chiquitano | non-transparent transcription of vowel nasalization |
| cbc | Carapan | non-transparent transcription of tones |
| ce | Chechen | non-transparent transcription of vowel length |
| ceb | Cebuano | non-transparent transcription of vowel length |
| chr | Cherokee | non-transparent transcription of vowel length |
| cwk | Western Kaqchikel | non-transparent transcription of vowels |
| cnh | Haka Chin | non-transparent transcription of tones |
| coe | Koreguaja | non-transparent transcription of tones |
| ctd | Tedim Chin | non-transparent transcription of tones |
| cub | Cubeo | non-transparent transcription of tones |
| cuk | San Blas Kuna | non-transparent transcription |
| cy | Welsh | non-transparent transcription of vowel length |
| da | Danish | non-transparent transcription of vowels |
| daa | Dangaléat | non-transparent transcription of tones |
| des | Desano | non-transparent transcription of tones |
| dgo | Dogri | non-transparent transcription of tones |
| din | Dinka | non-transparent transcription of tones |
| dts | Toro So Dogon | non-transparent transcription of tones |
| dz | Dzongkha | non-transparent transcription |
| ee | Ewe | non-transparent transcription of tones |
| efi | Efik | non-transparent transcription of tones |
| emp | Northern Emberá | non-transparent transcription |
| enb | Markweeta | non-transparent transcription of tones |
| enq | Enga | non-transparent transcription of tones |
| et | Estonian | non-transparent transcription of contrastive syllable length |
| faa | Fasu | non-transparent transcription of tones |
| fi | Finnish | non-transparent transcription |
| fj | Fijian | non-transparent transcription of vowel length |
| fo | Faroese | non-transparent transcription of vowels |
| for | Fore | non-transparent transcription of tones |
| fur | Friulian | non-transparent transcription of vowels |
| fy | Frisian | non-transparent transcription of vowels |
| ga | Irish | non-transparent transcription |
| gah | Alekano | non-transparent transcription of tones |
| gd | Scottish Gaelic | non-transparent transcription of consonants and vowels |
| gl | Galician | non-transparent transcription |
| gmo | Gamo-Gofa-Dawro | three languages understood to be linguistically separate |
| grb | Grebo | non-transparent transcription of tones |
| grt | Garo | non-transparent transcription of vowels |
| gub | Guajajara | non-transparent transcription of vowel nasalization |
| gum | Guambiano | non-standardized orthography |
| gur | Farefare | non-transparent transcription of tones |
| gv | Manx Gaelic | non-transparent transcription of consonants and vowels |
| ha | Hausa | non-transparent transcription of vowel length |
| hbs | Serbo-Croatian | non-transparent transcription of tones |
| hch | Huichol | non-transparent transcription of tones |
| heh | Hehe | non-transparent transcription of tones |
| hr | Croatian | non-transparent transcription of vowel length |
| hub | Huambisa | non-transparent transcription of vowel nasalization |
| hui | Huli | non-transparent transcription of tones |
| huv | Huave | inconsistent phonological documentation |
| hz | Herero | non-transparent transcription of tones |
| ig | Igbo | non-transparent transcription of tones |
| ik | Inupiaq | insufficient tokens |
| is | Icelandic | non-transparent transcription of vowel length |
| jiv | Shuar | non-transparent transcription of vowel nasalization |
| kab | Kabyle | non-transparent transcription of consonants |
| kac | Jingpho | non-transparent transcription of tones |
| kaq | Capanahua | non-transparent transcription of tones |
| kbc | Kadiweu | non-transparent transcription of consonant gemination |
| kbr | Kafa | non-transparent transcription of tones |
| kha | Khasi | non-transparent transcription of vowel length |
| khk | Khalkha Mongolian | non-transparent transcription of vowels |
| ki | Gikuyu | non-transparent transcription of tones |
| kj | Kwanyama | non-transparent transcription of tones |
| kjs | East Kewa | non-transparent transcription of tones |
| kew | West Kewa | non-transparent transcription of tones |
| kmr | Northern Kurdish | non-transparent transcription of consonants |
| kmu | Kanite | non-transparent transcription of tones |
| ksd | Kuanua | non-transparent transcription of vowel length |
| kus | Kusaal | non-transparent transcription of tones and vowel length |
| kw | Cornish | non-transparent transcription of vowel length |
| lac | Lacandon | non-transparent transcription of vowel length |
| lb | Luxembourgish | non-transparent transcription of vowels |
| lef | Lelemi | non-transparent transcription of tones |
| lg | Luganda | non-transparent transcription of tones |
| ln | Lingala | non-transparent transcription of tones |
| loz | Lozi | non-transparent transcription of tones |
| lt | Lithuanian | non-transparent transcription of tones |
| luo | Dholuo | non-transparent transcription of tones |
| lus | Mizo | non-transparent transcription of tones |
| lv | Latvian | non-transparent transcription of tones |
| lvs | Standard Latvian | non-transparent transcription of tones |
| lwo | Luwo | non-transparent transcription of tones and breathy vowels |
| man | Mandingo | non-transparent transcription of tones |
| mas | Maasai | insufficient tokens |
| mcb | Machiguenga | non-transparent transcription of tones |
| mcd | Sharanahua | non-transparent transcription of tones |
| meu | Motu | non-transparent transcription of vowel length |
| mfi | Wandala | non-transparent transcription of tones |
| mfz | Mabaan | non-transparent transcription of tones |
| mhr | Eastern Mari | non-transparent transcription of palatalization |
| mi | Maori | non-transparent transcription of vowel length |
| miq | Miskito | non-transparent transcription of vowel nasalization and length |
| mni | Meitei | non-transparent transcription of tones |
| mos | Mossi | non-transparent transcription of tones |
| mps | Dadibi | non-transparent transcription of tones and vowel nasalization |
| mpt | Mian | non-transparent transcription of tones |
| ms | Malay | non-transparent transcription of vowels |
| my | Burmese | non-transparent transcription of tones |
| myu | Mundurukú | non-transparent transcription of tones and creaky vowels |
| myy | Macuna | non-transparent transcription of tones |
| nd | Northern Ndebele | insufficient tokens |
| nds | Low Saxon | non-transparent transcription |
| nfr | Nafaanra | non-transparent transcription of tones |
| nhg | Tetelcingo Nahuatl | non-transparent transcription of vowel length |
| no | Norwegian | non-transparent transcription of tones and vowel length |
| ntp | Northern Tepehuan | non-transparent transcription of tones |
| nv | Navajo | non-transparent transcription of vowel nasalization |
| ny | Chichewa | non-transparent transcription of tones |
| nyn | Nyankore | non-transparent transcription of tones |
| om | Oromo | non-transparent transcription of tones |
| opm | Oksapmin | non-transparent transcription of vowels |
| ood | Tohono O'odham | non-transparent transcription |
| ots | Estado de México Otomi | non-transparent transcription of tones |
| pab | Parecís | non-transparent transcription of vowel length and nasalization |
| pao | Northern Paiute | non-transparent transcription of vowel length |
| pap | Papiamentu | non-transparent transcription of vowels |
| pir | Wanano | non-transparent transcription of tones |
| pl | Polish | non-transparent transcription |
| pms | Piedmontese | non-transparent transcription |
| poh | Poqomchi' | insufficient documentation |
| rw | Kinyarwanda | non-transparent transcription of tones and vowel length |
| sd | Sindhi | non-transparent transcription of vowels |
| se | Northern Sami | non-transparent transcription |
| sg | Sango | non-transparent transcription of tones |
| sim | Mende | non-transparent transcription of tones |
| sll | Salt-Yui | non-transparent transcription of tones |
| sn | Shona | non-transparent transcription of tones |
| so | Somali | non-transparent transcription of tones |
| soq | Kanasi | non-transparent transcription of glottal stops |
| spp | Supyire Senoufo | non-transparent transcription of tones |
| ss | Swati | non-transparent transcription of tones |
| st | Sesotho | non-transparent transcription of tones |
| sv | Swedish | non-transparent transcription |
| swp | Suau | non-transparent transcription |
| sxb | Suba | non-transparent transcription of tones |
| tav | Tatuyo | non-transparent transcription of tones |
| tcc | Datooga | non-transparent transcription of tones |
| tcy | Tulu | non-transparent transcription of vowels |
| tcz | Thadou Chin | non-transparent transcription of tones |
| ti | Tigrinya | non-transparent transcription of gemination |
| tk | Turkmen | non-transparent transcription of vowel length |
| tl | Tagalog | non-transparent spalling of vowel length |
| tn | Tswana | non-transparent transcription of tones |
| toi | Tonga | non-transparent transcription of tones |
| trp | Kok Borok | non-transparent transcription of tones |
| ts | Tsonga | non-transparent transcription of tones |
| ttc | Tekiteko | non-transparent transcription of vowel length |
| tuf | Central Tunebo | non-transparent transcription of contrastive features (first syllable) |
| tw | Twi | non-transparent transcription of tones |
| ubu | Umbu-Ungu | non-transparent transcription of tones |
| udu | Uduk | non-transparent transcription of tones |
| ur | Urdu | non-transparent transcription of vowels |
| ura | Urarina | non-transparent transcription of tones |
| usp | Uspanteko | non-transparent transcription of tones |
| ve | Venda | non-transparent transcription of tones |
| vro | Võro | non-transparent transcription of vowels and palatalization |
| wa | Walloon | non-transparent transcription |
| wal | Wolaytta | non-transparent transcription of tones |
| war | Waray-Waray | insufficient documentation |
| wiu | Wiru | non-transparent transcription of tones |
| xal | Kalmyk-Oirat | non-transparent transcription of vowels |
| xav | Xavánte | non-transparent transcription of vowel length |
| xbi | Kombio | non-transparent transcription of vowels |
| xh | Xhosa | non-transparent transcription of tones |
| xla | Kamula | non-transparent transcription of vowels and tones |
| xsr | Sherpa | insufficient documentation |
| yaa | Yaminahua | non-transparent transcription of tones |
| yad | Yagua | non-transparent transcription of tones |
| yby | Yaweyuha | non-transparent transcription of tones |
| yo | Yoruba | non-transparent transcription of tones |
| zai | Zapotec | non-transparent transcription of tones |
| zca | Coatecas Altas Zapotec | non-transparent transcription of tones |
| zpi | Santa María Quiegolani Zapotec | non-transparent transcription of tones |
| zpq | Zoogocho Zapotec | non-transparent transcription of tones |
| zu | Zulu | non-transparent transcription of tones |