| | --- |
| | license: cc-by-4.0 |
| | --- |
| | |
| | # Clean ConceptNet Data for All Languages |
| |
|
| | ## Data Details |
| |
|
| | For our project on [Retrofitting Glove embeddings for Low Resource Languages](https://github.com/pyRis/retrofitting-embeddings-lrls/tree/main?tab=readme-ov-file), we extracted all data from the [ConceptNet](https://github.com/commonsense/conceptnet5/wiki/Downloads) database for 304 languages. The extraction process involved several steps to clean and analyze the data from the official ConceptNet dump available [here](https://s3.amazonaws.com/conceptnet/downloads/2019/edges/conceptnet-assertions-5.7.0.csv.gz). |
| |
|
| | The final extracted dataset, available in another [HuggingFace repo](https://huggingface.co/datasets/DGurgurov/conceptnet_all), was used for training the graph embeddings using PPMI and consequently applying SVD on the co-occurence statistics of PPMI between the words. |
| |
|
| | We generate graph embeddings for 72 languages present in both CC100 and ConceptNet. |
| |
|
| | ### Dataset Structure |
| |
|
| | Each file is a txt file with a word / phrase and corresponding embedding separated with a space. |
| |
|
| | Use the following function to read in the embeddings: |
| |
|
| | ```python |
| | def read_embeddings_from_text(file_path, embedding_size=300): |
| | """Function to read the embeddings from a txt file""" |
| | embeddings = {} |
| | with open(file_path, 'r', encoding='utf-8') as file: |
| | for line in file: |
| | parts = line.strip().split(' ') |
| | embedding_start_index = len(parts) - embedding_size |
| | phrase = ' '.join(parts[:embedding_start_index]) |
| | embedding = np.array([float(val) for val in parts[embedding_start_index:]]) |
| | embeddings[phrase] = embedding |
| | return embeddings |
| | ``` |
| |
|
| | ### Language Details |
| |
|
| | | Language Code | Language Name | Vocabulary Size| |
| | | --- | --- | --- | |
| | | af | Afrikaans | 12973 | |
| | | sc | Sardinian | 573 | |
| | | yo | Yoruba | 2283 | |
| | | gn | Guarani | 131 | |
| | | qu | Quechua | 5156 | |
| | | li | Limburgish | 485 | |
| | | ln | Lingala | 4109 | |
| | | wo | Wolof | 1196 | |
| | | zu | Zulu | 2758 | |
| | | rm | Romansh | 3919 | |
| | | ht | Haitian Creole | 2699 | |
| | | su | Sundanese | 2514 | |
| | | br | Breton | 11665 | |
| | | gd | Scottish Gaelic | 14418 | |
| | | xh | Xhosa | 2504 | |
| | | mg | Malagasy | 26575 | |
| | | jv | Javanese | 4919 | |
| | | fy | Frisian | 7608 | |
| | | sa | Sanskrit | 5789 | |
| | | my | Burmese | 4875 | |
| | | ug | Uyghur | 998 | |
| | | yi | Yiddish | 8054 | |
| | | or | Oriya | 109 | |
| | | ha | Hausa | 802 | |
| | | la | Latin | 848943 | |
| | | sd | Sindhi | 143 | |
| | | so | Somali | 593 | |
| | | ku | Kurdish | 9737 | |
| | | pa | Punjabi | 4488 | |
| | | ps | Pashto | 1087 | |
| | | ga | Irish | 29459 | |
| | | am | Amharic | 1909 | |
| | | km | Khmer | 3466 | |
| | | uz | Uzbek | 5224 | |
| | | ky | Kyrgyz | 3574 | |
| | | cy | Welsh | 13243 | |
| | | gu | Gujarati | 4427 | |
| | | eo | Esperanto | 91074 | |
| | | sw | Swahili | 9131 | |
| | | mr | Marathi | 5545 | |
| | | kn | Kannada | 3415 | |
| | | ne | Nepali | 4224 | |
| | | mn | Mongolian | 6740 | |
| | | si | Sinhala | 2062 | |
| | | te | Telugu | 18707 | |
| | | be | Belarusian | 14871 | |
| | | mk | Macedonian | 28935 | |
| | | gl | Galician | 52824 | |
| | | hy | Armenian | 23434 | |
| | | is | Icelandic | 40287 | |
| | | ml | Malayalam | 6750 | |
| | | bn | Bengali | 7306 | |
| | | ur | Urdu | 8476 | |
| | | kk | Kazakh | 13700 | |
| | | ka | Georgian | 25014 | |
| | | az | Azerbaijani | 13277 | |
| | | sq | Albanian | 16262 | |
| | | ta | Tamil | 9064 | |
| | | et | Estonian | 20088 | |
| | | lv | Latvian | 30059 | |
| | | ms | Malay | 88416 | |
| | | sl | Slovenian | 89210 | |
| | | lt | Lithuanian | 21184 | |
| | | he | Hebrew | 27283 | |
| | | sk | Slovak | 21657 | |
| | | el | Greek | 39667 | |
| | | th | Thai | 94281 | |
| | | bg | Bulgarian | 171740 | |
| | | da | Danish | 46600 | |
| | | uk | Ukrainian | 27682 | |
| | | ro | Romanian | 36206 | |
| |
|
| |
|
| | ### Licensing Information |
| |
|
| | This work includes data from ConceptNet 5, which was compiled by the |
| | Commonsense Computing Initiative. ConceptNet 5 is freely available under |
| | the Creative Commons Attribution-ShareAlike license (CC BY SA 3.0) from |
| | http://conceptnet.io. |
| |
|
| | ### Citation Information |
| |
|
| | ``` |
| | @misc{gurgurov2024gremlinrepositorygreenbaseline, |
| | title={GrEmLIn: A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge}, |
| | author={Daniil Gurgurov and Rishu Kumar and Simon Ostermann}, |
| | year={2024}, |
| | eprint={2409.18193}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL}, |
| | url={https://arxiv.org/abs/2409.18193}, |
| | } |
| | |
| | @paper{speer2017conceptnet, |
| | author = {Robyn Speer and Joshua Chin and Catherine Havasi}, |
| | title = {ConceptNet 5.5: An Open Multilingual Graph of General Knowledge}, |
| | conference = {AAAI Conference on Artificial Intelligence}, |
| | year = {2017}, |
| | pages = {4444--4451}, |
| | keywords = {ConceptNet; knowledge graph; word embeddings}, |
| | url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972} |
| | } |
| | ``` |