Update README.md

6c94168 verified about 1 year ago

6.86 kB

	---
	license: cc-by-4.0
	---

	# Clean ConceptNet Data for All Languages

	## Data Details

	For our project on [Retrofitting Glove embeddings for Low Resource Languages](https://github.com/pyRis/retrofitting-embeddings-lrls/tree/main?tab=readme-ov-file), we extracted all data from the [ConceptNet](https://github.com/commonsense/conceptnet5/wiki/Downloads) database for 304 languages. The extraction process involved several steps to clean and analyze the data from the official ConceptNet dump available [here](https://s3.amazonaws.com/conceptnet/downloads/2019/edges/conceptnet-assertions-5.7.0.csv.gz).

	The final extracted dataset, available in another [HuggingFace repo](https://huggingface.co/datasets/DGurgurov/conceptnet_all), was used for training the graph embeddings using PPMI and consequently applying SVD on the co-occurence statistics of PPMI between the words.

	We generate graph embeddings for 72 languages present in both CC100 and ConceptNet.

	### Dataset Structure

	Each file is a txt file with a word / phrase and corresponding embedding separated with a space.

	Use the following function to read in the embeddings:

	```python
	def read_embeddings_from_text(file_path, embedding_size=300):
	"""Function to read the embeddings from a txt file"""
	embeddings = {}
	with open(file_path, 'r', encoding='utf-8') as file:
	for line in file:
	parts = line.strip().split(' ')
	embedding_start_index = len(parts) - embedding_size
	phrase = ' '.join(parts[:embedding_start_index])
	embedding = np.array([float(val) for val in parts[embedding_start_index:]])
	embeddings[phrase] = embedding
	return embeddings
	```

	### Language Details

	\| Language Code \| Language Name \| Vocabulary Size\|
	\| --- \| --- \| --- \|
	\| af \| Afrikaans \| 12973 \|
	\| sc \| Sardinian \| 573 \|
	\| yo \| Yoruba \| 2283 \|
	\| gn \| Guarani \| 131 \|
	\| qu \| Quechua \| 5156 \|
	\| li \| Limburgish \| 485 \|
	\| ln \| Lingala \| 4109 \|
	\| wo \| Wolof \| 1196 \|
	\| zu \| Zulu \| 2758 \|
	\| rm \| Romansh \| 3919 \|
	\| ht \| Haitian Creole \| 2699 \|
	\| su \| Sundanese \| 2514 \|
	\| br \| Breton \| 11665 \|
	\| gd \| Scottish Gaelic \| 14418 \|
	\| xh \| Xhosa \| 2504 \|
	\| mg \| Malagasy \| 26575 \|
	\| jv \| Javanese \| 4919 \|
	\| fy \| Frisian \| 7608 \|
	\| sa \| Sanskrit \| 5789 \|
	\| my \| Burmese \| 4875 \|
	\| ug \| Uyghur \| 998 \|
	\| yi \| Yiddish \| 8054 \|
	\| or \| Oriya \| 109 \|
	\| ha \| Hausa \| 802 \|
	\| la \| Latin \| 848943 \|
	\| sd \| Sindhi \| 143 \|
	\| so \| Somali \| 593 \|
	\| ku \| Kurdish \| 9737 \|
	\| pa \| Punjabi \| 4488 \|
	\| ps \| Pashto \| 1087 \|
	\| ga \| Irish \| 29459 \|
	\| am \| Amharic \| 1909 \|
	\| km \| Khmer \| 3466 \|
	\| uz \| Uzbek \| 5224 \|
	\| ky \| Kyrgyz \| 3574 \|
	\| cy \| Welsh \| 13243 \|
	\| gu \| Gujarati \| 4427 \|
	\| eo \| Esperanto \| 91074 \|
	\| sw \| Swahili \| 9131 \|
	\| mr \| Marathi \| 5545 \|
	\| kn \| Kannada \| 3415 \|
	\| ne \| Nepali \| 4224 \|
	\| mn \| Mongolian \| 6740 \|
	\| si \| Sinhala \| 2062 \|
	\| te \| Telugu \| 18707 \|
	\| be \| Belarusian \| 14871 \|
	\| mk \| Macedonian \| 28935 \|
	\| gl \| Galician \| 52824 \|
	\| hy \| Armenian \| 23434 \|
	\| is \| Icelandic \| 40287 \|
	\| ml \| Malayalam \| 6750 \|
	\| bn \| Bengali \| 7306 \|
	\| ur \| Urdu \| 8476 \|
	\| kk \| Kazakh \| 13700 \|
	\| ka \| Georgian \| 25014 \|
	\| az \| Azerbaijani \| 13277 \|
	\| sq \| Albanian \| 16262 \|
	\| ta \| Tamil \| 9064 \|
	\| et \| Estonian \| 20088 \|
	\| lv \| Latvian \| 30059 \|
	\| ms \| Malay \| 88416 \|
	\| sl \| Slovenian \| 89210 \|
	\| lt \| Lithuanian \| 21184 \|
	\| he \| Hebrew \| 27283 \|
	\| sk \| Slovak \| 21657 \|
	\| el \| Greek \| 39667 \|
	\| th \| Thai \| 94281 \|
	\| bg \| Bulgarian \| 171740 \|
	\| da \| Danish \| 46600 \|
	\| uk \| Ukrainian \| 27682 \|
	\| ro \| Romanian \| 36206 \|


	### Licensing Information

	This work includes data from ConceptNet 5, which was compiled by the
	Commonsense Computing Initiative. ConceptNet 5 is freely available under
	the Creative Commons Attribution-ShareAlike license (CC BY SA 3.0) from
	http://conceptnet.io.

	### Citation Information

	```
	@misc{gurgurov2024gremlinrepositorygreenbaseline,
	title={GrEmLIn: A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge},
	author={Daniil Gurgurov and Rishu Kumar and Simon Ostermann},
	year={2024},
	eprint={2409.18193},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2409.18193},
	}

	@paper{speer2017conceptnet,
	author = {Robyn Speer and Joshua Chin and Catherine Havasi},
	title = {ConceptNet 5.5: An Open Multilingual Graph of General Knowledge},
	conference = {AAAI Conference on Artificial Intelligence},
	year = {2017},
	pages = {4444--4451},
	keywords = {ConceptNet; knowledge graph; word embeddings},
	url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972}
	}
	```

	---
	license: cc-by-4.0
	---

	# Clean ConceptNet Data for All Languages

	## Data Details

	For our project on [Retrofitting Glove embeddings for Low Resource Languages](https://github.com/pyRis/retrofitting-embeddings-lrls/tree/main?tab=readme-ov-file), we extracted all data from the [ConceptNet](https://github.com/commonsense/conceptnet5/wiki/Downloads) database for 304 languages. The extraction process involved several steps to clean and analyze the data from the official ConceptNet dump available [here](https://s3.amazonaws.com/conceptnet/downloads/2019/edges/conceptnet-assertions-5.7.0.csv.gz).

	The final extracted dataset, available in another [HuggingFace repo](https://huggingface.co/datasets/DGurgurov/conceptnet_all), was used for training the graph embeddings using PPMI and consequently applying SVD on the co-occurence statistics of PPMI between the words.

	We generate graph embeddings for 72 languages present in both CC100 and ConceptNet.

	### Dataset Structure

	Each file is a txt file with a word / phrase and corresponding embedding separated with a space.

	Use the following function to read in the embeddings:

	```python
	def read_embeddings_from_text(file_path, embedding_size=300):
	"""Function to read the embeddings from a txt file"""
	embeddings = {}
	with open(file_path, 'r', encoding='utf-8') as file:
	for line in file:
	parts = line.strip().split(' ')
	embedding_start_index = len(parts) - embedding_size
	phrase = ' '.join(parts[:embedding_start_index])
	embedding = np.array([float(val) for val in parts[embedding_start_index:]])
	embeddings[phrase] = embedding
	return embeddings
	```

	### Language Details

	\| Language Code \| Language Name \| Vocabulary Size\|
	\| --- \| --- \| --- \|
	\| af \| Afrikaans \| 12973 \|
	\| sc \| Sardinian \| 573 \|
	\| yo \| Yoruba \| 2283 \|
	\| gn \| Guarani \| 131 \|
	\| qu \| Quechua \| 5156 \|
	\| li \| Limburgish \| 485 \|
	\| ln \| Lingala \| 4109 \|
	\| wo \| Wolof \| 1196 \|
	\| zu \| Zulu \| 2758 \|
	\| rm \| Romansh \| 3919 \|
	\| ht \| Haitian Creole \| 2699 \|
	\| su \| Sundanese \| 2514 \|
	\| br \| Breton \| 11665 \|
	\| gd \| Scottish Gaelic \| 14418 \|
	\| xh \| Xhosa \| 2504 \|
	\| mg \| Malagasy \| 26575 \|
	\| jv \| Javanese \| 4919 \|
	\| fy \| Frisian \| 7608 \|
	\| sa \| Sanskrit \| 5789 \|
	\| my \| Burmese \| 4875 \|
	\| ug \| Uyghur \| 998 \|
	\| yi \| Yiddish \| 8054 \|
	\| or \| Oriya \| 109 \|
	\| ha \| Hausa \| 802 \|
	\| la \| Latin \| 848943 \|
	\| sd \| Sindhi \| 143 \|
	\| so \| Somali \| 593 \|
	\| ku \| Kurdish \| 9737 \|
	\| pa \| Punjabi \| 4488 \|
	\| ps \| Pashto \| 1087 \|
	\| ga \| Irish \| 29459 \|
	\| am \| Amharic \| 1909 \|
	\| km \| Khmer \| 3466 \|
	\| uz \| Uzbek \| 5224 \|
	\| ky \| Kyrgyz \| 3574 \|
	\| cy \| Welsh \| 13243 \|
	\| gu \| Gujarati \| 4427 \|
	\| eo \| Esperanto \| 91074 \|
	\| sw \| Swahili \| 9131 \|
	\| mr \| Marathi \| 5545 \|
	\| kn \| Kannada \| 3415 \|
	\| ne \| Nepali \| 4224 \|
	\| mn \| Mongolian \| 6740 \|
	\| si \| Sinhala \| 2062 \|
	\| te \| Telugu \| 18707 \|
	\| be \| Belarusian \| 14871 \|
	\| mk \| Macedonian \| 28935 \|
	\| gl \| Galician \| 52824 \|
	\| hy \| Armenian \| 23434 \|
	\| is \| Icelandic \| 40287 \|
	\| ml \| Malayalam \| 6750 \|
	\| bn \| Bengali \| 7306 \|
	\| ur \| Urdu \| 8476 \|
	\| kk \| Kazakh \| 13700 \|
	\| ka \| Georgian \| 25014 \|
	\| az \| Azerbaijani \| 13277 \|
	\| sq \| Albanian \| 16262 \|
	\| ta \| Tamil \| 9064 \|
	\| et \| Estonian \| 20088 \|
	\| lv \| Latvian \| 30059 \|
	\| ms \| Malay \| 88416 \|
	\| sl \| Slovenian \| 89210 \|
	\| lt \| Lithuanian \| 21184 \|
	\| he \| Hebrew \| 27283 \|
	\| sk \| Slovak \| 21657 \|
	\| el \| Greek \| 39667 \|
	\| th \| Thai \| 94281 \|
	\| bg \| Bulgarian \| 171740 \|
	\| da \| Danish \| 46600 \|
	\| uk \| Ukrainian \| 27682 \|
	\| ro \| Romanian \| 36206 \|


	### Licensing Information

	This work includes data from ConceptNet 5, which was compiled by the
	Commonsense Computing Initiative. ConceptNet 5 is freely available under
	the Creative Commons Attribution-ShareAlike license (CC BY SA 3.0) from
	http://conceptnet.io.

	### Citation Information

	```
	@misc{gurgurov2024gremlinrepositorygreenbaseline,
	title={GrEmLIn: A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge},
	author={Daniil Gurgurov and Rishu Kumar and Simon Ostermann},
	year={2024},
	eprint={2409.18193},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2409.18193},
	}

	@paper{speer2017conceptnet,
	author = {Robyn Speer and Joshua Chin and Catherine Havasi},
	title = {ConceptNet 5.5: An Open Multilingual Graph of General Knowledge},
	conference = {AAAI Conference on Artificial Intelligence},
	year = {2017},
	pages = {4444--4451},
	keywords = {ConceptNet; knowledge graph; word embeddings},
	url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972}
	}
	```