Upload README.md with huggingface_hub

9a512c1 verified 7 months ago

4.48 kB

	---
	language: pt
	tags:
	- word-embeddings
	- static
	- portuguese
	- fasttext
	- cbow
	- 300d
	license: cc-by-4.0
	library_name: safetensors
	pipeline_tag: feature-extraction
	---

	# NILC Portuguese Word Embeddings — FastText CBOW 300d

	This repository contains the FastText CBOW 300d model in safetensors format.

	## About

	NILC-Embeddings is a repository for storing and sharing word embeddings for the Portuguese language. The goal is to provide ready-to-use vector resources for Natural Language Processing (NLP) and Machine Learning tasks.

	The embeddings were trained on a large Portuguese corpus (Brazilian + European), composed of 17 corpora (~1.39B tokens). Training was carried out with the following algorithms: Word2Vec, FastText, Wang2Vec, and GloVe.



	---

	## 📂 Files
	- `embeddings.safetensors` → embedding matrix (`[vocab_size, 300]`)
	- `vocab.txt` → vocabulary (one token per line, aligned with rows)

	---

	## 🚀 Usage

	```python
	from huggingface_hub import hf_hub_download
	from safetensors.numpy import load_file

	path = hf_hub_download(repo_id="nilc-nlp/fasttext-cbow-300d",
	filename="embeddings.safetensors")

	data = load_file(path)
	vectors = data["embeddings"]

	vocab_path = hf_hub_download(repo_id="nilc-nlp/fasttext-cbow-300d",
	filename="vocab.txt")
	with open(vocab_path) as f:
	vocab = [w.strip() for w in f]

	print(vectors.shape)
	```

	Or in PyTorch:

	```python
	from safetensors.torch import load_file
	tensors = load_file("embeddings.safetensors")
	vectors = tensors["embeddings"] # torch.Tensor
	```

	---

	## 📊 Corpus

	The embeddings were trained on a combination of 17 corpora (~1.39B tokens):

	\| Corpus \| Tokens \| Types \| Genre \| Description \|
	\|--------\|--------\|-------\|-------\|-------------\|
	\| LX-Corpus [Rodrigues et al. 2016] \| 714,286,638 \| 2,605,393 \| Mixed genres \| Large collection of texts from 19 sources, mostly European Portuguese \|
	\| Wikipedia \| 219,293,003 \| 1,758,191 \| Encyclopedic \| Wikipedia dump (2016-10-20) \|
	\| GoogleNews \| 160,396,456 \| 664,320 \| Informative \| News crawled from Google News \|
	\| SubIMDB-PT \| 129,975,149 \| 500,302 \| Spoken \| Movie subtitles from IMDb \|
	\| G1 \| 105,341,070 \| 392,635 \| Informative \| News from G1 portal (2014–2015) \|
	\| PLN-Br [Bruckschen et al. 2008] \| 31,196,395 \| 259,762 \| Informative \| Corpus of PLN-BR project (1994–2005) \|
	\| Domínio Público \| 23,750,521 \| 381,697 \| Prose \| 138,268 literary works \|
	\| Lacio-Web [Aluísio et al. 2003] \| 8,962,718 \| 196,077 \| Mixed \| Literary, informative, scientific, law, didactic texts \|
	\| Literatura Brasileira \| 1,299,008 \| 66,706 \| Prose \| Classical Brazilian fiction e-books \|
	\| Mundo Estranho \| 1,047,108 \| 55,000 \| Informative \| Texts from Mundo Estranho magazine \|
	\| CHC \| 941,032 \| 36,522 \| Informative \| Texts from Ciência Hoje das Crianças \|
	\| FAPESP \| 499,008 \| 31,746 \| Science communication \| Texts from Pesquisa FAPESP magazine \|
	\| Textbooks \| 96,209 \| 11,597 \| Didactic \| Elementary school textbooks \|
	\| Folhinha \| 73,575 \| 9,207 \| Informative \| Children’s news from Folhinha (Folha de São Paulo) \|
	\| NILC subcorpus \| 32,868 \| 4,064 \| Informative \| Children’s texts (3rd–4th grade) \|
	\| Para Seu Filho Ler \| 21,224 \| 3,942 \| Informative \| Children’s news from Zero Hora \|
	\| SARESP \| 13,308 \| 3,293 \| Didactic \| School evaluation texts \|
	\| Total \| 1,395,926,282 \| 3,827,725 \| — \| —

	---

	## 📖 Paper

	Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks
	Hartmann, N. et al. (2017), STIL 2017.
	[ArXiv Paper](https://arxiv.org/abs/1708.06025)

	### BibTeX
	```bibtex
	@inproceedings{hartmann-etal-2017-portuguese,
	title = {{P}ortuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks},
	author = {Hartmann, Nathan and Fonseca, Erick and Shulby, Christopher and Treviso, Marcos and Silva, J{'e}ssica and Alu{'i}sio, Sandra},
	year = 2017,
	month = oct,
	booktitle = {Proceedings of the 11th {B}razilian Symposium in Information and Human Language Technology},
	publisher = {Sociedade Brasileira de Computa{\c{c}}{\~a}o},
	address = {Uberl{\^a}ndia, Brazil},
	pages = {122--131},
	url = {https://aclanthology.org/W17-6615/},
	editor = {Paetzold, Gustavo Henrique and Pinheiro, Vl{'a}dia}
	}
	```

	---

	## 📜 License
	Creative Commons Attribution 4.0 International (CC BY 4.0)