oeg
/

Sent2vec_CelebA_Sp

celebFaces Attributes

Model card Files Files and versions

eduar03yauri commited on Mar 20, 2023

Commit

c35a712

·

1 Parent(s): 357d236

Update README.md

Files changed (1) hide show

README.md +11 -8

README.md CHANGED Viewed

@@ -19,15 +19,18 @@ tags:
 ## Description
-Sent2vec can be used directly for English texts. However, since this work is used with Spanish text, it has been necessary to train it
-previously using the generated corpus ([in this respository](https://huggingface.co/datasets/oeg/CelebA_Sent2Vect_Sp)) with the following process:
-- Initial preprocessing of the Spanish corpus. For this purpose, a new file has been developed in which each of the entries of the original
-  corpus is saved and the other components, such as the names of the image it describes and symbols, are removed.
   A total of 192,209 sentences are available for training.
-- Apply a second pre-processing consisting of removing accents. _stopwords_ and connectors were retained as part of
-- the sentence structure during training.
-- Configure the libraries, e.g., _Sent2vec_ and _FastText_, and the parameters. The parameters have been set empirically,
-  being: 4,800 feature vector dimension, 5,000 epochs, 200 threads, 2 n-grams, and 0.05 learning rate.
 ## How to use

 ## Description
+Sent2vec can be used directly for English texts. For this purpose, all you have to do is download the library and enter the text to be coded, since most
+of these algorithms were trained using English as the original language. However, since this work is used with text in Spanish, it has been necessary
+to train it from zero in this new language. This training was carried out using the generated corpus ([in this respository](https://huggingface.co/datasets/oeg/CelebA_Sent2Vect_Sp))
+with the following process:
+- A corpus composed of a set of descriptive sentences of characteristics of each of the faces of the CelebA dataset in Spanish has been generated.
   A total of 192,209 sentences are available for training.
+- Apply a pre-processing consisting of removing accents. _stopwords_ and connectors were retained as part of the sentence structure during training.
+- Install the libraries _Sent2vec_ and _FastText_, and configure the parameters. The parameters have been fixed empirically after several
+- tests, being: 4,800 dimensions of feature vectors, 5,000 epochs, 200 threads, 2 n-grams and a learning rate of 0.05.
+In this context, the total training time lasted 7 hours working with all CPUs at maximum performance.
+As a result, it generates a _bin_ extension file which can be downloaded from this repository.
 ## How to use