Commit
·
c35a712
1
Parent(s):
357d236
Update README.md
Browse files
README.md
CHANGED
|
@@ -19,15 +19,18 @@ tags:
|
|
| 19 |
|
| 20 |
## Description
|
| 21 |
|
| 22 |
-
Sent2vec can be used directly for English texts.
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
|
|
|
| 26 |
A total of 192,209 sentences are available for training.
|
| 27 |
-
- Apply a
|
| 28 |
-
- the
|
| 29 |
-
-
|
| 30 |
-
|
|
|
|
|
|
|
| 31 |
|
| 32 |
## How to use
|
| 33 |
|
|
|
|
| 19 |
|
| 20 |
## Description
|
| 21 |
|
| 22 |
+
Sent2vec can be used directly for English texts. For this purpose, all you have to do is download the library and enter the text to be coded, since most
|
| 23 |
+
of these algorithms were trained using English as the original language. However, since this work is used with text in Spanish, it has been necessary
|
| 24 |
+
to train it from zero in this new language. This training was carried out using the generated corpus ([in this respository](https://huggingface.co/datasets/oeg/CelebA_Sent2Vect_Sp))
|
| 25 |
+
with the following process:
|
| 26 |
+
- A corpus composed of a set of descriptive sentences of characteristics of each of the faces of the CelebA dataset in Spanish has been generated.
|
| 27 |
A total of 192,209 sentences are available for training.
|
| 28 |
+
- Apply a pre-processing consisting of removing accents. _stopwords_ and connectors were retained as part of the sentence structure during training.
|
| 29 |
+
- Install the libraries _Sent2vec_ and _FastText_, and configure the parameters. The parameters have been fixed empirically after several
|
| 30 |
+
- tests, being: 4,800 dimensions of feature vectors, 5,000 epochs, 200 threads, 2 n-grams and a learning rate of 0.05.
|
| 31 |
+
|
| 32 |
+
In this context, the total training time lasted 7 hours working with all CPUs at maximum performance.
|
| 33 |
+
As a result, it generates a _bin_ extension file which can be downloaded from this repository.
|
| 34 |
|
| 35 |
## How to use
|
| 36 |
|