Commit
·
73d1f1f
1
Parent(s):
7068663
add code sample
Browse files- README.md +27 -5
- config.json +0 -1
README.md
CHANGED
|
@@ -1,3 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# es-seq2seq-gender (encoder)
|
| 2 |
|
| 3 |
This is a seq2seq model (encoder half) to "flip" gender in Spanish sentences.
|
|
@@ -14,11 +18,29 @@ Intended Examples:
|
|
| 14 |
People's names are unchanged in this version, but you can use packages
|
| 15 |
such as https://pypi.org/project/gender-guesser/
|
| 16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
## Training
|
| 18 |
|
| 19 |
-
I originally developed
|
| 20 |
<a href="https://github.com/MonsoonNLP/el-la">a gender flip Python script</a>
|
| 21 |
-
with
|
| 22 |
<a href="https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased">BETO</a>,
|
| 23 |
the Spanish-language BERT from Universidad de Chile,
|
| 24 |
and spaCy to parse dependencies in sentences.
|
|
@@ -26,7 +48,7 @@ and spaCy to parse dependencies in sentences.
|
|
| 26 |
More about this project: https://medium.com/ai-in-plain-english/gender-bias-in-spanish-bert-1f4d76780617
|
| 27 |
|
| 28 |
The seq2seq model is trained on gender-flipped text from that script run on the
|
| 29 |
-
<a href="https://huggingface.co/datasets/muchocine">muchocine dataset</a>,
|
| 30 |
and the first 6,853 lines from the
|
| 31 |
<a href="https://oscar-corpus.com/">OSCAR corpus</a>
|
| 32 |
(Spanish ded-duped).
|
|
@@ -40,10 +62,10 @@ short of capturing gender diversity in the world and in the Spanish
|
|
| 40 |
language. Some communities prefer the plural -@s to represent
|
| 41 |
-os and -as, or -e and -es for gender-neutral or mixed-gender plural,
|
| 42 |
or use fewer gendered professional nouns (la juez and not jueza). This is not yet
|
| 43 |
-
embraced by the Royal Spanish Academy
|
| 44 |
and is not represented in the corpora and tokenizers used to build this project.
|
| 45 |
|
| 46 |
-
This seq2seq project and script could, in the future, help generate more text samples
|
| 47 |
and prepare NLP models to understand us all better.
|
| 48 |
|
| 49 |
#### Sources
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: es
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
# es-seq2seq-gender (encoder)
|
| 6 |
|
| 7 |
This is a seq2seq model (encoder half) to "flip" gender in Spanish sentences.
|
|
|
|
| 18 |
People's names are unchanged in this version, but you can use packages
|
| 19 |
such as https://pypi.org/project/gender-guesser/
|
| 20 |
|
| 21 |
+
|
| 22 |
+
## Sample code
|
| 23 |
+
|
| 24 |
+
https://colab.research.google.com/drive/1Ta_YkXx93FyxqEu_zJ-W23PjPumMNHe5
|
| 25 |
+
|
| 26 |
+
```
|
| 27 |
+
import torch
|
| 28 |
+
from transformers import AutoTokenizer, EncoderDecoderModel
|
| 29 |
+
|
| 30 |
+
model = EncoderDecoderModel.from_encoder_decoder_pretrained("monsoon-nlp/es-seq2seq-gender-encoder", "monsoon-nlp/es-seq2seq-gender-decoder")
|
| 31 |
+
tokenizer = AutoTokenizer.from_pretrained('monsoon-nlp/es-seq2seq-gender-decoder') # all are same as BETO uncased original
|
| 32 |
+
|
| 33 |
+
input_ids = torch.tensor(tokenizer.encode("la profesora vieja")).unsqueeze(0)
|
| 34 |
+
generated = model.generate(input_ids, decoder_start_token_id=model.config.decoder.pad_token_id)
|
| 35 |
+
tokenizer.decode(generated.tolist()[0])
|
| 36 |
+
> '[PAD] el profesor viejo profesor viejo profesor...'
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
## Training
|
| 40 |
|
| 41 |
+
I originally developed
|
| 42 |
<a href="https://github.com/MonsoonNLP/el-la">a gender flip Python script</a>
|
| 43 |
+
with
|
| 44 |
<a href="https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased">BETO</a>,
|
| 45 |
the Spanish-language BERT from Universidad de Chile,
|
| 46 |
and spaCy to parse dependencies in sentences.
|
|
|
|
| 48 |
More about this project: https://medium.com/ai-in-plain-english/gender-bias-in-spanish-bert-1f4d76780617
|
| 49 |
|
| 50 |
The seq2seq model is trained on gender-flipped text from that script run on the
|
| 51 |
+
<a href="https://huggingface.co/datasets/muchocine">muchocine dataset</a>,
|
| 52 |
and the first 6,853 lines from the
|
| 53 |
<a href="https://oscar-corpus.com/">OSCAR corpus</a>
|
| 54 |
(Spanish ded-duped).
|
|
|
|
| 62 |
language. Some communities prefer the plural -@s to represent
|
| 63 |
-os and -as, or -e and -es for gender-neutral or mixed-gender plural,
|
| 64 |
or use fewer gendered professional nouns (la juez and not jueza). This is not yet
|
| 65 |
+
embraced by the Royal Spanish Academy
|
| 66 |
and is not represented in the corpora and tokenizers used to build this project.
|
| 67 |
|
| 68 |
+
This seq2seq project and script could, in the future, help generate more text samples
|
| 69 |
and prepare NLP models to understand us all better.
|
| 70 |
|
| 71 |
#### Sources
|
config.json
CHANGED
|
@@ -1,5 +1,4 @@
|
|
| 1 |
{
|
| 2 |
-
"_name_or_path": "dccuchile/bert-base-spanish-wwm-uncased",
|
| 3 |
"architectures": [
|
| 4 |
"BertModel"
|
| 5 |
],
|
|
|
|
| 1 |
{
|
|
|
|
| 2 |
"architectures": [
|
| 3 |
"BertModel"
|
| 4 |
],
|