| | --- |
| | language: es |
| | --- |
| | |
| | # es-seq2seq-gender (encoder) |
| |
|
| | This is a seq2seq model (encoder half) to "flip" gender in Spanish sentences. |
| | The model can augment your existing Spanish data, or generate counterfactuals |
| | to test a model's decisions (would changing the gender of the subject or speaker change output?). |
| |
|
| | Intended Examples: |
| |
|
| | - el profesor viejo => la profesora vieja (article, noun, adjective all flip) |
| | - una actriz => un actor (irregular noun) |
| | - el lingüista => la lingüista (irregular noun) |
| | - la biblioteca => la biblioteca (no person, no flip) |
| |
|
| | People's names are unchanged in this version, but you can use packages |
| | such as https://pypi.org/project/gender-guesser/ |
| |
|
| |
|
| | ## Sample code |
| |
|
| | https://colab.research.google.com/drive/1Ta_YkXx93FyxqEu_zJ-W23PjPumMNHe5 |
| |
|
| | ``` |
| | import torch |
| | from transformers import AutoTokenizer, EncoderDecoderModel |
| | |
| | model = EncoderDecoderModel.from_encoder_decoder_pretrained("monsoon-nlp/es-seq2seq-gender-encoder", "monsoon-nlp/es-seq2seq-gender-decoder") |
| | tokenizer = AutoTokenizer.from_pretrained('monsoon-nlp/es-seq2seq-gender-decoder') # all are same as BETO uncased original |
| | |
| | input_ids = torch.tensor(tokenizer.encode("la profesora vieja")).unsqueeze(0) |
| | generated = model.generate(input_ids, decoder_start_token_id=model.config.decoder.pad_token_id) |
| | tokenizer.decode(generated.tolist()[0]) |
| | > '[PAD] el profesor viejo profesor viejo profesor...' |
| | ``` |
| |
|
| | ## Training |
| |
|
| | I originally developed |
| | <a href="https://github.com/MonsoonNLP/el-la">a gender flip Python script</a> |
| | with |
| | <a href="https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased">BETO</a>, |
| | the Spanish-language BERT from Universidad de Chile, |
| | and spaCy to parse dependencies in sentences. |
| |
|
| | More about this project: https://medium.com/ai-in-plain-english/gender-bias-in-spanish-bert-1f4d76780617 |
| |
|
| | The seq2seq model is trained on gender-flipped text from that script run on the |
| | <a href="https://huggingface.co/datasets/muchocine">muchocine dataset</a>, |
| | and the first 6,853 lines from the |
| | <a href="https://oscar-corpus.com/">OSCAR corpus</a> |
| | (Spanish ded-duped). |
| |
|
| | The encoder and decoder started with weights and vocabulary from BETO (uncased). |
| |
|
| | ## Non-binary gender |
| |
|
| | This model is useful to generate male and female text samples, but falls |
| | short of capturing gender diversity in the world and in the Spanish |
| | language. Some communities prefer the plural -@s to represent |
| | -os and -as, or -e and -es for gender-neutral or mixed-gender plural, |
| | or use fewer gendered professional nouns (la juez and not jueza). This is not yet |
| | embraced by the Royal Spanish Academy |
| | and is not represented in the corpora and tokenizers used to build this project. |
| |
|
| | This seq2seq project and script could, in the future, help generate more text samples |
| | and prepare NLP models to understand us all better. |
| |
|
| | #### Sources |
| |
|
| | - https://www.nytimes.com/2020/04/15/world/americas/argentina-gender-language.html |
| | - https://www.washingtonpost.com/dc-md-va/2019/12/05/teens-argentina-are-leading-charge-gender-neutral-language/?arc404=true |
| | - https://www.theguardian.com/world/2020/jan/19/gender-neutral-language-battle-spain |
| | - https://es.wikipedia.org/wiki/Lenguaje_no_sexista |
| | - https://remezcla.com/culture/argentine-company-re-imagines-little-prince-gender-neutral-language/ |
| |
|