CarolGuga
/

mbart-neutralization

text2text-generation

Generated from Trainer

Model card Files Files and versions

CarolGuga commited on Feb 26

Commit

bd04132

·

verified ·

1 Parent(s): 045351d

Update README.md

Files changed (1) hide show

README.md +45 -3

README.md CHANGED Viewed

@@ -25,18 +25,60 @@ It achieves the following results on the evaluation set:
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:

 ## Model description
+This model is a fine-tuned version of facebook/mbart-large-50, a multilingual sequence-to-sequence Transformer model, adapted for the task of Spanish gender neutralization.
+The goal of the model is to transform gender-marked Spanish sentences into gender-neutral reformulations, preserving meaning while reducing grammatical gender marking. This task can be framed as a monolingual translation problem (Spanish → neutral Spanish).
+The model was trained using the Hugging Face Transformers library and follows a standard encoder–decoder architecture with transfer learning from the pretrained mBART model.
+The resulting system performs controlled rewriting rather than translation between languages, making it suitable for experiments in:
+- inclusive language generation
+- stylistic rewriting
+- bias reduction in text
+- controlled text transformation
 ## Intended uses & limitations
+This model is intended for:
+- Research experiments in NLP and inclusive language
+- Educational purposes in courses on Machine Translation or Text Generation
+- Demonstrations of transfer learning using multilingual seq2seq models
+- Automatic rewriting of short Spanish sentences into gender-neutral forms
 ## Training and evaluation data
+The model was trained on the Spanish Gender Neutralization dataset available on Hugging Face:
+👉 hackathon-pln-es/neutral-es
+This dataset contains pairs of aligned sentences:
+- gendered: original sentence with grammatical gender marking
+- neutral: reformulated gender-neutral version
+The dataset already includes a predefined split:
+- Training set
+- Test set
+The dataset is relatively small and designed mainly for educational and experimental purposes, not for large-scale production systems.
+Before training, the data was:
+- tokenized using the mBART tokenizer
+- truncated/padded to model limits
+- converted into input/label format for seq2seq training
+Evaluation was performed using the BLEU score (sacrebleu), a standard metric in machine translation.
 ## Training procedure
+The model was trained using the Hugging Face Trainer API for sequence-to-sequence learning.
+Training steps:
+1. The pretrained model facebook/mbart-large-50 was loaded.
+2. The dataset was tokenized using the corresponding mBART tokenizer.
+3. Inputs were formatted as:
+    - source: gendered sentence
+    - target: neutral sentence
+4. The model was fine-tuned using transfer learning.
+5. Training was performed on GPU in Google Colab.
+6. Evaluation during training used the sacrebleu metric.
+7. The final model was uploaded to the Hugging Face Hub.
+The model therefore learns to perform monolingual rewriting via multilingual translation architecture.
 ### Training hyperparameters
 The following hyperparameters were used during training: