| # EsperBERTo Model Card | |
| ## Model Description | |
| EsperBERTo is a RoBERTa-like model specifically trained from scratch on the Esperanto language using a large corpus from the OSCAR and Leipzig Corpora Collection. It is designed to perform masked language modeling and other text-based prediction tasks. This model is ideal for understanding and generating Esperanto text. | |
| ### Datasets | |
| - **OSCAR Corpus (Esperanto)**: Extracted from Common Crawl dumps, filtered by language classification. | |
| - **Leipzig Corpora Collection (Esperanto)**: Includes texts from news, literature, and Wikipedia. | |
| ### Preprocessing | |
| - Trained a byte-level Byte-pair encoding tokenizer with a vocabulary size of 52,000 tokens. | |
| ### Hyperparameters | |
| - **Number of Epochs**: 1 | |
| - **Batch Size per GPU**: 64 | |
| - **Training Steps for Saving**: 10,000 | |
| - **Limit of Saved Models**: 2 | |
| - **Loss Calculation**: Prediction loss only | |
| ### Software and Libraries | |
| - **Transformers Library Version**: [Transformers](https://github.com/huggingface/transformers) | |
| - **Training Script**: `run_language_modeling.py` | |
| ```python | |
| from transformers import pipeline | |
| fill_mask = pipeline( | |
| "fill-mask", | |
| model="SamJoshua/EsperBERTo", | |
| tokenizer="SamJoshua/EsperBERTo" | |
| ) | |
| fill_mask("Jen la komenco de bela <mask>.") | |
| ``` | |
| ## Evaluation Results | |
| The model has not yet been evaluated on a standardized test set. Future updates will include evaluation metrics such as perplexity and accuracy on a held-out validation set. | |
| ## Intended Uses & Limitations | |
| **Intended Uses**: This model is intended for researchers, developers, and language enthusiasts who wish to explore Esperanto language processing for tasks like text generation, sentiment analysis, and more. | |
| **Limitations**: | |
| - The model is trained only for one epoch due to computational constraints, which may affect its understanding of more complex language structures. | |
| - As the model is trained on public web text, it may inadvertently learn and replicate social biases present in the training data. | |
| Feel free to contribute to the model by fine-tuning on specific tasks or extending its training with more data or epochs. This model serves as a baseline for further research and development in Esperanto language modeling. | |