SamJoshua
/

EsperBERTo

Model card Files Files and versions

EsperBERTo / README.md

SamJoshua's picture

Update README.md

6656a39 verified over 1 year ago

|

history blame contribute delete

2.23 kB

	# EsperBERTo Model Card

	## Model Description
	EsperBERTo is a RoBERTa-like model specifically trained from scratch on the Esperanto language using a large corpus from the OSCAR and Leipzig Corpora Collection. It is designed to perform masked language modeling and other text-based prediction tasks. This model is ideal for understanding and generating Esperanto text.

	### Datasets
	- OSCAR Corpus (Esperanto): Extracted from Common Crawl dumps, filtered by language classification.
	- Leipzig Corpora Collection (Esperanto): Includes texts from news, literature, and Wikipedia.

	### Preprocessing
	- Trained a byte-level Byte-pair encoding tokenizer with a vocabulary size of 52,000 tokens.

	### Hyperparameters
	- Number of Epochs: 1
	- Batch Size per GPU: 64
	- Training Steps for Saving: 10,000
	- Limit of Saved Models: 2
	- Loss Calculation: Prediction loss only

	### Software and Libraries
	- Transformers Library Version: [Transformers](https://github.com/huggingface/transformers)
	- Training Script: `run_language_modeling.py`

	```python
	from transformers import pipeline

	fill_mask = pipeline(
	"fill-mask",
	model="SamJoshua/EsperBERTo",
	tokenizer="SamJoshua/EsperBERTo"
	)

	fill_mask("Jen la komenco de bela <mask>.")
	```

	## Evaluation Results
	The model has not yet been evaluated on a standardized test set. Future updates will include evaluation metrics such as perplexity and accuracy on a held-out validation set.

	## Intended Uses & Limitations
	Intended Uses: This model is intended for researchers, developers, and language enthusiasts who wish to explore Esperanto language processing for tasks like text generation, sentiment analysis, and more.

	Limitations:
	- The model is trained only for one epoch due to computational constraints, which may affect its understanding of more complex language structures.
	- As the model is trained on public web text, it may inadvertently learn and replicate social biases present in the training data.

	Feel free to contribute to the model by fine-tuning on specific tasks or extending its training with more data or epochs. This model serves as a baseline for further research and development in Esperanto language modeling.