mBERT-Occitan
A fine-tuned multilingual BERT model adapted for Mediveal Occitan language using a hybrid tokenization approach (mBERT + BPE).
Model Description
This model is based on bert-base-multilingual-cased and has been fine-tuned on Occitan text using Masked Language Modeling (MLM). The model uses a hybrid tokenization approach that combines:
- The original mBERT tokenizer vocabulary
- Additional BPE (Byte Pair Encoding) subword units trained specifically on Occitan text
Training Details
- Base Model: bert-base-multilingual-cased
- Training Objective: Masked Language Modeling (MLM)
- MLM Probability: 15%
- Epochs: 10
- Batch Size: 32
- Learning Rate: 5e-5
- Max Sequence Length: 512
Performance
- Perplexity on Occitan validation set: 9.52
- Improvement over original mBERT: 98.99% reduction in perplexity (from 942.85 to 9.52)
- Improvement over traditional fine-tuning: 8.8% better than traditional mBERT fine-tuning (9.52 vs 10.44)
Usage
from transformers import AutoModelForMaskedLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained("ahan2000/mBERT-Occitan")
tokenizer = AutoTokenizer.from_pretrained("ahan2000/mBERT-Occitan")
# Example usage
text = "Lo temps es bèl uèi."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
Tokenization
The hybrid tokenizer combines:
- Original mBERT vocabulary (119,547 tokens)
- Additional Occitan-specific BPE subword units (419 tokens)
- Total vocabulary size: 119,966 tokens
Limitations
- The model is fine-tuned specifically for Occitan and may not perform as well on other languages as the original mBERT
- Training was done on a limited Occitan corpus
- The model maintains the same architecture as mBERT (12 layers, 768 hidden size)
License
Apache 2.0 (same as the base mBERT model)
- Downloads last month
- 3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for ahan2000/mBERT-Occitan
Base model
google-bert/bert-base-multilingual-casedEvaluation results
- perplexity on Occitan Corpusself-reported9.520