You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

mBERT-Occitan

A fine-tuned multilingual BERT model adapted for Mediveal Occitan language using a hybrid tokenization approach (mBERT + BPE).

Model Description

This model is based on bert-base-multilingual-cased and has been fine-tuned on Occitan text using Masked Language Modeling (MLM). The model uses a hybrid tokenization approach that combines:

The original mBERT tokenizer vocabulary
Additional BPE (Byte Pair Encoding) subword units trained specifically on Occitan text

Training Details

Base Model: bert-base-multilingual-cased
Training Objective: Masked Language Modeling (MLM)
MLM Probability: 15%
Epochs: 10
Batch Size: 32
Learning Rate: 5e-5
Max Sequence Length: 512

Performance

Perplexity on Occitan validation set: 9.52
Improvement over original mBERT: 98.99% reduction in perplexity (from 942.85 to 9.52)
Improvement over traditional fine-tuning: 8.8% better than traditional mBERT fine-tuning (9.52 vs 10.44)

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained("ahan2000/mBERT-Occitan")
tokenizer = AutoTokenizer.from_pretrained("ahan2000/mBERT-Occitan")

# Example usage
text = "Lo temps es bèl uèi."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Tokenization

The hybrid tokenizer combines:

Original mBERT vocabulary (119,547 tokens)
Additional Occitan-specific BPE subword units (419 tokens)
Total vocabulary size: 119,966 tokens

Limitations

The model is fine-tuned specifically for Occitan and may not perform as well on other languages as the original mBERT
Training was done on a limited Occitan corpus
The model maintains the same architecture as mBERT (12 layers, 768 hidden size)

License

Apache 2.0 (same as the base mBERT model)

Downloads last month: -

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ahan2000/mBERT-Occitan

Base model

google-bert/bert-base-multilingual-cased

Finetuned

(1000)

this model

Evaluation results

perplexity on Occitan Corpus
self-reported

9.520