You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

mBERT-Occitan

A fine-tuned multilingual BERT model adapted for Mediveal Occitan language using a hybrid tokenization approach (mBERT + BPE).

Model Description

This model is based on bert-base-multilingual-cased and has been fine-tuned on Occitan text using Masked Language Modeling (MLM). The model uses a hybrid tokenization approach that combines:

  • The original mBERT tokenizer vocabulary
  • Additional BPE (Byte Pair Encoding) subword units trained specifically on Occitan text

Training Details

  • Base Model: bert-base-multilingual-cased
  • Training Objective: Masked Language Modeling (MLM)
  • MLM Probability: 15%
  • Epochs: 10
  • Batch Size: 32
  • Learning Rate: 5e-5
  • Max Sequence Length: 512

Performance

  • Perplexity on Occitan validation set: 9.52
  • Improvement over original mBERT: 98.99% reduction in perplexity (from 942.85 to 9.52)
  • Improvement over traditional fine-tuning: 8.8% better than traditional mBERT fine-tuning (9.52 vs 10.44)

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained("ahan2000/mBERT-Occitan")
tokenizer = AutoTokenizer.from_pretrained("ahan2000/mBERT-Occitan")

# Example usage
text = "Lo temps es bèl uèi."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Tokenization

The hybrid tokenizer combines:

  • Original mBERT vocabulary (119,547 tokens)
  • Additional Occitan-specific BPE subword units (419 tokens)
  • Total vocabulary size: 119,966 tokens

Limitations

  • The model is fine-tuned specifically for Occitan and may not perform as well on other languages as the original mBERT
  • Training was done on a limited Occitan corpus
  • The model maintains the same architecture as mBERT (12 layers, 768 hidden size)

License

Apache 2.0 (same as the base mBERT model)

Downloads last month
3
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ahan2000/mBERT-Occitan

Finetuned
(953)
this model

Evaluation results