Instructions to use somukandula/maskara with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use somukandula/maskara with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="somukandula/maskara")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("somukandula/maskara") model = AutoModelForTokenClassification.from_pretrained("somukandula/maskara") - Notebooks
- Google Colab
- Kaggle
Maskara
maskara is a lightweight BERT token-classification model for detecting personally identifiable information (PII) in text.
This checkpoint is a continued fine-tune of the existing somukandula/maskara model on ai4privacy/open-pii-masking-500k-ai4privacy. It keeps the original maskara label taxonomy and does not expand the classifier head to every AI4Privacy label.
What Changed
- Continued training on all
464,150rows from the AI4Privacy training split. - Evaluated on a
20,000row slice of the AI4Privacy validation split. - Used the model's own tokenizer instead of the dataset's
mbert_tokenscolumns. - Converted AI4Privacy character spans from
privacy_maskinto token labels. - Mapped compatible AI4Privacy labels into the existing
maskaralabels.
Labels
The model predicts BIO tags for these PII classes:
ADDRESSAPI_KEYCREDIT_CARDDATE_OF_BIRTHDRIVER_LICENSEEMAILIP_ADDRESSLOCATIONPASSWORDPERSON_NAMEPHONESSNUSERNAME
The full label set is O plus B- and I- variants for each class above.
Usage
from transformers import pipeline
ner = pipeline(
"token-classification",
model="somukandula/maskara",
aggregation_strategy="simple",
)
text = "My name is Priya Sharma and my email is priya@example.com."
print(ner(text))
For lower-level control:
from transformers import AutoModelForTokenClassification, AutoTokenizer
model_id = "somukandula/maskara"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
Training Details
| Item | Value |
|---|---|
| Source checkpoint | somukandula/maskara |
| Dataset | ai4privacy/open-pii-masking-500k-ai4privacy |
| Train rows | 464,150 |
| Validation rows | 20,000 |
| Epochs | 1 |
| Platform | Modal |
| GPU | A10G |
| Final Modal path | /outputs/full-openpii-500k/final |
The training script used source_text and privacy_mask spans from the dataset. It did not train directly on the dataset's mbert_tokens / mbert_token_classes, because this model is based on a small uncased BERT tokenizer rather than mBERT.
Evaluation
Evaluation was run after one epoch on a 20,000-row validation slice.
| Metric | Value |
|---|---|
| Eval loss | 0.1640 |
| Precision | 0.4926 |
| Recall | 0.5602 |
| F1 | 0.5243 |
| Accuracy | 0.9453 |
These metrics are for the mapped maskara label taxonomy, not for the full AI4Privacy taxonomy.
Label Mapping
The AI4Privacy dataset includes labels that are not present in the original maskara taxonomy. Compatible labels were mapped into existing classes:
| AI4Privacy label examples | Maskara label |
|---|---|
GIVENNAME, SURNAME, FIRSTNAME, LASTNAME |
PERSON_NAME |
TELEPHONENUM, PHONENUMBER |
PHONE |
SOCIALNUM |
SSN |
DRIVERLICENSENUM |
DRIVER_LICENSE |
CITY, STATE, COUNTRY |
LOCATION |
STREET, BUILDINGNUM, ZIPCODE |
ADDRESS |
CREDITCARDNUMBER |
CREDIT_CARD |
EMAIL |
EMAIL |
Unsupported labels were ignored as O during this run rather than forced into incorrect classes. Examples include:
DATETIMEAGEIDCARDNUMPASSPORTNUMTAXNUMTITLESEXGENDER
Intended Use
Use this model for PII-oriented token classification where the existing maskara labels are sufficient. It is intended for experimentation, prototyping, and PII masking workflows that can tolerate the taxonomy above.
Limitations
- This checkpoint does not detect AI4Privacy-only labels such as
DATE,TIME,AGE,IDCARDNUM, orPASSPORTNUMas separate classes. - The model is small and optimized for lightweight inference, not maximum recall.
- The dataset is multilingual, but this model uses a small uncased BERT tokenizer; evaluate carefully before relying on it for non-English text.
- The reported metrics are from a validation slice and should be re-measured on your target domain before production use.
Training Provenance
- Modal run:
full-openpii-500k - Hugging Face upload commit:
997b1cfc6454356e95faa5e0016a92b928a29c4e
- Downloads last month
- 10
Model tree for somukandula/maskara
Unable to build the model tree, the base model loops to the model itself. Learn more.
Dataset used to train somukandula/maskara
Evaluation results
- F1 on ai4privacy/open-pii-masking-500k-ai4privacyvalidation set self-reported0.524
- Precision on ai4privacy/open-pii-masking-500k-ai4privacyvalidation set self-reported0.493
- Recall on ai4privacy/open-pii-masking-500k-ai4privacyvalidation set self-reported0.560
- Accuracy on ai4privacy/open-pii-masking-500k-ai4privacyvalidation set self-reported0.945