|
|
| --- |
| language: tl |
| tags: |
| - lexical-normalization |
| - filipino |
| - byt5 |
| base_model: google/byt5-base |
| --- |
| |
| # FiLex: Filipino Lexical Normalization |
|
|
| A lexical normalization model for Filipino/Tagalog lexical normalization. |
| Created by fine-tuning Google's ByT5-base model using a custom dataset. |
| Converts informal/noisy Filipino text (e.g. SMS, social media) into its canonical form. |
|
|
| ## Usage |
| ```python |
| from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
| import torch |
| |
| model = AutoModelForSeq2SeqLM.from_pretrained("Angelo25/Filipino-Lexical-Normalization") |
| tokenizer = AutoTokenizer.from_pretrained("Angelo25/Filipino-Lexical-Normalization") |
| model.eval() |
| |
| inputs = tokenizer("Sample Input Text", return_tensors="pt").to(model.device) |
| output = model.generate( |
| **inputs, |
| max_new_tokens=inputs["input_ids"].shape[1] + 50, |
| num_beams=3, |
| early_stopping=True, |
| use_cache=True |
| ) |
| print(tokenizer.decode(output[0], skip_special_tokens=True)) |
| |