File size: 961 Bytes
782bf7f
f3a590e
782bf7f
 
 
 
 
971776b
7df59ea
f3a590e
c02bc29
f3a590e
c02bc29
 
 
f3a590e
782bf7f
 
b0f59be
 
 
 
 
 
 
383ab14
b0f59be
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

---
language: tl
tags:
  - lexical-normalization
  - filipino
  - byt5
base_model: google/byt5-base
---

# FiLex: Filipino Lexical Normalization

A lexical normalization model for Filipino/Tagalog lexical normalization. 
Created by fine-tuning Google's ByT5-base model using a custom dataset.
Converts informal/noisy Filipino text (e.g. SMS, social media) into its canonical form.

## Usage
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

model = AutoModelForSeq2SeqLM.from_pretrained("Angelo25/Filipino-Lexical-Normalization")
tokenizer = AutoTokenizer.from_pretrained("Angelo25/Filipino-Lexical-Normalization")
model.eval()

inputs = tokenizer("Sample Input Text", return_tensors="pt").to(model.device)
output = model.generate(
    **inputs,
    max_new_tokens=inputs["input_ids"].shape[1] + 50,
    num_beams=3,
    early_stopping=True,
    use_cache=True
)
print(tokenizer.decode(output[0], skip_special_tokens=True))