BERT NER - Brazilian Addresses

Token classification model fine-tuned from BERTimbau for Named Entity Recognition of Brazilian addresses.

Supported Entities

Label	Description
RUA	Street / Avenue / Road name
NUMERO	Street number
BAIRRO	Neighborhood
CIDADE	City
ESTADO	State (UF)
CEP	ZIP code
COMPLEMENTO	Address complement (apartment, block, lot, etc.)
REFERENCIA	Reference point / landmark

Benchmark

Entity	Precision	Recall	F1
RUA	1.0000	1.0000	1.0000
NUMERO	1.0000	1.0000	1.0000
BAIRRO	1.0000	1.0000	1.0000
CIDADE	1.0000	1.0000	1.0000
ESTADO	1.0000	1.0000	1.0000
CEP	1.0000	1.0000	1.0000
COMPLEMENTO	0.8571	0.6000	0.7059
REFERENCIA	0.8182	0.9000	0.8571
Overall	0.9744	0.9580	0.9661

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("ottema/bert-addresses-brazil")
model = AutoModelForTokenClassification.from_pretrained("ottema/bert-addresses-brazil")

text = "Rua das Flores 123, Apto 402, Centro, Sao Paulo - SP. CEP 01310-100"

encoding = tokenizer(text, return_tensors="pt", return_offsets_mapping=True, truncation=True, max_length=128)
offsets = encoding["offset_mapping"][0].tolist()

with torch.no_grad():
    logits = model(input_ids=encoding["input_ids"], attention_mask=encoding["attention_mask"]).logits
    preds = torch.argmax(logits, dim=-1)[0].tolist()

id2label = model.config.id2label
entities = []
current_type = None
current_start = None
current_end = None

for pred, (start, end) in zip(preds, offsets):
    if start == end:
        continue
    label = id2label[str(pred)]
    if label.startswith("B-"):
        if current_type:
            entities.append((current_type, text[current_start:current_end].strip()))
        current_type = label[2:]
        current_start = start
        current_end = end
    elif label.startswith("I-") and current_type == label[2:]:
        current_end = end
    else:
        if current_type:
            entities.append((current_type, text[current_start:current_end].strip()))
            current_type = None

if current_type:
    entities.append((current_type, text[current_start:current_end].strip()))

for entity_type, value in entities:
    print(f"{entity_type}: {value}")

Output:

RUA: Rua das Flores
NUMERO: 123
COMPLEMENTO: Apto 402
BAIRRO: Centro
CIDADE: Sao Paulo
ESTADO: SP
CEP: 01310-100

Training Details

Base model: neuralmind/bert-base-portuguese-cased (BERTimbau)
Epochs: 4
Batch size: 16
Learning rate: 2e-5
Dropout: 0.2
Weight decay: 0.05
Label smoothing: 0.1
Early stopping patience: 2

Downloads last month: 152

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for ottema/bert-addresses-brazil

Base model

neuralmind/bert-base-portuguese-cased

Finetuned

(205)

this model