BERT NER - Brazilian Addresses

Token classification model fine-tuned from BERTimbau for Named Entity Recognition of Brazilian addresses.

Supported Entities

Label Description
RUA Street / Avenue / Road name
NUMERO Street number
BAIRRO Neighborhood
CIDADE City
ESTADO State (UF)
CEP ZIP code
COMPLEMENTO Address complement (apartment, block, lot, etc.)
REFERENCIA Reference point / landmark

Benchmark

Entity Precision Recall F1
RUA 1.0000 1.0000 1.0000
NUMERO 1.0000 1.0000 1.0000
BAIRRO 1.0000 1.0000 1.0000
CIDADE 1.0000 1.0000 1.0000
ESTADO 1.0000 1.0000 1.0000
CEP 1.0000 1.0000 1.0000
COMPLEMENTO 0.8571 0.6000 0.7059
REFERENCIA 0.8182 0.9000 0.8571
Overall 0.9744 0.9580 0.9661

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("ottema/bert-addresses-brazil")
model = AutoModelForTokenClassification.from_pretrained("ottema/bert-addresses-brazil")

text = "Rua das Flores 123, Apto 402, Centro, Sao Paulo - SP. CEP 01310-100"

encoding = tokenizer(text, return_tensors="pt", return_offsets_mapping=True, truncation=True, max_length=128)
offsets = encoding["offset_mapping"][0].tolist()

with torch.no_grad():
    logits = model(input_ids=encoding["input_ids"], attention_mask=encoding["attention_mask"]).logits
    preds = torch.argmax(logits, dim=-1)[0].tolist()

id2label = model.config.id2label
entities = []
current_type = None
current_start = None
current_end = None

for pred, (start, end) in zip(preds, offsets):
    if start == end:
        continue
    label = id2label[str(pred)]
    if label.startswith("B-"):
        if current_type:
            entities.append((current_type, text[current_start:current_end].strip()))
        current_type = label[2:]
        current_start = start
        current_end = end
    elif label.startswith("I-") and current_type == label[2:]:
        current_end = end
    else:
        if current_type:
            entities.append((current_type, text[current_start:current_end].strip()))
            current_type = None

if current_type:
    entities.append((current_type, text[current_start:current_end].strip()))

for entity_type, value in entities:
    print(f"{entity_type}: {value}")

Output:

RUA: Rua das Flores
NUMERO: 123
COMPLEMENTO: Apto 402
BAIRRO: Centro
CIDADE: Sao Paulo
ESTADO: SP
CEP: 01310-100

Training Details

  • Base model: neuralmind/bert-base-portuguese-cased (BERTimbau)
  • Epochs: 4
  • Batch size: 16
  • Learning rate: 2e-5
  • Dropout: 0.2
  • Weight decay: 0.05
  • Label smoothing: 0.1
  • Early stopping patience: 2
Downloads last month
152
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ottema/bert-addresses-brazil

Finetuned
(205)
this model