Turkish Address Classifier (State + City)

This is a multi-output model that predicts both state (province) and city (district) from Turkish street addresses.

Model Description

  • Base Model: dbmdz/bert-base-turkish-uncased
  • Task: Multi-output text classification
  • Language: Turkish (tr)
  • Outputs:
    • State (İl): 81 Turkish provinces
    • City (İlçe): 954 districts/municipalities
  • Training Data: Turkish street addresses with state and city labels

Usage

Installation

pip install transformers torch huggingface_hub

Basic Prediction

from transformers import AutoTokenizer, BertPreTrainedModel, BertModel
from torch import nn
import torch
import pickle
from huggingface_hub import hf_hub_download

# Define the model architecture
class MultiOutputModel(BertPreTrainedModel):
    def __init__(self, config, num_state_labels, num_city_labels):
        super().__init__(config)
        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        
        self.state_classifier = nn.Linear(config.hidden_size, num_state_labels)
        self.city_classifier = nn.Linear(config.hidden_size, num_city_labels)
        
        self.num_state_labels = num_state_labels
        self.num_city_labels = num_city_labels
        
        self.post_init()
    
    def forward(self, input_ids=None, attention_mask=None, state_labels=None, city_labels=None, **kwargs):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs[1]
        pooled_output = self.dropout(pooled_output)
        
        state_logits = self.state_classifier(pooled_output)
        city_logits = self.city_classifier(pooled_output)
        
        return {
            'state_logits': state_logits,
            'city_logits': city_logits
        }

# Load model and label encoders
model_path = "ucanbaklava/turkish-address-classifier_new"

# Load label encoders
le_state_path = hf_hub_download(repo_id=model_path, filename="state_label_encoder.pkl")
le_city_path = hf_hub_download(repo_id=model_path, filename="city_label_encoder.pkl")

with open(le_state_path, "rb") as f:
    le_state = pickle.load(f)
with open(le_city_path, "rb") as f:
    le_city = pickle.load(f)

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_path)

from transformers import AutoConfig
config = AutoConfig.from_pretrained(model_path)
model = MultiOutputModel(config, len(le_state.classes_), len(le_city.classes_))
model = model.from_pretrained(
    model_path, 
    config=config,
    num_state_labels=len(le_state.classes_),
    num_city_labels=len(le_city.classes_)
)

model.eval()

# Prediction function
def predict_address(text: str):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
    
    with torch.no_grad():
        outputs = model(**inputs)
        state_logits = outputs['state_logits']
        city_logits = outputs['city_logits']
        
        state_probs = torch.nn.functional.softmax(state_logits, dim=-1)
        city_probs = torch.nn.functional.softmax(city_logits, dim=-1)
        
        state_conf, state_pred = torch.max(state_probs, dim=-1)
        city_conf, city_pred = torch.max(city_probs, dim=-1)
    
    predicted_state = le_state.inverse_transform([state_pred.item()])[0]
    predicted_city = le_city.inverse_transform([city_pred.item()])[0]
    
    return {
        'state': predicted_state,
        'state_confidence': state_conf.item(),
        'city': predicted_city,
        'city_confidence': city_conf.item()
    }

# Example usage
result = predict_address("atatürk caddesi no:5")
print(f"State: {result['state']} ({result['state_confidence']:.2%})")
print(f"City: {result['city']} ({result['city_confidence']:.2%})")

Example Output

>>> result = predict_address("bağdat caddesi")
>>> print(result)
{
    'state': 'istanbul',
    'state_confidence': 0.95,
    'city': 'kadıköy',
    'city_confidence': 0.92
}

Training Details

Hyperparameters

  • Epochs: 10 (full training, no early stopping)
  • Batch Size: 512
  • Learning Rate: 2e-5
  • Max Sequence Length: 128
  • Optimizer: AdamW with weight decay 0.01
  • Mixed Precision: BF16 + TF32 (A100 optimized)
  • Hardware: NVIDIA A100-80GB GPU

Data Processing

  • Dataset Size: 1,148,700 Turkish addresses
  • Train/Validation Split: 90/10 (1,033,830 / 114,870 addresses)
  • Turkish-aware lowercasing (handles İ/i, I→ı correctly)
  • Street type transformations: (Sokak)→sokak, (Cadde)→caddesi, (Bulvar)→bulvarı
  • Removed (küme evler) patterns
  • Removed empty/missing values

Model Architecture

The model uses BERT as a backbone with two separate classification heads:

  1. State Classifier: Predicts one of 81 Turkish provinces
  2. City Classifier: Predicts one of 954 districts/municipalities

Both classifiers share the same BERT encoder, making the model efficient and allowing it to learn shared representations for Turkish addresses.

Limitations

  • Trained specifically on Turkish addresses
  • May not generalize well to addresses with very different formatting
  • Performance depends on the quality and coverage of training data
  • Some rare city/state combinations may have lower accuracy

Citation

If you use this model, please cite:

@misc{turkish-address-classifier,
  author = {ucanbaklava},
  title = {Turkish Address Classifier (State + City)},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ucanbaklava/turkish-address-classifier_new}}
}

Model Card Authors

ucanbaklava

Model Card Contact

For questions or feedback, please open an issue on the model repository.

Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support