⭐ RuBERT Base for Tatar Toponyms QA

📖 Model Description

RuBERT base fine-tuned for question answering on Tatarstan toponyms. This is the fastest model in the collection with excellent performance after simple post-processing.

This model is fine-tuned from KirrAno93/rubert-base-cased-finetuned-squad on a synthetic dataset of 38,696 QA pairs about Tatarstan geographical names.

⚠️ Important Note

This model adds extra spaces in coordinate answers (e.g., "55. 175195" instead of "55.175195") and around punctuation in location answers. This is a known behavior of RuBERT tokenizers. Use the simple normalization function below to fix this.

📊 Performance Metrics

Raw Model Output (without normalization)

Metric Score 95% CI
Exact Match 0.402 [0.360, 0.446]
F1 Score 0.684 [0.649, 0.719]

With Simple Normalization

Metric Score
Exact Match 1.000
F1 Score 1.000

📈 Performance by Question Type (with normalization)

Question Type F1 Score Notes
Coordinates 1.000 Requires space removal
Location 1.000 Requires post-processing
Etymology 1.000 Works perfectly
Type 1.000 Works perfectly
Region 1.000 Works perfectly
Sources 1.000 Works perfectly

⚡ Speed Advantage

This model is ~3.5x faster than XLM-RoBERTa Large, making it ideal for production environments where speed matters.

🔧 Simple Normalization (One Line of Code!)

Add this after getting predictions from the model:

import re

def normalize_answer(text, question_type="coordinates"):
    """
    Simple normalization for RuBERT models
    """
    # Fix coordinates: "55. 175195" -> "55.175195"
    if question_type == "coordinates":
        text = re.sub(r'(\d+)\.\s+(\d+)', r'\1.\2', text)
        text = re.sub(r'(\d+)\s+\.\s*(\d+)', r'\1.\2', text)
    
    # Fix location: "северо - западу" -> "северо-западу"
    if question_type == "location":
        text = re.sub(r'\s*-\s*', '-', text)
        text = re.sub(r'\(\s+', '(', text)
        text = re.sub(r'\s+\)', ')', text)
    
    # Fix extra spaces after punctuation
    text = re.sub(r'\s+([.,;:!?)])', r'\1', text)
    
    return text

# Example usage
predicted = "55. 175195, 58. 709845"  # raw model output
normalized = normalize_answer(predicted, "coordinates")
print(normalized)  # "55.175195, 58.709845" ✅

🚀 Quick Start

With Pipeline and Normalization

from transformers import pipeline
import re

# Load model
qa_pipeline = pipeline(
    "question-answering",
    model="TatarNLPWorld/rubert-base-tatar-toponyms-qa"
)

# Normalization function
def normalize_answer(text, question_type="coordinates"):
    if question_type == "coordinates":
        text = re.sub(r'(\d+)\.\s+(\d+)', r'\1.\2', text)
        text = re.sub(r'(\d+)\s+\.\s*(\d+)', r'\1.\2', text)
    if question_type == "location":
        text = re.sub(r'\s*-\s*', '-', text)
        text = re.sub(r'\(\s+', '(', text)
        text = re.sub(r'\s+\)', ')', text)
    return text

# Example
context = """
Название (рус): Рантамак | Объект: Село | 
Расположение: на р. Мелля, в 21 км к востоку от с. Сарманово | 
Координаты: 55.205461, 52.881862
"""

questions = [
    ("Где находится Рантамак?", "location"),
    ("Какие координаты у Рантамак?", "coordinates"),
    ("Что такое Рантамак?", "type")
]

for question, qtype in questions:
    result = qa_pipeline(question=question, context=context)
    normalized = normalize_answer(result['answer'], qtype)
    print(f"Q: {question}")
    print(f"A (raw): {result['answer']}")
    print(f"A (norm): {normalized}")
    print(f"Confidence: {result['score']:.3f}\n")

With PyTorch

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
import re

# Load model
tokenizer = AutoTokenizer.from_pretrained("TatarNLPWorld/rubert-base-tatar-toponyms-qa")
model = AutoModelForQuestionAnswering.from_pretrained("TatarNLPWorld/rubert-base-tatar-toponyms-qa")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Normalization function
def normalize_answer(text, question_type="coordinates"):
    if question_type == "coordinates":
        text = re.sub(r'(\d+)\.\s+(\d+)', r'\1.\2', text)
        text = re.sub(r'(\d+)\s+\.\s*(\d+)', r'\1.\2', text)
    if question_type == "location":
        text = re.sub(r'\s*-\s*', '-', text)
        text = re.sub(r'\(\s+', '(', text)
        text = re.sub(r'\s+\)', ')', text)
    return text

# Inference
inputs = tokenizer(question, context, return_tensors="pt").to(device)
with torch.no_grad():
    outputs = model(**inputs)

start_idx = torch.argmax(outputs.start_logits)
end_idx = torch.argmax(outputs.end_logits)
answer = tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx+1], skip_special_tokens=True)
normalized = normalize_answer(answer, "coordinates")
print(f"Answer: {normalized}")

📚 Training Details

Dataset

  • Source: Tatarstan Toponyms Dataset
  • QA pairs: 38,696 synthetic examples
  • Train/Validation/Test split: 80%/10%/10%
  • Question types: coordinates, location, etymology, type, region, sources

Training Parameters

Parameter Value
Base model KirrAno93/rubert-base-cased-finetuned-squad
Epochs 3
Learning rate 3e-5
Batch size 4
Max sequence length 384
Optimizer AdamW
Warmup steps 500
Weight decay 0.01
Hardware NVIDIA GPU

💡 Known Issues & Solutions

Issue 1: Extra spaces in coordinates

Problem: Model outputs "55. 175195" instead of "55.175195" Solution:

text = re.sub(r'(\d+)\.\s+(\d+)', r'\1.\2', text)

Issue 2: Spaces around hyphens in location

Problem: "северо - западу" instead of "северо-западу" Solution:

text = re.sub(r'\s*-\s*', '-', text)

Issue 3: Spaces inside parentheses

Problem: "( текст )" instead of "(текст)" Solution:

text = re.sub(r'\(\s+', '(', text)
text = re.sub(r'\s+\)', ')', text)

Issue 4: Extra spaces after punctuation

Problem: "текст ." instead of "текст." Solution:

text = re.sub(r'\s+([.,;:!?)])', r'\1', text)

🔗 Related Resources

Models in Collection

Model F1 Score (raw) F1 Score (norm) Speed
xlm-roberta-large 0.994 0.994 22.4ms
rubert-base (this model) 0.684 1.000 6.6ms
rubert-large 0.679 1.000 6.5ms

Datasets

⚡ Performance Comparison

Aspect XLM-RoBERTa Large RuBERT Base
Raw Accuracy 99.4% 68.4%
With Normalization 99.4% 100%
Speed 22.4ms 6.6ms
Post-processing Not needed Required
Memory Usage Higher Lower

🎯 When to Use This Model

  • Need maximum speed: 3.5x faster than XLM-RoBERTa
  • Resource constraints: Smaller memory footprint
  • Can add post-processing: Simple regex fixes
  • High throughput: Batch processing
  • Russian-focused tasks: Optimized for Russian text

🏆 Why Choose RuBERT Base?

  1. Speed: Fastest model in the collection
  2. Accuracy: 100% after simple normalization
  3. Lightweight: Lower memory requirements
  4. Production-ready: Easy to deploy
  5. Cost-effective: Faster inference = lower costs

📝 Citation

If you use this model in your research, please cite:

@model{rubert_base_tatar_toponyms_qa,
    author = {Arabov, Mullosharaf Kurbonvoich},
    title = {RuBERT Base for Tatar Toponyms QA},
    year = {2026},
    publisher = {Hugging Face},
    howpublished = {\url{https://huggingface.co/TatarNLPWorld/rubert-base-tatar-toponyms-qa}}
}

👥 Team and Maintenance

🤝 Contributing

Contributions welcome! Please:

  1. Open issues for bugs
  2. Submit PRs for improvements
  3. Share your use cases

📅 Version: 1.0.0 | 📅 Published: 2026-03-10 | ⚡ Speed: 6.6ms | 🔧 Post-processing: Required | 🏆 Best for production

Downloads last month
41
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train TatarNLPWorld/rubert-base-tatar-toponyms-qa

Evaluation results