Named Entity Recognition for Azerbaijani Language

A state-of-the-art Named Entity Recognition (NER) system specifically designed for the Azerbaijani language, featuring multiple fine-tuned transformer models and a production-ready FastAPI deployment with an intuitive web interface.

🚀 Live Demo

Try the live demo: Named Entity Recognition Demo

Note: The server runs on a free tier and may take 1-2 minutes to initialize if inactive. Please be patient during startup.

🏗️ System Architecture

graph TD
    A[User Input] --> B[FastAPI Server]
    B --> C[XLM-RoBERTa Model]
    C --> D[Token Classification]
    D --> E[Entity Aggregation]
    E --> F[Label Mapping]
    F --> G[JSON Response]
    G --> H[Frontend Visualization]
    
    subgraph "Model Pipeline"
        C --> C1[Tokenization]
        C1 --> C2[BERT Encoding]
        C2 --> C3[Classification Head]
        C3 --> D
    end
    
    subgraph "Entity Categories"
        I[Person] 
        J[Location]
        K[Organization]
        L[Date/Time]
        M[Government]
        N[25 Total Categories]
    end
    
    F --> I
    F --> J
    F --> K
    F --> L
    F --> M
    F --> N

🤖 Model Training Pipeline

flowchart LR
    A[Azerbaijani NER Dataset] --> B[Data Preprocessing]
    B --> C[Tokenization]
    C --> D[Label Alignment]
    
    subgraph "Model Training"
        E[mBERT] --> F[Fine-tuning]
        G[XLM-RoBERTa] --> F
        H[XLM-RoBERTa Large] --> F
        I[Azeri-Turkish BERT] --> F
        F --> J[Model Evaluation]
    end
    
    D --> E
    D --> G
    D --> H
    D --> I
    
    J --> K[Best Model Selection]
    K --> L[Hugging Face Hub]
    L --> M[Production Deployment]
    
    subgraph "Performance Metrics"
        N[Precision: 76.44%]
        O[Recall: 74.05%]
        P[F1-Score: 75.22%]
    end
    
    J --> N
    J --> O
    J --> P

🔄 Data Flow Architecture

sequenceDiagram
    participant U as User
    participant F as Frontend
    participant API as FastAPI
    participant M as XLM-RoBERTa
    participant HF as Hugging Face
    
    U->>F: Enter Azerbaijani text
    F->>API: POST /predict/ 
    API->>M: Process text
    M->>M: Tokenize input
    M->>M: Generate predictions
    M->>API: Return entity predictions
    API->>API: Apply label mapping
    API->>API: Group entities by type
    API->>F: JSON response with entities
    F->>U: Display highlighted entities
    
    Note over M,HF: Model loaded from<br/>IsmatS/xlm-roberta-az-ner

Project Structure

.
├── Dockerfile                # Docker image configuration
├── README.md                # Project documentation
├── fly.toml                 # Fly.io deployment configuration
├── main.py                  # FastAPI application entry point
├── models/                  # Model-related files
│   ├── NER_from_scratch.ipynb    # Custom NER implementation notebook
│   ├── README.md                 # Models documentation
│   ├── XLM-RoBERTa.ipynb        # XLM-RoBERTa training notebook
│   ├── azeri-turkish-bert-ner.ipynb  # Azeri-Turkish BERT training
│   ├── mBERT.ipynb              # mBERT training notebook
│   ├── push_to_HF.py            # Hugging Face upload script
│   ├── train-00000-of-00001.parquet  # Training data
│   └── xlm_roberta_large.ipynb  # XLM-RoBERTa Large training
├── requirements.txt         # Python dependencies
├── static/                  # Frontend assets
│   ├── app.js               # Frontend logic
│   └── style.css            # UI styling
└── templates/               # HTML templates
    └── index.html           # Main UI template

🧠 Models & Dataset

🏆 Available Models

Model	Parameters	F1-Score	Hugging Face	Status
mBERT Azerbaijani NER	180M	67.70%	✅	Released
XLM-RoBERTa Azerbaijani NER	125M	75.22%	✅	Production
XLM-RoBERTa Large Azerbaijani NER	355M	75.48%	✅	Released
Azerbaijani-Turkish BERT Base NER	110M	73.55%	✅	Released

📊 Supported Entity Types (25 Categories)

Category	Category	Category
Person	Government	Law
Location	Date	Language
Organization	Time	Position
Facility	Money	Nationality
Product	Percentage	Disease
Event	Contact	Quantity
Art	Project	Cardinal
Proverb	Ordinal	Miscellaneous
Other

📈 Dataset Information

Source: Azerbaijani NER Dataset
Size: High-quality annotated Azerbaijani text corpus
Language: Azerbaijani (az)
Annotation: IOB2 format with 25 entity categories
Training Infrastructure: A100 GPU on Google Colab Pro+

📊 Model Performance Comparison

Model	F1-Score
mBERT	67.70%
XLM-RoBERTa Base	75.22%
XLM-RoBERTa Large	75.48%
Azeri-Turkish-BERT	73.55%

📈 Detailed Performance Metrics

mBERT Performance

Epoch	Training Loss	Validation Loss	Precision	Recall	F1	Accuracy
1	0.2952	0.2657	0.7154	0.6229	0.6659	0.9191
2	0.2486	0.2521	0.7210	0.6380	0.6770	0.9214
3	0.2068	0.2534	0.7049	0.6507	0.6767	0.9209

XLM-RoBERTa Base Performance

Epoch	Training Loss	Validation Loss	Precision	Recall	F1
1	0.3231	0.2755	0.7758	0.6949	0.7331
3	0.2486	0.2525	0.7515	0.7412	0.7463
5	0.2238	0.2522	0.7644	0.7405	0.7522
7	0.2097	0.2507	0.7607	0.7394	0.7499

XLM-RoBERTa Large Performance

Epoch	Training Loss	Validation Loss	Precision	Recall	F1
1	0.4075	0.2538	0.7689	0.7214	0.7444
3	0.2144	0.2488	0.7509	0.7489	0.7499
6	0.1526	0.2881	0.7831	0.7284	0.7548
9	0.1194	0.3316	0.7393	0.7495	0.7444

Azeri-Turkish-BERT Performance

Epoch	Training Loss	Validation Loss	Precision	Recall	F1
1	0.4331	0.3067	0.7390	0.6933	0.7154
3	0.2506	0.2751	0.7583	0.7094	0.7330
6	0.1992	0.2861	0.7551	0.7170	0.7355
9	0.1717	0.3138	0.7431	0.7255	0.7342

⚡ Key Features

🎯 State-of-the-art Accuracy: 75.22% F1-score on Azerbaijani NER
🌐 25 Entity Categories: Comprehensive coverage including Person, Location, Organization, Government, and more
🚀 Production Ready: Deployed on Fly.io with FastAPI backend
🎨 Interactive UI: Real-time entity highlighting with confidence scores
🔄 Multiple Models: Four different transformer models to choose from
📊 Confidence Scoring: Each prediction includes confidence metrics
🌍 Multilingual Foundation: Built on XLM-RoBERTa for cross-lingual understanding
📱 Responsive Design: Works seamlessly across desktop and mobile devices

🛠️ Technology Stack

graph LR
    subgraph "Frontend"
        A[HTML5] --> B[CSS3]
        B --> C[JavaScript]
    end
    
    subgraph "Backend"
        D[FastAPI] --> E[Python 3.8+]
        E --> F[Uvicorn]
    end
    
    subgraph "ML Stack"
        G[Transformers] --> H[PyTorch]
        H --> I[Hugging Face]
    end
    
    subgraph "Deployment"
        J[Docker] --> K[Fly.io]
        K --> L[Production]
    end
    
    C --> D
    F --> G
    I --> J

🚀 Setup Instructions

Local Development

Clone the repository

git clone https://huggingface.co/IsmatS/Named_Entity_Recognition
cd Named_Entity_Recognition

Set up Python environment

# Create virtual environment
python -m venv .venv

# Activate virtual environment
# On Unix/macOS:
source .venv/bin/activate
# On Windows:
.venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Run the application

uvicorn main:app --host 0.0.0.0 --port 8080

Fly.io Deployment

Install Fly CLI

# On Unix/macOS
curl -L https://fly.io/install.sh | sh

Configure deployment

# Login to Fly.io
fly auth login

# Initialize app
fly launch

# Configure memory (minimum 2GB recommended)
fly scale memory 2048

Deploy application

fly deploy

# Monitor deployment
fly logs

💡 Usage

Quick Start

Access the application:
- 🏠 Local: http://localhost:8080
- 🌐 Production: https://named-entity-recognition.fly.dev
Enter Azerbaijani text in the input field
Click "Submit" to process and view named entities
View results with entities highlighted by category and confidence scores

Example Usage

# Example API request
import requests

response = requests.post(
    "https://named-entity-recognition.fly.dev/predict/",
    data={"text": "2014-cü ildə Azərbaycan Respublikasının prezidenti İlham Əliyev Salyanda olub."}
)

print(response.json())
# Output: {
#   "entities": {
#     "Date": ["2014"],
#     "Government": ["Azərbaycan"],
#     "Organization": ["Respublikasının"],
#     "Position": ["prezidenti"],
#     "Person": ["İlham Əliyev"],
#     "Location": ["Salyanda"]
#   }
# }