File size: 11,669 Bytes

---
language:
- az
license: mit
tags:
- token-classification
- ner
- azerbaijani
- fastapi
- transformers
- xlm-roberta
pipeline_tag: token-classification
datasets:
- LocalDoc/azerbaijani-ner-dataset
---

# Named Entity Recognition for Azerbaijani Language

A state-of-the-art Named Entity Recognition (NER) system specifically designed for the Azerbaijani language, featuring multiple fine-tuned transformer models and a production-ready FastAPI deployment with an intuitive web interface.

## 🚀 Live Demo

Try the live demo: [Named Entity Recognition Demo](https://named-entity-recognition.fly.dev/)

**Note:** The server runs on a free tier and may take 1-2 minutes to initialize if inactive. Please be patient during startup.

## 🏗️ System Architecture

```mermaid
graph TD
    A[User Input] --> B[FastAPI Server]
    B --> C[XLM-RoBERTa Model]
    C --> D[Token Classification]
    D --> E[Entity Aggregation]
    E --> F[Label Mapping]
    F --> G[JSON Response]
    G --> H[Frontend Visualization]
    
    subgraph "Model Pipeline"
        C --> C1[Tokenization]
        C1 --> C2[BERT Encoding]
        C2 --> C3[Classification Head]
        C3 --> D
    end
    
    subgraph "Entity Categories"
        I[Person] 
        J[Location]
        K[Organization]
        L[Date/Time]
        M[Government]
        N[25 Total Categories]
    end
    
    F --> I
    F --> J
    F --> K
    F --> L
    F --> M
    F --> N
```

## 🤖 Model Training Pipeline

```mermaid
flowchart LR
    A[Azerbaijani NER Dataset] --> B[Data Preprocessing]
    B --> C[Tokenization]
    C --> D[Label Alignment]
    
    subgraph "Model Training"
        E[mBERT] --> F[Fine-tuning]
        G[XLM-RoBERTa] --> F
        H[XLM-RoBERTa Large] --> F
        I[Azeri-Turkish BERT] --> F
        F --> J[Model Evaluation]
    end
    
    D --> E
    D --> G
    D --> H
    D --> I
    
    J --> K[Best Model Selection]
    K --> L[Hugging Face Hub]
    L --> M[Production Deployment]
    
    subgraph "Performance Metrics"
        N[Precision: 76.44%]
        O[Recall: 74.05%]
        P[F1-Score: 75.22%]
    end
    
    J --> N
    J --> O
    J --> P
```

## 🔄 Data Flow Architecture

```mermaid
sequenceDiagram
    participant U as User
    participant F as Frontend
    participant API as FastAPI
    participant M as XLM-RoBERTa
    participant HF as Hugging Face
    
    U->>F: Enter Azerbaijani text
    F->>API: POST /predict/ 
    API->>M: Process text
    M->>M: Tokenize input
    M->>M: Generate predictions
    M->>API: Return entity predictions
    API->>API: Apply label mapping
    API->>API: Group entities by type
    API->>F: JSON response with entities
    F->>U: Display highlighted entities
    
    Note over M,HF: Model loaded from<br/>IsmatS/xlm-roberta-az-ner
```

## Project Structure

```
.
├── Dockerfile                # Docker image configuration
├── README.md                # Project documentation
├── fly.toml                 # Fly.io deployment configuration
├── main.py                  # FastAPI application entry point
├── models/                  # Model-related files
│   ├── NER_from_scratch.ipynb    # Custom NER implementation notebook
│   ├── README.md                 # Models documentation
│   ├── XLM-RoBERTa.ipynb        # XLM-RoBERTa training notebook
│   ├── azeri-turkish-bert-ner.ipynb  # Azeri-Turkish BERT training
│   ├── mBERT.ipynb              # mBERT training notebook
│   ├── push_to_HF.py            # Hugging Face upload script
│   ├── train-00000-of-00001.parquet  # Training data
│   └── xlm_roberta_large.ipynb  # XLM-RoBERTa Large training
├── requirements.txt         # Python dependencies
├── static/                  # Frontend assets
│   ├── app.js               # Frontend logic
│   └── style.css            # UI styling
└── templates/               # HTML templates
    └── index.html           # Main UI template
```

## 🧠 Models & Dataset

### 🏆 Available Models

| Model | Parameters | F1-Score | Hugging Face | Status |
|-------|------------|----------|--------------|---------|
| [mBERT Azerbaijani NER](https://huggingface.co/IsmatS/mbert-az-ner) | 180M | 67.70% | ✅ | Released |
| [XLM-RoBERTa Azerbaijani NER](https://huggingface.co/IsmatS/xlm-roberta-az-ner) | 125M | **75.22%** | ✅ | **Production** |
| [XLM-RoBERTa Large Azerbaijani NER](https://huggingface.co/IsmatS/xlm_roberta_large_az_ner) | 355M | 75.48% | ✅ | Released |
| [Azerbaijani-Turkish BERT Base NER](https://huggingface.co/IsmatS/azeri-turkish-bert-ner) | 110M | 73.55% | ✅ | Released |

### 📊 Supported Entity Types (25 Categories)

| Category | Category | Category |
|----------|----------|----------|
| Person | Government | Law |
| Location | Date | Language |
| Organization | Time | Position |
| Facility | Money | Nationality |
| Product | Percentage | Disease |
| Event | Contact | Quantity |
| Art | Project | Cardinal |
| Proverb | Ordinal | Miscellaneous |
| Other | | |

### 📈 Dataset Information
- **Source:** [Azerbaijani NER Dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset)
- **Size:** High-quality annotated Azerbaijani text corpus
- **Language:** Azerbaijani (az)
- **Annotation:** IOB2 format with 25 entity categories
- **Training Infrastructure:** A100 GPU on Google Colab Pro+

## 📊 Model Performance Comparison

| Model | F1-Score |
|-------|----------|
| mBERT | 67.70% |
| XLM-RoBERTa Base | 75.22% |
| XLM-RoBERTa Large | **75.48%** |
| Azeri-Turkish-BERT | 73.55% |

## 📈 Detailed Performance Metrics

### mBERT Performance

| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
|-------|---------------|-----------------|-----------|---------|-------|-----------|
| 1 | 0.2952 | 0.2657 | 0.7154 | 0.6229 | 0.6659 | 0.9191 |
| 2 | 0.2486 | 0.2521 | 0.7210 | 0.6380 | 0.6770 | 0.9214 |
| 3 | 0.2068 | 0.2534 | 0.7049 | 0.6507 | 0.6767 | 0.9209 |

### XLM-RoBERTa Base Performance

| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 |
|-------|---------------|-----------------|-----------|---------|-------|
| 1 | 0.3231 | 0.2755 | 0.7758 | 0.6949 | 0.7331 |
| 3 | 0.2486 | 0.2525 | 0.7515 | 0.7412 | 0.7463 |
| 5 | 0.2238 | 0.2522 | 0.7644 | 0.7405 | 0.7522 |
| 7 | 0.2097 | 0.2507 | 0.7607 | 0.7394 | 0.7499 |

### XLM-RoBERTa Large Performance

| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 |
|-------|---------------|-----------------|-----------|---------|-------|
| 1 | 0.4075 | 0.2538 | 0.7689 | 0.7214 | 0.7444 |
| 3 | 0.2144 | 0.2488 | 0.7509 | 0.7489 | 0.7499 |
| 6 | 0.1526 | 0.2881 | 0.7831 | 0.7284 | 0.7548 |
| 9 | 0.1194 | 0.3316 | 0.7393 | 0.7495 | 0.7444 |

### Azeri-Turkish-BERT Performance

| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 |
|-------|---------------|-----------------|-----------|---------|-------|
| 1 | 0.4331 | 0.3067 | 0.7390 | 0.6933 | 0.7154 |
| 3 | 0.2506 | 0.2751 | 0.7583 | 0.7094 | 0.7330 |
| 6 | 0.1992 | 0.2861 | 0.7551 | 0.7170 | 0.7355 |
| 9 | 0.1717 | 0.3138 | 0.7431 | 0.7255 | 0.7342 |

## ⚡ Key Features

- 🎯 **State-of-the-art Accuracy**: 75.22% F1-score on Azerbaijani NER
- 🌐 **25 Entity Categories**: Comprehensive coverage including Person, Location, Organization, Government, and more
- 🚀 **Production Ready**: Deployed on Fly.io with FastAPI backend
- 🎨 **Interactive UI**: Real-time entity highlighting with confidence scores
- 🔄 **Multiple Models**: Four different transformer models to choose from
- 📊 **Confidence Scoring**: Each prediction includes confidence metrics
- 🌍 **Multilingual Foundation**: Built on XLM-RoBERTa for cross-lingual understanding
- 📱 **Responsive Design**: Works seamlessly across desktop and mobile devices

## 🛠️ Technology Stack

```mermaid
graph LR
    subgraph "Frontend"
        A[HTML5] --> B[CSS3]
        B --> C[JavaScript]
    end
    
    subgraph "Backend"
        D[FastAPI] --> E[Python 3.8+]
        E --> F[Uvicorn]
    end
    
    subgraph "ML Stack"
        G[Transformers] --> H[PyTorch]
        H --> I[Hugging Face]
    end
    
    subgraph "Deployment"
        J[Docker] --> K[Fly.io]
        K --> L[Production]
    end
    
    C --> D
    F --> G
    I --> J
```

## 🚀 Setup Instructions

### Local Development

1. **Clone the repository**
```bash
git clone https://huggingface.co/IsmatS/Named_Entity_Recognition
cd Named_Entity_Recognition
```

2. **Set up Python environment**
```bash
# Create virtual environment
python -m venv .venv

# Activate virtual environment
# On Unix/macOS:
source .venv/bin/activate
# On Windows:
.venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
```

3. **Run the application**
```bash
uvicorn main:app --host 0.0.0.0 --port 8080
```

### Fly.io Deployment

1. **Install Fly CLI**
```bash
# On Unix/macOS
curl -L https://fly.io/install.sh | sh
```

2. **Configure deployment**
```bash
# Login to Fly.io
fly auth login

# Initialize app
fly launch

# Configure memory (minimum 2GB recommended)
fly scale memory 2048
```

3. **Deploy application**
```bash
fly deploy

# Monitor deployment
fly logs
```

## 💡 Usage

### Quick Start
1. **Access the application:**
   - 🏠 Local: http://localhost:8080
   - 🌐 Production: https://named-entity-recognition.fly.dev

2. **Enter Azerbaijani text** in the input field
3. **Click "Submit"** to process and view named entities
4. **View results** with entities highlighted by category and confidence scores

### Example Usage

```python
# Example API request
import requests

response = requests.post(
    "https://named-entity-recognition.fly.dev/predict/",
    data={"text": "2014-cü ildə Azərbaycan Respublikasının prezidenti İlham Əliyev Salyanda olub."}
)

print(response.json())
# Output: {
#   "entities": {
#     "Date": ["2014"],
#     "Government": ["Azərbaycan"],
#     "Organization": ["Respublikasının"],
#     "Position": ["prezidenti"],
#     "Person": ["İlham Əliyev"],
#     "Location": ["Salyanda"]
#   }
# }
```

## 🎯 Model Capabilities

- **Person Names**: İlham Əliyev, Heydər Əliyev, Nizami Gəncəvi
- **Locations**: Bakı, Salyanda, Azərbaycan, Gəncə
- **Organizations**: Respublika, Universitet, Şirkət
- **Dates & Times**: 2014-cü il, sentyabr ayı, səhər saatları
- **Government Entities**: prezident, nazir, məclis
- **And 20+ more categories...**

## 🤝 Contributing

We welcome contributions! Here's how you can help:

1. 🍴 Fork the repository
2. 🌿 Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. 💍 Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. 📤 Push to the branch (`git push origin feature/AmazingFeature`)
5. 🔀 Open a Pull Request

### Development Areas
- 🧠 Model improvements and fine-tuning
- 🎨 UI/UX enhancements
- 📊 Performance optimizations
- 🧪 Additional test cases
- 📖 Documentation improvements

## 📄 License

This project is open source and available under the [MIT License](LICENSE).

## 🙏 Acknowledgments

- Hugging Face team for the transformer models and infrastructure
- Google Colab for providing A100 GPU access
- Fly.io for hosting the production deployment
- The Azerbaijani NLP community for dataset contributions



## 🔗 Related Projects

- [Azerbaijani NER Dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset)
- [mBERT Azerbaijani NER Model](https://huggingface.co/IsmatS/mbert-az-ner)
- [XLM-RoBERTa Azerbaijani NER Model](https://huggingface.co/IsmatS/xlm-roberta-az-ner)