IsmatS's picture
docs: replace unsupported mermaid mindmap/xychart with markdown tables
f366db4
---
language:
- az
license: mit
tags:
- token-classification
- ner
- azerbaijani
- fastapi
- transformers
- xlm-roberta
pipeline_tag: token-classification
datasets:
- LocalDoc/azerbaijani-ner-dataset
---
# Named Entity Recognition for Azerbaijani Language
A state-of-the-art Named Entity Recognition (NER) system specifically designed for the Azerbaijani language, featuring multiple fine-tuned transformer models and a production-ready FastAPI deployment with an intuitive web interface.
## 🚀 Live Demo
Try the live demo: [Named Entity Recognition Demo](https://named-entity-recognition.fly.dev/)
**Note:** The server runs on a free tier and may take 1-2 minutes to initialize if inactive. Please be patient during startup.
## 🏗️ System Architecture
```mermaid
graph TD
A[User Input] --> B[FastAPI Server]
B --> C[XLM-RoBERTa Model]
C --> D[Token Classification]
D --> E[Entity Aggregation]
E --> F[Label Mapping]
F --> G[JSON Response]
G --> H[Frontend Visualization]
subgraph "Model Pipeline"
C --> C1[Tokenization]
C1 --> C2[BERT Encoding]
C2 --> C3[Classification Head]
C3 --> D
end
subgraph "Entity Categories"
I[Person]
J[Location]
K[Organization]
L[Date/Time]
M[Government]
N[25 Total Categories]
end
F --> I
F --> J
F --> K
F --> L
F --> M
F --> N
```
## 🤖 Model Training Pipeline
```mermaid
flowchart LR
A[Azerbaijani NER Dataset] --> B[Data Preprocessing]
B --> C[Tokenization]
C --> D[Label Alignment]
subgraph "Model Training"
E[mBERT] --> F[Fine-tuning]
G[XLM-RoBERTa] --> F
H[XLM-RoBERTa Large] --> F
I[Azeri-Turkish BERT] --> F
F --> J[Model Evaluation]
end
D --> E
D --> G
D --> H
D --> I
J --> K[Best Model Selection]
K --> L[Hugging Face Hub]
L --> M[Production Deployment]
subgraph "Performance Metrics"
N[Precision: 76.44%]
O[Recall: 74.05%]
P[F1-Score: 75.22%]
end
J --> N
J --> O
J --> P
```
## 🔄 Data Flow Architecture
```mermaid
sequenceDiagram
participant U as User
participant F as Frontend
participant API as FastAPI
participant M as XLM-RoBERTa
participant HF as Hugging Face
U->>F: Enter Azerbaijani text
F->>API: POST /predict/
API->>M: Process text
M->>M: Tokenize input
M->>M: Generate predictions
M->>API: Return entity predictions
API->>API: Apply label mapping
API->>API: Group entities by type
API->>F: JSON response with entities
F->>U: Display highlighted entities
Note over M,HF: Model loaded from<br/>IsmatS/xlm-roberta-az-ner
```
## Project Structure
```
.
├── Dockerfile # Docker image configuration
├── README.md # Project documentation
├── fly.toml # Fly.io deployment configuration
├── main.py # FastAPI application entry point
├── models/ # Model-related files
│ ├── NER_from_scratch.ipynb # Custom NER implementation notebook
│ ├── README.md # Models documentation
│ ├── XLM-RoBERTa.ipynb # XLM-RoBERTa training notebook
│ ├── azeri-turkish-bert-ner.ipynb # Azeri-Turkish BERT training
│ ├── mBERT.ipynb # mBERT training notebook
│ ├── push_to_HF.py # Hugging Face upload script
│ ├── train-00000-of-00001.parquet # Training data
│ └── xlm_roberta_large.ipynb # XLM-RoBERTa Large training
├── requirements.txt # Python dependencies
├── static/ # Frontend assets
│ ├── app.js # Frontend logic
│ └── style.css # UI styling
└── templates/ # HTML templates
└── index.html # Main UI template
```
## 🧠 Models & Dataset
### 🏆 Available Models
| Model | Parameters | F1-Score | Hugging Face | Status |
|-------|------------|----------|--------------|---------|
| [mBERT Azerbaijani NER](https://huggingface.co/IsmatS/mbert-az-ner) | 180M | 67.70% | ✅ | Released |
| [XLM-RoBERTa Azerbaijani NER](https://huggingface.co/IsmatS/xlm-roberta-az-ner) | 125M | **75.22%** | ✅ | **Production** |
| [XLM-RoBERTa Large Azerbaijani NER](https://huggingface.co/IsmatS/xlm_roberta_large_az_ner) | 355M | 75.48% | ✅ | Released |
| [Azerbaijani-Turkish BERT Base NER](https://huggingface.co/IsmatS/azeri-turkish-bert-ner) | 110M | 73.55% | ✅ | Released |
### 📊 Supported Entity Types (25 Categories)
| Category | Category | Category |
|----------|----------|----------|
| Person | Government | Law |
| Location | Date | Language |
| Organization | Time | Position |
| Facility | Money | Nationality |
| Product | Percentage | Disease |
| Event | Contact | Quantity |
| Art | Project | Cardinal |
| Proverb | Ordinal | Miscellaneous |
| Other | | |
### 📈 Dataset Information
- **Source:** [Azerbaijani NER Dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset)
- **Size:** High-quality annotated Azerbaijani text corpus
- **Language:** Azerbaijani (az)
- **Annotation:** IOB2 format with 25 entity categories
- **Training Infrastructure:** A100 GPU on Google Colab Pro+
## 📊 Model Performance Comparison
| Model | F1-Score |
|-------|----------|
| mBERT | 67.70% |
| XLM-RoBERTa Base | 75.22% |
| XLM-RoBERTa Large | **75.48%** |
| Azeri-Turkish-BERT | 73.55% |
## 📈 Detailed Performance Metrics
### mBERT Performance
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
|-------|---------------|-----------------|-----------|---------|-------|-----------|
| 1 | 0.2952 | 0.2657 | 0.7154 | 0.6229 | 0.6659 | 0.9191 |
| 2 | 0.2486 | 0.2521 | 0.7210 | 0.6380 | 0.6770 | 0.9214 |
| 3 | 0.2068 | 0.2534 | 0.7049 | 0.6507 | 0.6767 | 0.9209 |
### XLM-RoBERTa Base Performance
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 |
|-------|---------------|-----------------|-----------|---------|-------|
| 1 | 0.3231 | 0.2755 | 0.7758 | 0.6949 | 0.7331 |
| 3 | 0.2486 | 0.2525 | 0.7515 | 0.7412 | 0.7463 |
| 5 | 0.2238 | 0.2522 | 0.7644 | 0.7405 | 0.7522 |
| 7 | 0.2097 | 0.2507 | 0.7607 | 0.7394 | 0.7499 |
### XLM-RoBERTa Large Performance
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 |
|-------|---------------|-----------------|-----------|---------|-------|
| 1 | 0.4075 | 0.2538 | 0.7689 | 0.7214 | 0.7444 |
| 3 | 0.2144 | 0.2488 | 0.7509 | 0.7489 | 0.7499 |
| 6 | 0.1526 | 0.2881 | 0.7831 | 0.7284 | 0.7548 |
| 9 | 0.1194 | 0.3316 | 0.7393 | 0.7495 | 0.7444 |
### Azeri-Turkish-BERT Performance
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 |
|-------|---------------|-----------------|-----------|---------|-------|
| 1 | 0.4331 | 0.3067 | 0.7390 | 0.6933 | 0.7154 |
| 3 | 0.2506 | 0.2751 | 0.7583 | 0.7094 | 0.7330 |
| 6 | 0.1992 | 0.2861 | 0.7551 | 0.7170 | 0.7355 |
| 9 | 0.1717 | 0.3138 | 0.7431 | 0.7255 | 0.7342 |
## ⚡ Key Features
- 🎯 **State-of-the-art Accuracy**: 75.22% F1-score on Azerbaijani NER
- 🌐 **25 Entity Categories**: Comprehensive coverage including Person, Location, Organization, Government, and more
- 🚀 **Production Ready**: Deployed on Fly.io with FastAPI backend
- 🎨 **Interactive UI**: Real-time entity highlighting with confidence scores
- 🔄 **Multiple Models**: Four different transformer models to choose from
- 📊 **Confidence Scoring**: Each prediction includes confidence metrics
- 🌍 **Multilingual Foundation**: Built on XLM-RoBERTa for cross-lingual understanding
- 📱 **Responsive Design**: Works seamlessly across desktop and mobile devices
## 🛠️ Technology Stack
```mermaid
graph LR
subgraph "Frontend"
A[HTML5] --> B[CSS3]
B --> C[JavaScript]
end
subgraph "Backend"
D[FastAPI] --> E[Python 3.8+]
E --> F[Uvicorn]
end
subgraph "ML Stack"
G[Transformers] --> H[PyTorch]
H --> I[Hugging Face]
end
subgraph "Deployment"
J[Docker] --> K[Fly.io]
K --> L[Production]
end
C --> D
F --> G
I --> J
```
## 🚀 Setup Instructions
### Local Development
1. **Clone the repository**
```bash
git clone https://huggingface.co/IsmatS/Named_Entity_Recognition
cd Named_Entity_Recognition
```
2. **Set up Python environment**
```bash
# Create virtual environment
python -m venv .venv
# Activate virtual environment
# On Unix/macOS:
source .venv/bin/activate
# On Windows:
.venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
```
3. **Run the application**
```bash
uvicorn main:app --host 0.0.0.0 --port 8080
```
### Fly.io Deployment
1. **Install Fly CLI**
```bash
# On Unix/macOS
curl -L https://fly.io/install.sh | sh
```
2. **Configure deployment**
```bash
# Login to Fly.io
fly auth login
# Initialize app
fly launch
# Configure memory (minimum 2GB recommended)
fly scale memory 2048
```
3. **Deploy application**
```bash
fly deploy
# Monitor deployment
fly logs
```
## 💡 Usage
### Quick Start
1. **Access the application:**
- 🏠 Local: http://localhost:8080
- 🌐 Production: https://named-entity-recognition.fly.dev
2. **Enter Azerbaijani text** in the input field
3. **Click "Submit"** to process and view named entities
4. **View results** with entities highlighted by category and confidence scores
### Example Usage
```python
# Example API request
import requests
response = requests.post(
"https://named-entity-recognition.fly.dev/predict/",
data={"text": "2014-cü ildə Azərbaycan Respublikasının prezidenti İlham Əliyev Salyanda olub."}
)
print(response.json())
# Output: {
# "entities": {
# "Date": ["2014"],
# "Government": ["Azərbaycan"],
# "Organization": ["Respublikasının"],
# "Position": ["prezidenti"],
# "Person": ["İlham Əliyev"],
# "Location": ["Salyanda"]
# }
# }
```
## 🎯 Model Capabilities
- **Person Names**: İlham Əliyev, Heydər Əliyev, Nizami Gəncəvi
- **Locations**: Bakı, Salyanda, Azərbaycan, Gəncə
- **Organizations**: Respublika, Universitet, Şirkət
- **Dates & Times**: 2014-cü il, sentyabr ayı, səhər saatları
- **Government Entities**: prezident, nazir, məclis
- **And 20+ more categories...**
## 🤝 Contributing
We welcome contributions! Here's how you can help:
1. 🍴 Fork the repository
2. 🌿 Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. 💍 Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. 📤 Push to the branch (`git push origin feature/AmazingFeature`)
5. 🔀 Open a Pull Request
### Development Areas
- 🧠 Model improvements and fine-tuning
- 🎨 UI/UX enhancements
- 📊 Performance optimizations
- 🧪 Additional test cases
- 📖 Documentation improvements
## 📄 License
This project is open source and available under the [MIT License](LICENSE).
## 🙏 Acknowledgments
- Hugging Face team for the transformer models and infrastructure
- Google Colab for providing A100 GPU access
- Fly.io for hosting the production deployment
- The Azerbaijani NLP community for dataset contributions
## 🔗 Related Projects
- [Azerbaijani NER Dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset)
- [mBERT Azerbaijani NER Model](https://huggingface.co/IsmatS/mbert-az-ner)
- [XLM-RoBERTa Azerbaijani NER Model](https://huggingface.co/IsmatS/xlm-roberta-az-ner)