File size: 11,669 Bytes
346d7dd c8ac4a6 0ee2acc 3089a0e 0ee2acc 3089a0e 072ecf9 3089a0e 072ecf9 3089a0e 0ee2acc 072ecf9 4f2555a 072ecf9 3089a0e 072ecf9 f3b924c 072ecf9 3089a0e 0ee2acc f366db4 b2f0cde 0ee2acc 3089a0e 54d87be f366db4 54d87be 3089a0e 072ecf9 3089a0e 072ecf9 3089a0e 072ecf9 3089a0e 072ecf9 599147d 072ecf9 599147d 072ecf9 2f37ab9 072ecf9 2f37ab9 072ecf9 2f37ab9 54d87be 599147d 072ecf9 3089a0e 072ecf9 c2a1ac0 072ecf9 b2f0cde 072ecf9 3089a0e 072ecf9 3089a0e 072ecf9 3089a0e 072ecf9 3089a0e 072ecf9 3089a0e 072ecf9 3089a0e 072ecf9 3089a0e 072ecf9 3089a0e 072ecf9 3089a0e 072ecf9 3089a0e 072ecf9 3089a0e 072ecf9 3089a0e 072ecf9 3089a0e 54d87be 3089a0e 54d87be 072ecf9 54d87be 072ecf9 c8ac4a6 dd2f68f ad075d5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 | ---
language:
- az
license: mit
tags:
- token-classification
- ner
- azerbaijani
- fastapi
- transformers
- xlm-roberta
pipeline_tag: token-classification
datasets:
- LocalDoc/azerbaijani-ner-dataset
---
# Named Entity Recognition for Azerbaijani Language
A state-of-the-art Named Entity Recognition (NER) system specifically designed for the Azerbaijani language, featuring multiple fine-tuned transformer models and a production-ready FastAPI deployment with an intuitive web interface.
## 🚀 Live Demo
Try the live demo: [Named Entity Recognition Demo](https://named-entity-recognition.fly.dev/)
**Note:** The server runs on a free tier and may take 1-2 minutes to initialize if inactive. Please be patient during startup.
## 🏗️ System Architecture
```mermaid
graph TD
A[User Input] --> B[FastAPI Server]
B --> C[XLM-RoBERTa Model]
C --> D[Token Classification]
D --> E[Entity Aggregation]
E --> F[Label Mapping]
F --> G[JSON Response]
G --> H[Frontend Visualization]
subgraph "Model Pipeline"
C --> C1[Tokenization]
C1 --> C2[BERT Encoding]
C2 --> C3[Classification Head]
C3 --> D
end
subgraph "Entity Categories"
I[Person]
J[Location]
K[Organization]
L[Date/Time]
M[Government]
N[25 Total Categories]
end
F --> I
F --> J
F --> K
F --> L
F --> M
F --> N
```
## 🤖 Model Training Pipeline
```mermaid
flowchart LR
A[Azerbaijani NER Dataset] --> B[Data Preprocessing]
B --> C[Tokenization]
C --> D[Label Alignment]
subgraph "Model Training"
E[mBERT] --> F[Fine-tuning]
G[XLM-RoBERTa] --> F
H[XLM-RoBERTa Large] --> F
I[Azeri-Turkish BERT] --> F
F --> J[Model Evaluation]
end
D --> E
D --> G
D --> H
D --> I
J --> K[Best Model Selection]
K --> L[Hugging Face Hub]
L --> M[Production Deployment]
subgraph "Performance Metrics"
N[Precision: 76.44%]
O[Recall: 74.05%]
P[F1-Score: 75.22%]
end
J --> N
J --> O
J --> P
```
## 🔄 Data Flow Architecture
```mermaid
sequenceDiagram
participant U as User
participant F as Frontend
participant API as FastAPI
participant M as XLM-RoBERTa
participant HF as Hugging Face
U->>F: Enter Azerbaijani text
F->>API: POST /predict/
API->>M: Process text
M->>M: Tokenize input
M->>M: Generate predictions
M->>API: Return entity predictions
API->>API: Apply label mapping
API->>API: Group entities by type
API->>F: JSON response with entities
F->>U: Display highlighted entities
Note over M,HF: Model loaded from<br/>IsmatS/xlm-roberta-az-ner
```
## Project Structure
```
.
├── Dockerfile # Docker image configuration
├── README.md # Project documentation
├── fly.toml # Fly.io deployment configuration
├── main.py # FastAPI application entry point
├── models/ # Model-related files
│ ├── NER_from_scratch.ipynb # Custom NER implementation notebook
│ ├── README.md # Models documentation
│ ├── XLM-RoBERTa.ipynb # XLM-RoBERTa training notebook
│ ├── azeri-turkish-bert-ner.ipynb # Azeri-Turkish BERT training
│ ├── mBERT.ipynb # mBERT training notebook
│ ├── push_to_HF.py # Hugging Face upload script
│ ├── train-00000-of-00001.parquet # Training data
│ └── xlm_roberta_large.ipynb # XLM-RoBERTa Large training
├── requirements.txt # Python dependencies
├── static/ # Frontend assets
│ ├── app.js # Frontend logic
│ └── style.css # UI styling
└── templates/ # HTML templates
└── index.html # Main UI template
```
## 🧠 Models & Dataset
### 🏆 Available Models
| Model | Parameters | F1-Score | Hugging Face | Status |
|-------|------------|----------|--------------|---------|
| [mBERT Azerbaijani NER](https://huggingface.co/IsmatS/mbert-az-ner) | 180M | 67.70% | ✅ | Released |
| [XLM-RoBERTa Azerbaijani NER](https://huggingface.co/IsmatS/xlm-roberta-az-ner) | 125M | **75.22%** | ✅ | **Production** |
| [XLM-RoBERTa Large Azerbaijani NER](https://huggingface.co/IsmatS/xlm_roberta_large_az_ner) | 355M | 75.48% | ✅ | Released |
| [Azerbaijani-Turkish BERT Base NER](https://huggingface.co/IsmatS/azeri-turkish-bert-ner) | 110M | 73.55% | ✅ | Released |
### 📊 Supported Entity Types (25 Categories)
| Category | Category | Category |
|----------|----------|----------|
| Person | Government | Law |
| Location | Date | Language |
| Organization | Time | Position |
| Facility | Money | Nationality |
| Product | Percentage | Disease |
| Event | Contact | Quantity |
| Art | Project | Cardinal |
| Proverb | Ordinal | Miscellaneous |
| Other | | |
### 📈 Dataset Information
- **Source:** [Azerbaijani NER Dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset)
- **Size:** High-quality annotated Azerbaijani text corpus
- **Language:** Azerbaijani (az)
- **Annotation:** IOB2 format with 25 entity categories
- **Training Infrastructure:** A100 GPU on Google Colab Pro+
## 📊 Model Performance Comparison
| Model | F1-Score |
|-------|----------|
| mBERT | 67.70% |
| XLM-RoBERTa Base | 75.22% |
| XLM-RoBERTa Large | **75.48%** |
| Azeri-Turkish-BERT | 73.55% |
## 📈 Detailed Performance Metrics
### mBERT Performance
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
|-------|---------------|-----------------|-----------|---------|-------|-----------|
| 1 | 0.2952 | 0.2657 | 0.7154 | 0.6229 | 0.6659 | 0.9191 |
| 2 | 0.2486 | 0.2521 | 0.7210 | 0.6380 | 0.6770 | 0.9214 |
| 3 | 0.2068 | 0.2534 | 0.7049 | 0.6507 | 0.6767 | 0.9209 |
### XLM-RoBERTa Base Performance
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 |
|-------|---------------|-----------------|-----------|---------|-------|
| 1 | 0.3231 | 0.2755 | 0.7758 | 0.6949 | 0.7331 |
| 3 | 0.2486 | 0.2525 | 0.7515 | 0.7412 | 0.7463 |
| 5 | 0.2238 | 0.2522 | 0.7644 | 0.7405 | 0.7522 |
| 7 | 0.2097 | 0.2507 | 0.7607 | 0.7394 | 0.7499 |
### XLM-RoBERTa Large Performance
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 |
|-------|---------------|-----------------|-----------|---------|-------|
| 1 | 0.4075 | 0.2538 | 0.7689 | 0.7214 | 0.7444 |
| 3 | 0.2144 | 0.2488 | 0.7509 | 0.7489 | 0.7499 |
| 6 | 0.1526 | 0.2881 | 0.7831 | 0.7284 | 0.7548 |
| 9 | 0.1194 | 0.3316 | 0.7393 | 0.7495 | 0.7444 |
### Azeri-Turkish-BERT Performance
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 |
|-------|---------------|-----------------|-----------|---------|-------|
| 1 | 0.4331 | 0.3067 | 0.7390 | 0.6933 | 0.7154 |
| 3 | 0.2506 | 0.2751 | 0.7583 | 0.7094 | 0.7330 |
| 6 | 0.1992 | 0.2861 | 0.7551 | 0.7170 | 0.7355 |
| 9 | 0.1717 | 0.3138 | 0.7431 | 0.7255 | 0.7342 |
## ⚡ Key Features
- 🎯 **State-of-the-art Accuracy**: 75.22% F1-score on Azerbaijani NER
- 🌐 **25 Entity Categories**: Comprehensive coverage including Person, Location, Organization, Government, and more
- 🚀 **Production Ready**: Deployed on Fly.io with FastAPI backend
- 🎨 **Interactive UI**: Real-time entity highlighting with confidence scores
- 🔄 **Multiple Models**: Four different transformer models to choose from
- 📊 **Confidence Scoring**: Each prediction includes confidence metrics
- 🌍 **Multilingual Foundation**: Built on XLM-RoBERTa for cross-lingual understanding
- 📱 **Responsive Design**: Works seamlessly across desktop and mobile devices
## 🛠️ Technology Stack
```mermaid
graph LR
subgraph "Frontend"
A[HTML5] --> B[CSS3]
B --> C[JavaScript]
end
subgraph "Backend"
D[FastAPI] --> E[Python 3.8+]
E --> F[Uvicorn]
end
subgraph "ML Stack"
G[Transformers] --> H[PyTorch]
H --> I[Hugging Face]
end
subgraph "Deployment"
J[Docker] --> K[Fly.io]
K --> L[Production]
end
C --> D
F --> G
I --> J
```
## 🚀 Setup Instructions
### Local Development
1. **Clone the repository**
```bash
git clone https://huggingface.co/IsmatS/Named_Entity_Recognition
cd Named_Entity_Recognition
```
2. **Set up Python environment**
```bash
# Create virtual environment
python -m venv .venv
# Activate virtual environment
# On Unix/macOS:
source .venv/bin/activate
# On Windows:
.venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
```
3. **Run the application**
```bash
uvicorn main:app --host 0.0.0.0 --port 8080
```
### Fly.io Deployment
1. **Install Fly CLI**
```bash
# On Unix/macOS
curl -L https://fly.io/install.sh | sh
```
2. **Configure deployment**
```bash
# Login to Fly.io
fly auth login
# Initialize app
fly launch
# Configure memory (minimum 2GB recommended)
fly scale memory 2048
```
3. **Deploy application**
```bash
fly deploy
# Monitor deployment
fly logs
```
## 💡 Usage
### Quick Start
1. **Access the application:**
- 🏠 Local: http://localhost:8080
- 🌐 Production: https://named-entity-recognition.fly.dev
2. **Enter Azerbaijani text** in the input field
3. **Click "Submit"** to process and view named entities
4. **View results** with entities highlighted by category and confidence scores
### Example Usage
```python
# Example API request
import requests
response = requests.post(
"https://named-entity-recognition.fly.dev/predict/",
data={"text": "2014-cü ildə Azərbaycan Respublikasının prezidenti İlham Əliyev Salyanda olub."}
)
print(response.json())
# Output: {
# "entities": {
# "Date": ["2014"],
# "Government": ["Azərbaycan"],
# "Organization": ["Respublikasının"],
# "Position": ["prezidenti"],
# "Person": ["İlham Əliyev"],
# "Location": ["Salyanda"]
# }
# }
```
## 🎯 Model Capabilities
- **Person Names**: İlham Əliyev, Heydər Əliyev, Nizami Gəncəvi
- **Locations**: Bakı, Salyanda, Azərbaycan, Gəncə
- **Organizations**: Respublika, Universitet, Şirkət
- **Dates & Times**: 2014-cü il, sentyabr ayı, səhər saatları
- **Government Entities**: prezident, nazir, məclis
- **And 20+ more categories...**
## 🤝 Contributing
We welcome contributions! Here's how you can help:
1. 🍴 Fork the repository
2. 🌿 Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. 💍 Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. 📤 Push to the branch (`git push origin feature/AmazingFeature`)
5. 🔀 Open a Pull Request
### Development Areas
- 🧠 Model improvements and fine-tuning
- 🎨 UI/UX enhancements
- 📊 Performance optimizations
- 🧪 Additional test cases
- 📖 Documentation improvements
## 📄 License
This project is open source and available under the [MIT License](LICENSE).
## 🙏 Acknowledgments
- Hugging Face team for the transformer models and infrastructure
- Google Colab for providing A100 GPU access
- Fly.io for hosting the production deployment
- The Azerbaijani NLP community for dataset contributions
## 🔗 Related Projects
- [Azerbaijani NER Dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset)
- [mBERT Azerbaijani NER Model](https://huggingface.co/IsmatS/mbert-az-ner)
- [XLM-RoBERTa Azerbaijani NER Model](https://huggingface.co/IsmatS/xlm-roberta-az-ner) |