--- language: - az license: mit tags: - token-classification - ner - azerbaijani - fastapi - transformers - xlm-roberta pipeline_tag: token-classification datasets: - LocalDoc/azerbaijani-ner-dataset --- # Named Entity Recognition for Azerbaijani Language A state-of-the-art Named Entity Recognition (NER) system specifically designed for the Azerbaijani language, featuring multiple fine-tuned transformer models and a production-ready FastAPI deployment with an intuitive web interface. ## 🚀 Live Demo Try the live demo: [Named Entity Recognition Demo](https://named-entity-recognition.fly.dev/) **Note:** The server runs on a free tier and may take 1-2 minutes to initialize if inactive. Please be patient during startup. ## 🏗️ System Architecture ```mermaid graph TD A[User Input] --> B[FastAPI Server] B --> C[XLM-RoBERTa Model] C --> D[Token Classification] D --> E[Entity Aggregation] E --> F[Label Mapping] F --> G[JSON Response] G --> H[Frontend Visualization] subgraph "Model Pipeline" C --> C1[Tokenization] C1 --> C2[BERT Encoding] C2 --> C3[Classification Head] C3 --> D end subgraph "Entity Categories" I[Person] J[Location] K[Organization] L[Date/Time] M[Government] N[25 Total Categories] end F --> I F --> J F --> K F --> L F --> M F --> N ``` ## 🤖 Model Training Pipeline ```mermaid flowchart LR A[Azerbaijani NER Dataset] --> B[Data Preprocessing] B --> C[Tokenization] C --> D[Label Alignment] subgraph "Model Training" E[mBERT] --> F[Fine-tuning] G[XLM-RoBERTa] --> F H[XLM-RoBERTa Large] --> F I[Azeri-Turkish BERT] --> F F --> J[Model Evaluation] end D --> E D --> G D --> H D --> I J --> K[Best Model Selection] K --> L[Hugging Face Hub] L --> M[Production Deployment] subgraph "Performance Metrics" N[Precision: 76.44%] O[Recall: 74.05%] P[F1-Score: 75.22%] end J --> N J --> O J --> P ``` ## 🔄 Data Flow Architecture ```mermaid sequenceDiagram participant U as User participant F as Frontend participant API as FastAPI participant M as XLM-RoBERTa participant HF as Hugging Face U->>F: Enter Azerbaijani text F->>API: POST /predict/ API->>M: Process text M->>M: Tokenize input M->>M: Generate predictions M->>API: Return entity predictions API->>API: Apply label mapping API->>API: Group entities by type API->>F: JSON response with entities F->>U: Display highlighted entities Note over M,HF: Model loaded from
IsmatS/xlm-roberta-az-ner ``` ## Project Structure ``` . ├── Dockerfile # Docker image configuration ├── README.md # Project documentation ├── fly.toml # Fly.io deployment configuration ├── main.py # FastAPI application entry point ├── models/ # Model-related files │ ├── NER_from_scratch.ipynb # Custom NER implementation notebook │ ├── README.md # Models documentation │ ├── XLM-RoBERTa.ipynb # XLM-RoBERTa training notebook │ ├── azeri-turkish-bert-ner.ipynb # Azeri-Turkish BERT training │ ├── mBERT.ipynb # mBERT training notebook │ ├── push_to_HF.py # Hugging Face upload script │ ├── train-00000-of-00001.parquet # Training data │ └── xlm_roberta_large.ipynb # XLM-RoBERTa Large training ├── requirements.txt # Python dependencies ├── static/ # Frontend assets │ ├── app.js # Frontend logic │ └── style.css # UI styling └── templates/ # HTML templates └── index.html # Main UI template ``` ## 🧠 Models & Dataset ### 🏆 Available Models | Model | Parameters | F1-Score | Hugging Face | Status | |-------|------------|----------|--------------|---------| | [mBERT Azerbaijani NER](https://huggingface.co/IsmatS/mbert-az-ner) | 180M | 67.70% | ✅ | Released | | [XLM-RoBERTa Azerbaijani NER](https://huggingface.co/IsmatS/xlm-roberta-az-ner) | 125M | **75.22%** | ✅ | **Production** | | [XLM-RoBERTa Large Azerbaijani NER](https://huggingface.co/IsmatS/xlm_roberta_large_az_ner) | 355M | 75.48% | ✅ | Released | | [Azerbaijani-Turkish BERT Base NER](https://huggingface.co/IsmatS/azeri-turkish-bert-ner) | 110M | 73.55% | ✅ | Released | ### 📊 Supported Entity Types (25 Categories) | Category | Category | Category | |----------|----------|----------| | Person | Government | Law | | Location | Date | Language | | Organization | Time | Position | | Facility | Money | Nationality | | Product | Percentage | Disease | | Event | Contact | Quantity | | Art | Project | Cardinal | | Proverb | Ordinal | Miscellaneous | | Other | | | ### 📈 Dataset Information - **Source:** [Azerbaijani NER Dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset) - **Size:** High-quality annotated Azerbaijani text corpus - **Language:** Azerbaijani (az) - **Annotation:** IOB2 format with 25 entity categories - **Training Infrastructure:** A100 GPU on Google Colab Pro+ ## 📊 Model Performance Comparison | Model | F1-Score | |-------|----------| | mBERT | 67.70% | | XLM-RoBERTa Base | 75.22% | | XLM-RoBERTa Large | **75.48%** | | Azeri-Turkish-BERT | 73.55% | ## 📈 Detailed Performance Metrics ### mBERT Performance | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy | |-------|---------------|-----------------|-----------|---------|-------|-----------| | 1 | 0.2952 | 0.2657 | 0.7154 | 0.6229 | 0.6659 | 0.9191 | | 2 | 0.2486 | 0.2521 | 0.7210 | 0.6380 | 0.6770 | 0.9214 | | 3 | 0.2068 | 0.2534 | 0.7049 | 0.6507 | 0.6767 | 0.9209 | ### XLM-RoBERTa Base Performance | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | |-------|---------------|-----------------|-----------|---------|-------| | 1 | 0.3231 | 0.2755 | 0.7758 | 0.6949 | 0.7331 | | 3 | 0.2486 | 0.2525 | 0.7515 | 0.7412 | 0.7463 | | 5 | 0.2238 | 0.2522 | 0.7644 | 0.7405 | 0.7522 | | 7 | 0.2097 | 0.2507 | 0.7607 | 0.7394 | 0.7499 | ### XLM-RoBERTa Large Performance | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | |-------|---------------|-----------------|-----------|---------|-------| | 1 | 0.4075 | 0.2538 | 0.7689 | 0.7214 | 0.7444 | | 3 | 0.2144 | 0.2488 | 0.7509 | 0.7489 | 0.7499 | | 6 | 0.1526 | 0.2881 | 0.7831 | 0.7284 | 0.7548 | | 9 | 0.1194 | 0.3316 | 0.7393 | 0.7495 | 0.7444 | ### Azeri-Turkish-BERT Performance | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | |-------|---------------|-----------------|-----------|---------|-------| | 1 | 0.4331 | 0.3067 | 0.7390 | 0.6933 | 0.7154 | | 3 | 0.2506 | 0.2751 | 0.7583 | 0.7094 | 0.7330 | | 6 | 0.1992 | 0.2861 | 0.7551 | 0.7170 | 0.7355 | | 9 | 0.1717 | 0.3138 | 0.7431 | 0.7255 | 0.7342 | ## ⚡ Key Features - 🎯 **State-of-the-art Accuracy**: 75.22% F1-score on Azerbaijani NER - 🌐 **25 Entity Categories**: Comprehensive coverage including Person, Location, Organization, Government, and more - 🚀 **Production Ready**: Deployed on Fly.io with FastAPI backend - 🎨 **Interactive UI**: Real-time entity highlighting with confidence scores - 🔄 **Multiple Models**: Four different transformer models to choose from - 📊 **Confidence Scoring**: Each prediction includes confidence metrics - 🌍 **Multilingual Foundation**: Built on XLM-RoBERTa for cross-lingual understanding - 📱 **Responsive Design**: Works seamlessly across desktop and mobile devices ## 🛠️ Technology Stack ```mermaid graph LR subgraph "Frontend" A[HTML5] --> B[CSS3] B --> C[JavaScript] end subgraph "Backend" D[FastAPI] --> E[Python 3.8+] E --> F[Uvicorn] end subgraph "ML Stack" G[Transformers] --> H[PyTorch] H --> I[Hugging Face] end subgraph "Deployment" J[Docker] --> K[Fly.io] K --> L[Production] end C --> D F --> G I --> J ``` ## 🚀 Setup Instructions ### Local Development 1. **Clone the repository** ```bash git clone https://huggingface.co/IsmatS/Named_Entity_Recognition cd Named_Entity_Recognition ``` 2. **Set up Python environment** ```bash # Create virtual environment python -m venv .venv # Activate virtual environment # On Unix/macOS: source .venv/bin/activate # On Windows: .venv\Scripts\activate # Install dependencies pip install -r requirements.txt ``` 3. **Run the application** ```bash uvicorn main:app --host 0.0.0.0 --port 8080 ``` ### Fly.io Deployment 1. **Install Fly CLI** ```bash # On Unix/macOS curl -L https://fly.io/install.sh | sh ``` 2. **Configure deployment** ```bash # Login to Fly.io fly auth login # Initialize app fly launch # Configure memory (minimum 2GB recommended) fly scale memory 2048 ``` 3. **Deploy application** ```bash fly deploy # Monitor deployment fly logs ``` ## 💡 Usage ### Quick Start 1. **Access the application:** - 🏠 Local: http://localhost:8080 - 🌐 Production: https://named-entity-recognition.fly.dev 2. **Enter Azerbaijani text** in the input field 3. **Click "Submit"** to process and view named entities 4. **View results** with entities highlighted by category and confidence scores ### Example Usage ```python # Example API request import requests response = requests.post( "https://named-entity-recognition.fly.dev/predict/", data={"text": "2014-cü ildə Azərbaycan Respublikasının prezidenti İlham Əliyev Salyanda olub."} ) print(response.json()) # Output: { # "entities": { # "Date": ["2014"], # "Government": ["Azərbaycan"], # "Organization": ["Respublikasının"], # "Position": ["prezidenti"], # "Person": ["İlham Əliyev"], # "Location": ["Salyanda"] # } # } ``` ## 🎯 Model Capabilities - **Person Names**: İlham Əliyev, Heydər Əliyev, Nizami Gəncəvi - **Locations**: Bakı, Salyanda, Azərbaycan, Gəncə - **Organizations**: Respublika, Universitet, Şirkət - **Dates & Times**: 2014-cü il, sentyabr ayı, səhər saatları - **Government Entities**: prezident, nazir, məclis - **And 20+ more categories...** ## 🤝 Contributing We welcome contributions! Here's how you can help: 1. 🍴 Fork the repository 2. 🌿 Create your feature branch (`git checkout -b feature/AmazingFeature`) 3. 💍 Commit your changes (`git commit -m 'Add some AmazingFeature'`) 4. 📤 Push to the branch (`git push origin feature/AmazingFeature`) 5. 🔀 Open a Pull Request ### Development Areas - 🧠 Model improvements and fine-tuning - 🎨 UI/UX enhancements - 📊 Performance optimizations - 🧪 Additional test cases - 📖 Documentation improvements ## 📄 License This project is open source and available under the [MIT License](LICENSE). ## 🙏 Acknowledgments - Hugging Face team for the transformer models and infrastructure - Google Colab for providing A100 GPU access - Fly.io for hosting the production deployment - The Azerbaijani NLP community for dataset contributions ## 🔗 Related Projects - [Azerbaijani NER Dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset) - [mBERT Azerbaijani NER Model](https://huggingface.co/IsmatS/mbert-az-ner) - [XLM-RoBERTa Azerbaijani NER Model](https://huggingface.co/IsmatS/xlm-roberta-az-ner)