docs: replace unsupported mermaid mindmap/xychart with markdown tables

f366db4 10 days ago

11.7 kB

	---
	language:
	- az
	license: mit
	tags:
	- token-classification
	- ner
	- azerbaijani
	- fastapi
	- transformers
	- xlm-roberta
	pipeline_tag: token-classification
	datasets:
	- LocalDoc/azerbaijani-ner-dataset
	---

	# Named Entity Recognition for Azerbaijani Language

	A state-of-the-art Named Entity Recognition (NER) system specifically designed for the Azerbaijani language, featuring multiple fine-tuned transformer models and a production-ready FastAPI deployment with an intuitive web interface.

	## 🚀 Live Demo

	Try the live demo: [Named Entity Recognition Demo](https://named-entity-recognition.fly.dev/)

	Note: The server runs on a free tier and may take 1-2 minutes to initialize if inactive. Please be patient during startup.

	## 🏗️ System Architecture

	```mermaid
	graph TD
	A[User Input] --> B[FastAPI Server]
	B --> C[XLM-RoBERTa Model]
	C --> D[Token Classification]
	D --> E[Entity Aggregation]
	E --> F[Label Mapping]
	F --> G[JSON Response]
	G --> H[Frontend Visualization]

	subgraph "Model Pipeline"
	C --> C1[Tokenization]
	C1 --> C2[BERT Encoding]
	C2 --> C3[Classification Head]
	C3 --> D
	end

	subgraph "Entity Categories"
	I[Person]
	J[Location]
	K[Organization]
	L[Date/Time]
	M[Government]
	N[25 Total Categories]
	end

	F --> I
	F --> J
	F --> K
	F --> L
	F --> M
	F --> N
	```

	## 🤖 Model Training Pipeline

	```mermaid
	flowchart LR
	A[Azerbaijani NER Dataset] --> B[Data Preprocessing]
	B --> C[Tokenization]
	C --> D[Label Alignment]

	subgraph "Model Training"
	E[mBERT] --> F[Fine-tuning]
	G[XLM-RoBERTa] --> F
	H[XLM-RoBERTa Large] --> F
	I[Azeri-Turkish BERT] --> F
	F --> J[Model Evaluation]
	end

	D --> E
	D --> G
	D --> H
	D --> I

	J --> K[Best Model Selection]
	K --> L[Hugging Face Hub]
	L --> M[Production Deployment]

	subgraph "Performance Metrics"
	N[Precision: 76.44%]
	O[Recall: 74.05%]
	P[F1-Score: 75.22%]
	end

	J --> N
	J --> O
	J --> P
	```

	## 🔄 Data Flow Architecture

	```mermaid
	sequenceDiagram
	participant U as User
	participant F as Frontend
	participant API as FastAPI
	participant M as XLM-RoBERTa
	participant HF as Hugging Face

	U->>F: Enter Azerbaijani text
	F->>API: POST /predict/
	API->>M: Process text
	M->>M: Tokenize input
	M->>M: Generate predictions
	M->>API: Return entity predictions
	API->>API: Apply label mapping
	API->>API: Group entities by type
	API->>F: JSON response with entities
	F->>U: Display highlighted entities

	Note over M,HF: Model loaded from<br/>IsmatS/xlm-roberta-az-ner
	```

	## Project Structure

	```
	.
	├── Dockerfile # Docker image configuration
	├── README.md # Project documentation
	├── fly.toml # Fly.io deployment configuration
	├── main.py # FastAPI application entry point
	├── models/ # Model-related files
	│ ├── NER_from_scratch.ipynb # Custom NER implementation notebook
	│ ├── README.md # Models documentation
	│ ├── XLM-RoBERTa.ipynb # XLM-RoBERTa training notebook
	│ ├── azeri-turkish-bert-ner.ipynb # Azeri-Turkish BERT training
	│ ├── mBERT.ipynb # mBERT training notebook
	│ ├── push_to_HF.py # Hugging Face upload script
	│ ├── train-00000-of-00001.parquet # Training data
	│ └── xlm_roberta_large.ipynb # XLM-RoBERTa Large training
	├── requirements.txt # Python dependencies
	├── static/ # Frontend assets
	│ ├── app.js # Frontend logic
	│ └── style.css # UI styling
	└── templates/ # HTML templates
	└── index.html # Main UI template
	```

	## 🧠 Models & Dataset

	### 🏆 Available Models

	\| Model \| Parameters \| F1-Score \| Hugging Face \| Status \|
	\|-------\|------------\|----------\|--------------\|---------\|
	\| [mBERT Azerbaijani NER](https://huggingface.co/IsmatS/mbert-az-ner) \| 180M \| 67.70% \| ✅ \| Released \|
	\| [XLM-RoBERTa Azerbaijani NER](https://huggingface.co/IsmatS/xlm-roberta-az-ner) \| 125M \| 75.22% \| ✅ \| Production \|
	\| [XLM-RoBERTa Large Azerbaijani NER](https://huggingface.co/IsmatS/xlm_roberta_large_az_ner) \| 355M \| 75.48% \| ✅ \| Released \|
	\| [Azerbaijani-Turkish BERT Base NER](https://huggingface.co/IsmatS/azeri-turkish-bert-ner) \| 110M \| 73.55% \| ✅ \| Released \|

	### 📊 Supported Entity Types (25 Categories)

	\| Category \| Category \| Category \|
	\|----------\|----------\|----------\|
	\| Person \| Government \| Law \|
	\| Location \| Date \| Language \|
	\| Organization \| Time \| Position \|
	\| Facility \| Money \| Nationality \|
	\| Product \| Percentage \| Disease \|
	\| Event \| Contact \| Quantity \|
	\| Art \| Project \| Cardinal \|
	\| Proverb \| Ordinal \| Miscellaneous \|
	\| Other \| \| \|

	### 📈 Dataset Information
	- Source: [Azerbaijani NER Dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset)
	- Size: High-quality annotated Azerbaijani text corpus
	- Language: Azerbaijani (az)
	- Annotation: IOB2 format with 25 entity categories
	- Training Infrastructure: A100 GPU on Google Colab Pro+

	## 📊 Model Performance Comparison

	\| Model \| F1-Score \|
	\|-------\|----------\|
	\| mBERT \| 67.70% \|
	\| XLM-RoBERTa Base \| 75.22% \|
	\| XLM-RoBERTa Large \| 75.48% \|
	\| Azeri-Turkish-BERT \| 73.55% \|

	## 📈 Detailed Performance Metrics

	### mBERT Performance

	\| Epoch \| Training Loss \| Validation Loss \| Precision \| Recall \| F1 \| Accuracy \|
	\|-------\|---------------\|-----------------\|-----------\|---------\|-------\|-----------\|
	\| 1 \| 0.2952 \| 0.2657 \| 0.7154 \| 0.6229 \| 0.6659 \| 0.9191 \|
	\| 2 \| 0.2486 \| 0.2521 \| 0.7210 \| 0.6380 \| 0.6770 \| 0.9214 \|
	\| 3 \| 0.2068 \| 0.2534 \| 0.7049 \| 0.6507 \| 0.6767 \| 0.9209 \|

	### XLM-RoBERTa Base Performance

	\| Epoch \| Training Loss \| Validation Loss \| Precision \| Recall \| F1 \|
	\|-------\|---------------\|-----------------\|-----------\|---------\|-------\|
	\| 1 \| 0.3231 \| 0.2755 \| 0.7758 \| 0.6949 \| 0.7331 \|
	\| 3 \| 0.2486 \| 0.2525 \| 0.7515 \| 0.7412 \| 0.7463 \|
	\| 5 \| 0.2238 \| 0.2522 \| 0.7644 \| 0.7405 \| 0.7522 \|
	\| 7 \| 0.2097 \| 0.2507 \| 0.7607 \| 0.7394 \| 0.7499 \|

	### XLM-RoBERTa Large Performance

	\| Epoch \| Training Loss \| Validation Loss \| Precision \| Recall \| F1 \|
	\|-------\|---------------\|-----------------\|-----------\|---------\|-------\|
	\| 1 \| 0.4075 \| 0.2538 \| 0.7689 \| 0.7214 \| 0.7444 \|
	\| 3 \| 0.2144 \| 0.2488 \| 0.7509 \| 0.7489 \| 0.7499 \|
	\| 6 \| 0.1526 \| 0.2881 \| 0.7831 \| 0.7284 \| 0.7548 \|
	\| 9 \| 0.1194 \| 0.3316 \| 0.7393 \| 0.7495 \| 0.7444 \|

	### Azeri-Turkish-BERT Performance

	\| Epoch \| Training Loss \| Validation Loss \| Precision \| Recall \| F1 \|
	\|-------\|---------------\|-----------------\|-----------\|---------\|-------\|
	\| 1 \| 0.4331 \| 0.3067 \| 0.7390 \| 0.6933 \| 0.7154 \|
	\| 3 \| 0.2506 \| 0.2751 \| 0.7583 \| 0.7094 \| 0.7330 \|
	\| 6 \| 0.1992 \| 0.2861 \| 0.7551 \| 0.7170 \| 0.7355 \|
	\| 9 \| 0.1717 \| 0.3138 \| 0.7431 \| 0.7255 \| 0.7342 \|

	## ⚡ Key Features

	- 🎯 State-of-the-art Accuracy: 75.22% F1-score on Azerbaijani NER
	- 🌐 25 Entity Categories: Comprehensive coverage including Person, Location, Organization, Government, and more
	- 🚀 Production Ready: Deployed on Fly.io with FastAPI backend
	- 🎨 Interactive UI: Real-time entity highlighting with confidence scores
	- 🔄 Multiple Models: Four different transformer models to choose from
	- 📊 Confidence Scoring: Each prediction includes confidence metrics
	- 🌍 Multilingual Foundation: Built on XLM-RoBERTa for cross-lingual understanding
	- 📱 Responsive Design: Works seamlessly across desktop and mobile devices

	## 🛠️ Technology Stack

	```mermaid
	graph LR
	subgraph "Frontend"
	A[HTML5] --> B[CSS3]
	B --> C[JavaScript]
	end

	subgraph "Backend"
	D[FastAPI] --> E[Python 3.8+]
	E --> F[Uvicorn]
	end

	subgraph "ML Stack"
	G[Transformers] --> H[PyTorch]
	H --> I[Hugging Face]
	end

	subgraph "Deployment"
	J[Docker] --> K[Fly.io]
	K --> L[Production]
	end

	C --> D
	F --> G
	I --> J
	```

	## 🚀 Setup Instructions

	### Local Development

	1. Clone the repository
	```bash
	git clone https://huggingface.co/IsmatS/Named_Entity_Recognition
	cd Named_Entity_Recognition
	```

	2. Set up Python environment
	```bash
	# Create virtual environment
	python -m venv .venv

	# Activate virtual environment
	# On Unix/macOS:
	source .venv/bin/activate
	# On Windows:
	.venv\Scripts\activate

	# Install dependencies
	pip install -r requirements.txt
	```

	3. Run the application
	```bash
	uvicorn main:app --host 0.0.0.0 --port 8080
	```

	### Fly.io Deployment

	1. Install Fly CLI
	```bash
	# On Unix/macOS
	curl -L https://fly.io/install.sh \| sh
	```

	2. Configure deployment
	```bash
	# Login to Fly.io
	fly auth login

	# Initialize app
	fly launch

	# Configure memory (minimum 2GB recommended)
	fly scale memory 2048
	```

	3. Deploy application
	```bash
	fly deploy

	# Monitor deployment
	fly logs
	```

	## 💡 Usage

	### Quick Start
	1. Access the application:
	- 🏠 Local: http://localhost:8080
	- 🌐 Production: https://named-entity-recognition.fly.dev

	2. Enter Azerbaijani text in the input field
	3. Click "Submit" to process and view named entities
	4. View results with entities highlighted by category and confidence scores

	### Example Usage

	```python
	# Example API request
	import requests

	response = requests.post(
	"https://named-entity-recognition.fly.dev/predict/",
	data={"text": "2014-cü ildə Azərbaycan Respublikasının prezidenti İlham Əliyev Salyanda olub."}
	)

	print(response.json())
	# Output: {
	# "entities": {
	# "Date": ["2014"],
	# "Government": ["Azərbaycan"],
	# "Organization": ["Respublikasının"],
	# "Position": ["prezidenti"],
	# "Person": ["İlham Əliyev"],
	# "Location": ["Salyanda"]
	# }
	# }
	```

	## 🎯 Model Capabilities

	- Person Names: İlham Əliyev, Heydər Əliyev, Nizami Gəncəvi
	- Locations: Bakı, Salyanda, Azərbaycan, Gəncə
	- Organizations: Respublika, Universitet, Şirkət
	- Dates & Times: 2014-cü il, sentyabr ayı, səhər saatları
	- Government Entities: prezident, nazir, məclis
	- And 20+ more categories...

	## 🤝 Contributing

	We welcome contributions! Here's how you can help:

	1. 🍴 Fork the repository
	2. 🌿 Create your feature branch (`git checkout -b feature/AmazingFeature`)
	3. 💍 Commit your changes (`git commit -m 'Add some AmazingFeature'`)
	4. 📤 Push to the branch (`git push origin feature/AmazingFeature`)
	5. 🔀 Open a Pull Request

	### Development Areas
	- 🧠 Model improvements and fine-tuning
	- 🎨 UI/UX enhancements
	- 📊 Performance optimizations
	- 🧪 Additional test cases
	- 📖 Documentation improvements

	## 📄 License

	This project is open source and available under the [MIT License](LICENSE).

	## 🙏 Acknowledgments

	- Hugging Face team for the transformer models and infrastructure
	- Google Colab for providing A100 GPU access
	- Fly.io for hosting the production deployment
	- The Azerbaijani NLP community for dataset contributions



	## 🔗 Related Projects

	- [Azerbaijani NER Dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset)
	- [mBERT Azerbaijani NER Model](https://huggingface.co/IsmatS/mbert-az-ner)
	- [XLM-RoBERTa Azerbaijani NER Model](https://huggingface.co/IsmatS/xlm-roberta-az-ner)