Spaces:
Running
A newer version of the Streamlit SDK is available: 1.56.0
title: Vietnamese NER Demo
emoji: π§
colorFrom: indigo
colorTo: yellow
sdk: streamlit
sdk_version: 1.46.1
app_file: src/app.py
pinned: false
Vietnamese Named Entity Recognition (NER) π§
A comprehensive Vietnamese Named Entity Recognition system using state-of-the-art deep learning models including PhoBERT, CRF, and ensemble methods.
π Live Demo
Try the interactive demo: Vietnamese NER Demo
π Project Workflow
π― Overview
This project implements a robust Vietnamese Named Entity Recognition system that can identify and classify entities in Vietnamese text. The system combines multiple approaches including:
- PhoBERT-based embeddings for contextual understanding
- Conditional Random Fields (CRF) for sequence labeling
- Random Forest with semantic embeddings
- Rule-based methods for enhanced accuracy
π Project Structure
VIETNAMESE_NER/
β
βββ .github/workflows
β βββ main.yml # Auto deploy to Hugging Space
β
βββ data/ # Dataset files
β βββ raw_data.csv # Raw training data
β
βββ notebooks/ # Jupyter notebooks for experimentation
β βββ Duc_Notebook.ipynb # CRF + RandomForest experiments
β βββ Softmax_PhoBERT.ipynb # Softmax approach
β βββ Kien_Rule_base.ipynb # Rule-based method with RF
β βββ Kien_RF_lightgbm.ipynb # RF with semantic embeddings
β
βββ src/ # Main source code
β βββ __init__.py
β βββ app.py # Streamlit web application
β βββ front.py # Highlight function
β βββ config.py # Project configuration
β βββ data_loader.py # Data loading utilities
β βββ preprocessing.py # Data preprocessing functions
β βββ model.py # Model architecture definitions
β βββ train.py # Training pipeline
β βββ evaluate.py # Model evaluation
β βββ predict.py # Inference utilities
β
βββ models/ # Saved model artifacts
β βββ best_model.pt # Best trained model weights
β
βββ outputs/ # Training outputs
β βββ output.log # Training logs (TensorBoard)
β βββ figures/ # Visualization plots
β
βββ tests/ # Unit tests (planned)
β
βββ requirements.txt # Python dependencies
βββ environment.yml # Conda environment file
βββ README.md # Project documentation
βββ run.py # Main training script
ποΈ Model Architecture
The system uses a hybrid architecture combining the strengths of different approaches:
Core Components:
- PhoBERT-Base: Generates contextual embeddings for Vietnamese text
- Linear + CRF Layer: Handles sequence labeling with context awareness
- Softmax/Random Forest: Provides single-label prediction capabilities
π Dataset & Performance
Dataset: VLSP2016
The model is trained on the VLSP2016 dataset extracted from Vietnamese news articles.
Dataset Statistics:
Model Performance:
|
|
|
π οΈ Installation & Setup
Prerequisites
- Python 3.10+
- Conda (recommended)
Option 1: Using requirements.txt
# Create and activate conda environment
conda create --name vnner python=3.10
conda activate vnner
# Install dependencies
pip install -r requirements.txt
Option 2: Using environment.yml
# Create environment from yml file
conda env create -f environment.yml
conda activate vnner
π Quick Start
Training the Model
python run.py
Running the Streamlit App
python src/app.py
π§ͺ Experimental Approaches
The project explores multiple methodologies:
- PhoBERT + CRF: Sequential labeling with contextual embeddings
- PhoBERT + Softmax: Direct classification approach
- Random Forest + Rule-based: Traditional ML with linguistic rules
- Random Forest + Semantic Embeddings: Enhanced feature engineering
π€ Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
π License
This project is open source. Please check the repository for license details.
π Acknowledgments
- VLSP2016 dataset providers
- PhoBERT model creators
- Hugging Face for hosting the demo
π Additional Resources
For better understanding of the project structure and technologies used:
Happy NER-ing! π―