Spaces:

DucLai
/

Vietnamese_NER

Running

App Files Files Community

Vietnamese_NER / README.md

GitHub Actions

Auto-deploy from GitHub (binary files removed)

95062a5 10 months ago

preview code

raw

history blame contribute delete

6.54 kB

A newer version of the Streamlit SDK is available: 1.56.0

Upgrade

metadata

title: Vietnamese NER Demo
emoji: 🧠
colorFrom: indigo
colorTo: yellow
sdk: streamlit
sdk_version: 1.46.1
app_file: src/app.py
pinned: false

Vietnamese Named Entity Recognition (NER) 🧠

A comprehensive Vietnamese Named Entity Recognition system using state-of-the-art deep learning models including PhoBERT, CRF, and ensemble methods.

🚀 Live Demo

Try the interactive demo: Vietnamese NER Demo

🔄 Project Workflow

🎯 Overview

This project implements a robust Vietnamese Named Entity Recognition system that can identify and classify entities in Vietnamese text. The system combines multiple approaches including:

PhoBERT-based embeddings for contextual understanding
Conditional Random Fields (CRF) for sequence labeling
Random Forest with semantic embeddings
Rule-based methods for enhanced accuracy

📂 Project Structure

VIETNAMESE_NER/
│
├── .github/workflows              
│   └── main.yml                   # Auto deploy to Hugging Space
│
├── data/                          # Dataset files
│   └── raw_data.csv               # Raw training data
│
├── notebooks/                      # Jupyter notebooks for experimentation
│   ├── Duc_Notebook.ipynb         # CRF + RandomForest experiments
│   ├── Softmax_PhoBERT.ipynb      # Softmax approach
│   ├── Kien_Rule_base.ipynb       # Rule-based method with RF
│   └── Kien_RF_lightgbm.ipynb     # RF with semantic embeddings
│
├── src/                           # Main source code
│   ├── __init__.py
│   ├── app.py                     # Streamlit web application
│   ├── front.py                   # Highlight function
│   ├── config.py                  # Project configuration
│   ├── data_loader.py             # Data loading utilities
│   ├── preprocessing.py           # Data preprocessing functions
│   ├── model.py                   # Model architecture definitions
│   ├── train.py                   # Training pipeline
│   ├── evaluate.py                # Model evaluation
│   └── predict.py                 # Inference utilities
│
├── models/                        # Saved model artifacts
│   └── best_model.pt              # Best trained model weights
│
├── outputs/                       # Training outputs
│   ├── output.log                 # Training logs (TensorBoard)
│   └── figures/                   # Visualization plots
│
├── tests/                         # Unit tests (planned)
│
├── requirements.txt               # Python dependencies
├── environment.yml                # Conda environment file
├── README.md                      # Project documentation
└── run.py                        # Main training script

🏗️ Model Architecture

The system uses a hybrid architecture combining the strengths of different approaches:

Core Components:

PhoBERT-Base: Generates contextual embeddings for Vietnamese text
Linear + CRF Layer: Handles sequence labeling with context awareness
Softmax/Random Forest: Provides single-label prediction capabilities

📊 Dataset & Performance

Dataset: VLSP2016

The model is trained on the VLSP2016 dataset extracted from Vietnamese news articles.

Dataset Statistics:

Model Performance:

🛠️ Installation & Setup

Prerequisites

Python 3.10+
Conda (recommended)

Option 1: Using `requirements.txt`

# Create and activate conda environment
conda create --name vnner python=3.10
conda activate vnner

# Install dependencies
pip install -r requirements.txt

Option 2: Using `environment.yml`

# Create environment from yml file
conda env create -f environment.yml
conda activate vnner

🚀 Quick Start

Training the Model

python run.py

Running the Streamlit App

python src/app.py

🧪 Experimental Approaches

The project explores multiple methodologies:

PhoBERT + CRF: Sequential labeling with contextual embeddings
PhoBERT + Softmax: Direct classification approach
Random Forest + Rule-based: Traditional ML with linguistic rules
Random Forest + Semantic Embeddings: Enhanced feature engineering

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

This project is open source. Please check the repository for license details.

🙏 Acknowledgments

VLSP2016 dataset providers
PhoBERT model creators
Hugging Face for hosting the demo

📚 Additional Resources

For better understanding of the project structure and technologies used:

Happy NER-ing! 🎯