Vietnamese_NER / README.md
GitHub Actions
Auto-deploy from GitHub (binary files removed)
95062a5

A newer version of the Streamlit SDK is available: 1.56.0

Upgrade
metadata
title: Vietnamese NER Demo
emoji: 🧠
colorFrom: indigo
colorTo: yellow
sdk: streamlit
sdk_version: 1.46.1
app_file: src/app.py
pinned: false

Vietnamese Named Entity Recognition (NER) 🧠

A comprehensive Vietnamese Named Entity Recognition system using state-of-the-art deep learning models including PhoBERT, CRF, and ensemble methods.

πŸš€ Live Demo

Try the interactive demo: Vietnamese NER Demo

image

πŸ”„ Project Workflow

Project Flowchart

🎯 Overview

This project implements a robust Vietnamese Named Entity Recognition system that can identify and classify entities in Vietnamese text. The system combines multiple approaches including:

  • PhoBERT-based embeddings for contextual understanding
  • Conditional Random Fields (CRF) for sequence labeling
  • Random Forest with semantic embeddings
  • Rule-based methods for enhanced accuracy

πŸ“‚ Project Structure

VIETNAMESE_NER/
β”‚
β”œβ”€β”€ .github/workflows              
β”‚   └── main.yml                   # Auto deploy to Hugging Space
β”‚
β”œβ”€β”€ data/                          # Dataset files
β”‚   └── raw_data.csv               # Raw training data
β”‚
β”œβ”€β”€ notebooks/                      # Jupyter notebooks for experimentation
β”‚   β”œβ”€β”€ Duc_Notebook.ipynb         # CRF + RandomForest experiments
β”‚   β”œβ”€β”€ Softmax_PhoBERT.ipynb      # Softmax approach
β”‚   β”œβ”€β”€ Kien_Rule_base.ipynb       # Rule-based method with RF
β”‚   └── Kien_RF_lightgbm.ipynb     # RF with semantic embeddings
β”‚
β”œβ”€β”€ src/                           # Main source code
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ app.py                     # Streamlit web application
β”‚   β”œβ”€β”€ front.py                   # Highlight function
β”‚   β”œβ”€β”€ config.py                  # Project configuration
β”‚   β”œβ”€β”€ data_loader.py             # Data loading utilities
β”‚   β”œβ”€β”€ preprocessing.py           # Data preprocessing functions
β”‚   β”œβ”€β”€ model.py                   # Model architecture definitions
β”‚   β”œβ”€β”€ train.py                   # Training pipeline
β”‚   β”œβ”€β”€ evaluate.py                # Model evaluation
β”‚   └── predict.py                 # Inference utilities
β”‚
β”œβ”€β”€ models/                        # Saved model artifacts
β”‚   └── best_model.pt              # Best trained model weights
β”‚
β”œβ”€β”€ outputs/                       # Training outputs
β”‚   β”œβ”€β”€ output.log                 # Training logs (TensorBoard)
β”‚   └── figures/                   # Visualization plots
β”‚
β”œβ”€β”€ tests/                         # Unit tests (planned)
β”‚
β”œβ”€β”€ requirements.txt               # Python dependencies
β”œβ”€β”€ environment.yml                # Conda environment file
β”œβ”€β”€ README.md                      # Project documentation
└── run.py                        # Main training script

πŸ—οΈ Model Architecture

The system uses a hybrid architecture combining the strengths of different approaches:

Model Architecture

Core Components:

  • PhoBERT-Base: Generates contextual embeddings for Vietnamese text
  • Linear + CRF Layer: Handles sequence labeling with context awareness
  • Softmax/Random Forest: Provides single-label prediction capabilities

πŸ“Š Dataset & Performance

Dataset: VLSP2016

The model is trained on the VLSP2016 dataset extracted from Vietnamese news articles.

Dataset Statistics:

Entity Frequency Entity Distribution
Token Length Distribution Sentence Length Distribution

Model Performance:

F1 Score Training Loss

Results Comparison

πŸ› οΈ Installation & Setup

Prerequisites

  • Python 3.10+
  • Conda (recommended)

Option 1: Using requirements.txt

# Create and activate conda environment
conda create --name vnner python=3.10
conda activate vnner

# Install dependencies
pip install -r requirements.txt

Option 2: Using environment.yml

# Create environment from yml file
conda env create -f environment.yml
conda activate vnner

πŸš€ Quick Start

Training the Model

python run.py

Running the Streamlit App

python src/app.py

πŸ§ͺ Experimental Approaches

The project explores multiple methodologies:

  1. PhoBERT + CRF: Sequential labeling with contextual embeddings
  2. PhoBERT + Softmax: Direct classification approach
  3. Random Forest + Rule-based: Traditional ML with linguistic rules
  4. Random Forest + Semantic Embeddings: Enhanced feature engineering

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“„ License

This project is open source. Please check the repository for license details.

πŸ™ Acknowledgments

  • VLSP2016 dataset providers
  • PhoBERT model creators
  • Hugging Face for hosting the demo

πŸ“š Additional Resources

For better understanding of the project structure and technologies used:


Happy NER-ing! 🎯