Spaces:
Runtime error
Runtime error
| title: Vietnamese Legal Doc Retrieval | |
| emoji: π | |
| colorFrom: indigo | |
| colorTo: pink | |
| sdk: docker | |
| pinned: false | |
| short_description: Fine-tuned Retrieval System for Vietnamese Legal Documents | |
| models: | |
| - YuITC/bert-base-multilingual-cased-finetuned-VNLegalDocs | |
| datasets: | |
| - YuITC/Vietnamese-Legal-Doc-Retrieval-Data | |
| # Vietnamese Legal Document Retrieval System | |
| [](https://huggingface.co/spaces/YuITC/Vietnamese-Legal-Doc-Retrieval) | |
| [](https://huggingface.co/YuITC/bert-base-multilingual-cased-finetuned-VNLegalDocs) | |
| [](https://huggingface.co/datasets/YuITC/Vietnamese-Legal-Doc-Retrieval-Data) | |
| A retrieval system specifically designed for Vietnamese legal documents using fine-tuned SBERT (Sentence-BERT) technology. | |
| ## π Overview | |
| This project implements a retrieval system for retrieving relevant Vietnamese legal documents based on user queries. The system uses a fine-tuned multilingual BERT model to encode legal queries and documents into a semantic vector space, allowing for retrieval based on meaning rather than just keyword matching. | |
|  | |
| ## π Key features | |
| - Step-by-step notebook for understanding. | |
| - Fine-tuned SBERT model specialized for Vietnamese legal document retrieval. | |
| - FAISS indexing for efficient vector search. | |
| - Evaluation based on MTEB. | |
| - Interactive web interface for quick legal document search. | |
| - High-performance retrieval of relevant legal passages. | |
| ## π οΈ Installation & Usage | |
| ```bash | |
| # Install dependencies | |
| conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia | |
| conda install faiss-gpu=1.9.0 -c pytorch -c nvidia | |
| pip install -r requirements.txt | |
| # Running the Application | |
| python main.py | |
| ``` | |
| The application will start a local web server with the Gradio interface, allowing you to enter legal queries and retrieve relevant documents. | |
| ## π Project Structure | |
| ``` | |
| Vietnamese-Legal-Doc-Retrieval/ | |
| βββ assets/ # Visual assets for documentation | |
| β βββ gradio_demo.png # Screenshot of the Gradio demo interface | |
| βββ cache/ # Cached model files | |
| β βββ VN-legalDocs-SBERT/ # Cached BERT model files | |
| βββ data/ # Dataset files | |
| β βββ original/ # Original downloaded dataset | |
| β β βββ corpus.csv # Raw corpus documents | |
| β β βββ train_split.csv # Training data | |
| β β βββ val_split.csv # Validation data | |
| β β βββ ... | |
| β βββ processed/ # Processed dataset files | |
| β β βββ corpus_data.parquet # Processed corpus for embedding | |
| β β βββ train_data.parquet # Processed training data | |
| β β βββ test_data.parquet # Processed test data | |
| β βββ retrieval/ # Files for retrieval system | |
| β βββ legal_faiss.index # FAISS index for fast vector search | |
| βββ models/ # Trained model files | |
| β βββ VN-legalDocs-SBERT/ # Fine-tuned BERT model for legal documents | |
| β βββ model.safetensors # Model weights | |
| β βββ config.json # Model configuration | |
| β βββ checkpoint-*/ # Training checkpoints | |
| βββ results/ # Evaluation results | |
| βββ Dockerfile # Docker configuration for deployment | |
| βββ main.py # Main application entry point | |
| βββ requirements.txt # Python dependencies | |
| βββ settings.py # Configuration settings | |
| βββ step_*_*.ipynb # Jupyter notebooks for each step of the process | |
| ``` | |
| ## πΎ Dataset | |
| The system is trained on a Vietnamese legal document corpus containing: | |
| - Legal texts from various domains | |
| - Query-document pairs for training and evaluation | |
| - Processed and structured for semantic search training | |
| The dataset is available on [Hugging Face](https://huggingface.co/datasets/YuITC/Vietnamese-Legal-Doc-Retrieval-Data) (modified by me, the base dataset is cited below). | |
| ## π Model Training Process | |
| The project follows a systematic approach to build the retrieval system: | |
| 1. **Data Preparation** (`step_01_Prepare_Data.ipynb`): | |
| - Processes raw legal documents | |
| - Creates query-document pairs for training | |
| - Formats data for the embedding model | |
| 2. **SBERT Fine-tuning** (`step_02_Finetune_SBERT.ipynb`): | |
| - Fine-tunes a multilingual BERT model with legal document pairs | |
| - Uses `CachedMultipleNegativesRankingLoss` for training | |
| - Optimizes for semantic similarity in legal context | |
| 3. **Evaluation** (`step_03_Eval_with_MTEB.ipynb`): | |
| - Evaluates model performance using retrieval metrics | |
| - Compares with baseline models | |
| 4. **Retrieval System Setup** (`step_04_Retrieval.ipynb`): | |
| - Creates FAISS index from document embeddings | |
| - Implements efficient search functionality | |
| - Prepares for deployment | |
| ## π Usage Examples | |
| The system accepts natural language queries in Vietnamese related to legal topics. Example queries: | |
| - "Tα»i xΓΊc phαΊ‘m danh dα»±?" (Crimes against honor?) | |
| - "Quyα»n lợi cα»§a ngΖ°α»i lao Δα»ng?" (Rights of workers?) | |
| - "Thα»§ tα»₯c ΔΔng kΓ½ kαΊΏt hΓ΄n?" (Marriage registration procedures?) | |
| ## π§ͺ Performance | |
| The fine-tuned model was evaluated using the [MTEB benchmark](https://github.com/embeddings-benchmark/mteb) on the BKAILegalDocRetrieval dataset. Key results: | |
| | Metric | @k | Pre-trained model score (%) | Fine-tuned model score (%) | | |
| |--------------|-----|-----------------------------|-----------------------------| | |
| | **NDCG** | 1 | 0.007 | 42.425 | | |
| | | 5 | 0.011 | 57.387 | | |
| | | 10 | 0.023 | 60.389 | | |
| | | 20 | 0.049 | 62.160 | | |
| | | 100 | 0.147 | 63.894 | | |
| | **MAP** | 1 | 0.007 | 40.328 | | |
| | | 5 | 0.009 | 52.297 | | |
| | | 10 | 0.014 | 53.608 | | |
| | | 20 | 0.021 | 54.136 | | |
| | | 100 | 0.033 | 54.418 | | |
| | **Recall** | 1 | 0.007 | 40.328 | | |
| | | 5 | 0.017 | 70.466 | | |
| | | 10 | 0.054 | 79.407 | | |
| | | 20 | 0.157 | 86.112 | | |
| | | 100 | 0.713 | 94.805 | | |
| | **Precision**| 1 | 0.007 | 42.425 | | |
| | | 5 | 0.003 | 15.119 | | |
| | | 10 | 0.005 | 8.587 | | |
| | | 20 | 0.008 | 4.687 | | |
| | | 100 | 0.007 | 1.045 | | |
| | **MRR** | 1 | 0.007 | 42.418 | | |
| | | 5 | 0.010 | 54.337 | | |
| | | 10 | 0.014 | 55.510 | | |
| | | 20 | 0.021 | 55.956 | | |
| | | 100 | 0.033 | 56.172 | | |
| - **NDCG@k (Normalized Discounted Cumulative Gain)** | |
| Measures ranking quality by evaluating the relevance of results with logarithmic position-based discounting. | |
| - **MAP@k (Mean Average Precision)** | |
| Computes the average precision for each query up to rank kβprecision at each relevant retrieved documentβthen averages across all queries. | |
| - **Recall@k** | |
| The proportion of all relevant documents that are retrieved in the top k results. | |
| - **Precision@k** | |
| The proportion of the top k retrieved documents that are relevant. | |
| - **MRR@k (Mean Reciprocal Rank)** | |
| The average of the reciprocal of the rank position of the first relevant document across all queries. | |
| The model significantly outperforms baseline retrieval methods, with the main evaluation score (NDCG@10) reaching 60.4%, demonstrating strong performance on Vietnamese legal document retrieval tasks. | |
| ## π³ Docker Deployment | |
| The project includes a Docker configuration for easy deployment. The Docker image is built on `continuumio/miniconda3` and includes GPU support via PyTorch CUDA and FAISS-GPU. | |
| ```bash | |
| # Build the Docker image | |
| docker build -t vietnamese-legal-retrieval . | |
| # Run the container | |
| docker run -p 7860:7860 vietnamese-legal-retrieval | |
| ``` | |
| The container: | |
| - Uses Python 3.10 with CUDA 12.1 support | |
| - Installs required dependencies from requirements.txt | |
| - Exposes port 7860 for the Gradio web interface | |
| - Sets proper environment variables for security and performance | |
| - Runs as a non-root user for enhanced security | |
| You can access the web interface by navigating to `http://localhost:7860` after starting the container. | |
| ## π License | |
| This project is licensed under the MIT License β feel free to modify and distribute it as needed. | |
| ## π€ Acknowledgments | |
| Thanks for: | |
| - [BKAI Legal Retrieval Dataset](https://huggingface.co/datasets/tmnam20/BKAI-Legal-Retrieval) for the original data | |
| - [Sentence Transformers](https://www.sbert.net/) library for the embedding model architecture | |
| - [Hugging Face](https://huggingface.co/) for hosting the model and dataset | |
| If you find this project useful, consider βοΈ starring the repository or contributing to further improvements! | |
| ## π¬ Contact | |
| For any questions or collaboration opportunities, feel free to reach out: | |
| π§ Email: tainguyenphu2502@gmail.com |