File size: 10,750 Bytes
b653f91 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 |
# Saksi Translation: Nepali-English Machine Translation
This project provides a machine translation solution to translate text from Nepali and Sinhala to English. It leverages the power of the NLLB (No Language Left Behind) model from Meta AI, which is fine-tuned on a custom dataset for improved performance. The project includes a complete workflow from data acquisition to model deployment, featuring a REST API for easy integration.
## Table of Contents
- [Features](#features)
- [Workflow](#workflow)
- [Tech Stack](#tech-stack)
- [Model Details](#model-details)
- [API Endpoints](#api-endpoints)
- [Getting Started](#getting-started)
- [Usage](#usage)
- [Project Structure](#project-structure)
- [Future Improvements](#future-improvements)
## Features
- **High-Quality Translation:** Utilizes a fine-tuned NLLB model for accurate translations.
- **Support for Multiple Languages:** Currently supports Nepali and Sinhala to English translation.
- **REST API:** Exposes the translation model through a high-performance FastAPI application.
- **Interactive Frontend:** A simple and intuitive web interface for easy translation.
- **Batch Translation:** Supports translating multiple texts in a single request.
- **PDF Translation:** Supports translating text directly from PDF files.
- **Scalable and Reproducible:** Built with a modular structure and uses MLflow for experiment tracking.
## Workflow
The project follows a standard machine learning workflow for building and deploying a translation model:
1. **Data Acquisition:** The process begins with collecting parallel text data (Nepali/Sinhala and English). The `scripts/fetch_parallel_data.py` script is used to download data from various online sources. The quality and quantity of this data are crucial for the model's performance.
2. **Data Cleaning and Preprocessing:** Raw data from the web is often noisy and requires cleaning. The `scripts/clean_text_data.py` script performs several preprocessing steps:
* **HTML Tag Removal:** Strips out HTML tags and other web artifacts.
* **Unicode Normalization:** Normalizes Unicode characters to ensure consistency.
* **Sentence Filtering:** Removes sentences that are too long or too short, which can negatively impact training.
* **Corpus Alignment:** Ensures a one-to-one correspondence between source and target sentences.
3. **Model Finetuning:** The core of the project is fine-tuning a pre-trained NLLB model on our custom parallel dataset. The `src/train.py` script, which leverages the Hugging Face `Trainer` API, handles this process. This script manages the entire training loop, including:
* Loading the pre-trained NLLB model and tokenizer.
* Creating a PyTorch Dataset from the preprocessed data.
* Configuring training arguments like learning rate, batch size, and number of epochs.
* Executing the training loop and saving the fine-tuned model checkpoints.
4. **Model Evaluation:** After training, the model's performance is evaluated using the `src/evaluation.py` script. This script calculates the **BLEU (Bilingual Evaluation Understudy)** score, a widely accepted metric for machine translation quality. It works by comparing the model's translations of a test set with a set of high-quality reference translations.
5. **Inference and Deployment:** Once the model is trained and evaluated, it's ready for use.
* `interactive_translate.py`: A command-line script for quick, interactive translation tests.
* `fast_api.py`: A production-ready REST API built with FastAPI that serves the translation model. This allows other applications to easily consume the translation service.
## Tech Stack
The technologies used in this project were chosen to create a robust, efficient, and maintainable machine translation pipeline:
- **Python:** The primary language for the project, offering a rich ecosystem of libraries and frameworks for machine learning.
- **PyTorch:** A flexible and powerful deep learning framework that provides fine-grained control over the model training process.
- **Hugging Face Transformers:** The backbone of the project, providing easy access to pre-trained models like NLLB and a standardized interface for training and inference.
- **Hugging Face Datasets:** Simplifies the process of loading and preprocessing large datasets, with efficient data loading and manipulation capabilities.
- **FastAPI:** A modern, high-performance web framework for building APIs with Python. It's used to serve the translation model as a REST API.
- **Uvicorn:** A lightning-fast ASGI server, used to run the FastAPI application.
- **MLflow:** Used for experiment tracking to ensure reproducibility. It logs training parameters, metrics, and model artifacts, which is crucial for managing machine learning projects.
## Model Details
- **Base Model:** The project uses the `facebook/nllb-200-distilled-600M` model, a distilled version of the NLLB-200 model. This model is designed to be efficient while still providing high-quality translations for a large number of languages.
- **Fine-tuning:** The base model is fine-tuned on a custom dataset of Nepali-English and Sinhala-English parallel text to improve its performance on these specific language pairs.
- **Tokenizer:** The `NllbTokenizer` is used for tokenizing the text. It's a sentence-piece based tokenizer that is specifically designed for the NLLB model.
## API Endpoints
The FastAPI application provides the following endpoints:
- **`GET /`**: Returns the frontend HTML page.
- **`GET /languages`**: Returns a list of supported languages.
- **`POST /translate`**: Translates a single text.
- **Request Body:**
```json
{
"text": "string",
"source_language": "string"
}
```
- **Response Body:**
```json
{
"original_text": "string",
"translated_text": "string",
"source_language": "string"
}
```
- **`POST /batch-translate`**: Translates a batch of texts.
- **Request Body:**
```json
{
"texts": [
"string"
],
"source_language": "string"
}
```
- **Response Body:**
```json
{
"original_texts": [
"string"
],
"translated_texts": [
"string"
],
"source_language": "string"
}
```
- **`POST /translate-pdf`**: Translates a PDF file.
- **Request:** `source_language: str`, `file: UploadFile`
- **Response Body:**
```json
{
"filename": "string",
"translated_text": "string",
"source_language": "string"
}
```
## Getting Started
### Prerequisites
- **Python 3.10 or higher:** Ensure you have a recent version of Python installed.
- **Git and Git LFS:** Git is required to clone the repository, and Git LFS is required to handle large model files.
- **(Optional) NVIDIA GPU with CUDA:** A GPU is highly recommended for training the model.
### Installation
1. **Clone the repository:**
```bash
git clone <repository-url>
cd saksi_translation
```
2. **Create and activate a virtual environment:**
```bash
python -m venv .venv
# On Windows
.venv\Scripts\activate
# On macOS/Linux
source .venv/bin/activate
```
3. **Install dependencies:**
```bash
pip install -r requirements.txt
```
## Usage
### Data Preparation
- **Fetch Parallel Data:**
```bash
python scripts/fetch_parallel_data.py --output_dir data/raw
```
- **Clean Text Data:**
```bash
python scripts/clean_text_data.py --input_dir data/raw --output_dir data/processed
```
### Training
- **Start Training:**
```bash
python src/train.py \
--model_name "facebook/nllb-200-distilled-600M" \
--dataset_path "data/processed" \
--output_dir "models/nllb-finetuned-nepali-en" \
--learning_rate 2e-5 \
--per_device_train_batch_size 8 \
--num_train_epochs 3
```
### Evaluation
- **Evaluate the Model:**
```bash
python src/evaluate.py \
--model_path "models/nllb-finetuned-nepali-en" \
--test_data_path "data/test_sets/test.en" \
--reference_data_path "data/test_sets/test.ne"
```
### Interactive Translation
- **Run the interactive script:**
```bash
python interactive_translate.py
```
### API
- **Run the API:**
```bash
uvicorn fast_api:app --reload
```
Open your browser and navigate to `http://127.0.0.1:8000` to use the web interface.
## Project Structure
```
saksi_translation/
βββ .gitignore
βββ fast_api.py # FastAPI application
βββ interactive_translate.py # Interactive translation script
βββ README.md # Project documentation
βββ requirements.txt # Python dependencies
βββ test_translation.py # Script for testing the translation model
βββ frontend/
β βββ index.html # Frontend HTML
β βββ script.js # Frontend JavaScript
β βββ styles.css # Frontend CSS
βββ data/
β βββ processed/ # Processed data for training
β βββ raw/ # Raw data downloaded from the web
β βββ test_sets/ # Test sets for evaluation
βββ mlruns/ # MLflow experiment tracking data
βββ models/
β βββ nllb-finetuned-nepali-en/ # Fine-tuned model
βββ notebooks/ # Jupyter notebooks for experimentation
βββ scripts/
β βββ clean_text_data.py
β βββ create_test_set.py
β βββ download_model.py
β βββ fetch_parallel_data.py
β βββ scrape_bbc_nepali.py
βββ src/
βββ __init__.py
βββ evaluation.py # Script for evaluating the model
βββ train.py # Script for training the model
βββ translate.py # Script for translating text
```
## Future Improvements
- **Support for more languages:** The project can be extended to support more languages by adding more parallel data and fine-tuning the model on it.
- **Improved Model:** The model can be improved by using a larger version of the NLLB model or by fine-tuning it on a larger and cleaner dataset.
- **Advanced Frontend:** The frontend can be improved by adding features like translation history, user accounts, and more advanced styling.
- **Containerization:** The application can be containerized using Docker for easier deployment and scaling.
|