File size: 10,861 Bytes
c5ac0c0
279ed8e
 
 
 
c5ac0c0
279ed8e
c5ac0c0
 
279ed8e
c5ac0c0
279ed8e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
---
title: Translate
emoji: 🌐
colorFrom: blue
colorTo: indigo
sdk: docker
app_file: app.py
pinned: false
---
# Saksi Translation: Nepali-English Machine Translation

This project provides a machine translation solution to translate text from Nepali and Sinhala to English. It leverages the power of the NLLB (No Language Left Behind) model from Meta AI, which is fine-tuned on a custom dataset for improved performance. The project includes a complete workflow from data acquisition to model deployment, featuring a REST API for easy integration.

## Table of Contents

- [Features](#features)
- [Workflow](#workflow)
- [Tech Stack](#tech-stack)
- [Model Details](#model-details)
- [API Endpoints](#api-endpoints)
- [Getting Started](#getting-started)
- [Usage](#usage)
- [Project Structure](#project-structure)
- [Future Improvements](#future-improvements)

## Features

-   **High-Quality Translation:** Utilizes a fine-tuned NLLB model for accurate translations.
-   **Support for Multiple Languages:** Currently supports Nepali and Sinhala to English translation.
-   **REST API:** Exposes the translation model through a high-performance FastAPI application.
-   **Interactive Frontend:** A simple and intuitive web interface for easy translation.
-   **Batch Translation:** Supports translating multiple texts in a single request.
-   **PDF Translation:** Supports translating text directly from PDF files.
-   **Scalable and Reproducible:** Built with a modular structure and uses MLflow for experiment tracking.

## Workflow

The project follows a standard machine learning workflow for building and deploying a translation model:

1.  **Data Acquisition:** The process begins with collecting parallel text data (Nepali/Sinhala and English). The `scripts/fetch_parallel_data.py` script is used to download data from various online sources. The quality and quantity of this data are crucial for the model's performance.

2.  **Data Cleaning and Preprocessing:** Raw data from the web is often noisy and requires cleaning. The `scripts/clean_text_data.py` script performs several preprocessing steps:
    *   **HTML Tag Removal:** Strips out HTML tags and other web artifacts.
    *   **Unicode Normalization:** Normalizes Unicode characters to ensure consistency.
    *   **Sentence Filtering:** Removes sentences that are too long or too short, which can negatively impact training.
    *   **Corpus Alignment:** Ensures a one-to-one correspondence between source and target sentences.

3.  **Model Finetuning:** The core of the project is fine-tuning a pre-trained NLLB model on our custom parallel dataset. The `src/train.py` script, which leverages the Hugging Face `Trainer` API, handles this process. This script manages the entire training loop, including:
    *   Loading the pre-trained NLLB model and tokenizer.
    *   Creating a PyTorch Dataset from the preprocessed data.
    *   Configuring training arguments like learning rate, batch size, and number of epochs.
    *   Executing the training loop and saving the fine-tuned model checkpoints.

4.  **Model Evaluation:** After training, the model's performance is evaluated using the `src/evaluation.py` script. This script calculates the **BLEU (Bilingual Evaluation Understudy)** score, a widely accepted metric for machine translation quality. It works by comparing the model's translations of a test set with a set of high-quality reference translations.

5.  **Inference and Deployment:** Once the model is trained and evaluated, it's ready for use.
    *   `interactive_translate.py`: A command-line script for quick, interactive translation tests.
    *   `fast_api.py`: A production-ready REST API built with FastAPI that serves the translation model. This allows other applications to easily consume the translation service.

## Tech Stack

The technologies used in this project were chosen to create a robust, efficient, and maintainable machine translation pipeline:

-   **Python:** The primary language for the project, offering a rich ecosystem of libraries and frameworks for machine learning.
-   **PyTorch:** A flexible and powerful deep learning framework that provides fine-grained control over the model training process.
-   **Hugging Face Transformers:** The backbone of the project, providing easy access to pre-trained models like NLLB and a standardized interface for training and inference.
-   **Hugging Face Datasets:** Simplifies the process of loading and preprocessing large datasets, with efficient data loading and manipulation capabilities.
-   **FastAPI:** A modern, high-performance web framework for building APIs with Python. It's used to serve the translation model as a REST API.
-   **Uvicorn:** A lightning-fast ASGI server, used to run the FastAPI application.
-   **MLflow:** Used for experiment tracking to ensure reproducibility. It logs training parameters, metrics, and model artifacts, which is crucial for managing machine learning projects.

## Model Details

-   **Base Model:** The project uses the `facebook/nllb-200-distilled-600M` model, a distilled version of the NLLB-200 model. This model is designed to be efficient while still providing high-quality translations for a large number of languages.
-   **Fine-tuning:** The base model is fine-tuned on a custom dataset of Nepali-English and Sinhala-English parallel text to improve its performance on these specific language pairs.
-   **Tokenizer:** The `NllbTokenizer` is used for tokenizing the text. It's a sentence-piece based tokenizer that is specifically designed for the NLLB model.

## API Endpoints

The FastAPI application provides the following endpoints:

-   **`GET /`**: Returns the frontend HTML page.
-   **`GET /languages`**: Returns a list of supported languages.
-   **`POST /translate`**: Translates a single text.
    -   **Request Body:**
        ```json
        {
          "text": "string",
          "source_language": "string"
        }
        ```
    -   **Response Body:**
        ```json
        {
          "original_text": "string",
          "translated_text": "string",
          "source_language": "string"
        }
        ```
-   **`POST /batch-translate`**: Translates a batch of texts.
    -   **Request Body:**
        ```json
        {
          "texts": [
            "string"
          ],
          "source_language": "string"
        }
        ```
    -   **Response Body:**
        ```json
        {
          "original_texts": [
            "string"
          ],
          "translated_texts": [
            "string"
          ],
          "source_language": "string"
        }
        ```
-   **`POST /translate-pdf`**: Translates a PDF file.
    -   **Request:** `source_language: str`, `file: UploadFile`
    -   **Response Body:**
        ```json
        {
          "filename": "string",
          "translated_text": "string",
          "source_language": "string"
        }
        ```

## Getting Started

### Prerequisites

-   **Python 3.10 or higher:** Ensure you have a recent version of Python installed.
-   **Git and Git LFS:** Git is required to clone the repository, and Git LFS is required to handle large model files.
-   **(Optional) NVIDIA GPU with CUDA:** A GPU is highly recommended for training the model.

### Installation

1.  **Clone the repository:**
    ```bash
    git clone <repository-url>
    cd saksi_translation
    ```

2.  **Create and activate a virtual environment:**
    ```bash
    python -m venv .venv
    # On Windows
    .venv\Scripts\activate
    # On macOS/Linux
    source .venv/bin/activate
    ```

3.  **Install dependencies:**
    ```bash
    pip install -r requirements.txt
    ```

## Usage

### Data Preparation

-   **Fetch Parallel Data:**
    ```bash
    python scripts/fetch_parallel_data.py --output_dir data/raw
    ```

-   **Clean Text Data:**
    ```bash
    python scripts/clean_text_data.py --input_dir data/raw --output_dir data/processed
    ```

### Training

-   **Start Training:**
    ```bash
    python src/train.py \
        --model_name "facebook/nllb-200-distilled-600M" \
        --dataset_path "data/processed" \
        --output_dir "models/nllb-finetuned-nepali-en" \
        --learning_rate 2e-5 \
        --per_device_train_batch_size 8 \
        --num_train_epochs 3
    ```

### Evaluation

-   **Evaluate the Model:**
    ```bash
    python src/evaluate.py \
        --model_path "models/nllb-finetuned-nepali-en" \
        --test_data_path "data/test_sets/test.en" \
        --reference_data_path "data/test_sets/test.ne"
    ```

### Interactive Translation

-   **Run the interactive script:**
    ```bash
    python interactive_translate.py
    ```

### API

-   **Run the API:**
    ```bash
    uvicorn fast_api:app --reload
    ```
    Open your browser and navigate to `http://127.0.0.1:8000` to use the web interface.

## Project Structure

```
saksi_translation/
β”œβ”€β”€ .gitignore
β”œβ”€β”€ fast_api.py             # FastAPI application
β”œβ”€β”€ interactive_translate.py  # Interactive translation script
β”œβ”€β”€ README.md               # Project documentation
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ test_translation.py     # Script for testing the translation model
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ index.html          # Frontend HTML
β”‚   β”œβ”€β”€ script.js           # Frontend JavaScript
β”‚   └── styles.css          # Frontend CSS
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ processed/          # Processed data for training
β”‚   β”œβ”€β”€ raw/                # Raw data downloaded from the web
β”‚   └── test_sets/          # Test sets for evaluation
β”œβ”€β”€ mlruns/                 # MLflow experiment tracking data
β”œβ”€β”€ models/
β”‚   └── nllb-finetuned-nepali-en/ # Fine-tuned model
β”œβ”€β”€ notebooks/              # Jupyter notebooks for experimentation
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ clean_text_data.py
β”‚   β”œβ”€β”€ create_test_set.py
β”‚   β”œβ”€β”€ download_model.py
β”‚   β”œβ”€β”€ fetch_parallel_data.py
β”‚   └── scrape_bbc_nepali.py
└── src/
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ evaluation.py       # Script for evaluating the model
    β”œβ”€β”€ train.py            # Script for training the model
    └── translate.py        # Script for translating text
```

## Future Improvements

-   **Support for more languages:** The project can be extended to support more languages by adding more parallel data and fine-tuning the model on it.
-   **Improved Model:** The model can be improved by using a larger version of the NLLB model or by fine-tuning it on a larger and cleaner dataset.
-   **Advanced Frontend:** The frontend can be improved by adding features like translation history, user accounts, and more advanced styling.
-   **Containerization:** The application can be containerized using Docker for easier deployment and scaling.