Update README.md
Browse files
README.md
CHANGED
|
@@ -1,4 +1,3 @@
|
|
| 1 |
-
|
| 2 |
---
|
| 3 |
base_model: "None"
|
| 4 |
language:
|
|
@@ -11,84 +10,108 @@ tags:
|
|
| 11 |
metrics:
|
| 12 |
- accuracy
|
| 13 |
---
|
| 14 |
-
# LinguaFlow: English-Arabic Neural Machine Translation Model
|
| 15 |
-
|
| 16 |
-
## Model Description
|
| 17 |
-
|
| 18 |
-
This is a sequence-to-sequence (Seq2Seq) model designed for translating English text to Arabic text. It employs an Encoder-Decoder architecture built with Long Short-Term Memory (LSTM) layers, a popular choice for handling sequential data like natural language.
|
| 19 |
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
-
|
| 23 |
-
* **Encoder**: Processes the input English sequence.
|
| 24 |
-
* **Decoder**: Generates the output Arabic sequence based on the encoder's context.
|
| 25 |
-
* **Input Language**: English (en)
|
| 26 |
-
* **Output Language**: Arabic (ar)
|
| 27 |
-
* **Input Sequence Length**: Maximum of `20` words.
|
| 28 |
-
* **Output Sequence Length**: Maximum of `20` words.
|
| 29 |
-
* **Vocabulary Size (English)**: `6409` unique words.
|
| 30 |
-
* **Vocabulary Size (Arabic)**: `9642` unique words.
|
| 31 |
-
|
| 32 |
-
## Training Data
|
| 33 |
-
|
| 34 |
-
The model was trained on a subset of the `salehalmansour/english-to-arabic-translate` dataset, which contains English-Arabic sentence pairs. The training involved cleaning and tokenizing the text, and then encoding the sequences into numerical representations suitable for the neural network.
|
| 35 |
|
| 36 |
-
##
|
| 37 |
|
| 38 |
-
|
| 39 |
|
| 40 |
-
|
| 41 |
-
*
|
| 42 |
-
*
|
| 43 |
-
*
|
|
|
|
| 44 |
|
| 45 |
-
|
| 46 |
|
| 47 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
-
|
| 50 |
|
| 51 |
-
|
| 52 |
-
2. **Load the model and tokenizers**: Download `Translation_model.keras`, `eng_tokenizer.pkl`, and `ar_tokenizer.pkl` from this repository.
|
| 53 |
-
3. **Prepare your input**: Clean and tokenize your English input text using the loaded `eng_tokenizer`, and then pad it to the `eng_length` (20).
|
| 54 |
-
4. **Make a prediction**: Pass the encoded English sequence to the loaded Keras model's `predict` method.
|
| 55 |
-
5. **Decode the output**: Use `np.argmax` on the model's output to get the predicted word indices, then convert these indices back to Arabic words using the `ar_tokenizer`.
|
| 56 |
|
| 57 |
-
|
| 58 |
|
| 59 |
-
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
-
|
| 62 |
-
* **Vocabulary Size**: Limited vocabulary might result in out-of-vocabulary (OOV) tokens, which can impact translation quality.
|
| 63 |
-
* **Sequence Length**: The fixed maximum sequence lengths for input and output can limit the translation of very long sentences.
|
| 64 |
|
| 65 |
-
##
|
| 66 |
|
| 67 |
-
|
| 68 |
-
|
|
|
|
|
|
|
| 69 |
|
|
|
|
| 70 |
```python
|
| 71 |
from huggingface_hub import snapshot_download
|
| 72 |
import tensorflow as tf
|
| 73 |
import numpy as np
|
| 74 |
import os
|
| 75 |
-
|
| 76 |
from tensorflow.keras.preprocessing.sequence import pad_sequences
|
| 77 |
|
|
|
|
| 78 |
repo_id = "Ali0044/LinguaFlow"
|
| 79 |
local_dir = snapshot_download(repo_id=repo_id)
|
| 80 |
|
| 81 |
-
|
|
|
|
| 82 |
|
| 83 |
-
with open(os.path.join(local_dir, "
|
| 84 |
-
eng_tokenizer =
|
| 85 |
|
| 86 |
-
with open(os.path.join(local_dir, "
|
| 87 |
-
ar_tokenizer =
|
| 88 |
|
|
|
|
| 89 |
def translate(sentences):
|
|
|
|
| 90 |
seq = eng_tokenizer.texts_to_sequences(sentences)
|
| 91 |
-
|
|
|
|
|
|
|
| 92 |
preds = model.predict(padded)
|
| 93 |
preds = np.argmax(preds, axis=-1)
|
| 94 |
|
|
@@ -98,6 +121,34 @@ def translate(sentences):
|
|
| 98 |
results.append(' '.join(text))
|
| 99 |
return results
|
| 100 |
|
| 101 |
-
#
|
| 102 |
print(translate(["Hello, how are you?"]))
|
| 103 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
base_model: "None"
|
| 3 |
language:
|
|
|
|
| 10 |
metrics:
|
| 11 |
- accuracy
|
| 12 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
<div align="center">
|
| 15 |
+
<img src="banner.png" alt="LinguaFlow Banner" width="100%">
|
| 16 |
+
|
| 17 |
+
# π LinguaFlow
|
| 18 |
+
### *Advanced English-to-Arabic Neural Machine Translation*
|
| 19 |
+
|
| 20 |
+
[](https://opensource.org/licenses/MIT)
|
| 21 |
+
[](https://www.python.org/)
|
| 22 |
+
[](https://tensorflow.org/)
|
| 23 |
+
[](https://huggingface.co/Ali0044/LinguaFlow)
|
| 24 |
+
</div>
|
| 25 |
|
| 26 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
+
## π Overview
|
| 29 |
|
| 30 |
+
**LinguaFlow** is a robust Sequence-to-Sequence (Seq2Seq) neural machine translation model specialized in converting English text into Arabic. Leveraging a deep learning architecture based on **LSTM (Long Short-Term Memory)**, it captures complex linguistic relationships and contextual nuances to provide high-quality translations for short-to-medium length sentences.
|
| 31 |
|
| 32 |
+
### β¨ Key Features
|
| 33 |
+
- π **LSTM-Based Architecture**: High-efficiency encoder-decoder framework.
|
| 34 |
+
- π― **Domain Specificity**: Optimized for the `salehalmansour/english-to-arabic-translate` dataset.
|
| 35 |
+
- π οΈ **Easy Integration**: Simple Python API for quick deployment.
|
| 36 |
+
- π **Bilingual Support**: Full English-to-Arabic vocabulary coverage (En: 6,400+ | Ar: 9,600+).
|
| 37 |
|
| 38 |
+
---
|
| 39 |
|
| 40 |
+
## ποΈ Technical Architecture
|
| 41 |
+
|
| 42 |
+
The model employs an **Encoder-Decoder** topology designed for sequence transduction tasks.
|
| 43 |
+
|
| 44 |
+
```mermaid
|
| 45 |
+
graph LR
|
| 46 |
+
A[English Input Sequence] --> B[Embedding Layer]
|
| 47 |
+
B --> C[LSTM Encoder]
|
| 48 |
+
C --> D[Context Vector]
|
| 49 |
+
D --> E[Repeat Vector]
|
| 50 |
+
E --> F[LSTM Decoder]
|
| 51 |
+
F --> G[Dense Layer / Softmax]
|
| 52 |
+
G --> H[Arabic Output Sequence]
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
### Configuration Highlights
|
| 56 |
+
| Component | Specification |
|
| 57 |
+
| :--- | :--- |
|
| 58 |
+
| **Model Type** | Seq2Seq LSTM |
|
| 59 |
+
| **Hidden Units** | 512 |
|
| 60 |
+
| **Embedding Size** | 512 |
|
| 61 |
+
| **Input Depth** | 20 Timesteps |
|
| 62 |
+
| **Output Depth** | 20 Timesteps |
|
| 63 |
+
| **Optimizer** | Adam |
|
| 64 |
+
| **Loss Function** | Sparse Categorical Crossentropy |
|
| 65 |
|
| 66 |
+
---
|
| 67 |
|
| 68 |
+
## π Performance Benchmark
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
+
LinguaFlow demonstrates strong generalization capabilities on the validation set after extensive training.
|
| 71 |
|
| 72 |
+
| Metric | Training | Validation |
|
| 73 |
+
| :--- | :--- | :--- |
|
| 74 |
+
| **Accuracy** | 85.99% | 85.74% |
|
| 75 |
+
| **Loss** | 0.9594 | 1.1926 |
|
| 76 |
|
| 77 |
+
---
|
|
|
|
|
|
|
| 78 |
|
| 79 |
+
## π Getting Started
|
| 80 |
|
| 81 |
+
### Prerequisites
|
| 82 |
+
```bash
|
| 83 |
+
pip install tensorflow numpy pandas scikit-learn huggingface_hub
|
| 84 |
+
```
|
| 85 |
|
| 86 |
+
### Usage Example
|
| 87 |
```python
|
| 88 |
from huggingface_hub import snapshot_download
|
| 89 |
import tensorflow as tf
|
| 90 |
import numpy as np
|
| 91 |
import os
|
| 92 |
+
import pickle
|
| 93 |
from tensorflow.keras.preprocessing.sequence import pad_sequences
|
| 94 |
|
| 95 |
+
# 1. Download model and tokenizers
|
| 96 |
repo_id = "Ali0044/LinguaFlow"
|
| 97 |
local_dir = snapshot_download(repo_id=repo_id)
|
| 98 |
|
| 99 |
+
# 2. Load resources
|
| 100 |
+
model = tf.keras.models.load_model(os.path.join(local_dir, "Translation_model.keras"))
|
| 101 |
|
| 102 |
+
with open(os.path.join(local_dir, "eng_tokenizer.pkl"), "rb") as f:
|
| 103 |
+
eng_tokenizer = pickle.load(f)
|
| 104 |
|
| 105 |
+
with open(os.path.join(local_dir, "ar_tokenizer.pkl"), "rb") as f:
|
| 106 |
+
ar_tokenizer = pickle.load(f)
|
| 107 |
|
| 108 |
+
# 3. Translation Function
|
| 109 |
def translate(sentences):
|
| 110 |
+
# Clean and tokenize
|
| 111 |
seq = eng_tokenizer.texts_to_sequences(sentences)
|
| 112 |
+
# Pad sequences
|
| 113 |
+
padded = pad_sequences(seq, maxlen=20, padding='post')
|
| 114 |
+
# Predict
|
| 115 |
preds = model.predict(padded)
|
| 116 |
preds = np.argmax(preds, axis=-1)
|
| 117 |
|
|
|
|
| 121 |
results.append(' '.join(text))
|
| 122 |
return results
|
| 123 |
|
| 124 |
+
# 4. Try it out!
|
| 125 |
print(translate(["Hello, how are you?"]))
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
---
|
| 129 |
+
|
| 130 |
+
## β οΈ Limitations & Ethical Notes
|
| 131 |
+
- **Maximum Length**: Best results are achieved with sentences up to 20 words.
|
| 132 |
+
- **Domain Bias**: Accuracy may vary when translating specialized technical or medical jargon not present in the training set.
|
| 133 |
+
- **Bias**: As with all language models, potential biases in the open-source dataset may occasionally be reflected in translations.
|
| 134 |
+
|
| 135 |
+
---
|
| 136 |
+
|
| 137 |
+
## πΊοΈ Roadmap
|
| 138 |
+
- [ ] Implement Attention Mechanism (Bahdanau/Luong).
|
| 139 |
+
- [ ] Upgrade to Transformer architecture (Base/Large).
|
| 140 |
+
- [ ] Expand sequence length support to 50+ tokens.
|
| 141 |
+
- [ ] Continuous training on larger Arabic datasets (e.g., OPUS).
|
| 142 |
+
|
| 143 |
+
---
|
| 144 |
+
|
| 145 |
+
## π€ Contributing
|
| 146 |
+
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
|
| 147 |
+
|
| 148 |
+
## π License
|
| 149 |
+
This project is licensed under the **MIT License** - see the [LICENSE](LICENSE) file for details.
|
| 150 |
+
|
| 151 |
+
---
|
| 152 |
+
<div align="center">
|
| 153 |
+
Developed by <a href="https://github.com/Ali0044">Ali Khalidalikhalid</a>
|
| 154 |
+
</div>
|