Upload folder using huggingface_hub
Browse files- .gitattributes +1 -0
- README.md +67 -0
- Translation_model.keras +3 -0
- ar_tokenizer.pkl +3 -0
- eng_tokenizer.pkl +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
Translation_model.keras filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
---
|
| 3 |
+
base_model: "None"
|
| 4 |
+
language:
|
| 5 |
+
- en
|
| 6 |
+
- ar
|
| 7 |
+
license: mit
|
| 8 |
+
tags:
|
| 9 |
+
- translation
|
| 10 |
+
- seq2seq
|
| 11 |
+
metrics:
|
| 12 |
+
- accuracy
|
| 13 |
+
---
|
| 14 |
+
# LinguaFlow: English-Arabic Neural Machine Translation Model
|
| 15 |
+
|
| 16 |
+
## Model Description
|
| 17 |
+
|
| 18 |
+
This is a sequence-to-sequence (Seq2Seq) model designed for translating English text to Arabic text. It employs an Encoder-Decoder architecture built with Long Short-Term Memory (LSTM) layers, a popular choice for handling sequential data like natural language.
|
| 19 |
+
|
| 20 |
+
## Model Details
|
| 21 |
+
|
| 22 |
+
* **Architecture**: Encoder-Decoder with LSTM layers.
|
| 23 |
+
* **Encoder**: Processes the input English sequence.
|
| 24 |
+
* **Decoder**: Generates the output Arabic sequence based on the encoder's context.
|
| 25 |
+
* **Input Language**: English (en)
|
| 26 |
+
* **Output Language**: Arabic (ar)
|
| 27 |
+
* **Input Sequence Length**: Maximum of `20` words.
|
| 28 |
+
* **Output Sequence Length**: Maximum of `20` words.
|
| 29 |
+
* **Vocabulary Size (English)**: `6409` unique words.
|
| 30 |
+
* **Vocabulary Size (Arabic)**: `9642` unique words.
|
| 31 |
+
|
| 32 |
+
## Training Data
|
| 33 |
+
|
| 34 |
+
The model was trained on a subset of the `salehalmansour/english-to-arabic-translate` dataset, which contains English-Arabic sentence pairs. The training involved cleaning and tokenizing the text, and then encoding the sequences into numerical representations suitable for the neural network.
|
| 35 |
+
|
| 36 |
+
## Evaluation Metrics
|
| 37 |
+
|
| 38 |
+
During training, the model's performance was monitored using `accuracy` and `sparse_categorical_crossentropy` loss.
|
| 39 |
+
|
| 40 |
+
* **Training Accuracy**: 0.8599
|
| 41 |
+
* **Validation Accuracy**: 0.8574
|
| 42 |
+
* **Training Loss**: 0.9594
|
| 43 |
+
* **Validation Loss**: 1.1926
|
| 44 |
+
|
| 45 |
+
These metrics indicate the model's ability to correctly predict Arabic words given an English input, and how well it generalizes to unseen data.
|
| 46 |
+
|
| 47 |
+
## Usage
|
| 48 |
+
|
| 49 |
+
To use this model for translation, you will need to:
|
| 50 |
+
|
| 51 |
+
1. **Install the necessary libraries**: Ensure you have `tensorflow`, `numpy`, `pandas`, `scikit-learn`, and `huggingface_hub` installed.
|
| 52 |
+
2. **Load the model and tokenizers**: Download `Translation_model.keras`, `eng_tokenizer.pkl`, and `ar_tokenizer.pkl` from this repository.
|
| 53 |
+
3. **Prepare your input**: Clean and tokenize your English input text using the loaded `eng_tokenizer`, and then pad it to the `eng_length` (20).
|
| 54 |
+
4. **Make a prediction**: Pass the encoded English sequence to the loaded Keras model's `predict` method.
|
| 55 |
+
5. **Decode the output**: Use `np.argmax` on the model's output to get the predicted word indices, then convert these indices back to Arabic words using the `ar_tokenizer`.
|
| 56 |
+
|
| 57 |
+
For a detailed example of how to load and use this model, please refer to the Colab notebook or Python script where this model was developed. You will find functions like `encode_sequences` and `sequences_to_text` which are crucial for preparing inputs and interpreting outputs.
|
| 58 |
+
|
| 59 |
+
## Limitations
|
| 60 |
+
|
| 61 |
+
* **Domain Specificity**: The model's performance is highly dependent on the domain and style of the training data. It might not generalize well to texts outside of the dataset's scope.
|
| 62 |
+
* **Vocabulary Size**: Limited vocabulary might result in out-of-vocabulary (OOV) tokens, which can impact translation quality.
|
| 63 |
+
* **Sequence Length**: The fixed maximum sequence lengths for input and output can limit the translation of very long sentences.
|
| 64 |
+
|
| 65 |
+
## Ethical Considerations
|
| 66 |
+
|
| 67 |
+
As with any language model, care should be taken when deploying this for real-world applications. Potential biases present in the training data could be reflected in the translations. It's important to monitor its output and ensure fair and accurate use.
|
Translation_model.keras
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9cb7ab2e69f0a30cffc31b29835a71ae53a58be54d1521150c46f15b14a0aaf5
|
| 3 |
+
size 99444464
|
ar_tokenizer.pkl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d235b938c9cdc5aaae745f2367ccdf54eca974d90e2eb812d57471906d0fbbe5
|
| 3 |
+
size 423394
|
eng_tokenizer.pkl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a429ba138f89fe655f25a040b604106f385970930a93b92c269455ae42698c20
|
| 3 |
+
size 253609
|