Ali0044 commited on
Commit
d93f298
·
verified ·
1 Parent(s): 5dd15f6

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ Translation_model.keras filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ base_model: "None"
4
+ language:
5
+ - en
6
+ - ar
7
+ license: mit
8
+ tags:
9
+ - translation
10
+ - seq2seq
11
+ metrics:
12
+ - accuracy
13
+ ---
14
+ # LinguaFlow: English-Arabic Neural Machine Translation Model
15
+
16
+ ## Model Description
17
+
18
+ This is a sequence-to-sequence (Seq2Seq) model designed for translating English text to Arabic text. It employs an Encoder-Decoder architecture built with Long Short-Term Memory (LSTM) layers, a popular choice for handling sequential data like natural language.
19
+
20
+ ## Model Details
21
+
22
+ * **Architecture**: Encoder-Decoder with LSTM layers.
23
+ * **Encoder**: Processes the input English sequence.
24
+ * **Decoder**: Generates the output Arabic sequence based on the encoder's context.
25
+ * **Input Language**: English (en)
26
+ * **Output Language**: Arabic (ar)
27
+ * **Input Sequence Length**: Maximum of `20` words.
28
+ * **Output Sequence Length**: Maximum of `20` words.
29
+ * **Vocabulary Size (English)**: `6409` unique words.
30
+ * **Vocabulary Size (Arabic)**: `9642` unique words.
31
+
32
+ ## Training Data
33
+
34
+ The model was trained on a subset of the `salehalmansour/english-to-arabic-translate` dataset, which contains English-Arabic sentence pairs. The training involved cleaning and tokenizing the text, and then encoding the sequences into numerical representations suitable for the neural network.
35
+
36
+ ## Evaluation Metrics
37
+
38
+ During training, the model's performance was monitored using `accuracy` and `sparse_categorical_crossentropy` loss.
39
+
40
+ * **Training Accuracy**: 0.8599
41
+ * **Validation Accuracy**: 0.8574
42
+ * **Training Loss**: 0.9594
43
+ * **Validation Loss**: 1.1926
44
+
45
+ These metrics indicate the model's ability to correctly predict Arabic words given an English input, and how well it generalizes to unseen data.
46
+
47
+ ## Usage
48
+
49
+ To use this model for translation, you will need to:
50
+
51
+ 1. **Install the necessary libraries**: Ensure you have `tensorflow`, `numpy`, `pandas`, `scikit-learn`, and `huggingface_hub` installed.
52
+ 2. **Load the model and tokenizers**: Download `Translation_model.keras`, `eng_tokenizer.pkl`, and `ar_tokenizer.pkl` from this repository.
53
+ 3. **Prepare your input**: Clean and tokenize your English input text using the loaded `eng_tokenizer`, and then pad it to the `eng_length` (20).
54
+ 4. **Make a prediction**: Pass the encoded English sequence to the loaded Keras model's `predict` method.
55
+ 5. **Decode the output**: Use `np.argmax` on the model's output to get the predicted word indices, then convert these indices back to Arabic words using the `ar_tokenizer`.
56
+
57
+ For a detailed example of how to load and use this model, please refer to the Colab notebook or Python script where this model was developed. You will find functions like `encode_sequences` and `sequences_to_text` which are crucial for preparing inputs and interpreting outputs.
58
+
59
+ ## Limitations
60
+
61
+ * **Domain Specificity**: The model's performance is highly dependent on the domain and style of the training data. It might not generalize well to texts outside of the dataset's scope.
62
+ * **Vocabulary Size**: Limited vocabulary might result in out-of-vocabulary (OOV) tokens, which can impact translation quality.
63
+ * **Sequence Length**: The fixed maximum sequence lengths for input and output can limit the translation of very long sentences.
64
+
65
+ ## Ethical Considerations
66
+
67
+ As with any language model, care should be taken when deploying this for real-world applications. Potential biases present in the training data could be reflected in the translations. It's important to monitor its output and ensure fair and accurate use.
Translation_model.keras ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9cb7ab2e69f0a30cffc31b29835a71ae53a58be54d1521150c46f15b14a0aaf5
3
+ size 99444464
ar_tokenizer.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d235b938c9cdc5aaae745f2367ccdf54eca974d90e2eb812d57471906d0fbbe5
3
+ size 423394
eng_tokenizer.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a429ba138f89fe655f25a040b604106f385970930a93b92c269455ae42698c20
3
+ size 253609