Upload folder using huggingface_hub

Browse files

Files changed (5) hide show

.gitattributes +1 -0
README.md +67 -0
Translation_model.keras +3 -0
ar_tokenizer.pkl +3 -0
eng_tokenizer.pkl +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+Translation_model.keras filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,67 @@

+---
+base_model: "None"
+language:
+- en
+- ar
+license: mit
+tags:
+- translation
+- seq2seq
+metrics:
+- accuracy
+---
+# LinguaFlow: English-Arabic Neural Machine Translation Model
+## Model Description
+This is a sequence-to-sequence (Seq2Seq) model designed for translating English text to Arabic text. It employs an Encoder-Decoder architecture built with Long Short-Term Memory (LSTM) layers, a popular choice for handling sequential data like natural language.
+## Model Details
+*   **Architecture**: Encoder-Decoder with LSTM layers.
+    *   **Encoder**: Processes the input English sequence.
+    *   **Decoder**: Generates the output Arabic sequence based on the encoder's context.
+*   **Input Language**: English (en)
+*   **Output Language**: Arabic (ar)
+*   **Input Sequence Length**: Maximum of `20` words.
+*   **Output Sequence Length**: Maximum of `20` words.
+*   **Vocabulary Size (English)**: `6409` unique words.
+*   **Vocabulary Size (Arabic)**: `9642` unique words.
+## Training Data
+The model was trained on a subset of the `salehalmansour/english-to-arabic-translate` dataset, which contains English-Arabic sentence pairs. The training involved cleaning and tokenizing the text, and then encoding the sequences into numerical representations suitable for the neural network.
+## Evaluation Metrics
+During training, the model's performance was monitored using `accuracy` and `sparse_categorical_crossentropy` loss.
+*   **Training Accuracy**: 0.8599
+*   **Validation Accuracy**: 0.8574
+*   **Training Loss**: 0.9594
+*   **Validation Loss**: 1.1926
+These metrics indicate the model's ability to correctly predict Arabic words given an English input, and how well it generalizes to unseen data.
+## Usage
+To use this model for translation, you will need to:
+1.  **Install the necessary libraries**: Ensure you have `tensorflow`, `numpy`, `pandas`, `scikit-learn`, and `huggingface_hub` installed.
+2.  **Load the model and tokenizers**: Download `Translation_model.keras`, `eng_tokenizer.pkl`, and `ar_tokenizer.pkl` from this repository.
+3.  **Prepare your input**: Clean and tokenize your English input text using the loaded `eng_tokenizer`, and then pad it to the `eng_length` (20).
+4.  **Make a prediction**: Pass the encoded English sequence to the loaded Keras model's `predict` method.
+5.  **Decode the output**: Use `np.argmax` on the model's output to get the predicted word indices, then convert these indices back to Arabic words using the `ar_tokenizer`.
+For a detailed example of how to load and use this model, please refer to the Colab notebook or Python script where this model was developed. You will find functions like `encode_sequences` and `sequences_to_text` which are crucial for preparing inputs and interpreting outputs.
+## Limitations
+*   **Domain Specificity**: The model's performance is highly dependent on the domain and style of the training data. It might not generalize well to texts outside of the dataset's scope.
+*   **Vocabulary Size**: Limited vocabulary might result in out-of-vocabulary (OOV) tokens, which can impact translation quality.
+*   **Sequence Length**: The fixed maximum sequence lengths for input and output can limit the translation of very long sentences.
+## Ethical Considerations
+As with any language model, care should be taken when deploying this for real-world applications. Potential biases present in the training data could be reflected in the translations. It's important to monitor its output and ensure fair and accurate use.

Translation_model.keras ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9cb7ab2e69f0a30cffc31b29835a71ae53a58be54d1521150c46f15b14a0aaf5
+size 99444464

ar_tokenizer.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d235b938c9cdc5aaae745f2367ccdf54eca974d90e2eb812d57471906d0fbbe5
+size 423394

eng_tokenizer.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a429ba138f89fe655f25a040b604106f385970930a93b92c269455ae42698c20
+size 253609