Ali0044 commited on
Commit
4a59d3c
Β·
verified Β·
1 Parent(s): 47aa0b5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -51
README.md CHANGED
@@ -1,4 +1,3 @@
1
-
2
  ---
3
  base_model: "None"
4
  language:
@@ -11,84 +10,108 @@ tags:
11
  metrics:
12
  - accuracy
13
  ---
14
- # LinguaFlow: English-Arabic Neural Machine Translation Model
15
-
16
- ## Model Description
17
-
18
- This is a sequence-to-sequence (Seq2Seq) model designed for translating English text to Arabic text. It employs an Encoder-Decoder architecture built with Long Short-Term Memory (LSTM) layers, a popular choice for handling sequential data like natural language.
19
 
20
- ## Model Details
 
 
 
 
 
 
 
 
 
 
21
 
22
- * **Architecture**: Encoder-Decoder with LSTM layers.
23
- * **Encoder**: Processes the input English sequence.
24
- * **Decoder**: Generates the output Arabic sequence based on the encoder's context.
25
- * **Input Language**: English (en)
26
- * **Output Language**: Arabic (ar)
27
- * **Input Sequence Length**: Maximum of `20` words.
28
- * **Output Sequence Length**: Maximum of `20` words.
29
- * **Vocabulary Size (English)**: `6409` unique words.
30
- * **Vocabulary Size (Arabic)**: `9642` unique words.
31
-
32
- ## Training Data
33
-
34
- The model was trained on a subset of the `salehalmansour/english-to-arabic-translate` dataset, which contains English-Arabic sentence pairs. The training involved cleaning and tokenizing the text, and then encoding the sequences into numerical representations suitable for the neural network.
35
 
36
- ## Evaluation Metrics
37
 
38
- During training, the model's performance was monitored using `accuracy` and `sparse_categorical_crossentropy` loss.
39
 
40
- * **Training Accuracy**: 0.8599
41
- * **Validation Accuracy**: 0.8574
42
- * **Training Loss**: 0.9594
43
- * **Validation Loss**: 1.1926
 
44
 
45
- These metrics indicate the model's ability to correctly predict Arabic words given an English input, and how well it generalizes to unseen data.
46
 
47
- ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
- To use this model for translation, you will need to:
50
 
51
- 1. **Install the necessary libraries**: Ensure you have `tensorflow`, `numpy`, `pandas`, `scikit-learn`, and `huggingface_hub` installed.
52
- 2. **Load the model and tokenizers**: Download `Translation_model.keras`, `eng_tokenizer.pkl`, and `ar_tokenizer.pkl` from this repository.
53
- 3. **Prepare your input**: Clean and tokenize your English input text using the loaded `eng_tokenizer`, and then pad it to the `eng_length` (20).
54
- 4. **Make a prediction**: Pass the encoded English sequence to the loaded Keras model's `predict` method.
55
- 5. **Decode the output**: Use `np.argmax` on the model's output to get the predicted word indices, then convert these indices back to Arabic words using the `ar_tokenizer`.
56
 
57
- For a detailed example of how to load and use this model, please refer to the Colab notebook or Python script where this model was developed. You will find functions like `encode_sequences` and `sequences_to_text` which are crucial for preparing inputs and interpreting outputs.
58
 
59
- ## Limitations
 
 
 
60
 
61
- * **Domain Specificity**: The model's performance is highly dependent on the domain and style of the training data. It might not generalize well to texts outside of the dataset's scope.
62
- * **Vocabulary Size**: Limited vocabulary might result in out-of-vocabulary (OOV) tokens, which can impact translation quality.
63
- * **Sequence Length**: The fixed maximum sequence lengths for input and output can limit the translation of very long sentences.
64
 
65
- ## Ethical Considerations
66
 
67
- As with any language model, care should be taken when deploying this for real-world applications. Potential biases present in the training data could be reflected in the translations. It's important to monitor its output and ensure fair and accurate use.
68
- ## πŸš€ How to use
 
 
69
 
 
70
  ```python
71
  from huggingface_hub import snapshot_download
72
  import tensorflow as tf
73
  import numpy as np
74
  import os
75
- from tensorflow.keras.preprocessing.text import tokenizer_from_json
76
  from tensorflow.keras.preprocessing.sequence import pad_sequences
77
 
 
78
  repo_id = "Ali0044/LinguaFlow"
79
  local_dir = snapshot_download(repo_id=repo_id)
80
 
81
- model = tf.keras.models.load_model(os.path.join(local_dir, "Translation_model_for_hf.keras"))
 
82
 
83
- with open(os.path.join(local_dir, "tokenizer/eng_tokenizer.json"), "r", encoding="utf-8") as f:
84
- eng_tokenizer = tokenizer_from_json(f.read())
85
 
86
- with open(os.path.join(local_dir, "tokenizer/ar_tokenizer.json"), "r", encoding="utf-8") as f:
87
- ar_tokenizer = tokenizer_from_json(f.read())
88
 
 
89
  def translate(sentences):
 
90
  seq = eng_tokenizer.texts_to_sequences(sentences)
91
- padded = pad_sequences(seq, maxlen=model.input_shape[1], padding='post')
 
 
92
  preds = model.predict(padded)
93
  preds = np.argmax(preds, axis=-1)
94
 
@@ -98,6 +121,34 @@ def translate(sentences):
98
  results.append(' '.join(text))
99
  return results
100
 
101
- # Example
102
  print(translate(["Hello, how are you?"]))
103
- """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  base_model: "None"
3
  language:
 
10
  metrics:
11
  - accuracy
12
  ---
 
 
 
 
 
13
 
14
+ <div align="center">
15
+ <img src="banner.png" alt="LinguaFlow Banner" width="100%">
16
+
17
+ # 🌊 LinguaFlow
18
+ ### *Advanced English-to-Arabic Neural Machine Translation*
19
+
20
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
21
+ [![Python](https://img.shields.io/badge/Python-3.8+-blue.svg)](https://www.python.org/)
22
+ [![TensorFlow](https://img.shields.io/badge/TensorFlow-2.0+-orange.svg)](https://tensorflow.org/)
23
+ [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-LinguaFlow-FFD21E)](https://huggingface.co/Ali0044/LinguaFlow)
24
+ </div>
25
 
26
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
+ ## πŸ“– Overview
29
 
30
+ **LinguaFlow** is a robust Sequence-to-Sequence (Seq2Seq) neural machine translation model specialized in converting English text into Arabic. Leveraging a deep learning architecture based on **LSTM (Long Short-Term Memory)**, it captures complex linguistic relationships and contextual nuances to provide high-quality translations for short-to-medium length sentences.
31
 
32
+ ### ✨ Key Features
33
+ - πŸš€ **LSTM-Based Architecture**: High-efficiency encoder-decoder framework.
34
+ - 🎯 **Domain Specificity**: Optimized for the `salehalmansour/english-to-arabic-translate` dataset.
35
+ - πŸ› οΈ **Easy Integration**: Simple Python API for quick deployment.
36
+ - 🌍 **Bilingual Support**: Full English-to-Arabic vocabulary coverage (En: 6,400+ | Ar: 9,600+).
37
 
38
+ ---
39
 
40
+ ## πŸ—οΈ Technical Architecture
41
+
42
+ The model employs an **Encoder-Decoder** topology designed for sequence transduction tasks.
43
+
44
+ ```mermaid
45
+ graph LR
46
+ A[English Input Sequence] --> B[Embedding Layer]
47
+ B --> C[LSTM Encoder]
48
+ C --> D[Context Vector]
49
+ D --> E[Repeat Vector]
50
+ E --> F[LSTM Decoder]
51
+ F --> G[Dense Layer / Softmax]
52
+ G --> H[Arabic Output Sequence]
53
+ ```
54
+
55
+ ### Configuration Highlights
56
+ | Component | Specification |
57
+ | :--- | :--- |
58
+ | **Model Type** | Seq2Seq LSTM |
59
+ | **Hidden Units** | 512 |
60
+ | **Embedding Size** | 512 |
61
+ | **Input Depth** | 20 Timesteps |
62
+ | **Output Depth** | 20 Timesteps |
63
+ | **Optimizer** | Adam |
64
+ | **Loss Function** | Sparse Categorical Crossentropy |
65
 
66
+ ---
67
 
68
+ ## πŸ“Š Performance Benchmark
 
 
 
 
69
 
70
+ LinguaFlow demonstrates strong generalization capabilities on the validation set after extensive training.
71
 
72
+ | Metric | Training | Validation |
73
+ | :--- | :--- | :--- |
74
+ | **Accuracy** | 85.99% | 85.74% |
75
+ | **Loss** | 0.9594 | 1.1926 |
76
 
77
+ ---
 
 
78
 
79
+ ## πŸš€ Getting Started
80
 
81
+ ### Prerequisites
82
+ ```bash
83
+ pip install tensorflow numpy pandas scikit-learn huggingface_hub
84
+ ```
85
 
86
+ ### Usage Example
87
  ```python
88
  from huggingface_hub import snapshot_download
89
  import tensorflow as tf
90
  import numpy as np
91
  import os
92
+ import pickle
93
  from tensorflow.keras.preprocessing.sequence import pad_sequences
94
 
95
+ # 1. Download model and tokenizers
96
  repo_id = "Ali0044/LinguaFlow"
97
  local_dir = snapshot_download(repo_id=repo_id)
98
 
99
+ # 2. Load resources
100
+ model = tf.keras.models.load_model(os.path.join(local_dir, "Translation_model.keras"))
101
 
102
+ with open(os.path.join(local_dir, "eng_tokenizer.pkl"), "rb") as f:
103
+ eng_tokenizer = pickle.load(f)
104
 
105
+ with open(os.path.join(local_dir, "ar_tokenizer.pkl"), "rb") as f:
106
+ ar_tokenizer = pickle.load(f)
107
 
108
+ # 3. Translation Function
109
  def translate(sentences):
110
+ # Clean and tokenize
111
  seq = eng_tokenizer.texts_to_sequences(sentences)
112
+ # Pad sequences
113
+ padded = pad_sequences(seq, maxlen=20, padding='post')
114
+ # Predict
115
  preds = model.predict(padded)
116
  preds = np.argmax(preds, axis=-1)
117
 
 
121
  results.append(' '.join(text))
122
  return results
123
 
124
+ # 4. Try it out!
125
  print(translate(["Hello, how are you?"]))
126
+ ```
127
+
128
+ ---
129
+
130
+ ## ⚠️ Limitations & Ethical Notes
131
+ - **Maximum Length**: Best results are achieved with sentences up to 20 words.
132
+ - **Domain Bias**: Accuracy may vary when translating specialized technical or medical jargon not present in the training set.
133
+ - **Bias**: As with all language models, potential biases in the open-source dataset may occasionally be reflected in translations.
134
+
135
+ ---
136
+
137
+ ## πŸ—ΊοΈ Roadmap
138
+ - [ ] Implement Attention Mechanism (Bahdanau/Luong).
139
+ - [ ] Upgrade to Transformer architecture (Base/Large).
140
+ - [ ] Expand sequence length support to 50+ tokens.
141
+ - [ ] Continuous training on larger Arabic datasets (e.g., OPUS).
142
+
143
+ ---
144
+
145
+ ## 🀝 Contributing
146
+ Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
147
+
148
+ ## πŸ“„ License
149
+ This project is licensed under the **MIT License** - see the [LICENSE](LICENSE) file for details.
150
+
151
+ ---
152
+ <div align="center">
153
+ Developed by <a href="https://github.com/Ali0044">Ali Khalidalikhalid</a>
154
+ </div>