File size: 4,807 Bytes
d93f298 4a59d3c d93f298 4a59d3c d93f298 4a59d3c d93f298 4a59d3c d93f298 4a59d3c d93f298 4a59d3c d93f298 4a59d3c d93f298 4a59d3c d93f298 4a59d3c d93f298 4a59d3c d93f298 4a59d3c d93f298 4a59d3c d93f298 4a59d3c d93f298 4a59d3c 47aa0b5 4a59d3c 47aa0b5 c388ee4 47aa0b5 4a59d3c c388ee4 4a59d3c 9abc92a 47aa0b5 c388ee4 4a59d3c c388ee4 4a59d3c c388ee4 4a59d3c c388ee4 4a59d3c c388ee4 4a59d3c 47aa0b5 4a59d3c 47aa0b5 c388ee4 47aa0b5 4a59d3c 47aa0b5 4a59d3c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | ---
base_model: "None"
language:
- en
- ar
license: mit
tags:
- translation
- seq2seq
metrics:
- accuracy
---
<div align="center">
<img src="banner.png" alt="LinguaFlow Banner" width="100%">
# π LinguaFlow
### *Advanced English-to-Arabic Neural Machine Translation*
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/)
[](https://tensorflow.org/)
[](https://huggingface.co/Ali0044/LinguaFlow)
</div>
---
## π Overview
**LinguaFlow** is a robust Sequence-to-Sequence (Seq2Seq) neural machine translation model specialized in converting English text into Arabic. Leveraging a deep learning architecture based on **LSTM (Long Short-Term Memory)**, it captures complex linguistic relationships and contextual nuances to provide high-quality translations for short-to-medium length sentences.
### β¨ Key Features
- π **LSTM-Based Architecture**: High-efficiency encoder-decoder framework.
- π― **Domain Specificity**: Optimized for the `salehalmansour/english-to-arabic-translate` dataset.
- π οΈ **Easy Integration**: Simple Python API for quick deployment.
- π **Bilingual Support**: Full English-to-Arabic vocabulary coverage (En: 6,400+ | Ar: 9,600+).
---
## ποΈ Technical Architecture
The model employs an **Encoder-Decoder** topology designed for sequence transduction tasks.
```mermaid
graph LR
A[English Input Sequence] --> B[Embedding Layer]
B --> C[LSTM Encoder]
C --> D[Context Vector]
D --> E[Repeat Vector]
E --> F[LSTM Decoder]
F --> G[Dense Layer / Softmax]
G --> H[Arabic Output Sequence]
```
### Configuration Highlights
| Component | Specification |
| :--- | :--- |
| **Model Type** | Seq2Seq LSTM |
| **Hidden Units** | 512 |
| **Embedding Size** | 512 |
| **Input Depth** | 20 Timesteps |
| **Output Depth** | 20 Timesteps |
| **Optimizer** | Adam |
| **Loss Function** | Sparse Categorical Crossentropy |
---
## π Performance Benchmark
LinguaFlow demonstrates strong generalization capabilities on the validation set after extensive training.
| Metric | Training | Validation |
| :--- | :--- | :--- |
| **Accuracy** | 85.99% | 85.74% |
| **Loss** | 0.9594 | 1.1926 |
---
## π Getting Started
### Prerequisites
```bash
pip install tensorflow numpy pandas scikit-learn huggingface_hub
```
### Usage Example
```python
from huggingface_hub import snapshot_download
import tensorflow as tf
import numpy as np
import os
import pickle
from tensorflow.keras.preprocessing.sequence import pad_sequences
# 1. Download model and tokenizers
repo_id = "Ali0044/LinguaFlow"
local_dir = snapshot_download(repo_id=repo_id)
# 2. Load resources
model = tf.keras.models.load_model(os.path.join(local_dir, "Translation_model.keras"))
with open(os.path.join(local_dir, "eng_tokenizer.pkl"), "rb") as f:
eng_tokenizer = pickle.load(f)
with open(os.path.join(local_dir, "ar_tokenizer.pkl"), "rb") as f:
ar_tokenizer = pickle.load(f)
# 3. Translation Function
def translate(sentences):
# Clean and tokenize
seq = eng_tokenizer.texts_to_sequences(sentences)
# Pad sequences
padded = pad_sequences(seq, maxlen=20, padding='post')
# Predict
preds = model.predict(padded)
preds = np.argmax(preds, axis=-1)
results = []
for s in preds:
text = [ar_tokenizer.index_word[i] for i in s if i != 0]
results.append(' '.join(text))
return results
# 4. Try it out!
print(translate(["Hello, how are you?"]))
```
---
## β οΈ Limitations & Ethical Notes
- **Maximum Length**: Best results are achieved with sentences up to 20 words.
- **Domain Bias**: Accuracy may vary when translating specialized technical or medical jargon not present in the training set.
- **Bias**: As with all language models, potential biases in the open-source dataset may occasionally be reflected in translations.
---
## πΊοΈ Roadmap
- [ ] Implement Attention Mechanism (Bahdanau/Luong).
- [ ] Upgrade to Transformer architecture (Base/Large).
- [ ] Expand sequence length support to 50+ tokens.
- [ ] Continuous training on larger Arabic datasets (e.g., OPUS).
---
## π€ Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
## π License
This project is licensed under the **MIT License** - see the [LICENSE](LICENSE) file for details.
---
<div align="center">
Developed by <a href="https://github.com/Ali0044">Ali Khalidalikhalid</a>
</div> |