Commit
·
43ed7e7
1
Parent(s):
43cba70
Update README.md
Browse files
README.md
CHANGED
|
@@ -12,23 +12,23 @@ metrics:
|
|
| 12 |
- recall
|
| 13 |
library_name: transformers
|
| 14 |
---
|
| 15 |
-
#
|
| 16 |
-
**Fine_Tuned_HF_Language_Identification_Model:** Language Identification Model
|
| 17 |
|
| 18 |
<img src="https://miro.medium.com/v2/resize:fit:1400/1*G5AyGtaUAQBcVLikpxu6CQ.png" style="border-radius: 5%;">
|
| 19 |
|
| 20 |
-
##
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
##
|
| 27 |
-
- f1
|
| 28 |
-
-
|
| 29 |
-
-
|
| 30 |
-
-
|
| 31 |
-
##
|
|
|
|
| 32 |
|
| 33 |
## Overview
|
| 34 |
Language identification is a foundational task in Natural Language Processing (NLP). This project introduces a meticulously fine-tuned language identification model, rooted in the robust XLM-RoBERTa architecture. It excels at classifying text in five diverse languages: English, French, German, Arabic, and Russian. Delve into the intricate details of this cutting-edge model that pushes the boundaries of multilingual language identification.
|
|
@@ -63,6 +63,13 @@ The model underwent a rigorous fine-tuning process using Hugging Face's Trainer
|
|
| 63 |
|
| 64 |
## Dataset Used
|
| 65 |
The corpus used for training is the corpus of © 2023 Universität Leipzig / Sächsische Akademie der Wissenschaften / InfAI.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
|
| 68 |
## Technology Stack
|
|
@@ -132,6 +139,19 @@ f1 = eval_result["eval_f1"]
|
|
| 132 |
## Model Performance
|
| 133 |
Table of Model Performance
|
| 134 |
|
| 135 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
## Contributing
|
|
|
|
| 12 |
- recall
|
| 13 |
library_name: transformers
|
| 14 |
---
|
| 15 |
+
# Fine Tuned HuggingFace Language Identification Model
|
|
|
|
| 16 |
|
| 17 |
<img src="https://miro.medium.com/v2/resize:fit:1400/1*G5AyGtaUAQBcVLikpxu6CQ.png" style="border-radius: 5%;">
|
| 18 |
|
| 19 |
+
## Language Supported:
|
| 20 |
+
1. English (en)
|
| 21 |
+
2. French (fr)
|
| 22 |
+
3. German (de)
|
| 23 |
+
4. Russian (ru)
|
| 24 |
+
5. Arabc (ar)
|
| 25 |
+
## Metrics:
|
| 26 |
+
- f1 - score
|
| 27 |
+
- Accuracy
|
| 28 |
+
- Precision
|
| 29 |
+
- Recall
|
| 30 |
+
## Library_name:
|
| 31 |
+
Transformers
|
| 32 |
|
| 33 |
## Overview
|
| 34 |
Language identification is a foundational task in Natural Language Processing (NLP). This project introduces a meticulously fine-tuned language identification model, rooted in the robust XLM-RoBERTa architecture. It excels at classifying text in five diverse languages: English, French, German, Arabic, and Russian. Delve into the intricate details of this cutting-edge model that pushes the boundaries of multilingual language identification.
|
|
|
|
| 63 |
|
| 64 |
## Dataset Used
|
| 65 |
The corpus used for training is the corpus of © 2023 Universität Leipzig / Sächsische Akademie der Wissenschaften / InfAI.
|
| 66 |
+
| Language | Size of Corpus (in number of sentence) |
|
| 67 |
+
| -------- | -------- |
|
| 68 |
+
|**English**|50002|
|
| 69 |
+
|**French**|50002|
|
| 70 |
+
|**German**|50002|
|
| 71 |
+
|**Russian**|50002|
|
| 72 |
+
|**Arabic**|36888|
|
| 73 |
|
| 74 |
|
| 75 |
## Technology Stack
|
|
|
|
| 139 |
## Model Performance
|
| 140 |
Table of Model Performance
|
| 141 |
|
| 142 |
+
| Language | Precision | Recall | F1 - Score | Accuracy |
|
| 143 |
+
| -------- | -------- | -------- | -------- | -------- |
|
| 144 |
+
|**English**|1.0000|0.9994|0.9997|0.9994|
|
| 145 |
+
|**French**|1.0000|0.9992|0.9996|0.9992|
|
| 146 |
+
|**German**|1.0000|0.9998|0.9999|0.9998|
|
| 147 |
+
|**Arabic**|1.0000|0.9997|0.9999|0.9997|
|
| 148 |
+
|**Russian**|1.0000|1.0000|1.0000|1.0000|
|
| 149 |
+
|
| 150 |
+
## Project Files Structure
|
| 151 |
+
The project's structure is organized as follows:
|
| 152 |
+
|
| 153 |
+
- `data/` : Contains datasets used for training and testing the model
|
| 154 |
+
- `src/` : Source code and Google Collab Notebook
|
| 155 |
+
- `README.md` : This README file
|
| 156 |
|
| 157 |
## Contributing
|