Commit
·
5dcaa2c
1
Parent(s):
43ed7e7
Update README.md
Browse files
README.md
CHANGED
|
@@ -16,27 +16,35 @@ library_name: transformers
|
|
| 16 |
|
| 17 |
<img src="https://miro.medium.com/v2/resize:fit:1400/1*G5AyGtaUAQBcVLikpxu6CQ.png" style="border-radius: 5%;">
|
| 18 |
|
| 19 |
-
##
|
| 20 |
-
1. English (en)
|
| 21 |
-
2. French (fr)
|
| 22 |
3. German (de)
|
| 23 |
4. Russian (ru)
|
| 24 |
5. Arabc (ar)
|
| 25 |
-
## Metrics:
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
## Library_name:
|
| 31 |
-
Transformers
|
| 32 |
|
| 33 |
## Overview
|
| 34 |
Language identification is a foundational task in Natural Language Processing (NLP). This project introduces a meticulously fine-tuned language identification model, rooted in the robust XLM-RoBERTa architecture. It excels at classifying text in five diverse languages: English, French, German, Arabic, and Russian. Delve into the intricate details of this cutting-edge model that pushes the boundaries of multilingual language identification.
|
| 35 |
|
| 36 |
## Table of Contents
|
| 37 |
-
Index here
|
| 38 |
|
| 39 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
1. **Model Architecture:** The model architecture is based on XLM-RoBERTa, a multilingual variant of RoBERTa. This architecture is renowned for its contextual embeddings and multilingual capabilities.
|
| 42 |
|
|
@@ -47,7 +55,7 @@ Index here
|
|
| 47 |
4. **Evaluation Metric:** The primary evaluation metrics used are accuracy and F1-score. These metrics provide insights into the model's overall classification performance.
|
| 48 |
|
| 49 |
|
| 50 |
-
## Training
|
| 51 |
The model underwent a rigorous fine-tuning process using Hugging Face's Trainer class. This section delves into the intricacies of the training process:
|
| 52 |
|
| 53 |
1. **Number of Epochs:** The model was trained over the course of two epochs. This balance between training time and performance optimization allows it to reach its full potential.
|
|
@@ -61,9 +69,9 @@ The model underwent a rigorous fine-tuning process using Hugging Face's Trainer
|
|
| 61 |
5. **Logging Steps:** The logging steps are determined based on the size of the training dataset. This dynamic approach adapts to dataset variations, providing more informative logs during training.
|
| 62 |
|
| 63 |
|
| 64 |
-
##
|
| 65 |
The corpus used for training is the corpus of © 2023 Universität Leipzig / Sächsische Akademie der Wissenschaften / InfAI.
|
| 66 |
-
| Language | Size of Corpus (in number of
|
| 67 |
| -------- | -------- |
|
| 68 |
|**English**|50002|
|
| 69 |
|**French**|50002|
|
|
@@ -72,7 +80,7 @@ The corpus used for training is the corpus of © 2023 Universität Leipzig / Sä
|
|
| 72 |
|**Arabic**|36888|
|
| 73 |
|
| 74 |
|
| 75 |
-
## Technology Stack
|
| 76 |
1. **Python**: Python is the primary programming language used for developing the language identification model and its associated tools. Python's simplicity, readability, and extensive libraries make it an ideal choice for Natural Language Processing (NLP) tasks.
|
| 77 |
|
| 78 |
2. **Hugging Face Transformers**: Hugging Face Transformers is a fundamental component of the technology stack. It provides access to pre-trained models, libraries for model fine-tuning, and tokenization tools. The project relies heavily on this open-source library for model loading, fine-tuning, and evaluation.
|
|
@@ -93,7 +101,27 @@ The corpus used for training is the corpus of © 2023 Universität Leipzig / Sä
|
|
| 93 |
|
| 94 |
10. **Google Collab Notebooks**: Jupyter Notebooks were used for exploratory data analysis, code prototyping, and interactive documentation. They offer a convenient environment for experimenting with code and data visualization.
|
| 95 |
|
| 96 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
To use this model for language identification, you can follow these steps:
|
| 98 |
|
| 99 |
1. Install the necessary libraries and dependencies.
|
|
@@ -101,13 +129,13 @@ To use this model for language identification, you can follow these steps:
|
|
| 101 |
3. Tokenize the input text using the model's tokenizer.
|
| 102 |
4. Make predictions on the tokenized input to identify the language.
|
| 103 |
|
| 104 |
-
### Installation
|
| 105 |
To utilize this language identification model, you must install the transformers library and other essential dependencies.
|
| 106 |
<pre>
|
| 107 |
pip install transformers datasets
|
| 108 |
</pre>
|
| 109 |
|
| 110 |
-
### Loading the Model
|
| 111 |
The model can be effortlessly loaded using the Hugging Face Transformers library. The following code demonstrates how to load the model and tokenizer:
|
| 112 |
<pre>
|
| 113 |
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
|
@@ -117,7 +145,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
|
|
| 117 |
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt)
|
| 118 |
</pre>
|
| 119 |
|
| 120 |
-
### Language Identification
|
| 121 |
Identifying the language of a given text is a straightforward process. The model utilizes a pre-trained tokenizer to prepare the text and a fine-tuned model to make the prediction. Here's a code snippet:
|
| 122 |
<pre>
|
| 123 |
text = "Your input text goes here"
|
|
@@ -127,31 +155,25 @@ with torch.no_grad():
|
|
| 127 |
predicted_language = model.config.id2label[torch.argmax(outputs.logits)]
|
| 128 |
</pre>
|
| 129 |
|
| 130 |
-
### Evaluating the Model
|
| 131 |
The model's performance can be evaluated on your dataset using the provided evaluation script. It calculates accuracy and F1-score, giving you a comprehensive understanding of how well the model classifies text.
|
| 132 |
<pre>
|
| 133 |
eval_result = trainer.evaluate(eval_dataset=tok_test)
|
| 134 |
accuracy = eval_result["eval_accuracy"]
|
| 135 |
f1 = eval_result["eval_f1"]
|
| 136 |
-
|
| 137 |
</pre>
|
| 138 |
|
| 139 |
-
## Model Performance
|
| 140 |
-
Table of Model Performance
|
| 141 |
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|**English**|1.0000|0.9994|0.9997|0.9994|
|
| 145 |
-
|**French**|1.0000|0.9992|0.9996|0.9992|
|
| 146 |
-
|**German**|1.0000|0.9998|0.9999|0.9998|
|
| 147 |
-
|**Arabic**|1.0000|0.9997|0.9999|0.9997|
|
| 148 |
-
|**Russian**|1.0000|1.0000|1.0000|1.0000|
|
| 149 |
|
| 150 |
-
|
| 151 |
-
|
|
|
|
|
|
|
|
|
|
| 152 |
|
| 153 |
-
-
|
| 154 |
-
- `src/` : Source code and Google Collab Notebook
|
| 155 |
-
- `README.md` : This README file
|
| 156 |
|
| 157 |
-
## Contributing
|
|
|
|
|
|
| 16 |
|
| 17 |
<img src="https://miro.medium.com/v2/resize:fit:1400/1*G5AyGtaUAQBcVLikpxu6CQ.png" style="border-radius: 5%;">
|
| 18 |
|
| 19 |
+
> ## Languages Supported:
|
| 20 |
+
> 1. English (en)
|
| 21 |
+
2. French (fr)
|
| 22 |
3. German (de)
|
| 23 |
4. Russian (ru)
|
| 24 |
5. Arabc (ar)
|
| 25 |
+
> ## Metrics:
|
| 26 |
+
> 1. f1 - score
|
| 27 |
+
> 2. Accuracy
|
| 28 |
+
> 3. Precision
|
| 29 |
+
> 4. Recall
|
| 30 |
+
> ## Library_name:
|
| 31 |
+
> Transformers
|
| 32 |
|
| 33 |
## Overview
|
| 34 |
Language identification is a foundational task in Natural Language Processing (NLP). This project introduces a meticulously fine-tuned language identification model, rooted in the robust XLM-RoBERTa architecture. It excels at classifying text in five diverse languages: English, French, German, Arabic, and Russian. Delve into the intricate details of this cutting-edge model that pushes the boundaries of multilingual language identification.
|
| 35 |
|
| 36 |
## Table of Contents
|
|
|
|
| 37 |
|
| 38 |
+
1. Model Details
|
| 39 |
+
2. Training
|
| 40 |
+
3. Corpus Used
|
| 41 |
+
4. Technology Stack
|
| 42 |
+
5. Model Performance
|
| 43 |
+
6. Usage
|
| 44 |
+
7. Project File Structure
|
| 45 |
+
8. Contributing
|
| 46 |
+
|
| 47 |
+
## 1. Model Details
|
| 48 |
|
| 49 |
1. **Model Architecture:** The model architecture is based on XLM-RoBERTa, a multilingual variant of RoBERTa. This architecture is renowned for its contextual embeddings and multilingual capabilities.
|
| 50 |
|
|
|
|
| 55 |
4. **Evaluation Metric:** The primary evaluation metrics used are accuracy and F1-score. These metrics provide insights into the model's overall classification performance.
|
| 56 |
|
| 57 |
|
| 58 |
+
## 2. Training
|
| 59 |
The model underwent a rigorous fine-tuning process using Hugging Face's Trainer class. This section delves into the intricacies of the training process:
|
| 60 |
|
| 61 |
1. **Number of Epochs:** The model was trained over the course of two epochs. This balance between training time and performance optimization allows it to reach its full potential.
|
|
|
|
| 69 |
5. **Logging Steps:** The logging steps are determined based on the size of the training dataset. This dynamic approach adapts to dataset variations, providing more informative logs during training.
|
| 70 |
|
| 71 |
|
| 72 |
+
## 3. Corpus Used
|
| 73 |
The corpus used for training is the corpus of © 2023 Universität Leipzig / Sächsische Akademie der Wissenschaften / InfAI.
|
| 74 |
+
| Language | Size of Corpus (in number of sentences)|
|
| 75 |
| -------- | -------- |
|
| 76 |
|**English**|50002|
|
| 77 |
|**French**|50002|
|
|
|
|
| 80 |
|**Arabic**|36888|
|
| 81 |
|
| 82 |
|
| 83 |
+
## 4. Technology Stack
|
| 84 |
1. **Python**: Python is the primary programming language used for developing the language identification model and its associated tools. Python's simplicity, readability, and extensive libraries make it an ideal choice for Natural Language Processing (NLP) tasks.
|
| 85 |
|
| 86 |
2. **Hugging Face Transformers**: Hugging Face Transformers is a fundamental component of the technology stack. It provides access to pre-trained models, libraries for model fine-tuning, and tokenization tools. The project relies heavily on this open-source library for model loading, fine-tuning, and evaluation.
|
|
|
|
| 101 |
|
| 102 |
10. **Google Collab Notebooks**: Jupyter Notebooks were used for exploratory data analysis, code prototyping, and interactive documentation. They offer a convenient environment for experimenting with code and data visualization.
|
| 103 |
|
| 104 |
+
|
| 105 |
+
## 5. Model Performance
|
| 106 |
+
|
| 107 |
+
### 5.1 Overall Performance
|
| 108 |
+
| Accuracy | F1-Score |
|
| 109 |
+
| -------- | -------- |
|
| 110 |
+
|0.9996|0.9996|
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
### 5.2 Language Wise Performance
|
| 114 |
+
|
| 115 |
+
| Language | Precision | Recall | F1 - Score | Accuracy |
|
| 116 |
+
| -------- | -------- | -------- | -------- | -------- |
|
| 117 |
+
|**English**|1.0000|0.9994|0.9997|0.9994|
|
| 118 |
+
|**French**|1.0000|0.9992|0.9996|0.9992|
|
| 119 |
+
|**German**|1.0000|0.9998|0.9999|0.9998|
|
| 120 |
+
|**Arabic**|1.0000|0.9997|0.9999|0.9997|
|
| 121 |
+
|**Russian**|1.0000|1.0000|1.0000|1.0000|
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
## 6. Usage
|
| 125 |
To use this model for language identification, you can follow these steps:
|
| 126 |
|
| 127 |
1. Install the necessary libraries and dependencies.
|
|
|
|
| 129 |
3. Tokenize the input text using the model's tokenizer.
|
| 130 |
4. Make predictions on the tokenized input to identify the language.
|
| 131 |
|
| 132 |
+
### 6.1 Installation
|
| 133 |
To utilize this language identification model, you must install the transformers library and other essential dependencies.
|
| 134 |
<pre>
|
| 135 |
pip install transformers datasets
|
| 136 |
</pre>
|
| 137 |
|
| 138 |
+
### 6.2 Loading the Model
|
| 139 |
The model can be effortlessly loaded using the Hugging Face Transformers library. The following code demonstrates how to load the model and tokenizer:
|
| 140 |
<pre>
|
| 141 |
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
|
|
|
| 145 |
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt)
|
| 146 |
</pre>
|
| 147 |
|
| 148 |
+
### 6.3 Language Identification
|
| 149 |
Identifying the language of a given text is a straightforward process. The model utilizes a pre-trained tokenizer to prepare the text and a fine-tuned model to make the prediction. Here's a code snippet:
|
| 150 |
<pre>
|
| 151 |
text = "Your input text goes here"
|
|
|
|
| 155 |
predicted_language = model.config.id2label[torch.argmax(outputs.logits)]
|
| 156 |
</pre>
|
| 157 |
|
| 158 |
+
### 6.4 Evaluating the Model
|
| 159 |
The model's performance can be evaluated on your dataset using the provided evaluation script. It calculates accuracy and F1-score, giving you a comprehensive understanding of how well the model classifies text.
|
| 160 |
<pre>
|
| 161 |
eval_result = trainer.evaluate(eval_dataset=tok_test)
|
| 162 |
accuracy = eval_result["eval_accuracy"]
|
| 163 |
f1 = eval_result["eval_f1"]
|
|
|
|
| 164 |
</pre>
|
| 165 |
|
|
|
|
|
|
|
| 166 |
|
| 167 |
+
## 7. Project Files Structure
|
| 168 |
+
The project's files structure is organized as follows:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 169 |
|
| 170 |
+
- <span style="background-color: transparent; padding: 2px; border: 1px solid #ccc; border-radius: 3px; font-family: inherit;">data/</span> : Contains datasets used for training and testing the model
|
| 171 |
+
|
| 172 |
+
- <span style="background-color: transparent; padding: 2px; border: 1px solid #ccc; border-radius: 3px; font-family: inherit;">src/</span> : Source code and Google Collab Notebook
|
| 173 |
+
|
| 174 |
+
- <span style="background-color: transparent; padding: 2px; border: 1px solid #ccc; border-radius: 3px; font-family: inherit;">README.md</span> : This README file
|
| 175 |
|
| 176 |
+
- <span style="background-color: transparent; padding: 2px; border: 1px solid #ccc; border-radius: 3px; font-family: inherit;">/</span> : Model checkpoint files
|
|
|
|
|
|
|
| 177 |
|
| 178 |
+
## 8. Contributing
|
| 179 |
+
Contributions and suggestions are welcome. If you find issues or have ideas for improvements, please open an issue or submit a pull request.
|