Joshi-Aryan commited on
Commit
5dcaa2c
·
1 Parent(s): 43ed7e7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +59 -37
README.md CHANGED
@@ -16,27 +16,35 @@ library_name: transformers
16
 
17
  <img src="https://miro.medium.com/v2/resize:fit:1400/1*G5AyGtaUAQBcVLikpxu6CQ.png" style="border-radius: 5%;">
18
 
19
- ## Language Supported:
20
- 1. English (en)
21
- 2. French (fr)
22
  3. German (de)
23
  4. Russian (ru)
24
  5. Arabc (ar)
25
- ## Metrics:
26
- - f1 - score
27
- - Accuracy
28
- - Precision
29
- - Recall
30
- ## Library_name:
31
- Transformers
32
 
33
  ## Overview
34
  Language identification is a foundational task in Natural Language Processing (NLP). This project introduces a meticulously fine-tuned language identification model, rooted in the robust XLM-RoBERTa architecture. It excels at classifying text in five diverse languages: English, French, German, Arabic, and Russian. Delve into the intricate details of this cutting-edge model that pushes the boundaries of multilingual language identification.
35
 
36
  ## Table of Contents
37
- Index here
38
 
39
- ## Model Details
 
 
 
 
 
 
 
 
 
40
 
41
  1. **Model Architecture:** The model architecture is based on XLM-RoBERTa, a multilingual variant of RoBERTa. This architecture is renowned for its contextual embeddings and multilingual capabilities.
42
 
@@ -47,7 +55,7 @@ Index here
47
  4. **Evaluation Metric:** The primary evaluation metrics used are accuracy and F1-score. These metrics provide insights into the model's overall classification performance.
48
 
49
 
50
- ## Training
51
  The model underwent a rigorous fine-tuning process using Hugging Face's Trainer class. This section delves into the intricacies of the training process:
52
 
53
  1. **Number of Epochs:** The model was trained over the course of two epochs. This balance between training time and performance optimization allows it to reach its full potential.
@@ -61,9 +69,9 @@ The model underwent a rigorous fine-tuning process using Hugging Face's Trainer
61
  5. **Logging Steps:** The logging steps are determined based on the size of the training dataset. This dynamic approach adapts to dataset variations, providing more informative logs during training.
62
 
63
 
64
- ## Dataset Used
65
  The corpus used for training is the corpus of © 2023 Universität Leipzig / Sächsische Akademie der Wissenschaften / InfAI.
66
- | Language | Size of Corpus (in number of sentence) |
67
  | -------- | -------- |
68
  |**English**|50002|
69
  |**French**|50002|
@@ -72,7 +80,7 @@ The corpus used for training is the corpus of © 2023 Universität Leipzig / Sä
72
  |**Arabic**|36888|
73
 
74
 
75
- ## Technology Stack
76
  1. **Python**: Python is the primary programming language used for developing the language identification model and its associated tools. Python's simplicity, readability, and extensive libraries make it an ideal choice for Natural Language Processing (NLP) tasks.
77
 
78
  2. **Hugging Face Transformers**: Hugging Face Transformers is a fundamental component of the technology stack. It provides access to pre-trained models, libraries for model fine-tuning, and tokenization tools. The project relies heavily on this open-source library for model loading, fine-tuning, and evaluation.
@@ -93,7 +101,27 @@ The corpus used for training is the corpus of © 2023 Universität Leipzig / Sä
93
 
94
  10. **Google Collab Notebooks**: Jupyter Notebooks were used for exploratory data analysis, code prototyping, and interactive documentation. They offer a convenient environment for experimenting with code and data visualization.
95
 
96
- ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
  To use this model for language identification, you can follow these steps:
98
 
99
  1. Install the necessary libraries and dependencies.
@@ -101,13 +129,13 @@ To use this model for language identification, you can follow these steps:
101
  3. Tokenize the input text using the model's tokenizer.
102
  4. Make predictions on the tokenized input to identify the language.
103
 
104
- ### Installation
105
  To utilize this language identification model, you must install the transformers library and other essential dependencies.
106
  <pre>
107
  pip install transformers datasets
108
  </pre>
109
 
110
- ### Loading the Model
111
  The model can be effortlessly loaded using the Hugging Face Transformers library. The following code demonstrates how to load the model and tokenizer:
112
  <pre>
113
  from transformers import AutoModelForSequenceClassification, AutoTokenizer
@@ -117,7 +145,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
117
  model = AutoModelForSequenceClassification.from_pretrained(model_ckpt)
118
  </pre>
119
 
120
- ### Language Identification
121
  Identifying the language of a given text is a straightforward process. The model utilizes a pre-trained tokenizer to prepare the text and a fine-tuned model to make the prediction. Here's a code snippet:
122
  <pre>
123
  text = "Your input text goes here"
@@ -127,31 +155,25 @@ with torch.no_grad():
127
  predicted_language = model.config.id2label[torch.argmax(outputs.logits)]
128
  </pre>
129
 
130
- ### Evaluating the Model
131
  The model's performance can be evaluated on your dataset using the provided evaluation script. It calculates accuracy and F1-score, giving you a comprehensive understanding of how well the model classifies text.
132
  <pre>
133
  eval_result = trainer.evaluate(eval_dataset=tok_test)
134
  accuracy = eval_result["eval_accuracy"]
135
  f1 = eval_result["eval_f1"]
136
-
137
  </pre>
138
 
139
- ## Model Performance
140
- Table of Model Performance
141
 
142
- | Language | Precision | Recall | F1 - Score | Accuracy |
143
- | -------- | -------- | -------- | -------- | -------- |
144
- |**English**|1.0000|0.9994|0.9997|0.9994|
145
- |**French**|1.0000|0.9992|0.9996|0.9992|
146
- |**German**|1.0000|0.9998|0.9999|0.9998|
147
- |**Arabic**|1.0000|0.9997|0.9999|0.9997|
148
- |**Russian**|1.0000|1.0000|1.0000|1.0000|
149
 
150
- ## Project Files Structure
151
- The project's structure is organized as follows:
 
 
 
152
 
153
- - `data/` : Contains datasets used for training and testing the model
154
- - `src/` : Source code and Google Collab Notebook
155
- - `README.md` : This README file
156
 
157
- ## Contributing
 
 
16
 
17
  <img src="https://miro.medium.com/v2/resize:fit:1400/1*G5AyGtaUAQBcVLikpxu6CQ.png" style="border-radius: 5%;">
18
 
19
+ > ## Languages Supported:
20
+ > 1. English (en)
21
+ 2. French (fr)
22
  3. German (de)
23
  4. Russian (ru)
24
  5. Arabc (ar)
25
+ > ## Metrics:
26
+ > 1. f1 - score
27
+ > 2. Accuracy
28
+ > 3. Precision
29
+ > 4. Recall
30
+ > ## Library_name:
31
+ > Transformers
32
 
33
  ## Overview
34
  Language identification is a foundational task in Natural Language Processing (NLP). This project introduces a meticulously fine-tuned language identification model, rooted in the robust XLM-RoBERTa architecture. It excels at classifying text in five diverse languages: English, French, German, Arabic, and Russian. Delve into the intricate details of this cutting-edge model that pushes the boundaries of multilingual language identification.
35
 
36
  ## Table of Contents
 
37
 
38
+ 1. Model Details
39
+ 2. Training
40
+ 3. Corpus Used
41
+ 4. Technology Stack
42
+ 5. Model Performance
43
+ 6. Usage
44
+ 7. Project File Structure
45
+ 8. Contributing
46
+
47
+ ## 1. Model Details
48
 
49
  1. **Model Architecture:** The model architecture is based on XLM-RoBERTa, a multilingual variant of RoBERTa. This architecture is renowned for its contextual embeddings and multilingual capabilities.
50
 
 
55
  4. **Evaluation Metric:** The primary evaluation metrics used are accuracy and F1-score. These metrics provide insights into the model's overall classification performance.
56
 
57
 
58
+ ## 2. Training
59
  The model underwent a rigorous fine-tuning process using Hugging Face's Trainer class. This section delves into the intricacies of the training process:
60
 
61
  1. **Number of Epochs:** The model was trained over the course of two epochs. This balance between training time and performance optimization allows it to reach its full potential.
 
69
  5. **Logging Steps:** The logging steps are determined based on the size of the training dataset. This dynamic approach adapts to dataset variations, providing more informative logs during training.
70
 
71
 
72
+ ## 3. Corpus Used
73
  The corpus used for training is the corpus of © 2023 Universität Leipzig / Sächsische Akademie der Wissenschaften / InfAI.
74
+ | Language | Size of Corpus (in number of sentences)|
75
  | -------- | -------- |
76
  |**English**|50002|
77
  |**French**|50002|
 
80
  |**Arabic**|36888|
81
 
82
 
83
+ ## 4. Technology Stack
84
  1. **Python**: Python is the primary programming language used for developing the language identification model and its associated tools. Python's simplicity, readability, and extensive libraries make it an ideal choice for Natural Language Processing (NLP) tasks.
85
 
86
  2. **Hugging Face Transformers**: Hugging Face Transformers is a fundamental component of the technology stack. It provides access to pre-trained models, libraries for model fine-tuning, and tokenization tools. The project relies heavily on this open-source library for model loading, fine-tuning, and evaluation.
 
101
 
102
  10. **Google Collab Notebooks**: Jupyter Notebooks were used for exploratory data analysis, code prototyping, and interactive documentation. They offer a convenient environment for experimenting with code and data visualization.
103
 
104
+
105
+ ## 5. Model Performance
106
+
107
+ ### 5.1 Overall Performance
108
+ | Accuracy | F1-Score |
109
+ | -------- | -------- |
110
+ |0.9996|0.9996|
111
+
112
+
113
+ ### 5.2 Language Wise Performance
114
+
115
+ | Language | Precision | Recall | F1 - Score | Accuracy |
116
+ | -------- | -------- | -------- | -------- | -------- |
117
+ |**English**|1.0000|0.9994|0.9997|0.9994|
118
+ |**French**|1.0000|0.9992|0.9996|0.9992|
119
+ |**German**|1.0000|0.9998|0.9999|0.9998|
120
+ |**Arabic**|1.0000|0.9997|0.9999|0.9997|
121
+ |**Russian**|1.0000|1.0000|1.0000|1.0000|
122
+
123
+
124
+ ## 6. Usage
125
  To use this model for language identification, you can follow these steps:
126
 
127
  1. Install the necessary libraries and dependencies.
 
129
  3. Tokenize the input text using the model's tokenizer.
130
  4. Make predictions on the tokenized input to identify the language.
131
 
132
+ ### 6.1 Installation
133
  To utilize this language identification model, you must install the transformers library and other essential dependencies.
134
  <pre>
135
  pip install transformers datasets
136
  </pre>
137
 
138
+ ### 6.2 Loading the Model
139
  The model can be effortlessly loaded using the Hugging Face Transformers library. The following code demonstrates how to load the model and tokenizer:
140
  <pre>
141
  from transformers import AutoModelForSequenceClassification, AutoTokenizer
 
145
  model = AutoModelForSequenceClassification.from_pretrained(model_ckpt)
146
  </pre>
147
 
148
+ ### 6.3 Language Identification
149
  Identifying the language of a given text is a straightforward process. The model utilizes a pre-trained tokenizer to prepare the text and a fine-tuned model to make the prediction. Here's a code snippet:
150
  <pre>
151
  text = "Your input text goes here"
 
155
  predicted_language = model.config.id2label[torch.argmax(outputs.logits)]
156
  </pre>
157
 
158
+ ### 6.4 Evaluating the Model
159
  The model's performance can be evaluated on your dataset using the provided evaluation script. It calculates accuracy and F1-score, giving you a comprehensive understanding of how well the model classifies text.
160
  <pre>
161
  eval_result = trainer.evaluate(eval_dataset=tok_test)
162
  accuracy = eval_result["eval_accuracy"]
163
  f1 = eval_result["eval_f1"]
 
164
  </pre>
165
 
 
 
166
 
167
+ ## 7. Project Files Structure
168
+ The project's files structure is organized as follows:
 
 
 
 
 
169
 
170
+ - <span style="background-color: transparent; padding: 2px; border: 1px solid #ccc; border-radius: 3px; font-family: inherit;">data/</span> : Contains datasets used for training and testing the model
171
+
172
+ - <span style="background-color: transparent; padding: 2px; border: 1px solid #ccc; border-radius: 3px; font-family: inherit;">src/</span> : Source code and Google Collab Notebook
173
+
174
+ - <span style="background-color: transparent; padding: 2px; border: 1px solid #ccc; border-radius: 3px; font-family: inherit;">README.md</span> : This README file
175
 
176
+ - <span style="background-color: transparent; padding: 2px; border: 1px solid #ccc; border-radius: 3px; font-family: inherit;">/</span> : Model checkpoint files
 
 
177
 
178
+ ## 8. Contributing
179
+ Contributions and suggestions are welcome. If you find issues or have ideas for improvements, please open an issue or submit a pull request.