Update README.md
Browse files
README.md
CHANGED
|
@@ -56,8 +56,11 @@ Automated text analysis for businesses
|
|
| 56 |
|
| 57 |
## Training Details
|
| 58 |
Base Model: xlm-roberta-base
|
|
|
|
| 59 |
Training Dataset: The model is trained on the PAN-X subset of the XTREME dataset, which includes labeled NER data for multiple languages.
|
|
|
|
| 60 |
Training Framework: Hugging Face transformers library with PyTorch backend.
|
|
|
|
| 61 |
Data Preprocessing: Tokenization was performed using XLM-RoBERTa tokenizer, with attention paid to aligning token labels to subword tokens.
|
| 62 |
|
| 63 |
|
|
@@ -69,36 +72,53 @@ Here's a brief overview of the training procedure for the XLM-RoBERTa model for
|
|
| 69 |
Setup Environment:
|
| 70 |
|
| 71 |
Clone the repository and set up dependencies.
|
|
|
|
| 72 |
Import necessary libraries and modules.
|
|
|
|
| 73 |
Load Data:
|
| 74 |
|
| 75 |
Load the PAN-X subset from the XTREME dataset.
|
|
|
|
| 76 |
Shuffle and sample data subsets for training and evaluation.
|
|
|
|
| 77 |
Data Preparation:
|
| 78 |
|
| 79 |
Convert raw dataset into a format suitable for token classification.
|
|
|
|
| 80 |
Define a mapping for entity tags and apply tokenization.
|
|
|
|
| 81 |
Align NER tags with tokenized inputs.
|
|
|
|
| 82 |
Define Model:
|
| 83 |
|
| 84 |
Initialize the XLM-RoBERTa model for token classification.
|
|
|
|
| 85 |
Configure the model with the number of labels based on the dataset.
|
|
|
|
| 86 |
Setup Training Arguments:
|
| 87 |
|
| 88 |
Define hyperparameters such as learning rate, batch size, number of epochs, and evaluation strategy.
|
|
|
|
| 89 |
Configure logging and checkpointing.
|
|
|
|
| 90 |
Initialize Trainer:
|
| 91 |
|
| 92 |
Create a Trainer instance with the model, training arguments, datasets, and data collator.
|
|
|
|
| 93 |
Specify evaluation metrics to monitor performance.
|
|
|
|
| 94 |
Train the Model:
|
| 95 |
|
| 96 |
Start the training process using the Trainer.
|
|
|
|
| 97 |
Monitor training progress and metrics.
|
|
|
|
| 98 |
Evaluation and Results:
|
| 99 |
|
| 100 |
Evaluate the model on the validation set.
|
|
|
|
| 101 |
Compute metrics like F1 score for performance assessment.
|
|
|
|
| 102 |
Save and Push Model:
|
| 103 |
|
| 104 |
Save the fine-tuned model locally or push to a model hub for sharing and further use.
|
|
@@ -140,7 +160,7 @@ def tag_text_with_pipeline(text, ner_pipeline):
|
|
| 140 |
df.columns = ['Tokens', 'Tags', 'Score'] # Rename columns for clarity
|
| 141 |
return df
|
| 142 |
|
| 143 |
-
text = "
|
| 144 |
result = tag_text_with_pipeline(text, ner_pipeline)
|
| 145 |
print(result)
|
| 146 |
|
|
|
|
| 56 |
|
| 57 |
## Training Details
|
| 58 |
Base Model: xlm-roberta-base
|
| 59 |
+
|
| 60 |
Training Dataset: The model is trained on the PAN-X subset of the XTREME dataset, which includes labeled NER data for multiple languages.
|
| 61 |
+
|
| 62 |
Training Framework: Hugging Face transformers library with PyTorch backend.
|
| 63 |
+
|
| 64 |
Data Preprocessing: Tokenization was performed using XLM-RoBERTa tokenizer, with attention paid to aligning token labels to subword tokens.
|
| 65 |
|
| 66 |
|
|
|
|
| 72 |
Setup Environment:
|
| 73 |
|
| 74 |
Clone the repository and set up dependencies.
|
| 75 |
+
|
| 76 |
Import necessary libraries and modules.
|
| 77 |
+
|
| 78 |
Load Data:
|
| 79 |
|
| 80 |
Load the PAN-X subset from the XTREME dataset.
|
| 81 |
+
|
| 82 |
Shuffle and sample data subsets for training and evaluation.
|
| 83 |
+
|
| 84 |
Data Preparation:
|
| 85 |
|
| 86 |
Convert raw dataset into a format suitable for token classification.
|
| 87 |
+
|
| 88 |
Define a mapping for entity tags and apply tokenization.
|
| 89 |
+
|
| 90 |
Align NER tags with tokenized inputs.
|
| 91 |
+
|
| 92 |
Define Model:
|
| 93 |
|
| 94 |
Initialize the XLM-RoBERTa model for token classification.
|
| 95 |
+
|
| 96 |
Configure the model with the number of labels based on the dataset.
|
| 97 |
+
|
| 98 |
Setup Training Arguments:
|
| 99 |
|
| 100 |
Define hyperparameters such as learning rate, batch size, number of epochs, and evaluation strategy.
|
| 101 |
+
|
| 102 |
Configure logging and checkpointing.
|
| 103 |
+
|
| 104 |
Initialize Trainer:
|
| 105 |
|
| 106 |
Create a Trainer instance with the model, training arguments, datasets, and data collator.
|
| 107 |
+
|
| 108 |
Specify evaluation metrics to monitor performance.
|
| 109 |
+
|
| 110 |
Train the Model:
|
| 111 |
|
| 112 |
Start the training process using the Trainer.
|
| 113 |
+
|
| 114 |
Monitor training progress and metrics.
|
| 115 |
+
|
| 116 |
Evaluation and Results:
|
| 117 |
|
| 118 |
Evaluate the model on the validation set.
|
| 119 |
+
|
| 120 |
Compute metrics like F1 score for performance assessment.
|
| 121 |
+
|
| 122 |
Save and Push Model:
|
| 123 |
|
| 124 |
Save the fine-tuned model locally or push to a model hub for sharing and further use.
|
|
|
|
| 160 |
df.columns = ['Tokens', 'Tags', 'Score'] # Rename columns for clarity
|
| 161 |
return df
|
| 162 |
|
| 163 |
+
text = "Einwohnern an der Danziger Bucht in der polnischen Woiwodschaft Pommern ."
|
| 164 |
result = tag_text_with_pipeline(text, ner_pipeline)
|
| 165 |
print(result)
|
| 166 |
|