MassMin
/

Multilingual-NER-tagging

@@ -56,8 +56,11 @@ Automated text analysis for businesses
 ## Training Details
 Base Model: xlm-roberta-base
 Training Dataset: The model is trained on the PAN-X subset of the XTREME dataset, which includes labeled NER data for multiple languages.
 Training Framework: Hugging Face transformers library with PyTorch backend.
 Data Preprocessing: Tokenization was performed using XLM-RoBERTa tokenizer, with attention paid to aligning token labels to subword tokens.
@@ -69,36 +72,53 @@ Here's a brief overview of the training procedure for the XLM-RoBERTa model for
 Setup Environment:
 Clone the repository and set up dependencies.
 Import necessary libraries and modules.
 Load Data:
 Load the PAN-X subset from the XTREME dataset.
 Shuffle and sample data subsets for training and evaluation.
 Data Preparation:
 Convert raw dataset into a format suitable for token classification.
 Define a mapping for entity tags and apply tokenization.
 Align NER tags with tokenized inputs.
 Define Model:
 Initialize the XLM-RoBERTa model for token classification.
 Configure the model with the number of labels based on the dataset.
 Setup Training Arguments:
 Define hyperparameters such as learning rate, batch size, number of epochs, and evaluation strategy.
 Configure logging and checkpointing.
 Initialize Trainer:
 Create a Trainer instance with the model, training arguments, datasets, and data collator.
 Specify evaluation metrics to monitor performance.
 Train the Model:
 Start the training process using the Trainer.
 Monitor training progress and metrics.
 Evaluation and Results:
 Evaluate the model on the validation set.
 Compute metrics like F1 score for performance assessment.
 Save and Push Model:
 Save the fine-tuned model locally or push to a model hub for sharing and further use.
@@ -140,7 +160,7 @@ def tag_text_with_pipeline(text, ner_pipeline):
     df.columns = ['Tokens', 'Tags', 'Score']  # Rename columns for clarity
     return df
-text = "Jeff Dean works at Google in California."
 result = tag_text_with_pipeline(text, ner_pipeline)
 print(result)

 ## Training Details
 Base Model: xlm-roberta-base
 Training Dataset: The model is trained on the PAN-X subset of the XTREME dataset, which includes labeled NER data for multiple languages.
 Training Framework: Hugging Face transformers library with PyTorch backend.
 Data Preprocessing: Tokenization was performed using XLM-RoBERTa tokenizer, with attention paid to aligning token labels to subword tokens.
 Setup Environment:
 Clone the repository and set up dependencies.
 Import necessary libraries and modules.
 Load Data:
 Load the PAN-X subset from the XTREME dataset.
 Shuffle and sample data subsets for training and evaluation.
 Data Preparation:
 Convert raw dataset into a format suitable for token classification.
 Define a mapping for entity tags and apply tokenization.
 Align NER tags with tokenized inputs.
 Define Model:
 Initialize the XLM-RoBERTa model for token classification.
 Configure the model with the number of labels based on the dataset.
 Setup Training Arguments:
 Define hyperparameters such as learning rate, batch size, number of epochs, and evaluation strategy.
 Configure logging and checkpointing.
 Initialize Trainer:
 Create a Trainer instance with the model, training arguments, datasets, and data collator.
 Specify evaluation metrics to monitor performance.
 Train the Model:
 Start the training process using the Trainer.
 Monitor training progress and metrics.
 Evaluation and Results:
 Evaluate the model on the validation set.
 Compute metrics like F1 score for performance assessment.
 Save and Push Model:
 Save the fine-tuned model locally or push to a model hub for sharing and further use.
     df.columns = ['Tokens', 'Tags', 'Score']  # Rename columns for clarity
     return df
+text = "Einwohnern	an	der	Danziger	Bucht	in	der	polnischen	Woiwodschaft	Pommern	."
 result = tag_text_with_pipeline(text, ner_pipeline)
 print(result)