| --- |
| language: |
| - en |
| tags: |
| - sentiment |
| - bert |
| - sentiment-analysis |
| - transformers |
| - Transformateurs |
| pipeline_tag: token-classification |
| --- |
| |
| Sentiment Analysis Model for Hotel Reviews |
| This model performs sentiment analysis on hotel reviews. The goal is to classify reviews into one of the three categories: Negative, Neutral, or Positive. |
|
|
| Model Description |
| This model is based on the BERT (Bidirectional Encoder Representations from Transformers) model, specifically bert-base-uncased. |
|
|
| Training Procedure |
| The model was trained on the TripAdvisor hotel reviews dataset. Each review in the dataset is associated with a rating from 1 to 5. |
| The ratings were converted to sentiment labels as follows: |
|
|
| Ratings of 1 and 2 were labelled as 'Negative' |
| Rating of 3 was labelled as 'Neutral' |
| Ratings of 4 and 5 were labelled as 'Positive' |
| The text of each review was preprocessed by lowercasing, removing punctuation, emojis, and stop words, and tokenized with the BERT tokenizer. |
|
|
| The model was trained with a learning rate of 2e-5, an epsilon of 1e-8, and a batch size of 6 for 5 epochs. |
|
|
| Evaluation |
| The model was evaluated using a weighted F1 score. |
|
|
| Usage |
| To use the model, load it and use it to classify a review. For example: |
|
|
|
|
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
| tokenizer = AutoTokenizer.from_pretrained("<Group209>") |
| model = AutoModelForSequenceClassification.from_pretrained("<Group209>") |
|
|
| text = "The hotel was great and the staff were very friendly." |
|
|
| encoded_input = tokenizer(text, truncation=True, padding=True, return_tensors='pt') |
| output = model(**encoded_input) |
| predictions = output.logits.argmax(dim=1) |
| |
| print(predictions) |
| |
| Limitations and Bias |
| The model is trained on English data, so it might not perform well on reviews in other languages. |
| Furthermore, it might be biased towards certain phrases or words that are commonly used in the training dataset. |