---
library_name: transformers
tags:
- steam
- video games
- distilbert
license: apache-2.0
datasets:
- SebastianHops/steam-reviews-english
language:
- en
base_model:
- distilbert/distilbert-base-uncased
pipeline_tag: text-classification
---

# Distilbert Steam Sentiment (Small)

This is a fine-tuned version of the distilbert/distilbert-base-uncased model trained on the SebastianHops/steam-reviews-english dataset. 
It was made for the purpose of simple sentiment analysis, particularly of video game reviews. 

### Model Description

This model uses Distilbert as a base and then uses a subset of the SebastianHops/steam-reviews-english dataset for training. I call this 
model the "small" version because it utilizes only a fraction (100000 lines) of the training dataset for training/running speed purposes.
Given the dataset and base model, Distilbert Steam Sentiment (Small) is great for sentiment analysis applications, especially within the 
video games & new media industries. The training data includes lots of gen alpha/z internet culture-related slang which makes it unique
compared to other sentiment analysis models. 

- **Developed by:** Trevor Keay
- **Model type:** Custom-tuned Transformer
- **Language(s) (NLP):** Python
- **License:** Apache License 2.0
- **Finetuned from model [optional]:** distilbert/distilbert-base-uncased

### Model Sources

- **Base Model** https://huggingface.co/distilbert/distilbert-base-uncased
- **Training Data** https://huggingface.co/datasets/SebastianHops/steam-reviews-english

## Uses

While Distilbert is useful for a variety of sentence prediction and analysis appliications, sentiment analysis is the primary purpose for
this downstream version.

### Direct Use

Primarily sentiment analysis applications involving new media / videogames industry

### Out-of-Scope Use

This model may not work as well when used to analyze traditional literature or more formal text as the training data is comprised of
extremely informal text that is littered with modern slang. I do not endorse or condone the use of this model for any malicious or 
illegal purposes, and I do not believe it would work well for those applications anyways!

## Bias, Risks, and Limitations

This model reflects the biases present witin both the base model and training data. It is biased towards more extreme reactions as due to response bias, users
that voluntarily review games are more likely to have extreme opinions compared to the average user of a game. Additionally, due to cultural trends within the 
gaming community, racial and/or gender biases are likely present in the output. 

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

## How to Get Started with the Model

Here is a really simple application of the model to get you going:
```
from transformers import pipeline

MODEL_NAME = "tjkeay/Distilbert_Steam_Sentiment_Small"

sentiment_classifier = pipeline(
    task = "text-classification",
    model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME),
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME),
    device = 0 if torch.cuda.is_available() else -1
)
example_text = "10/10 could not stop dying"
result = sentiment_classifier(example_text)[0]
output = result["label"]
print("output (0 should be negative):", output)
```
## Training Details

The model was trained with custom arguments focused around being lightweight and efficient.

### Training Data

The training data contains multitudes of reviews scraped directly from steam. Only the 'game', 'review', 'voted_up', 'author_playtime_forever', and 
'author_playtime_at_review' columns were included for training. Additionally, the model was only trained on a random sample of 100,000 entries from the dataset
to make the model faster to train and to use.

## Evaluation

Evaluation Results: {'eval_train_loss': 0.14118799567222595, 'eval_test_loss': 0.1386687308549881}

#### Testing Data

A train-test split of the same steam reviews dataset was used. 

## Model Card Authors

Trevor Keay