distilroberta-base-goodreads-genres

Model Overview

This model is a fine-tuned version of distilroberta-base for Goodreads book review / genre classification. It is designed to classify book-related text, such as book reviews, descriptions, or summaries, into genre categories.

The model was developed as part of an MLOps assignment using Hugging Face Transformers, Kaggle Notebook, Hugging Face Hub, and Weights & Biases for experiment tracking.

Model Details

Model name: distilroberta-base-goodreads-genres
Base model: distilroberta-base
Model type: Transformer-based sequence classification model
Task: Text classification
Domain: Goodreads book reviews / genre classification
Language: English
Library: Hugging Face Transformers
Training platform: Kaggle Notebook
Experiment tracking: Weights & Biases
Model repository: https://huggingface.co/sumitp76/distilroberta-base-goodreads-genres

Important Links

Kaggle Notebook: https://www.kaggle.com/code/sumitpiitj/gr-book-review/edit
Hugging Face Model: https://huggingface.co/sumitp76/distilroberta-base-goodreads-genres
W&B Project Dashboard: https://wandb.ai/sumit-k-pal-76-iitj/mlops-assignment2/table?nw=nwusersumitkpal76

Setup Instructions

1. Clone or open the project

The training was performed in a Kaggle Notebook.

Kaggle Notebook:

https://www.kaggle.com/code/sumitpiitj/gr-book-review/edit

2. Install dependencies

Install the required Python libraries:

pip install transformers
pip install datasets
pip install evaluate
pip install accelerate
pip install huggingface_hub
pip install wandb
pip install scikit-learn
pip install pandas
pip install numpy
pip install torch

In Kaggle, many packages may already be installed. If needed, install missing packages inside a notebook cell:

!pip install transformers datasets evaluate accelerate huggingface_hub wandb scikit-learn

3. Set up Huggging Face token

To push the model to Hugging Face Hub, create a Hugging Face access token with Write permission.

In Kaggle:

Go to Add-ons Open Secrets Add your Hugging Face token Save it using the name:

HF_TOKEN

Then load it in the notebook:

from kaggle_secrets import UserSecretsClient
from huggingface_hub import login

user_secrets = UserSecretsClient()
HF_TOKEN = user_secrets.get_secret("HF_TOKEN")

login(token=HF_TOKEN)

4. Set up W&B tracking

import wandb

wandb.login()

Initialize a W&B run:

wandb.init(
    project="mlops-assignment2",
    name="distilroberta-base-goodreads-genres"
)

W&B dashboard:

https://wandb.ai/sumit-k-pal-76-iitj/mlops-assignment2/table?nw=nwusersumitkpal76

5. Train the model

The model was trained using Hugging Face Trainer.

General training flow:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

model_name = "distilroberta-base"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels
)

Tokenize the dataset:

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True)

Train using Trainer:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

6. Evaluate the model

After training, evaluate the model:

results = trainer.evaluate()
print(results)

7. Push the model to Hugging Face Hub

repo_id = "sumitp76/distilroberta-base-goodreads-genres"

model.push_to_hub(repo_id, token=HF_TOKEN)
tokenizer.push_to_hub(repo_id, token=HF_TOKEN)

Model link:

https://huggingface.co/sumitp76/distilroberta-base-goodreads-genres

How to use the model

You can use the model directly with the Hugging Face pipeline.

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="sumitp76/distilroberta-base-goodreads-genres"
)

text = "A young wizard discovers his magical powers and enters a hidden world of adventure."

result = classifier(text)

print(result)

Training Details

Training Platform

The model was trained on Kaggle Notebook.

Platform: Kaggle Notebook link: https://www.kaggle.com/code/sumitpiitj/gr-book-review/edit Framework: Hugging Face Transformers Experiment tracking: Weights & Biases Model hosting: Hugging Face Hub

Base Model

The base model used was:

distilroberta-base

distilroberta-base is a smaller and faster version of RoBERTa. It is suitable for text classification tasks where a balance between performance and efficiency is required.

Preprocessing Steps

The general preprocessing steps included:

Loading the Goodreads book review / genre dataset
Checking and cleaning missing values
Preparing the input text column
Encoding genre labels into numeric IDs
Splitting the dataset into training and evaluation sets
Tokenizing text using the distilroberta-base tokenizer
Applying truncation and padding
Training the model using Hugging Face Trainer

Training Configuration

Update the values below according to the final notebook settings:

Parameter	Value
Base model	`distilroberta-base`
Task	Text Classification
Optimizer	AdamW
Loss function	Cross-entropy loss
Training/Eval Batch size	16/32
Learning rate	2e-5
Number of epochs	6
Max sequence length	256
Evaluation strategy	steps

Results

Metric	Score
Accuracy	0.61583
F1 Score	0.61632
Eval Loss	2.66787

Result Link

Kaggle Notebook: https://www.kaggle.com/code/sumitpiitj/gr-book-review/edit Hugging Face model: https://huggingface.co/sumitp76/distilroberta-base-goodreads-genres W&B dashboard: https://wandb.ai/sumit-k-pal-76-iitj/mlops-assignment2/table?nw=nwusersumitkpal76

Downloads last month: 2

Safetensors

Model size

82.1M params

Tensor type

F32

Model tree for sumitp76/distilroberta-base-goodreads-genres

Base model

distilbert/distilroberta-base

Finetuned

(787)

this model

Evaluation results

Accuracy on Goodreads Book Review / Genre Classification Dataset
self-reported

0.616
F1 Score on Goodreads Book Review / Genre Classification Dataset
self-reported

0.616
Eval Loss on Goodreads Book Review / Genre Classification Dataset
self-reported

2.668