File size: 2,950 Bytes
f5b2a74 015f67a f5b2a74 015f67a f5b2a74 63fa175 015f67a 63fa175 015f67a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | ---
license: mit
tags:
- roberta
- text-classification
- reddit
- nlp
- sequence-classification
- pytorch
- transformers
model-index:
- name: SubRoBERTa
results: []
language:
- en
metrics:
- accuracy
base_model:
- FacebookAI/roberta-base
---
# SubRoBERTa: Reddit Subreddit Classification Model
This model is a fine-tuned RoBERTa-base model for classifying text into 10 different subreddits. It was trained on a dataset of posts from various subreddits to predict which subreddit a given text belongs to.
## Model Description
- **Model type:** RoBERTa-base fine-tuned for sequence classification
- **Language:** English
- **License:** MIT
- **Finetuned from model:** [roberta-base](https://huggingface.co/roberta-base)
## Intended Uses & Limitations
This model is intended to be used for:
- Classifying text into one of the following subreddits:
- r/aitah
- r/buildapc
- r/dating_advice
- r/legaladvice
- r/minecraft
- r/nostupidquestions
- r/pcmasterrace
- r/relationship_advice
- r/techsupport
- r/teenagers
### Limitations
- The model was trained on English text only
- Performance may vary for texts that are significantly different from the training data
- The model may not perform well on texts that don't clearly belong to any of the target subreddits
## Usage
Here's how to use the model:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F
# Load model and tokenizer
model_name = "marcoallanda/SubRoBERTa"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example text
text = "My computer won't turn on, what should I do?"
# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
# Run inference
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probs = F.softmax(logits, dim=-1)
pred_id = torch.argmax(probs, dim=-1).item()
pred_label = model.config.id2label[pred_id]
print(f"Predicted subreddit: {pred_label}")
```
## Training and Evaluation Data
The model was trained on a dataset of posts from the 10 target subreddits. The data was split into training and evaluation sets with an 80-20 split.
### Training Procedure
- **Training regime:** Fine-tuning
- **Learning rate:** 2e-5
- **Number of epochs:** 10
- **Batch size:** 128
- **Optimizer:** AdamW
- **Mixed precision:** FP16
### Training Results
The model was evaluated using accuracy and F1-macro scores. The best model was selected based on the F1-macro score.
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{SubRoBERTa,
author = {Marco Allanda},
title = {SubRoBERTa: Reddit Subreddit Classification Model},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Hub},
howpublished = {\url{https://huggingface.co/marcoallanda/SubRoBERTa}}
}
``` |