File size: 2,950 Bytes
f5b2a74
 
 
015f67a
 
 
 
 
 
 
f5b2a74
015f67a
 
 
 
 
 
 
 
f5b2a74
 
63fa175
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
015f67a
63fa175
 
 
 
015f67a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
license: mit
tags:
- roberta
- text-classification
- reddit
- nlp
- sequence-classification
- pytorch
- transformers
model-index:
- name: SubRoBERTa
  results: []
language:
- en
metrics:
- accuracy
base_model:
- FacebookAI/roberta-base
---

# SubRoBERTa: Reddit Subreddit Classification Model

This model is a fine-tuned RoBERTa-base model for classifying text into 10 different subreddits. It was trained on a dataset of posts from various subreddits to predict which subreddit a given text belongs to.

## Model Description

- **Model type:** RoBERTa-base fine-tuned for sequence classification
- **Language:** English
- **License:** MIT
- **Finetuned from model:** [roberta-base](https://huggingface.co/roberta-base)

## Intended Uses & Limitations

This model is intended to be used for:
- Classifying text into one of the following subreddits:
  - r/aitah
  - r/buildapc
  - r/dating_advice
  - r/legaladvice
  - r/minecraft
  - r/nostupidquestions
  - r/pcmasterrace
  - r/relationship_advice
  - r/techsupport
  - r/teenagers

### Limitations

- The model was trained on English text only
- Performance may vary for texts that are significantly different from the training data
- The model may not perform well on texts that don't clearly belong to any of the target subreddits

## Usage

Here's how to use the model:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

# Load model and tokenizer
model_name = "marcoallanda/SubRoBERTa"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example text
text = "My computer won't turn on, what should I do?"

# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probs = F.softmax(logits, dim=-1)
    pred_id = torch.argmax(probs, dim=-1).item()
    pred_label = model.config.id2label[pred_id]

print(f"Predicted subreddit: {pred_label}")
```

## Training and Evaluation Data

The model was trained on a dataset of posts from the 10 target subreddits. The data was split into training and evaluation sets with an 80-20 split.

### Training Procedure

- **Training regime:** Fine-tuning
- **Learning rate:** 2e-5
- **Number of epochs:** 10
- **Batch size:** 128
- **Optimizer:** AdamW
- **Mixed precision:** FP16

### Training Results

The model was evaluated using accuracy and F1-macro scores. The best model was selected based on the F1-macro score.

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{SubRoBERTa,
  author = {Marco Allanda},
  title = {SubRoBERTa: Reddit Subreddit Classification Model},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Hub},
  howpublished = {\url{https://huggingface.co/marcoallanda/SubRoBERTa}}
}
```