| | --- |
| | license: mit |
| | tags: |
| | - roberta |
| | - text-classification |
| | - reddit |
| | - nlp |
| | - sequence-classification |
| | - pytorch |
| | - transformers |
| | model-index: |
| | - name: SubRoBERTa |
| | results: [] |
| | language: |
| | - en |
| | metrics: |
| | - accuracy |
| | base_model: |
| | - FacebookAI/roberta-base |
| | --- |
| | |
| | # SubRoBERTa: Reddit Subreddit Classification Model |
| |
|
| | This model is a fine-tuned RoBERTa-base model for classifying text into 10 different subreddits. It was trained on a dataset of posts from various subreddits to predict which subreddit a given text belongs to. |
| |
|
| | ## Model Description |
| |
|
| | - **Model type:** RoBERTa-base fine-tuned for sequence classification |
| | - **Language:** English |
| | - **License:** MIT |
| | - **Finetuned from model:** [roberta-base](https://huggingface.co/roberta-base) |
| |
|
| | ## Intended Uses & Limitations |
| |
|
| | This model is intended to be used for: |
| | - Classifying text into one of the following subreddits: |
| | - r/aitah |
| | - r/buildapc |
| | - r/dating_advice |
| | - r/legaladvice |
| | - r/minecraft |
| | - r/nostupidquestions |
| | - r/pcmasterrace |
| | - r/relationship_advice |
| | - r/techsupport |
| | - r/teenagers |
| |
|
| | ### Limitations |
| |
|
| | - The model was trained on English text only |
| | - Performance may vary for texts that are significantly different from the training data |
| | - The model may not perform well on texts that don't clearly belong to any of the target subreddits |
| |
|
| | ## Usage |
| |
|
| | Here's how to use the model: |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | import torch |
| | import torch.nn.functional as F |
| | |
| | # Load model and tokenizer |
| | model_name = "marcoallanda/SubRoBERTa" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | model = AutoModelForSequenceClassification.from_pretrained(model_name) |
| | |
| | # Example text |
| | text = "My computer won't turn on, what should I do?" |
| | |
| | # Tokenize input |
| | inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) |
| | |
| | # Run inference |
| | with torch.no_grad(): |
| | outputs = model(**inputs) |
| | logits = outputs.logits |
| | probs = F.softmax(logits, dim=-1) |
| | pred_id = torch.argmax(probs, dim=-1).item() |
| | pred_label = model.config.id2label[pred_id] |
| | |
| | print(f"Predicted subreddit: {pred_label}") |
| | ``` |
| |
|
| | ## Training and Evaluation Data |
| |
|
| | The model was trained on a dataset of posts from the 10 target subreddits. The data was split into training and evaluation sets with an 80-20 split. |
| |
|
| | ### Training Procedure |
| |
|
| | - **Training regime:** Fine-tuning |
| | - **Learning rate:** 2e-5 |
| | - **Number of epochs:** 10 |
| | - **Batch size:** 128 |
| | - **Optimizer:** AdamW |
| | - **Mixed precision:** FP16 |
| |
|
| | ### Training Results |
| |
|
| | The model was evaluated using accuracy and F1-macro scores. The best model was selected based on the F1-macro score. |
| |
|
| | ## Citation |
| |
|
| | If you use this model in your research, please cite: |
| |
|
| | ```bibtex |
| | @misc{SubRoBERTa, |
| | author = {Marco Allanda}, |
| | title = {SubRoBERTa: Reddit Subreddit Classification Model}, |
| | year = {2025}, |
| | publisher = {Hugging Face}, |
| | journal = {Hugging Face Hub}, |
| | howpublished = {\url{https://huggingface.co/marcoallanda/SubRoBERTa}} |
| | } |
| | ``` |