--- license: mit tags: - roberta - text-classification - reddit - nlp - sequence-classification - pytorch - transformers model-index: - name: SubRoBERTa results: [] language: - en metrics: - accuracy base_model: - FacebookAI/roberta-base --- # SubRoBERTa: Reddit Subreddit Classification Model This model is a fine-tuned RoBERTa-base model for classifying text into 10 different subreddits. It was trained on a dataset of posts from various subreddits to predict which subreddit a given text belongs to. ## Model Description - **Model type:** RoBERTa-base fine-tuned for sequence classification - **Language:** English - **License:** MIT - **Finetuned from model:** [roberta-base](https://huggingface.co/roberta-base) ## Intended Uses & Limitations This model is intended to be used for: - Classifying text into one of the following subreddits: - r/aitah - r/buildapc - r/dating_advice - r/legaladvice - r/minecraft - r/nostupidquestions - r/pcmasterrace - r/relationship_advice - r/techsupport - r/teenagers ### Limitations - The model was trained on English text only - Performance may vary for texts that are significantly different from the training data - The model may not perform well on texts that don't clearly belong to any of the target subreddits ## Usage Here's how to use the model: ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch import torch.nn.functional as F # Load model and tokenizer model_name = "marcoallanda/SubRoBERTa" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Example text text = "My computer won't turn on, what should I do?" # Tokenize input inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) # Run inference with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits probs = F.softmax(logits, dim=-1) pred_id = torch.argmax(probs, dim=-1).item() pred_label = model.config.id2label[pred_id] print(f"Predicted subreddit: {pred_label}") ``` ## Training and Evaluation Data The model was trained on a dataset of posts from the 10 target subreddits. The data was split into training and evaluation sets with an 80-20 split. ### Training Procedure - **Training regime:** Fine-tuning - **Learning rate:** 2e-5 - **Number of epochs:** 10 - **Batch size:** 128 - **Optimizer:** AdamW - **Mixed precision:** FP16 ### Training Results The model was evaluated using accuracy and F1-macro scores. The best model was selected based on the F1-macro score. ## Citation If you use this model in your research, please cite: ```bibtex @misc{SubRoBERTa, author = {Marco Allanda}, title = {SubRoBERTa: Reddit Subreddit Classification Model}, year = {2025}, publisher = {Hugging Face}, journal = {Hugging Face Hub}, howpublished = {\url{https://huggingface.co/marcoallanda/SubRoBERTa}} } ```