marcoallanda
/

SubRoBERTa

+# SubRoBERTa: Reddit Subreddit Classification Model
+This model is a fine-tuned RoBERTa-base model for classifying text into 10 different subreddits. It was trained on a dataset of posts from various subreddits to predict which subreddit a given text belongs to.
+## Model Description
+- **Model type:** RoBERTa-base fine-tuned for sequence classification
+- **Language:** English
+- **License:** MIT
+- **Finetuned from model:** [roberta-base](https://huggingface.co/roberta-base)
+## Intended Uses & Limitations
+This model is intended to be used for:
+- Classifying text into one of the following subreddits:
+  - r/aitah
+  - r/buildapc
+  - r/dating_advice
+  - r/legaladvice
+  - r/minecraft
+  - r/nostupidquestions
+  - r/pcmasterrace
+  - r/relationship_advice
+  - r/techsupport
+  - r/teenagers
+### Limitations
+- The model was trained on English text only
+- Performance may vary for texts that are significantly different from the training data
+- The model may not perform well on texts that don't clearly belong to any of the target subreddits
+## Usage
+Here's how to use the model:
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+import torch.nn.functional as F
+# Load model and tokenizer
+model_name = "marcoallanda/SubRoBERTa"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+# Example text
+text = "My computer won't turn on, what should I do?"
+# Tokenize input
+inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
+# Run inference
+with torch.no_grad():
+    outputs = model(**inputs)
+    logits = outputs.logits
+    probs = F.softmax(logits, dim=-1)
+    pred_id = torch.argmax(probs, dim=-1).item()
+    pred_label = model.config.id2label[pred_id]
+print(f"Predicted subreddit: {pred_label}")
+```
+## Training and Evaluation Data
+The model was trained on a dataset of posts from the 10 target subreddits. The data was split into training and evaluation sets with an 80-20 split.
+### Training Procedure
+- **Training regime:** Fine-tuning
+- **Learning rate:** 2e-5
+- **Number of epochs:** 10
+- **Batch size:** 128
+- **Optimizer:** AdamW
+- **Mixed precision:** FP16
+### Training Results
+The model was evaluated using accuracy and F1-macro scores. The best model was selected based on the F1-macro score.
+## Environmental Impact
+- **Hardware Type:** GPU
+- **Hours used:** [Add your training time]
+- **Cloud Provider:** [Add your cloud provider if applicable]
+- **Compute Region:** [Add your region if applicable]
+- **Carbon Emitted:** [Add if you have this information]
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{SubRoBERTa,
+  author = {Marco Allanda},
+  title = {SubRoBERTa: Reddit Subreddit Classification Model},
+  year = {2024},
+  publisher = {Hugging Face},
+  journal = {Hugging Face Hub},
+  howpublished = {\url{https://huggingface.co/marcoallanda/SubRoBERTa}}
+}
+```
+## Contact
+For questions or feedback, please open an issue on the [GitHub repository](https://github.com/marcoallanda/NLP-Project) or contact me through Hugging Face.