marcoallanda
/

SubRoBERTa

Text Classification

sequence-classification

text-embeddings-inference

Model card Files Files and versions

SubRoBERTa / README.md

marcoallanda's picture

Update README.md

015f67a verified 10 months ago

|

history blame contribute delete

2.95 kB

	---
	license: mit
	tags:
	- roberta
	- text-classification
	- reddit
	- nlp
	- sequence-classification
	- pytorch
	- transformers
	model-index:
	- name: SubRoBERTa
	results: []
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- FacebookAI/roberta-base
	---

	# SubRoBERTa: Reddit Subreddit Classification Model

	This model is a fine-tuned RoBERTa-base model for classifying text into 10 different subreddits. It was trained on a dataset of posts from various subreddits to predict which subreddit a given text belongs to.

	## Model Description

	- Model type: RoBERTa-base fine-tuned for sequence classification
	- Language: English
	- License: MIT
	- Finetuned from model: [roberta-base](https://huggingface.co/roberta-base)

	## Intended Uses & Limitations

	This model is intended to be used for:
	- Classifying text into one of the following subreddits:
	- r/aitah
	- r/buildapc
	- r/dating_advice
	- r/legaladvice
	- r/minecraft
	- r/nostupidquestions
	- r/pcmasterrace
	- r/relationship_advice
	- r/techsupport
	- r/teenagers

	### Limitations

	- The model was trained on English text only
	- Performance may vary for texts that are significantly different from the training data
	- The model may not perform well on texts that don't clearly belong to any of the target subreddits

	## Usage

	Here's how to use the model:

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch
	import torch.nn.functional as F

	# Load model and tokenizer
	model_name = "marcoallanda/SubRoBERTa"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# Example text
	text = "My computer won't turn on, what should I do?"

	# Tokenize input
	inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

	# Run inference
	with torch.no_grad():
	outputs = model(**inputs)
	logits = outputs.logits
	probs = F.softmax(logits, dim=-1)
	pred_id = torch.argmax(probs, dim=-1).item()
	pred_label = model.config.id2label[pred_id]

	print(f"Predicted subreddit: {pred_label}")
	```

	## Training and Evaluation Data

	The model was trained on a dataset of posts from the 10 target subreddits. The data was split into training and evaluation sets with an 80-20 split.

	### Training Procedure

	- Training regime: Fine-tuning
	- Learning rate: 2e-5
	- Number of epochs: 10
	- Batch size: 128
	- Optimizer: AdamW
	- Mixed precision: FP16

	### Training Results

	The model was evaluated using accuracy and F1-macro scores. The best model was selected based on the F1-macro score.

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{SubRoBERTa,
	author = {Marco Allanda},
	title = {SubRoBERTa: Reddit Subreddit Classification Model},
	year = {2025},
	publisher = {Hugging Face},
	journal = {Hugging Face Hub},
	howpublished = {\url{https://huggingface.co/marcoallanda/SubRoBERTa}}
	}
	```