marcoallanda commited on
Commit
63fa175
·
verified ·
1 Parent(s): bc891aa

Update README.md documentation

Browse files
Files changed (1) hide show
  1. README.md +106 -3
README.md CHANGED
@@ -1,3 +1,106 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SubRoBERTa: Reddit Subreddit Classification Model
2
+
3
+ This model is a fine-tuned RoBERTa-base model for classifying text into 10 different subreddits. It was trained on a dataset of posts from various subreddits to predict which subreddit a given text belongs to.
4
+
5
+ ## Model Description
6
+
7
+ - **Model type:** RoBERTa-base fine-tuned for sequence classification
8
+ - **Language:** English
9
+ - **License:** MIT
10
+ - **Finetuned from model:** [roberta-base](https://huggingface.co/roberta-base)
11
+
12
+ ## Intended Uses & Limitations
13
+
14
+ This model is intended to be used for:
15
+ - Classifying text into one of the following subreddits:
16
+ - r/aitah
17
+ - r/buildapc
18
+ - r/dating_advice
19
+ - r/legaladvice
20
+ - r/minecraft
21
+ - r/nostupidquestions
22
+ - r/pcmasterrace
23
+ - r/relationship_advice
24
+ - r/techsupport
25
+ - r/teenagers
26
+
27
+ ### Limitations
28
+
29
+ - The model was trained on English text only
30
+ - Performance may vary for texts that are significantly different from the training data
31
+ - The model may not perform well on texts that don't clearly belong to any of the target subreddits
32
+
33
+ ## Usage
34
+
35
+ Here's how to use the model:
36
+
37
+ ```python
38
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
39
+ import torch
40
+ import torch.nn.functional as F
41
+
42
+ # Load model and tokenizer
43
+ model_name = "marcoallanda/SubRoBERTa"
44
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
45
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
46
+
47
+ # Example text
48
+ text = "My computer won't turn on, what should I do?"
49
+
50
+ # Tokenize input
51
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
52
+
53
+ # Run inference
54
+ with torch.no_grad():
55
+ outputs = model(**inputs)
56
+ logits = outputs.logits
57
+ probs = F.softmax(logits, dim=-1)
58
+ pred_id = torch.argmax(probs, dim=-1).item()
59
+ pred_label = model.config.id2label[pred_id]
60
+
61
+ print(f"Predicted subreddit: {pred_label}")
62
+ ```
63
+
64
+ ## Training and Evaluation Data
65
+
66
+ The model was trained on a dataset of posts from the 10 target subreddits. The data was split into training and evaluation sets with an 80-20 split.
67
+
68
+ ### Training Procedure
69
+
70
+ - **Training regime:** Fine-tuning
71
+ - **Learning rate:** 2e-5
72
+ - **Number of epochs:** 10
73
+ - **Batch size:** 128
74
+ - **Optimizer:** AdamW
75
+ - **Mixed precision:** FP16
76
+
77
+ ### Training Results
78
+
79
+ The model was evaluated using accuracy and F1-macro scores. The best model was selected based on the F1-macro score.
80
+
81
+ ## Environmental Impact
82
+
83
+ - **Hardware Type:** GPU
84
+ - **Hours used:** [Add your training time]
85
+ - **Cloud Provider:** [Add your cloud provider if applicable]
86
+ - **Compute Region:** [Add your region if applicable]
87
+ - **Carbon Emitted:** [Add if you have this information]
88
+
89
+ ## Citation
90
+
91
+ If you use this model in your research, please cite:
92
+
93
+ ```bibtex
94
+ @misc{SubRoBERTa,
95
+ author = {Marco Allanda},
96
+ title = {SubRoBERTa: Reddit Subreddit Classification Model},
97
+ year = {2024},
98
+ publisher = {Hugging Face},
99
+ journal = {Hugging Face Hub},
100
+ howpublished = {\url{https://huggingface.co/marcoallanda/SubRoBERTa}}
101
+ }
102
+ ```
103
+
104
+ ## Contact
105
+
106
+ For questions or feedback, please open an issue on the [GitHub repository](https://github.com/marcoallanda/NLP-Project) or contact me through Hugging Face.