Anyuhhh commited on
Commit
2942687
·
verified ·
1 Parent(s): 0027b40

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +134 -0
README.md ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ tags:
7
+ - text-classification
8
+ - distilbert
9
+ - fine-tuned
10
+ - pytorch
11
+ datasets:
12
+ - cassieli226/cities-text-dataset
13
+ base_model: distilbert-base-uncased
14
+
15
+ model-index:
16
+ - name: hw2-text-distilbert
17
+ results:
18
+ - task:
19
+ type: text-classification
20
+ name: Text Classification
21
+ dataset:
22
+ type: cassieli226/cities-text-dataset
23
+ name: Cities Text Dataset
24
+ split: test
25
+ metrics:
26
+ - type: accuracy
27
+ value: 99.5
28
+ name: Test Accuracy
29
+ - type: f1
30
+ value: 99.5
31
+ name: Test F1 Score (Macro)
32
+ ---
33
+
34
+ # DistilBERT Text Classification Model
35
+
36
+ This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) for text classification tasks.
37
+
38
+ ## Model Description
39
+
40
+ This model is a fine-tuned DistilBERT model for binary text classification, specifically designed to classify text as being related to either Pittsburgh or Shanghai cities. The model achieves excellent performance with 99.5% accuracy on the test set.
41
+
42
+ - **Model type:** Text Classification (Binary)
43
+ - **Language(s) (NLP):** English
44
+ - **Base model:** distilbert-base-uncased
45
+ - **Classes:** Pittsburgh, Shanghai
46
+
47
+ ## Intended Uses & Limitations
48
+
49
+ ### Intended Uses
50
+ - Binary text classification between Pittsburgh and Shanghai-related content
51
+ - City-based text categorization tasks
52
+ - Research and educational purposes in NLP and text classification
53
+
54
+ ### Limitations
55
+ - Limited to English language text
56
+ - Performance may vary on out-of-domain data
57
+ - Maximum input length of 256 tokens due to truncation
58
+
59
+ ## Training and Evaluation Data
60
+
61
+ ### Training Data
62
+ - **Base dataset:** [cassieli226/cities-text-dataset](https://huggingface.co/datasets/cassieli226/cities-text-dataset)
63
+ - **Classes:** Pittsburgh (507 samples) and Shanghai (493 samples) in augmented dataset
64
+ - **Original dataset:** 100 samples (50 Pittsburgh, 50 Shanghai)
65
+ - **Data augmentation:** Applied to increase dataset size from 100 to 1000 samples
66
+ - **Train/Test Split:** 80/20 split (800 train, 200 test) with stratified sampling
67
+ - **External validation:** Original 100 samples used for additional validation
68
+
69
+ ### Preprocessing
70
+ - Text tokenization using DistilBERT tokenizer
71
+ - Maximum sequence length: 256 tokens
72
+ - Truncation applied to longer sequences
73
+
74
+ ## Training Procedure
75
+
76
+ ### Training Hyperparameters
77
+ - **Learning rate:** 5e-5
78
+ - **Training batch size:** 16
79
+ - **Evaluation batch size:** 32
80
+ - **Number of epochs:** 4
81
+ - **Weight decay:** 0.01
82
+ - **Warmup ratio:** 0.1
83
+ - **LR scheduler:** Linear
84
+ - **Gradient accumulation steps:** 1
85
+ - **Mixed precision:** FP16 (if GPU available)
86
+
87
+ ### Training Configuration
88
+ - **Optimizer:** AdamW (default)
89
+ - **Early stopping:** Enabled with patience of 2 epochs
90
+ - **Best model selection:** Based on F1 score (macro)
91
+ - **Evaluation strategy:** Every epoch
92
+ - **Save strategy:** Every epoch (best model only)
93
+
94
+ ## Evaluation
95
+
96
+ ### Metrics
97
+ The model was evaluated using:
98
+ - **Accuracy:** Overall classification accuracy
99
+ - **F1 Score (Macro):** Macro-averaged F1 score across all classes
100
+ - **Per-class accuracy:** Individual class performance metrics
101
+
102
+ ### Results
103
+ - **Test Set Performance:**
104
+ - Accuracy: 99.5%
105
+ - F1 Score (Macro): 99.5%
106
+ - **External Validation:**
107
+ - Accuracy: 100.0%
108
+ - F1 Score (Macro): 100.0%
109
+
110
+ ### Detailed Performance
111
+ - **Pittsburgh Class:** 99.01% accuracy (101 samples)
112
+ - **Shanghai Class:** 100.0% accuracy (99 samples)
113
+ - **Confusion Matrix:** Only 1 misclassification out of 200 test samples
114
+
115
+ ## Usage
116
+ ```python
117
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
118
+ import torch
119
+
120
+ # Load model and tokenizer
121
+ model_name = "Anyuhhh/hw2-text-distilbert"
122
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
123
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
124
+
125
+ # Example usage
126
+ text = "Your input text here"
127
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
128
+
129
+ with torch.no_grad():
130
+ outputs = model(**inputs)
131
+ predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
132
+ predicted_class = torch.argmax(predictions, dim=-1)
133
+
134
+ print(f"Predicted class: {predicted_class.item()}")