omzn commited on
Commit
d845a71
·
1 Parent(s): 06d0354

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +125 -0
README.md ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-4.0
3
+ language: ja
4
+ tags:
5
+ - generated_from_trainer
6
+ - text-classification
7
+
8
+ metrics:
9
+ - accuracy
10
+ ---
11
+
12
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
+ should probably proofread and complete it, then remove this comment. -->
14
+
15
+ # Facemark Detection
16
+
17
+ This model classifies given text into facemark(1) or not(0).
18
+
19
+ This model is a fine-tuned version of [cl-tohoku/bert-base-japanese-whole-word-masking](https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking) on an original facemark dataset.
20
+ It achieves the following results on the evaluation set:
21
+ - Loss: 0.1301
22
+ - Accuracy: 0.9896
23
+
24
+ ## Model description
25
+
26
+ This model classifies given text into facemark(1) or not(0).
27
+
28
+ ## Intended uses & limitations
29
+
30
+ Extract a facemark-prone potion of text and apply the text to the model.
31
+ Extraction of a facemark can be done by regex but usually includes many non-facemarks.
32
+
33
+ For example, I used the following regex pattern to extract a facemark-prone text by perl.
34
+
35
+ ```perl
36
+ $input_text = "facemark prne text"
37
+
38
+ my $text = '[0-9A-Za-zぁ-ヶ一-龠]';
39
+ my $non_text = '[^0-9A-Za-zぁ-ヶ一-龠]';
40
+ my $allow_text = '[ovっつ゜ニノ三二]';
41
+ my $hw_kana = '[ヲ-゚]';
42
+ my $open_branket = '[\(∩꒰(]';
43
+ my $close_branket = '[\)∩꒱)]';
44
+ my $around_face = '(?:' . $non_text . '|' . $allow_text . ')*';
45
+ my $face = '(?!(?:' . $text . '|' . $hw_kana . '){3,8}).{3,8}';
46
+ my $face_char = $around_face . $open_branket . $face . $close_branket . $around_face;
47
+
48
+ my $facemark;
49
+ if ($input_text=~/($face_char)/) {
50
+ $facemark = $1;
51
+ }
52
+ ```
53
+ Example of facemarks are:
54
+ ```
55
+ (^U^)←
56
+ 。\n\n⊂( *・ω・ )⊃
57
+ っ(。>﹏<)
58
+ タカ( ˘ω' ) ヤスゥ…
59
+ 。(’↑▽↑)
60
+ ……💰( ˘ω˘ )💰
61
+ ーーー(*´꒳`*)!(
62
+ …(o:∇:o)
63
+ !!…(;´Д`)?
64
+ (*´﹃ `*)✿
65
+ ```
66
+ Examples of non-facemarks are:
67
+ ```
68
+ (3,000円)
69
+ : (1/3)
70
+ (@nVApO)
71
+ (10/7~)
72
+ ?<<「ニャア(しゃーねぇな)」プイッ
73
+ (残り 51字)
74
+ (-0.1602)
75
+ (25-0)
76
+ (コーヒー飲んだ)
77
+ (※軽トラ)
78
+ ```
79
+
80
+ This model intended to use for a facemark-prone text like above.
81
+
82
+ ## Training and evaluation data
83
+
84
+ Facemark data is collected manually and automatically from Twitter timeline.
85
+
86
+ * train.csv : 35591 samples (29911 facemark, 5680 non-facemark)
87
+ * test.csv : 3954 samples (3315 facemark, 639 non-facemark)
88
+
89
+ ## Training procedure
90
+
91
+ ```bash
92
+ python ./examples/pytorch/text-classification/run_glue.py \
93
+ --model_name_or_path=cl-tohoku/bert-base-japanese-whole-word-masking \
94
+ --do_train --do_eval \
95
+ --max_seq_length=128 --per_device_train_batch_size=32 \
96
+ --use_fast_tokenizer=False --learning_rate=2e-5 --num_train_epochs=50 \
97
+ --output_dir=facemark_classify \
98
+ --save_steps=1000 --save_total_limit=3 \
99
+ --train_file=train.csv \
100
+ --validation_file=test.csv
101
+ ```
102
+
103
+ ### Training hyperparameters
104
+
105
+ The following hyperparameters were used during training:
106
+ - learning_rate: 2e-05
107
+ - train_batch_size: 32
108
+ - eval_batch_size: 8
109
+ - seed: 42
110
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
111
+ - lr_scheduler_type: linear
112
+ - num_epochs: 50.0
113
+
114
+ ### Training results
115
+
116
+ It achieves the following results on the evaluation set:
117
+ - Loss: 0.1301
118
+ - Accuracy: 0.9896
119
+
120
+ ### Framework versions
121
+
122
+ - Transformers 4.26.0.dev0
123
+ - Pytorch 1.11.0+cu102
124
+ - Datasets 2.7.1
125
+ - Tokenizers 0.13.2