omzn
/

facemark_detection

+---
+license: cc-by-sa-4.0
+language: ja
+tags:
+- generated_from_trainer
+- text-classification
+metrics:
+- accuracy
+---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+# Facemark Detection
+This model classifies given text into facemark(1) or not(0).
+This model is a fine-tuned version of [cl-tohoku/bert-base-japanese-whole-word-masking](https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking) on an original facemark dataset.
+It achieves the following results on the evaluation set:
+- Loss: 0.1301
+- Accuracy: 0.9896
+## Model description
+This model classifies given text into facemark(1) or not(0).
+## Intended uses & limitations
+Extract a facemark-prone potion of text and apply the text to the model.
+Extraction of a facemark can be done by regex but usually includes many non-facemarks.
+For example, I used the following regex pattern to extract a facemark-prone text by perl.
+```perl
+$input_text = "facemark prne text"
+my $text          = '[0-9A-Za-zぁ-ヶ一-龠]';
+my $non_text      = '[^0-9A-Za-zぁ-ヶ一-龠]';
+my $allow_text    = '[ovっつ゜ニノ三二]';
+my $hw_kana       = '[ｦ-ﾟ]';
+my $open_branket  = '[\(∩꒰（]';
+my $close_branket = '[\)∩꒱）]';
+my $around_face   = '(?:' . $non_text . '|' . $allow_text . ')*';
+my $face          = '(?!(?:' . $text . '|' . $hw_kana . '){3,8}).{3,8}';
+my $face_char     = $around_face . $open_branket . $face . $close_branket . $around_face;
+my $facemark;
+if ($input_text=~/($face_char)/) {
+  $facemark = $1;
+}
+```
+Example of facemarks are:
+```
+  （^U^）←
+  。\n\n⊂( *･ω･ )⊃
+  っ(｡＞﹏＜)
+  ﾀｶ( ˘ω' ) ﾔｽｩ…
+  。(’↑▽↑)
+  ……💰( ˘ω˘ )💰
+  ーーー(*´꒳`*)！（
+  …(o：∇：o)
+ ！！…(；´Д｀)？
+  (*´﹃ ｀*)✿
+```
+Examples of non-facemarks are:
+```
+  (3,000円)
+  : (1/3)
+  (@nVApO)
+  （10/7~）
+  ？＜＜「ﾆｬｱ(しゃーねぇな)」ﾌﾟｲｯ
+  (残り 51字)
+  (-0.1602)
+  (25-0)
+  （コーヒー飲んだ）
+  （※軽トラ）
+```
+This model intended to use for a facemark-prone text like above.
+## Training and evaluation data
+Facemark data is collected manually and automatically from Twitter timeline.
+* train.csv : 35591 samples (29911 facemark, 5680 non-facemark)
+* test.csv : 3954 samples (3315 facemark, 639 non-facemark)
+## Training procedure
+```bash
+python ./examples/pytorch/text-classification/run_glue.py \
+   --model_name_or_path=cl-tohoku/bert-base-japanese-whole-word-masking \
+   --do_train --do_eval \
+   --max_seq_length=128 --per_device_train_batch_size=32 \
+   --use_fast_tokenizer=False --learning_rate=2e-5 --num_train_epochs=50  \
+   --output_dir=facemark_classify \
+   --save_steps=1000 --save_total_limit=3 \
+   --train_file=train.csv \
+   --validation_file=test.csv
+```
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 2e-05
+- train_batch_size: 32
+- eval_batch_size: 8
+- seed: 42
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: linear
+- num_epochs: 50.0
+### Training results
+It achieves the following results on the evaluation set:
+- Loss: 0.1301
+- Accuracy: 0.9896
+### Framework versions
+- Transformers 4.26.0.dev0
+- Pytorch 1.11.0+cu102
+- Datasets 2.7.1
+- Tokenizers 0.13.2