| | --- |
| | license: cc-by-sa-4.0 |
| | language: ja |
| | tags: |
| | - generated_from_trainer |
| | - text-classification |
| |
|
| | metrics: |
| | - accuracy |
| |
|
| | widget: |
| | - text: "💪(^ω^ 🍤)" |
| | example_title: "Facemark 1" |
| | - text: "(੭ु∂∀6)੭ु⁾⁾ ஐ•*¨*•.¸¸" |
| | example_title: "Facemark 2" |
| | - text: ":-P" |
| | example_title: "Facemark 3" |
| | - text: "(o.o)" |
| | example_title: "Facemark 4" |
| | - text: "(10/7~)" |
| | example_title: "Non-facemark 1" |
| | - text: "??<<「ニャア(しゃーねぇな)」プイッ" |
| | example_title: "Non-facemark 2" |
| | - text: "(0.01)" |
| | example_title: "Non-facemark 3" |
| | --- |
| | |
| | <!-- This model card has been generated automatically according to the information the Trainer had access to. You |
| | should probably proofread and complete it, then remove this comment. --> |
| |
|
| | # Facemark Detection |
| |
|
| | This model classifies given text into facemark(1) or not(0). |
| |
|
| | This model is a fine-tuned version of [cl-tohoku/bert-base-japanese-whole-word-masking](https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking) on an original facemark dataset. |
| | It achieves the following results on the evaluation set: |
| | - Loss: 0.1301 |
| | - Accuracy: 0.9896 |
| |
|
| | ## Model description |
| |
|
| | This model classifies given text into facemark(1) or not(0). |
| |
|
| | ## Intended uses & limitations |
| |
|
| | Extract a facemark-prone potion of text and apply the text to the model. |
| | Extraction of a facemark can be done by regex but usually includes many non-facemarks. |
| |
|
| | For example, I used the following regex pattern to extract a facemark-prone text by perl. |
| |
|
| | ```perl |
| | $input_text = "facemark prne text" |
| | |
| | my $text = '[0-9A-Za-zぁ-ヶ一-龠]'; |
| | my $non_text = '[^0-9A-Za-zぁ-ヶ一-龠]'; |
| | my $allow_text = '[ovっつ゜ニノ三二]'; |
| | my $hw_kana = '[ヲ-゚]'; |
| | my $open_branket = '[\(∩꒰(]'; |
| | my $close_branket = '[\)∩꒱)]'; |
| | my $around_face = '(?:' . $non_text . '|' . $allow_text . ')*'; |
| | my $face = '(?!(?:' . $text . '|' . $hw_kana . '){3,8}).{3,8}'; |
| | my $face_char = $around_face . $open_branket . $face . $close_branket . $around_face; |
| | |
| | my $facemark; |
| | if ($input_text=~/($face_char)/) { |
| | $facemark = $1; |
| | } |
| | ``` |
| | Example of facemarks are: |
| | ``` |
| | (^U^)← |
| | 。\n\n⊂( *・ω・ )⊃ |
| | っ(。>﹏<) |
| | タカ( ˘ω' ) ヤスゥ… |
| | 。(’↑▽↑) |
| | ……💰( ˘ω˘ )💰 |
| | ーーー(*´꒳`*)!( |
| | …(o:∇:o) |
| | !!…(;´Д`)? |
| | (*´﹃ `*)✿ |
| | ``` |
| | Examples of non-facemarks are: |
| | ``` |
| | (3,000円) |
| | : (1/3) |
| | (@nVApO) |
| | (10/7~) |
| | ?<<「ニャア(しゃーねぇな)」プイッ |
| | (残り 51字) |
| | (-0.1602) |
| | (25-0) |
| | (コーヒー飲んだ) |
| | (※軽トラ) |
| | ``` |
| |
|
| | This model intended to use for a facemark-prone text like above. |
| |
|
| | ## Training and evaluation data |
| |
|
| | Facemark data is collected manually and automatically from Twitter timeline. |
| |
|
| | * train.csv : 35591 samples (29911 facemark, 5680 non-facemark) |
| | * test.csv : 3954 samples (3315 facemark, 639 non-facemark) |
| |
|
| | ## Training procedure |
| |
|
| | ```bash |
| | python ./examples/pytorch/text-classification/run_glue.py \ |
| | --model_name_or_path=cl-tohoku/bert-base-japanese-whole-word-masking \ |
| | --do_train --do_eval \ |
| | --max_seq_length=128 --per_device_train_batch_size=32 \ |
| | --use_fast_tokenizer=False --learning_rate=2e-5 --num_train_epochs=50 \ |
| | --output_dir=facemark_classify \ |
| | --save_steps=1000 --save_total_limit=3 \ |
| | --train_file=train.csv \ |
| | --validation_file=test.csv |
| | ``` |
| |
|
| | ### Training hyperparameters |
| |
|
| | The following hyperparameters were used during training: |
| | - learning_rate: 2e-05 |
| | - train_batch_size: 32 |
| | - eval_batch_size: 8 |
| | - seed: 42 |
| | - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
| | - lr_scheduler_type: linear |
| | - num_epochs: 50.0 |
| |
|
| | ### Training results |
| |
|
| | It achieves the following results on the evaluation set: |
| | - Loss: 0.1301 |
| | - Accuracy: 0.9896 |
| |
|
| | ### Framework versions |
| |
|
| | - Transformers 4.26.0.dev0 |
| | - Pytorch 1.11.0+cu102 |
| | - Datasets 2.7.1 |
| | - Tokenizers 0.13.2 |
| |
|