Update README.md

6762de3 about 3 years ago

3.87 kB

	---
	license: cc-by-sa-4.0
	language: ja
	tags:
	- generated_from_trainer
	- text-classification

	metrics:
	- accuracy

	widget:
	- text: "💪(^ω^ 🍤)"
	example_title: "Facemark 1"
	- text: "(੭ु∂∀6)੭ु⁾⁾ ஐ•¨•.¸¸"
	example_title: "Facemark 2"
	- text: ":-P"
	example_title: "Facemark 3"
	- text: "(o.o)"
	example_title: "Facemark 4"
	- text: "（10/7~）"
	example_title: "Non-facemark 1"
	- text: "？？＜＜「ﾆｬｱ(しゃーねぇな)」ﾌﾟｲｯ"
	example_title: "Non-facemark 2"
	- text: "(0.01)"
	example_title: "Non-facemark 3"
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# Facemark Detection

	This model classifies given text into facemark(1) or not(0).

	This model is a fine-tuned version of [cl-tohoku/bert-base-japanese-whole-word-masking](https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking) on an original facemark dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.1301
	- Accuracy: 0.9896

	## Model description

	This model classifies given text into facemark(1) or not(0).

	## Intended uses & limitations

	Extract a facemark-prone potion of text and apply the text to the model.
	Extraction of a facemark can be done by regex but usually includes many non-facemarks.

	For example, I used the following regex pattern to extract a facemark-prone text by perl.

	```perl
	$input_text = "facemark prne text"

	my $text = '[0-9A-Za-zぁ-ヶ一-龠]';
	my $non_text = '[^0-9A-Za-zぁ-ヶ一-龠]';
	my $allow_text = '[ovっつ゜ニノ三二]';
	my $hw_kana = '[ｦ-ﾟ]';
	my $open_branket = '[\(∩꒰（]';
	my $close_branket = '[\)∩꒱）]';
	my $around_face = '(?:' . $non_text . '\|' . $allow_text . ')*';
	my $face = '(?!(?:' . $text . '\|' . $hw_kana . '){3,8}).{3,8}';
	my $face_char = $around_face . $open_branket . $face . $close_branket . $around_face;

	my $facemark;
	if ($input_text=~/($face_char)/) {
	$facemark = $1;
	}
	```
	Example of facemarks are:
	```
	（^U^）←
	。\n\n⊂( *･ω･ )⊃
	っ(｡＞﹏＜)
	ﾀｶ( ˘ω' ) ﾔｽｩ…
	。(’↑▽↑)
	……💰( ˘ω˘ )💰
	ーーー(´꒳`)！（
	…(o：∇：o)
	！！…(；´Д｀)？
	(´﹃｀)✿
	```
	Examples of non-facemarks are:
	```
	(3,000円)
	: (1/3)
	(@nVApO)
	（10/7~）
	？＜＜「ﾆｬｱ(しゃーねぇな)」ﾌﾟｲｯ
	(残り 51字)
	(-0.1602)
	(25-0)
	（コーヒー飲んだ）
	（※軽トラ）
	```

	This model intended to use for a facemark-prone text like above.

	## Training and evaluation data

	Facemark data is collected manually and automatically from Twitter timeline.

	* train.csv : 35591 samples (29911 facemark, 5680 non-facemark)
	* test.csv : 3954 samples (3315 facemark, 639 non-facemark)

	## Training procedure

	```bash
	python ./examples/pytorch/text-classification/run_glue.py \
	--model_name_or_path=cl-tohoku/bert-base-japanese-whole-word-masking \
	--do_train --do_eval \
	--max_seq_length=128 --per_device_train_batch_size=32 \
	--use_fast_tokenizer=False --learning_rate=2e-5 --num_train_epochs=50 \
	--output_dir=facemark_classify \
	--save_steps=1000 --save_total_limit=3 \
	--train_file=train.csv \
	--validation_file=test.csv
	```

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 2e-05
	- train_batch_size: 32
	- eval_batch_size: 8
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 50.0

	### Training results

	It achieves the following results on the evaluation set:
	- Loss: 0.1301
	- Accuracy: 0.9896

	### Framework versions

	- Transformers 4.26.0.dev0
	- Pytorch 1.11.0+cu102
	- Datasets 2.7.1
	- Tokenizers 0.13.2