Yuvarraj
/

MASR

Model card Files Files and versions

MASR / transformers /examples /research_projects /mlm_wwm /README.md

Yuvarraj's picture

Initial commit

a0db2f9 about 2 years ago

|

history blame contribute delete

3.61 kB

	<!---
	Copyright 2020 The HuggingFace Team. All rights reserved.

	Licensed under the Apache License, Version 2.0 (the "License");
	you may not use this file except in compliance with the License.
	You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->

	## Whole Word Mask Language Model


	These scripts leverage the 🤗 Datasets library and the Trainer API. You can easily customize them to your needs if you
	need extra processing on your datasets.

	The following examples, will run on a datasets hosted on our [hub](https://huggingface.co/datasets) or with your own
	text files for training and validation. We give examples of both below.



	The BERT authors released a new version of BERT using Whole Word Masking in May 2019. Instead of masking randomly
	selected tokens (which may be part of words), they mask randomly selected words (masking all the tokens corresponding
	to that word). This technique has been refined for Chinese in [this paper](https://arxiv.org/abs/1906.08101).

	To fine-tune a model using whole word masking, use the following script:
	```bash
	python run_mlm_wwm.py \
	--model_name_or_path roberta-base \
	--dataset_name wikitext \
	--dataset_config_name wikitext-2-raw-v1 \
	--do_train \
	--do_eval \
	--output_dir /tmp/test-mlm-wwm
	```

	For Chinese models, we need to generate a reference files (which requires the ltp library), because it's tokenized at
	the character level.

	Q : Why a reference file?

	A : Suppose we have a Chinese sentence like: `我喜欢你` The original Chinese-BERT will tokenize it as
	`['我','喜','欢','你']` (character level). But `喜欢` is a whole word. For whole word masking proxy, we need a result
	like `['我','喜','##欢','你']`, so we need a reference file to tell the model which position of the BERT original token
	should be added `##`.

	Q : Why LTP ?

	A : Cause the best known Chinese WWM BERT is [Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm) by HIT.
	It works well on so many Chines Task like CLUE (Chinese GLUE). They use LTP, so if we want to fine-tune their model,
	we need LTP.

	You could run the following:


	```bash
	export TRAIN_FILE=/path/to/train/file
	export LTP_RESOURCE=/path/to/ltp/tokenizer
	export BERT_RESOURCE=/path/to/bert/tokenizer
	export SAVE_PATH=/path/to/data/ref.txt

	python run_chinese_ref.py \
	--file_name=$TRAIN_FILE \
	--ltp=$LTP_RESOURCE \
	--bert=$BERT_RESOURCE \
	--save_path=$SAVE_PATH
	```

	Then you can run the script like this:


	```bash
	export TRAIN_FILE=/path/to/train/file
	export VALIDATION_FILE=/path/to/validation/file
	export TRAIN_REF_FILE=/path/to/train/chinese_ref/file
	export VALIDATION_REF_FILE=/path/to/validation/chinese_ref/file
	export OUTPUT_DIR=/tmp/test-mlm-wwm

	python run_mlm_wwm.py \
	--model_name_or_path roberta-base \
	--train_file $TRAIN_FILE \
	--validation_file $VALIDATION_FILE \
	--train_ref_file $TRAIN_REF_FILE \
	--validation_ref_file $VALIDATION_REF_FILE \
	--do_train \
	--do_eval \
	--output_dir $OUTPUT_DIR
	```

	Note1: On TPU, you should the flag `--pad_to_max_length` to make sure all your batches have the same length.

	Note2: And if you have any questions or something goes wrong when runing this code, don't hesitate to pin @wlhgtc.