Are there any special tokens formatted as '<PERSON>', '<LOC>' in the training set or fine-tuning set?

#47

by tingxinli - opened Aug 5, 2023

Aug 5, 2023

•

edited Aug 5, 2023

We fine-tuned BLOOMZ to do customized translation task, and it surprisingly works well on text masked with those entity labels like '<PERSON>', '<LOC>', etc. However, when we slightly changed the label as '<PERSON_id>' (to discriminate different entities), its performance dropped dramatically. Hence, we suspect that labels like '<PERSON>' are somewhat specially treated in the pretraining or multi-task fine-tuning process. Is our guessing correct? If not, what could be possible reasons? Thanks!

Muennighoff

BigScience Workshop org Aug 5, 2023

All special tokens of the model are here: https://huggingface.co/bigscience/bloomz/blob/main/special_tokens_map.json
& they do not include such tokens.

I imagine that things like <PERSON> may naturally appear somewhere in the datasets, but it was not added by us on purpose at least for the finetuning data.

tingxinli

Aug 6, 2023

All special tokens of the model are here: https://huggingface.co/bigscience/bloomz/blob/main/special_tokens_map.json
& they do not include such tokens.

I imagine that things like <PERSON> may naturally appear somewhere in the datasets, but it was not added by us on purpose at least for the finetuning data.

We guess so. Thanks for your reply!

tingxinli changed discussion status to closed Aug 6, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment