do_lower_case=True not seeming to work
#2
by thiagotps - opened
I'm testing version v1.1.1 with the following code
_tokenizer = AutoTokenizer.from_pretrained(
"latincy/latin-bert", revision="v1.1.1", trust_remote_code=True, do_lower_case=True
)
_tokens = _tokenizer("Gallia est omnis divisa in partes tres.", return_tensors='pt')
_token_ids = _tokens['input_ids'][0]
_token_texts = _tokenizer.convert_ids_to_tokens(_token_ids)
_token_texts
and the result is
[
"[CLS]",
"\\",
"71",
";",
"allia",
"_",
"\\",
"32",
";_",
"est_",
"\\",
"32",
";_",
"omnis_",
"\\",
"32",
";_",
"divisa_",
"\\",
"32",
";_",
"in_",
"\\",
"32",
";_",
"partes_",
"\\",
"32",
";_",
"tres_",
"._",
"[SEP]"
]
It seems like the lower() method is still not being applied internally because the capital G in Gallia was escaped by the tokenizer.
Thank you for posting the Issue—I have been able to replicate this behavior. This turned out to be a packaging error not a code/model error, so I am going to force-update the v1.1.1 tag. The original snippet should now work (even if you do not specifically invoke do_lower_case=True; it is the config default.). Let me know if this works on your end and again thanks for the report.
diyclassics changed discussion status to closed