do_lower_case=True not seeming to work

#2
by thiagotps - opened

I'm testing version v1.1.1 with the following code

_tokenizer = AutoTokenizer.from_pretrained(
    "latincy/latin-bert", revision="v1.1.1", trust_remote_code=True, do_lower_case=True
)
_tokens = _tokenizer("Gallia est omnis divisa in partes tres.", return_tensors='pt')
_token_ids = _tokens['input_ids'][0]
_token_texts = _tokenizer.convert_ids_to_tokens(_token_ids)
_token_texts

and the result is

[
  "[CLS]",
  "\\",
  "71",
  ";",
  "allia",
  "_",
  "\\",
  "32",
  ";_",
  "est_",
  "\\",
  "32",
  ";_",
  "omnis_",
  "\\",
  "32",
  ";_",
  "divisa_",
  "\\",
  "32",
  ";_",
  "in_",
  "\\",
  "32",
  ";_",
  "partes_",
  "\\",
  "32",
  ";_",
  "tres_",
  "._",
  "[SEP]"
]

It seems like the lower() method is still not being applied internally because the capital G in Gallia was escaped by the tokenizer.

LatinCy org

Thank you for posting the Issue—I have been able to replicate this behavior. This turned out to be a packaging error not a code/model error, so I am going to force-update the v1.1.1 tag. The original snippet should now work (even if you do not specifically invoke do_lower_case=True; it is the config default.). Let me know if this works on your end and again thanks for the report.

diyclassics changed discussion status to closed

Sign up or log in to comment