do_lower_case=True not seeming to work

by thiagotps - opened May 4

May 4

I'm testing version v1.1.1 with the following code

_tokenizer = AutoTokenizer.from_pretrained(
    "latincy/latin-bert", revision="v1.1.1", trust_remote_code=True, do_lower_case=True
)
_tokens = _tokenizer("Gallia est omnis divisa in partes tres.", return_tensors='pt')
_token_ids = _tokens['input_ids'][0]
_token_texts = _tokenizer.convert_ids_to_tokens(_token_ids)
_token_texts

and the result is

[
  "[CLS]",
  "\\",
  "71",
  ";",
  "allia",
  "_",
  "\\",
  "32",
  ";_",
  "est_",
  "\\",
  "32",
  ";_",
  "omnis_",
  "\\",
  "32",
  ";_",
  "divisa_",
  "\\",
  "32",
  ";_",
  "in_",
  "\\",
  "32",
  ";_",
  "partes_",
  "\\",
  "32",
  ";_",
  "tres_",
  "._",
  "[SEP]"
]

It seems like the lower() method is still not being applied internally because the capital G in Gallia was escaped by the tokenizer.

diyclassics

LatinCy org May 4

Thank you for posting the Issue—I have been able to replicate this behavior. This turned out to be a packaging error not a code/model error, so I am going to force-update the v1.1.1 tag. The original snippet should now work (even if you do not specifically invoke do_lower_case=True; it is the config default.). Let me know if this works on your end and again thanks for the report.

diyclassics changed discussion status to closed May 4

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment