token limit - warning

by msperka - opened Nov 5, 2023

Nov 5, 2023

•

edited Nov 5, 2023

Hi
as mentioned in Issue #2
"the model can't handle inputs of longer than 512 tokens "

is there a warning that i can get in cases i exceed the limit?
i split to sentences and in most cases its well withing the limit, but there are exceptions - any way to flag these exceptions before i run the "dictabert-morph" model ?
maybe running the tokenizer only (without the morphology) and is i reach 512 tokens i knpow i probably need to split before runing the morph model?

Shaltiel

DICTA: The Israel Center for Text Analysis org Nov 5, 2023

Right now the code automatically truncates the sentence to 512 tokens, if it exceeds the length.
A good solution would be to run the tokenizer on its own and see if the number tokens exceed 512 tokens.

Alternatively, if you have a preferred way which would need to be added into the interface, feel free to make the modifications and open a PR, we welcome contributions :)

msperka changed discussion status to closed Nov 21, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment