Tagging texts which are longer than 512 tokens.

#4
by EmilA - opened

I am using your tool to tag call texts from National Institutes of Health (NIH) with MeSH-terms, e.g., Part II Section I of this https://grants.nih.gov/grants/guide/rfa-files/RFA-RM-09-020.html. I have the issue that the call texts are often longer than the 512 tokens permitted by the model. Is this an issue you have handled yourself somehow, or would you have any idea how to handle it?

Simple truncation is not really an option. I can see that the paper which the model is based on has some ideas on how to concatenate sections of papers to allow for more than 512 tokens, which I however cannot implement, as it would require redeveloping your model https://pubmed.ncbi.nlm.nih.gov/32976559/

@EmilA one option is just to chunk the document into 512 token chunks, and run inference on all the chunks, then concatenate the tokens from the various 512 token chunks. I believe that's how we did it in the past.

Sign up or log in to comment