Training cut-off date?

#2
by mapto - opened

Hello, thanks for sharing such an important effort.
I'm doing research on historical language models. That's why the cut-off date is extremely important for me.
You indicate your data as up to 16th century, yet in the first row of your table you have CC100 with indicated range up to 18th century. Could you please elaborate on what appears to be a contradiction?
Thanks in advance!
Martin

Hi there, that's just because CC100 doesn't come with any metadata, and we're not entirely sure we've filtered out all Neo-Latin texts (meaning there's an undetermined number of 17th–18th century texts that may have been absorbed by the model or even fake latin or micro quotes in greek from scholastics authors, as you know cleaning raw texts from crawlers can be exhausting!). The dates for all the other corpora are well established, or at least contain an ante quem and post quem dates.

Sign up or log in to comment