Are you sure you limited this to only 1899?

#3
by Nafnlaus - opened

Random example of generated content: "mechanisation remains a calling for men and women alike". It's not the "men and women alike". Mechanisation was an exceedingly rare word in the Victorian era, only taking off in the 1920s. I've searched through the corpus to 1899. It only occurs twice, both of them duplicates, from the 1840s. Now, it's possible that it could have coined the word from "mechanise" - mechanise was a word one would find sometimes in the corpus. But it had a different meaning back then - it meant "to make a person like a machine" (e.g. enslave, zombify, etc), not "to replace human labour with that of machines". So even if it were to have coined the word from "mechanise", it seems very unlikely that it would further decide that it was a "calling". "Mechanist" was a calling (certainly not for women at the time), but it seems very unlikely it would just coin "mechanisation" from "mechanist" and decide it's a calling. I know it's a quite primitive model and is very prone to confabulation, but in this example (and others) its behavior seems... suspicious.

Short answer: the model is pretrained on British library books from 1837 to 1899, but instruct tuned on modern "synthetic" data to teach it to do turn-based conversation and recognize modern input. That might be where/how it's generating some of what you're noticing. I'm hoping future iterations will clean that up. For the long answer, check out the narrative documentation here: https://www.estragon.news/mr-chatterbox-or-the-modern-prometheus/

That would explain it, and unfortunately I don't think that's what most people who have been sharing this model thought it was. Thanks for clearing that up though.

May I suggest for getting QA pairs for finetuning: perhaps have an existing (lightweight) model scan the dataset for dialogue between two individuals and extract the dialogue into QA pairs (with varying numbers of QA entries in the context)? That should be an easy task even for a a model running on a cheap gaming GPU.

It's an a very interesting project though (even though such a "pure" Victorian model will always be fundamentally weak), and it does really seem like there's' two directions to go, both of which have their own merits. At least with Victorian texts you have the option to go for a "fully pure" model, like you're working on. But once you get to say, ancient Latin or Greek texts, let alone other less common ancient languages, there's not enough data to train a foundation (and even a foundation from Victorian texts is pretty weak). I've been thinking a lot about how to best do a non-fully-pure model. I'm torn between (A) synthetic data created from a fully pure model (e.g. model asked to expound upon / derive implications of chunks of preexisting texts, or given bullet pointed topics / facts and asked to write about them in its own voice), vs. (B) abliterating modern knowledge and vocabulary from an existing high-quality model and then fine-tuning on historic data (or the other way around, or perhaps historic data -> abliteration -> historic data).

Anyway, neat stuff, and thanks :)

(I also find this project interesting not just from the perspective of virtually-resurrecting past cultures, but also because it's an interesting research challenge for how efficiently one can learn from a limited dataset - a problem with great implications toward model training in general)

Sign up or log in to comment