Amazing work @tomaarsen , thanks for everything you do for open source !
May I ask some questions on the recipe ?
- If I understand well, you are mixing lightOn hard negatives data (with Jang et al. stratified sampling) with broader lightonai/embeddings-pre-training - which (i think) doesn't include mined negatives
- Did you try several mixes of pretraining / resampled hard negatives ?
- In ST, you advise to discard Arguana and Touché when using Nanobeir13, I guess it works best with those at the end ?
Best,