Massive Text Embedding Benchmark

non-profit

https://github.com/embeddings-benchmark

embeddings-benchmark

Activity Feed

AI & ML interests

Massive Text Embeddings Benchmark

Recent Activity

Samoed updated a dataset 2 days ago

mteb/AESDD-fixed

Samoed published a dataset 2 days ago

mteb/AESDD-fixed

Samoed updated a dataset 19 days ago

mteb/STSBenchmarkv2

View all activity

Papers

MVEB: Massive Video Embedding Benchmark

MAEB: Massive Audio Embedding Benchmark

View all Papers

Samoed

updated a dataset 2 days ago

mteb/AESDD-fixed

Viewer • Updated 2 days ago • 604 • 20

Samoed

published a dataset 2 days ago

mteb/AESDD-fixed

Viewer • Updated 2 days ago • 604 • 20

Samoed

updated a dataset 19 days ago

mteb/STSBenchmarkv2

Viewer • Updated 19 days ago • 8.6k • 95

Samoed

published a dataset 19 days ago

mteb/STSBenchmarkv2

Viewer • Updated 19 days ago • 8.6k • 95

Samoed

in mteb/leaderboard 25 days ago

Question: can this run on mobile?

#190 opened 28 days ago by

3morixd

mmhamdy

posted an update 25 days ago

Post

158

Decades before the modern scaling laws, this paper showed that neural networks behavior under scale follows remarkably predictable laws.

In 1993, researchers at Bell Labs were grappling with a constraint that feels entirely familiar (and contemporary): datasets were outgrowing the available hardware, and training a model to the end was becoming too expensive. To evaluate an architectural tweak to a state-of-the-art model (at the time it was LeNet) on 60,000 samples meant burning up to three weeks of compute time.

To save compute, people would train candidate architectures on small subsets of the data, assuming that the top performer at small scale would remain the top performer at full scale. But with our future wisdom, we know this is not the case.

In "Learning Curves: Asymptotic Values and Rate of Convergence (NeurIPS 93)", using insights from statistical mechanics, they proposed a practical and principled method for predicting the performance of classifiers trained on large datasets (at the time, models were assumed to be large enough). The method was based on a simple power-law modeling of the expected training and test errors.

It is often noted that many of today's breakthroughs in AI and deep learning are actually decades-old concepts that simply lacked the computational power to be tested at the time. While there is some truth to that, it highlights a more valuable lesson: there is immense worth in revisiting early literature and reflecting on foundational ideas we may have prematurely left behind.

So, go explore and find your own inspiration. The current trend has enough champions already!

2 replies

AyushM6

in mteb/leaderboard 27 days ago

Question: can this run on mobile?

#190 opened 28 days ago by

3morixd

mmhamdy

posted an update 30 days ago

Post

299

It has been more than a decade now since the knowledge distillation paper came out.

Knowledge Distillation (KD) is one of my favorite topics, but I have to confess that I'm not a huge fan of the term because I find it confusing (or at least, it has became so over time).

The idea behind KD is not novel; it was there almost a decade before the paper came out (and arguably even a decade before that, back to 1990-91). But this paper is the one that clicked, the one that made the topic much more popular and introduced it to a broader audience.

First, the timing and the authors played a big role: we have Geoffrey Hinton, Oriol Vinyals, and Jeff Dean here. And second, Geoffrey Hinton is really good at idea branding: Model compression?! No, no, no! Let's call it "Knowledge Distillation" and use evocative terms such as "Dark Knowledge" to describe what is being transferred.

It's a great name, but as time has passed, the term became a bit of a relic. KD is no longer solely about compression (KD used to be introduced as a method for model compression, but now model compression is just one application of KD). And the other thing is that the word "distillation" implies some sort of potency here, that the student is somehow more powerful than the teacher, which is not the case (but many counterarguments could be made, for example, more powerful compared to another model trained with no teacher)

Nevertheless, the paper is incredibly well-written, short, and fun to read. It's one of few papers that I read several times. Check it out, and maybe share your thoughts on the topic with us here!

If you had to choose another name for Knowledge Distillation, what would it be?