Why do we call "download counts" a noisy variable?

#1
by yjernite - opened
Hugging Face Data for Research org
edited 9 days ago

It can be tempting to look for the "true download count" for a Hub artifact on any given day, but in practice what counts as a download depends on lots of different choices and (sometimes unobserved) variables, such as:

  • Whether HEAD/GET/other types of request are handled similarly or not
  • How local cache hits for a model or dataset are counted
  • IP variation for virtual machines without a fixed address
  • How Git/Xet backend is queried in different download APIs (transformers, hub, datasets)
  • Whether some downloads are identified as spam and how their removal is handled, proactively or retroactively
  • VPN use

In particular, since most repositories have multiple files, we need to decide on a strategy for which files to keep track of. The default behavior on the Hub uses a library-dependent set of basic files - so e.g. models in different modalities, with different architectures, or different quantization methods have different default sets of tracked files. This means in particular that repositories that use the Hub in less-standard ways, such as by packaging an adapter with a base model or having different quantization types in a single repository - among others - would be counted differently.

The former is just one example of the ways in which the flexibility of the Hub can be at odds with having a single metric that is robust across research questions. As a result, we recommend treating the counts available through existing datasets and the Hub API as a noisy variable in your experimental design and accounting for its variability when presenting findings

yjernite changed discussion status to closed
yjernite changed discussion status to open

Sign up or log in to comment