Spaces:
Running
Why do we call "download counts" a noisy variable?
It can be tempting to look for the "true download count" for a Hub artifact on any given day, but in practice what counts as a download depends on lots of different choices and (sometimes unobserved) variables, such as:
- Whether HEAD/GET/other types of request are handled similarly or not
- How local cache hits for a model or dataset are counted
- IP variation for virtual machines without a fixed address
- How Git/Xet backend is queried in different download APIs (transformers, hub, datasets)
- Whether some downloads are identified as spam and how their removal is handled, proactively or retroactively
- VPN use
In particular, since most repositories have multiple files, we need to decide on a strategy for which files to keep track of. The default behavior on the Hub uses a library-dependent set of basic files - so e.g. models in different modalities, with different architectures, or different quantization methods have different default sets of tracked files. This means in particular that repositories that use the Hub in less-standard ways, such as by packaging an adapter with a base model or having different quantization types in a single repository - among others - would be counted differently.
The former is just one example of the ways in which the flexibility of the Hub can be at odds with having a single metric that is robust across research questions. As a result, we recommend treating the counts available through existing datasets and the Hub API as a noisy variable in your experimental design and accounting for its variability when presenting findings