Spaces:

HF-Data-for-Research
/

README

Running

App Files Files Community

Why do we call "download counts" a noisy variable?

by yjernite - opened Mar 17

Discussion

yjernite

Hugging Face Data for Research org Mar 17

•

edited Mar 20

It can be tempting to look for the "true download count" for a Hub artifact on any given day, but in practice what counts as a download depends on lots of different choices and (sometimes unobserved) variables, such as:

Whether HEAD/GET/other types of request are handled similarly or not
How local cache hits for a model or dataset are counted
IP variation for virtual machines without a fixed address
How Git/Xet backend is queried in different download APIs (transformers, hub, datasets)
Whether some downloads are identified as spam and how their removal is handled, proactively or retroactively
VPN use

In particular, since most repositories have multiple files, we need to decide on a strategy for which files to keep track of. The default behavior on the Hub uses a library-dependent set of basic files - so e.g. models in different modalities, with different architectures, or different quantization methods have different default sets of tracked files. This means in particular that repositories that use the Hub in less-standard ways, such as by packaging an adapter with a base model or having different quantization types in a single repository - among others - would be counted differently.

The former is just one example of the ways in which the flexibility of the Hub can be at odds with having a single metric that is robust across research questions. As a result, we recommend treating the counts available through existing datasets and the Hub API as a noisy variable in your experimental design and accounting for its variability when presenting findings

yjernite changed discussion status to closed Mar 20

yjernite changed discussion status to open Mar 20

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment