Buckets:
| # Server infrastructure | |
| The [dataset viewer](https://github.com/huggingface/dataset-viewer) has two main components that work together to return queries about a dataset instantly: | |
| - a user-facing web API for exploring and returning information about a dataset | |
| - a server runs the queries ahead of time and caches them in a database | |
| While most of the documentation is focused on the web API, the server is crucial because it performs all the time-consuming preprocessing and stores the results so the web API can retrieve and serve them to the user. This saves a user time because instead of generating the response every time it gets requested, the dataset viewer can return the preprocessed results instantly from the cache. | |
| There are three elements that keep the server running: the job queue, workers, and the cache. | |
| ## Job queue | |
| The job queue is a list of jobs stored in a Mongo database that should be completed by the workers. The jobs are practically identical to the endpoints the user uses; only the server runs the jobs ahead of time, and the user gets the results when they use the endpoint. | |
| There are three jobs: | |
| - `/splits` corresponds to the `/splits` endpoint. It refreshes a dataset and then returns that dataset's splits and subsets. For every split in the dataset, it'll create a new job. | |
| - `/first-rows` corresponds to the `/first-rows` endpoint. It gets the first 100 rows and columns of a dataset split. | |
| - `/parquet` corresponds to the `/parquet` endpoint. It downloads the whole dataset, converts it to [parquet](https://parquet.apache.org/) and publishes the parquet files to the Hub. | |
| You might've noticed the `/rows` and `/search` endpoints don't have a job in the queue. The responses from these endpoints are generated on demand. | |
| ## Workers | |
| Workers are responsible for executing the jobs in the queue. They complete the actual preprocessing requests, such as getting a list of splits and subsets. The workers can be controlled by configurable environment variables, like the minimum or the maximum number of rows returned by a worker or the maximum number of jobs to start per dataset user or organization. | |
| Take a look at the [workers configuration](https://github.com/huggingface/dataset-viewer/tree/main/services/worker#configuration) for a complete list of the environment variables if you're interested in learning more. | |
| ## Cache | |
| Once the workers complete a job, the results are stored - or _cached_ - in a Mongo database. When a user makes a request with an endpoint like `/first-rows`, the dataset viewer retrieves the preprocessed response from the cache, and serves it to the user. This eliminates the time a user would've waited if the server hadn't already completed the job and stored the response. | |
| As a result, users can get their requested information about a dataset (even large ones) nearly instantaneously! | |
Xet Storage Details
- Size:
- 2.87 kB
- Xet hash:
- df1b39a9a480b574a2b717f43e60809d00d6b066ed2a704154f09599e8c96ebc
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.