Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / dataset-viewer /pr_3255 /en /server.md

rtrm

about 2 months ago

preview code

download

raw

2.87 kB

	# Server infrastructure

	The [dataset viewer](https://github.com/huggingface/dataset-viewer) has two main components that work together to return queries about a dataset instantly:

	- a user-facing web API for exploring and returning information about a dataset
	- a server runs the queries ahead of time and caches them in a database

	While most of the documentation is focused on the web API, the server is crucial because it performs all the time-consuming preprocessing and stores the results so the web API can retrieve and serve them to the user. This saves a user time because instead of generating the response every time it gets requested, the dataset viewer can return the preprocessed results instantly from the cache.

	There are three elements that keep the server running: the job queue, workers, and the cache.

	## Job queue

	The job queue is a list of jobs stored in a Mongo database that should be completed by the workers. The jobs are practically identical to the endpoints the user uses; only the server runs the jobs ahead of time, and the user gets the results when they use the endpoint.

	There are three jobs:

	- `/splits` corresponds to the `/splits` endpoint. It refreshes a dataset and then returns that dataset's splits and subsets. For every split in the dataset, it'll create a new job.
	- `/first-rows` corresponds to the `/first-rows` endpoint. It gets the first 100 rows and columns of a dataset split.
	- `/parquet` corresponds to the `/parquet` endpoint. It downloads the whole dataset, converts it to [parquet](https://parquet.apache.org/) and publishes the parquet files to the Hub.

	You might've noticed the `/rows` and `/search` endpoints don't have a job in the queue. The responses from these endpoints are generated on demand.

	## Workers

	Workers are responsible for executing the jobs in the queue. They complete the actual preprocessing requests, such as getting a list of splits and subsets. The workers can be controlled by configurable environment variables, like the minimum or the maximum number of rows returned by a worker or the maximum number of jobs to start per dataset user or organization.

	Take a look at the [workers configuration](https://github.com/huggingface/dataset-viewer/tree/main/services/worker#configuration) for a complete list of the environment variables if you're interested in learning more.

	## Cache

	Once the workers complete a job, the results are stored - or _cached_ - in a Mongo database. When a user makes a request with an endpoint like `/first-rows`, the dataset viewer retrieves the preprocessed response from the cache, and serves it to the user. This eliminates the time a user would've waited if the server hadn't already completed the job and stored the response.

	As a result, users can get their requested information about a dataset (even large ones) nearly instantaneously!

Xet Storage Details

Size:: 2.87 kB
Xet hash:: df1b39a9a480b574a2b717f43e60809d00d6b066ed2a704154f09599e8c96ebc

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.