Initial model upload

184d68f verified 8 months ago

16 kB

	gds-tools package provides binaries for data verification, GDS config verification and a GPU based synthetic IO benchmarking tool.

	gds-tools are installed at /usr/local/cuda-x.y/gds/tools

	1. gdsio synthetic IO benchmarking tool:

	gdsio - This is a synthetic IO benchmark using cuFile APIs

	Here is a sample usage:

	Usage
	gdsio version :1.2
	Usage ./gdsio
	Usage [using cmd line options]:
	$ ./gdsio
	-f <file name>
	-D <directory name>
	-d <gpu_index (refer nvidia-smi)>
	-n <numa node>
	-w <worker_count>
	-s <file size(K\|M\|G)>
	-o <start offset(K\|M\|G)>
	-i <io_size(K\|M\|G)> <min_size:max_size:step_size>
	-p <enable nvlinks>
	-b <skip bufregister>
	-o <start file offset>
	-V <verify IO>
	-x <xfer_type>
	-I <(read) 0\|(write)1\| (randread) 2\| (randwrite) 3>
	-T <duration in seconds>
	-k <random_seed> (number e.g. 3456) to be used with random read/write>
	-U <use unaligned(4K) random offsets>
	-R <fill io buffer with random data>
	-F <refill io buffer with random data during each write>
	-B <batch size>
	Usage [using config file]:
	(refer to the rw-sample.gdsio provided as sample)

	$ ./gdsio rw-sample.gdsio

	xfer_type:
	0 - Storage -> GPU (GDS)
	1 - Storage->CPU
	2 - Storage->CPU->GPU
	3 - Storage->CPU->GPU_ASYNC
	4 - Storage->PAGE_CACHE->CPU->GPU
	5 - Storage->GPU_ASYNC
	6 - Storage -> GPU (GDS) in batch mode.

	Note:
	read test (-I 0) with verify option (-V) should be used with files written (-I 1) with -V option
	read test (-I 2) with verify option (-V) should be used with files written (-I 3) with -V option and using same random seed (-k),
	using same number of threads, offset, and data size
	write test (-I 1/3) with verify option (-V) will perform writes followed by read
	In batch mode, io sizes must be aligned with 4K, otherwise, it will return error.

	gdsio config file options:
	==========================
	gdsio config file (refer rw-sample.gdsio as an expample) can be used to issue multiple parallel jobs.
	The config file has two sections, global section and per job section.
	Note: gdsio config file has two more per job options that are not currently available with the command line.
	1) per job start offset("start_offset") - This specifies the start offset for a particular job. If not defined, the global start_offset will be used
	2) per job size("size") - This defines the size for a job in a config file. If not defined, the global size will be used
	e.g.
	[job1]
	filename=/mnt/test/testfile
	start_offset=1M
	size=2M
	This will start IO of size 2M at offset 1M for job1.

	gdsio command line options:
	============================

	[ job options ]
	-f - The file path to use (/mnt/gdsio.txt)
	-D - The directory to use (/mnt/gdsio_dir). this option will require files created in the directory using
	-I 1 -w <n>. The files will have the pattern (gdsio.0, gdsio.1, .. gdsio.<n-1>).
	Note: -D and -f cannot be supported at same time
	-V - verify the contents of the file based on specific IO pattern.
	to verify the data, The files IO pattern is generated using -V and -I 1 -w <n> options
	-d - device number of the GPU ( 0 - 15) . The files will be matched one to one with the file and device
	-w - Number of threads per file
	-n - numa node

	[ global options ]
	-s - Size of the file (Ex: -s 1G , -s 10M, -s 3.5g) (For reads, if -s is not specified, by default uses file size)
	-i - IO size to use when reading or writing ( choose somewhere from 1024K to 8192K)
	-I - 0 - seq read 1 - seq write 2 - randread 3 - randwrite
	-x - Transfer type to test differ ways to transfer data from storage
	-x 0 when you want to test GPUDirect Storage.
	-x 2 to test with pread in CPU path and then cudaMemcpy to GPU
	-o - Starting file offset in each thread to read from.
	Eg. for aligned file reads specify -o 4K, -o 1M.
	-p - enable p2p for all CUDA_VISIBLE_DEVCIES used for dynamic routing; this may improve performance if IO has to traverse QPI/UPI path
	-T - duration of the test in seconds.
	-U - unaligned(4K) random offsets
	-k - random seed for use with randread/randwrite -I(2/3)
	-T - duration in seconds
	-R - fill buffer with random data
	-F - fill buffer with random at every write

	This is a write(-I 1) benchmark does 4K (-i) IO to create a file of size 1 GiB (-s)

	# 4KiB GDS WRITE test on GPU 0 with 2 worker threads on a single file for 1GiB dataset
	$ ./gdsio -f /mnt/test -d 0 -n 0 -w 2 -s 1G -i 4K -x 0 -I 1
	IoType: WRITE XferType: GPUD Threads: 2 DataSetSize: 1073442816/1073741824 IOSize: 4(KiB),Throughput: 0.167347 GiB/sec, Avg_Latency: 45.588810 usecs ops: 071 total_latency 5973939.000000

	# 4KiB GDS READ test on GPU 0 with 2 worker threads on a single file for 1GiB dataset
	$ ./gdsio -f /mnt/test -d 0 -n 0 -w 2 -s 1G -i 4K -x 0 -I 0
	IoType: READ XferType: GPUD Threads: 2 DataSetSize: 1073475584/1073741824 IOSize: 4(KiB),Throughput: 0.079856 GiB/sec, Avg_Latency: 95.536943 usecs ops: 079 total_latency 12519361.000000

	For performance testing, User can also launch multiple IOs on different files(under different mountpoints) as shown below (This is on a 16 GPU DGX-2 system) :

	# GPUDirect Storage performance test for READS with 1MiB IO SIZE on 512G dataset using 8 workers
	$ WORKERS=8; IO_TYPE=1 ; XFER_TYPE=0 ; IO_SIZE=1M;DATASET_SIZE=512G
	$ ./gdsio -x $XFER_TYPE -I $IO_TYPE -i $IO_SIZE -s $DATASET_SIZE \
	-f /mnt/dir1/test -d 0 -n 0 -w $WORKERS \
	-f /mnt/dir2/test -d 3 -n 0 -w $WORKERS \
	-f /mnt/dir3/test -d 4 -n 0 -w $WORKERS \
	-f /mnt/dir4/test -d 7 -n 0 -w $WORKERS \
	-f /mnt/dir5/test -d 8 -n 1 -w $WORKERS \
	-f /mnt/dir6/test -d 11 -n 1 -w $WORKERS \
	-f /mnt/dir7/test -d 12 -n 1 -w $WORKERS \
	-f /mnt/dir8/test -d 15 -n 1 -w $WORKERS

	# Compare with Storage to GPU using traditional method for READS with 1MiB IO SIZE on 512G dataset using 8 workers
	$ WORKERS=8; IO_TYPE=1 ; XFER_TYPE=2 ; IO_SIZE=1M;DATASET_SIZE=512G
	$ ./gdsio -x $XFER_TYPE -I $IO_TYPE -i $IO_SIZE -s $DATASET_SIZE \
	-f /mnt/dir1/test -d 0 -n 0 -w $WORKERS \
	-f /mnt/dir2/test -d 3 -n 0 -w $WORKERS \
	-f /mnt/dir3/test -d 4 -n 0 -w $WORKERS \
	-f /mnt/dir4/test -d 7 -n 0 -w $WORKERS \
	-f /mnt/dir5/test -d 8 -n 1 -w $WORKERS \
	-f /mnt/dir6/test -d 11 -n 1 -w $WORKERS \
	-f /mnt/dir7/test -d 12 -n 1 -w $WORKERS \
	-f /mnt/dir8/test -d 15 -n 1 -w $WORKERS

	# Users can also use the directory option with gdsio. This is a file-per thread mode.
	Files must be created before reading using transfer type write.
	Note: the directory(-D) option must not be used simulatenous with file mode(-f)
	$ WORKERS=8; IO_TYPE=1 ; XFER_TYPE=0 ; IO_SIZE=1M;DATASET_SIZE=512G
	$ ./gdsio -x $XFER_TYPE -I $IO_TYPE -i $IO_SIZE -s $DATASET_SIZE \
	-D /mnt/dir1/ -d 0 -n 0 -w $WORKERS \
	-D /mnt/dir2/ -d 5 -n 0 -w $WORKERS \
	-D /mnt/dir3/ -d 9 -n 0 -w $WORKERS \
	-D /mnt/dir4/ -d 13 -n 0 -w $WORKERS

	#verification of data
	$ WORKERS=8; IO_TYPE=1 ; XFER_TYPE=1 ; IO_SIZE=1M;DATASET_SIZE=512G
	$ ./gdsio -V -x $XFER_TYPE -I $IO_TYPE -i $IO_SIZE -s $DATASET_SIZE \
	-D /mnt/dir1/ -d 0 -n 0 -w $WORKERS \
	-D /mnt/dir2/ -d 5 -n 0 -w $WORKERS \
	-D /mnt/dir3/ -d 9 -n 0 -w $WORKERS \
	-D /mnt/dir4/ -d 13 -n 0 -w $WORKERS

	#Use variable block size, and chose IO pattern
	Sequential Read
	$ ./gdsio -D /mnt/dir/ -d 0 -n 0 -w 32 -s 8G -i 32K:1024K:1K -x 0 -I 0
	Sequential Write
	$ ./gdsio -D /mnt/dir/ -d 0 -n 0 -w 32 -s 8G -i 32K:1024K:1K -x 0 -I 1
	Random Read
	$ ./gdsio -D /mnt/dir/ -d 0 -n 0 -w 32 -s 8G -i 32K:1024K:1K -x 0 -I 2
	Random Write
	$ ./gdsio -D /mnt/dir/ -d 0 -n 0 -w 32 -s 8G -i 32K:1024K:1K -x 0 -I 3

	#gdsio examples for batch mode
	Sequiential Read in batch mode with the batch size of 4 for a single file
	$./gdsio -x 6 -f /mnt/foo -d 0 -w 4 -s 128K -i 4k -I 0
	Sequiential write in batch mode with the batch size of 4 for a single file
	$./gdsio -x 6 -f /mnt/foo -d 0 -w 4 -s 128K -i 4k -I 1
	Sequiential write in batch mode with the batch size of 4 for a single file with
	verification
	$./gdsio -x 6 -f /mnt/foo -d 0 -w 4 -s 128K -i 4k -I 1 -V
	#For User-Space RDMA Tests,

	Run Server
	$ ./rdma_dci_server.sh (please update the IP addresses with that configured on the system)
	Run Client
	read : ./gdsio -P sockfs://IPV4:PORT -d 0 -n 0 -w 4 -P sockfs://IPV4:PORT -d 1 -n 0 -w 4 -s 1G -i 1M -x 0 -I 1
	write: ./gdsio -P sockfs://IPV4:PORT -d 0 -n 0 -w 4 -P sockfs://IPV4:PORT -d 1 -n 0 -w 4 -s 1G -i 1M -x 0 -I 0

	#Use refill buffer option. This will fill io buffer with random data at every write
	$ ./gdsio -D /mnt/dir/ -d 0 -n 0 -w 32 -s 8G -i 1024K -x 0 -I 1 -F -k 3456

	2: gdsio_verify

	This is a data verification tool to check for data integrity using cuFile APIs

	$ ./gdsio_verify -h
	--gpu(d) <gpu-index>
	--file(f) <filename>
	--gpu_offset(t) <gpu_offset(K\|M\|G)>
	--gpu_devptr_offset(b) <gpu_devptr_offset(K\|M\|G)>
	--gpubufalignment(g) <offset(K\|M\|G)>
	--fileoffset(o) <offsetbytes(K\|M\|G)>
	--iosize(s) <size in (K\|M\|G)>
	--chunksize(c) <chunk size in (K\|M\|G)>
	--nr(n) <number of ios>
	--sync(m) <mode sync(1) or async(0)>
	--skipregister(S) <skip buffer register>
	--verbose(V) <verbose>
	--fsync(p) <O_SYNC (1)>
	--batch(B) <no of batch entries per I/O>
	--version(v) <version>

	NOTE: for batch mode -b, -g -t -o -c must be 4K aligned, and -S is not supported.
	iosize(-s) represents each batch entry io size, e.g. with 4 number of batches and
	256MB iosize, total amount of I/O would be 1GB.

	Example:

	Make sure test file is not empty.

	# verify reading 1G data using GPUDirect Storage
	$ ./gdsio_verify -d 0 -f /mnt/test -o 0 -s 1G -n 1 -m 1
	gpu index :0,file :/mnt/test, RING buffer size :0, gpu buffer alignment :0, gpu buffer offset :0, file offset :0, io_requested :1073741824, sync :1, nr ios :1,
	address = 7fa27e000000

	This test reads 1G from /mnt/test to GPU 0 using cuFileRead and Writes it back to /mnt/ using cuFileWrite
	and verifies the data of source and target

	# verify reading 256MB data using GPUDirect Storage batch mode with batch size of 4
	$ ./gdsio_verify -B 4 -f /mnt/foo -s 64K -d 0 -c 4K -o 0

	3: gdscheck

	This tool performs basic platform, driver and filesystem specific checks to test for GPU Direct Storage support.

	$ ./gdscheck.py -h
	usage: gdscheck.py [-h] [-p] [-f FILE] [-v] [-V]

	GPUDirectStorage platform checker

	optional arguments:
	-h, --help show this help message and exit
	-p gds platform check
	-f FILE gds file check
	-v gds version checks
	-V gds fs checks

	example:
	(for version information)
	$ ./gdscheck.py -v
	GDS release version (beta): 0.9.0.14
	nvidia_fs version: 2.3 libcufile version: 2.3

	(for only platform check)
	$ ./gdscheck.py -p
	$ /usr/local/gds/tools/gdscheck.py -p
	GDS release version (beta): 0.95.0.49
	nvidia_fs version: 2.6 libcufile version: 2.3
	cuFile CONFIGURATION:
	NVMe : Supported
	NVMeOF : Supported
	SCSI : Unsupported
	SCALEFLUX CSD : Supported
	NVMesh : Supported
	LUSTRE : Supported
	GPFS : Unsupported
	NFS : Supported
	WEKAFS : Supported
	USERSPACE RDMA : Supported
	--MOFED peer direct : enabled
	--rdma library : Loaded (libcufile_rdma.so)
	--rdma devices : Configured
	--rdma_device_status : Up: 1 Down: 0
	properties.use_compat_mode : 1
	properties.use_poll_mode : 0
	properties.poll_mode_max_size_kb : 4
	properties.max_batch_io_timeout_msecs : 5
	properties.max_direct_io_size_kb : 16384
	properties.max_device_cache_size_kb : 131072
	properties.max_device_pinned_mem_size_kb : 33554432
	properties.posix_pool_slab_size_kb : 4 1024 16384
	properties.posix_pool_slab_count : 128 64 32
	properties.rdma_peer_affinity_policy : RoundRobin
	properties.rdma_dynamic_routing : 1
	properties.rdma_dynamic_routing_order : GPU_MEM_NVLINKS GPU_MEM SYS_MEM P2P
	fs.generic.posix_unaligned_writes : 0
	fs.lustre.posix_gds_min_kb: 0
	fs.weka.rdma_write_support: 0
	profile.nvtx : 0
	profile.cufile_stats : 0
	miscellaneous.api_check_aggressive : 0
	GPU INFO:
	GPU index 0 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
	GPU index 1 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
	GPU index 2 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
	GPU index 3 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
	GPU index 4 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
	GPU index 5 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
	GPU index 6 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
	GPU index 7 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
	GPU index 8 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
	GPU index 9 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
	GPU index 10 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
	GPU index 11 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
	GPU index 12 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
	GPU index 13 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
	GPU index 14 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
	GPU index 15 Tesla V100-SXM3-32GB bar:1 bar size (MiB):32768 supports GDS
	IOMMU : disabled
	Platform verification succeeded

	(for only file check)
	$ ./gdscheck.py -f /mnt/test
	GDS register success
	generating 4k read latency matrix :
	GPU 34:00:00 : 250.56(us) read_verification: pass
	GPU 36:00:00 : 250.00(us) read_verification: pass
	GPU 39:00:00 : 250.05(us) read_verification: pass
	GPU 3b:00:00 : 243.88(us) read_verification: pass

	(for checking client file systems version support)
	$ /usr/local/gds/tools/gdscheck.py -v -V
	/usr/local/gds/tools/gdscheck.py -v -V
	GDS release version (beta): 0.95.0
	nvidia_fs version: 2.6 libcufile version: 2.3
	FILESYSTEM VERSION CHECK:
	LUSTRE:
	current version: 2.6.99 (Unsupported)
	min version supported: 2.12.3_ddn28
	WEKAFS:
	GDS RDMA read: supported
	GDS RDMA write: supported
	current version: 3.8.0.9-dg
	min version supported: 3.8.0

	4: gdscp

	This tools copies file from one location to another using cuFile APIs. This mimics "cp" behaviour
	Make sure test file is not empty

	$ ./gdscp /mnt/test /mnt/test_copy 0 -v
	gpu md5:90672a90fba312a386b25b8861e8bd9
	cpu md5:90672a90fba312a386b25b8861e8bd9
	md5sum Match!!
	In above example, it copies data from /mnt/test to /mnt/test_copy;
	the data is routed through GPU Memory using cuFileAPI

	6: gds_stats

	This tool is used to read user-space statistics exported by libcufile per process.

	$ ./gds_stats -p <process id> -l <verbosity level>

	-l is the level and can be 1, 2, or 3.
	Please ensure that the cufile statistics is enabled
	by setting JSON configuration key profile.cufile_stats to a valid level,
	before trying to read the statistics.

	7: gdsio_static
	Functionally and usage-wise they are same as gdsio, but uses cufile static libraries.
	For more, look at the gdsio examples above.

	8: gds_log_collection.py

	This tool is used to collect logs from the system that are relevant for debugging.
	It collects logs such as os and kernel info, nvidia-fs stats, dmesg logs, syslogs,
	System map files and per process logs like cufile.json, cufile.log, gdsstats, process stack, etc.

	Usage ./gds_log_collection.py [options]
	options:
	-h help
	-f file_path1,file_path2,..(Note: there should be no spaces between ',')

	e.g.
	sudo ./gds_log_colection.py - Collects all the relevant logs
	sudo ./gds_log_colection.py -f file_path1,file_path2 - Collects all the relevant files as well as user specifed files. These could be crash files or any other relevant files