Ray Cluster Setup and Execution Guide
Quick Commands
Spin up the ray cluster:
ray up ray_cluster_configs/cluster_west.yamlAccess the ray cluster:
ray attach ray_cluster_configs/cluster_west.yamlTransfer the
tokenize_shuffle.pyscript to the cluster:ray rsync_up ray_cluster_configs/cluster_west.yaml tokenize_shuffle.py /home/ubuntuTokenize with shuffling:
python tokenize_shuffle.py --input “s3://dcnlp-data/redpajamas-raw/c4-train.{00000..00063}-of-01024.jsonl” --output s3://dcnlp-data/tokenize-shuffle-test/
Note: Ensure that the paths specified above are in the same AWS region as the one mentioned in the ray yaml file (currently set to
us-west-2).
- Exit and re-enter the cluster as required.
Detailed Workflow
Configure AWS:
Start by setting up your AWS credentials:aws configureInitialize the cluster:
ray up ray_cluster_configs/cluster_west.yamlCopy the script to the cluster:
ray rsync_up ray_cluster_configs/cluster_west.yaml tokenize_shuffle.py /home/ubuntuCopy the
default_dataset_yamlas well if used.SSH into the cluster:
ray attach ray_cluster_configs/cluster_west.yamlEnter tmux and execute the job:
tmux new-session -d -s ray_tokenize_shuffle 'python tokenize_shuffle.py'
Heads up: This is version 0 of this script. The user interface will be improved in future versions. Currently, objects are being spilled to
dcnlp-hub.