|
|
| <!doctype html> |
| <html lang="en"> |
| <head> |
| <meta charset="utf-8"> |
| <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> |
|
|
| <title>Training Transformers Together</title> |
| <meta name="description" content="A NeurIPS'21 demonstration that explains how to train large models together with multiple collaborators."> |
| <link rel="mask-icon" href="https://learning-at-home.github.io/logo_small.png"> |
| <link rel="alternate icon" class="js-site-favicon" type="image/png" href="https://learning-at-home.github.io/logo.png"> |
| <link rel="icon" class="js-site-favicon" type="image/png" href="https://learning-at-home.github.io/logo.png"> |
| <meta property="og:url" content="https://training-transformers-together.github.io"> |
| <meta property="og:site_name" content="Training Transformers Together"> |
| <meta property="og:title" content="Train vast neural networks together"> |
| <meta property="og:description" content="A NeurIPS'21 demonstration that explains how to train large models together with multiple collaborators."> |
| <meta property="og:image" content="https://learning-at-home.github.io/logo_small.png"> |
| <meta property="og:image:type" content="image/png"> |
| <meta property="og:image:width" content="96"> |
| <meta property="og:image:height" content="96"> |
| <meta property="twitter:site" content="https://training-transformers-together.github.io"> |
| <meta property="twitter:creator" content="Yandex, Hugging Face, University of Washington, Hivemind team & contributors"> |
| <meta property="twitter:card" content="summary_large_image"> |
| <meta property="twitter:title" content="Training Transformers Together"> |
| <meta property="twitter:description" content="A NeurIPS'21 demonstration that explains how to train large models together with multiple collaborators."> |
| <meta property="twitter:image:src" content="https://learning-at-home.github.io/logo_horizontal.png"> |
| <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> |
|
|
| |
| <link href="https://bootswatch.com/5/flatly/bootstrap.css" rel="stylesheet"> |
|
|
| |
| <link href="./style.css" rel="stylesheet"> |
| </head> |
|
|
| <body> |
| <div id="header_main" style="display: block;" class="mb-0 pb-0"> |
| <canvas></canvas> |
| <div id="overlay"> |
| <div id="header_window"> |
| <div id="header"> |
| <img src="https://learning-at-home.github.io/logo.png" id="bug-logo" |
| style="width: 40%; max-height: 320px; max-width: 320px; z-index:1000; position: relative;"> |
| <br> |
| <h1 class="faded title title_elem mb-1 pb-1" style="margin-top:-25px; margin-bottom:-10px"> |
| <p style="margin-top: 0px; font-weight:bolder; margin-bottom:0px;"> |
| <span id="title_text">Training Transformers Together</span> |
| </p> |
| <p style="font-size: 18px; margin-top:0px; margin-bottom:5px;"> |
| large-scale deep learning for everyone, by everyone</p> |
| <p style="font-size: 18px; font-weight:lighter; margin-top:0px; margin-bottom:0px;"> |
| A NeurIPS 2021 Demonstration</p> |
| </h1> |
| </div> |
| </div> |
| </div> |
| </div> |
| <script src="./header-animate.js"></script> |
|
|
| <div class="container d-flex justify-content-center mb-2 pb-2" style="max-width: 500px"> |
| <div class="row text-center align-items-center justify-content-center"> |
| <div class="col-3"> |
| <a href="https://research.yandex.com/"> |
| <img src="logos/yandex.png" class="img-fluid center-block" style="max-width: 66%" alt="Yandex Research"> |
| </a> |
| </div> |
| <div class="col-3 px-2"> |
| <a href="https://huggingface.co/"> |
| <img src="logos/huggingface.png" class="img-fluid center-block" style="max-width: 66%" alt="Hugging Face"> |
| </a> |
| </div> |
| <div class="col-3 px-3"> |
| <a href="https://www.hse.ru/en/"> |
| <img src="logos/hse.png" class="img-fluid center-block" style="max-width: 66%" alt="HSE University"> |
| </a> |
| </div> |
| <div class="col-3 px-2"> |
| <a href="http://www.washington.edu/"> |
| <img src="logos/uwash.png" class="img-fluid center-block" alt="University of Washington"> |
| </a> |
| </div> |
| </div> |
| </div> |
|
|
| <div class="container" style="display: block;"> |
| <p> |
| There was a time when you could comfortably train state-of-the-art vision and language models at home on your workstation. |
| The first convolutional neural net to beat ImageNet |
| (<a target="_blank" href="https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf">AlexNet</a>) |
| was trained for 5-6 days on two gamer-grade GPUs. In contrast, today's Top-1 ImageNet model |
| (<a target="_blank" href="https://arxiv.org/abs/2106.04803">CoAtNet</a>) |
| takes 20,000 TPU-v3 days. And things are even worse in the NLP world: training |
| <a target="_blank" href="https://arxiv.org/abs/2005.14165">GPT‑3</a> |
| on a top-tier server with 8x A100 would take decades. |
| </p> |
| <p> |
| So, can individual researchers and small labs still train state-of-the-art models? Yes we can! |
| All it takes is for a bunch of us to come together. In fact, we're doing it right now and <b>you are invited to join!</b> |
| </p> |
| <iframe id="iframe_main" src="https://hf.space/streamlitiframe/training-transformers-together/dashboard-embedded/+" |
| data-src="https://hf.space/streamlitiframe/training-transformers-together/dashboard-embedded/+" |
| data-sdk="streamlit" |
| title="Streamlit app" class="container p-0 flex-grow space-iframe" |
| allow="accelerometer; ambient-light-sensor; autoplay; battery; camera; document-domain; encrypted-media; fullscreen; geolocation; gyroscope; layout-animations; legacy-image-formats; magnetometer; microphone; midi; oversized-images; payment; picture-in-picture; publickey-credentials-get; sync-xhr; usb; vr ; wake-lock; xr-spatial-tracking" |
| sandbox="allow-forms allow-modals allow-popups allow-popups-to-escape-sandbox allow-same-origin allow-scripts allow-downloads" |
| style="top:-200px; left:0; bottom:0; right:0; width:100%; height:200px; border:none; margin:0; padding:0; z-index:999999;" scrolling=no> |
| <p>This was meant to be an IFrame, but your browser did not display it.</p> |
| <p>Please go to <a href="https://huggingface.co/spaces/training-transformers-together/demo">https://huggingface.co/spaces/training-transformers-together/demo</a>.</p> |
| </iframe> |
| <p> |
| In this demo, we train a model similar to <a target="_blank" href="https://openai.com/blog/dall-e/">OpenAI DALL-E</a> — |
| a Transformer model that generates images from text descriptions. |
| It is trained on <a target="_blank" href="https://laion.ai/laion-400-open-dataset/">LAION-400M</a>, |
| the world's largest openly available image-text-pair dataset with 400 million samples. Our model is based on |
| the <a target="_blank" href="https://github.com/lucidrains/DALLE-pytorch">dalle‑pytorch</a> implementation |
| by <a target="_blank" href="https://github.com/lucidrains">Phil Wang</a> with a few tweaks to make it communication-efficient. |
| </p> |
| <div class="accordion" id="accordionExample"> |
| <div class="accordion-item"> |
| <h2 class="accordion-header" id="headingOne"> |
| <button class="accordion-button collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#collapseOne" aria-expanded="false" aria-controls="collapseOne"> |
| How to train efficiently over the Internet? |
| </button> |
| </h2> |
| <div id="collapseOne" class="accordion-collapse collapse" aria-labelledby="headingOne" data-bs-parent="#accordionExample"> |
| <div class="accordion-body"> |
| <p> |
| Modern distributed training algorithms are designed for HPC clusters with a 10-100 gigabit per second bandwidth. |
| In turn, a typical Internet connection runs at 10-100 megabits per second: that’s three orders of magnitude slower. |
| To make distributed training efficient, you need to win back these three orders of magnitude. |
| This may seem daunting at first, but in reality, DL researchers have already made all the necessary pieces for solving this puzzle: |
| </p> |
| <table class="table table-hover"> |
| <thead> |
| <tr> |
| <th scope="col">Speed‑up</th> |
| <th scope="col">How to achieve</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr><td class="centered"><strong>4-16x</strong></td><td> |
| <strong>Large-batch training:</strong> <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/1904.00962">You et al. (2019)</a> proposed a way for training neural networks efficiently with larger batches, and hence, fewer communication rounds. |
| </td></tr> |
| <tr><td class="centered"><strong>4-32x</strong></td><td> |
| <strong>Gradient compression:</strong> from simple <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/1511.04561">8-bit quantization</a> |
| to advanced techniques such as <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/1712.01887">Deep Gradient Compression</a>, |
| <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/1905.13727">PowerSGD</a>, <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2102.02888">1-bit Adam</a>, |
| and many others. As a rule of thumb, these techniques can safely reduce communication by 16-32x. More extreme compression is often |
| possible, but it may affect stability or final quality. |
| </td></tr> |
| <tr><td class="centered"><strong>4-24x</strong></td><td> |
| <strong>Parameter sharing:</strong> reusing parameters between model layers results in a model with fewer parameters, |
| and hence, fewer gradients to communicate. <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/1909.11942">Lan et al. (2019)</a> and |
| <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/pdf/2107.11817.pdf">Xue et al. (2021)</a> propose efficient parameter sharing architectures |
| for NLP and computer vision. |
| </td></tr> |
| <tr><td class="centered"><strong>1.5-2x</strong></td><td> |
| <strong>Overlapping computation with communication:</strong> running network communication in background while |
| computing the next portion of gradients. This is a <a target="_blank" rel="noopener noreferrer" href="https://ur.booksc.eu/book/1624068/2d0506">long-standing trick from HPC</a> |
| that was recently adapted for DL training. <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2101.06840">Ren et al. (2021)</a> show that |
| updating parameters in background while computing the next batch of gradients does not harm convergence. |
| </td></tr> |
| </tbody> |
| </table> |
| <p> |
| These techniques are already more than enough to cover 1000x slower communication. |
| This means that in practice you can pick and choose which of them you want in your training run. |
| For this demo, we use 8x larger batches, 4x compression, 12x parameter sharing and partial overlapping. |
| If you don’t want parameter sharing, you can trade it for more advanced gradient compression or larger batches. |
| </p> |
| </div> |
| </div> |
| </div> |
| </div> |
| <div class="accordion" id="accordionAnother" style="margin-top: 10px;"> |
| <div class="accordion-item"> |
| <h2 class="accordion-header" id="headingTwo"> |
| <button class="accordion-button collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#collapseTwo" aria-expanded="false" aria-controls="collapseOne"> |
| How to train with different device types? |
| </button> |
| </h2> |
| <div id="collapseTwo" class="accordion-collapse collapse" aria-labelledby="headingTwo" data-bs-parent="#accordionAnother"> |
| <div class="accordion-body"> |
| <p> |
| Most distributed DL frameworks assume that the computation is performed by a fleet of identical devices, |
| typically GPU servers or TPU cores. Under this assumption, each device can be assigned an equal part of |
| computation, such as processing a fixed batch size of training samples. |
| However, this quickly breaks down if workers use different device types. If one participant uses a GPU (e.g. P100) |
| and another runs on TPU-v2-8, it is difficult to find a regime where both devices will be fully utilized. |
| </p> |
| <p> |
| To make the best use of all available devices, we let each device accumulate gradients at its own pace |
| with individually tuned batch size and some other features (e.g. gradient checkpointing or using XLA). |
| Once workers collectively aggregate some predefined global batch size, they average their gradients |
| with weights proportional to each worker's individual contribution (i.e. number of samples processed). |
| </p> |
| <a class="block overflow-hidden"> |
| <div class="w-full h-40 mb-2 bg-gray-900 group-hover:bg-gray-850 rounded-lg flex items-start justify-start overflow-hidden"> |
| <iframe src="https://www.youtube.com/embed/zdVsg5zsGdc" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" frameborder="0" |
| style="width: 100%; height: 240px"></iframe> |
| </div> |
| </a> |
| <p> |
| This technique allows the "swarm" to automatically adjust its behavior as peers join, leave or fail. |
| For instance, if several high-performance peers join the experiment, other peers will need to process a smaller |
| number of samples per optimizer step, and hence, the collaboration will train faster with the same hyperparameters. |
| In turn, if one of the workers fails and loses its progress (e.g. due to a fp16 overflow), others will make |
| up for that by processing slightly more. For more details on how this works, please refer to |
| <a target="_blank" rel="noopener noreferrer" href="https://papers.nips.cc/paper/2021/hash/41a60377ba920919939d83326ebee5a1-Abstract.html"> |
| "Deep Learning In Open Collaborations"</a> paper or the corresponding <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/blog/collaborative-training">blog post</a>. |
| </p> |
| </div> |
| </div> |
| </div> |
| </div> |
|
|
|
|
| <h3 class="my-4">How do I join?</h3> |
|
|
| <p>This section will be updated <strong>on December 7</strong>.</p> |
|
|
| <h3 class="my-4">Practical aspects</h3> |
|
|
| <div class="border-bottom pb-3"> |
| |
| <ul class="nav nav-tabs m-3"> |
| <li class="nav-item"> |
| <a class="nav-link active" data-bs-toggle="tab" href="#memory-efficiency">Memory-Efficient Training</a> |
| </li> |
| <li class="nav-item"> |
| <a class="nav-link" data-bs-toggle="tab" href="#security">Security</a> |
| </li> |
| <li class="nav-item"> |
| <a class="nav-link" data-bs-toggle="tab" href="#make-your-own">Make Your Own</a> |
| </li> |
| </ul> |
|
|
| |
| <div class="tab-content"> |
| <div class="tab-pane fade active show" id="memory-efficiency"> |
| <p> |
| Our aim is to train a large model in a decentralized fashion on consumer hardware or low-end cloud instances. |
| This means we need to make the model, dataset, and other memory buffers fit onto a few GB of disk, 12-16 GB of CPU RAM, |
| and 8-12 GB of GPU memory. Unfortunately, this rules out many popular techniques such as |
| <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2101.06840">ZeRO-Offload</a>: |
| there is simply not enough RAM for that. Instead, we must make better use of what limited memory we have. |
| To do this, we use two techniques: 8-bit Optimizers for GPU memory and dataset streaming for RAM & HDD. |
| </p> |
| <p> |
| <b>8-bit optimizers:</b> |
| Using optimizers such as LAMB or Adam requires four times as much GPU memory as simply storing model parameters (8 bytes vs 2 bytes) |
| because of additional gradient statistics. |
| As such, for training large models with many parameters, the optimizer state takes the largest amount of memory. |
| With 8-bit optimizers, this amount is reduced by 75% (2 bytes), making it much easier to fit large models onto consumer GPUs. |
| </p><p> |
| Naturally, we can combine this technique with offloading and store 8-bit optimizer states in the CPU memory rather |
| than in the GPU memory (0 bytes GPU, 2 bytes CPU). To perform an optimizer update, we transfer the GPU gradients |
| to the CPU, update the model parameters, and then copy the new weights to the GPU. |
| We can do this for each weight one-by-one so that the additional CPU memory required for the |
| optimizer update is minimal. |
| This combination of offloading and 8-bit optimizers means that we conserve GPU memory (0 bytes per parameter) |
| and also use only a limited amount of CPU memory (2 bytes per parameter). |
|
|
| </p> |
| <p> |
| <b>Dataset streaming:</b> |
| Usually data is stored on disk and needs to be fully or partially loaded into RAM for training. |
| Large datasets used for pretraining measure in <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2101.00027">hundreds of gigabytes</a> |
| or even <a target="_blank" rel="noopener noreferrer" href="https://laion.ai/laion-400-open-dataset/">terabytes</a>. |
| This can pose a significant problem, as most desktop and cheap cloud instances simply do not have that much free space. |
| Furthermore, downloading the data over the Internet would take up hours before one can even begin training. |
| </p> |
| <center> |
| <img src="./logos/stream.gif" id="stream" |
| style="width: 80%; max-height: 200px; max-width: 640px; z-index:1000; top:-10px; position: relative;"> |
| </center> |
| <p> |
| To circumvent these problems, it is possible to stream the data in the same way as you stream online videos. |
| Participants download a small random portion of the training dataset and immediately begin training on it, |
| while additional data is loaded in the background. As such, we can train a model with virtually no storage |
| overhead from the dataset, and switching to a new dataset is as simple as changing an argument of the dataset class. |
| </p> |
| <h5><b>Here's our tutorial covering these methods:</b> |
| <a target="_blank" rel="noopener noreferrer" href="https://colab.research.google.com/gist/justheuristic/75f6a2a731f05a213a55cd2c8a458aaf/fine-tune-a-language-model-with-dataset-streaming-and-8-bit-optimizers.ipynb"> |
| <span> |
| <img src="https://colab.research.google.com/assets/colab-badge.svg" width="150px"> |
| </span> |
| </a></h5> |
|
|
| </div> |
| <div class="tab-pane fade" id="security"> |
| <p>In this section, we discuss common concerns related to security of collaborative training:</p> |
|
|
| <p> |
| <b>Q: If I join a collaborative experiment, do I allow other people to execute code on my computer?</b> |
| </p> |
|
|
| <p> |
| <b>A:</b> During the training, participants only exchange data (gradients, statistics, model weights) and never send code to each other. |
| No other peer can execute arbitrary code on your computer. |
| </p> |
|
|
| <p> |
| To join the experiment, you typically need to run the code (implementing the model, data streaming, training loop, etc.) |
| from a repository or a Colab notebook provided by the authors of the experiment. |
| This is no different from running any other open source project/Colab notebook. |
| </p> |
|
|
| <p> |
| <b>Q: Can a malicious participant influence the training outcome?</b> |
| </p> |
|
|
| <p> |
| <b>A:</b> It is indeed possible unless we use some defense mechanisms. |
| For instance, a malicious participant can damage model weights by sending large numbers instead of correct gradients. |
| The same can happen due to broken hardware or misconfiguration. |
| </p> |
|
|
| <ul> |
| <li> |
| <p> |
| One possible defense is using <b>authentication</b> combined with <b>model checkpointing</b>. |
| In this case, participants should log in (e.g. with their Hugging Face account) to interact with the rest of the collaboration. |
| In turn, moderators can screen potential participants and add them to an allowlist. |
| If something goes wrong (e.g. a participant sends invalid gradients and the model diverges), |
| the moderators remove them from the list and revert the model to the latest checkpoint unaffected by the attack. |
| </p> |
|
|
| |
|
|
| <p> |
| Nice bonus: using this data, the moderators can acknowledge the personal contribution of each participant. |
| </p> |
| </li> |
| <li> |
| <p> |
| Another defense is replacing the naive averaging of the peers' gradients with an <b>aggregation technique that is robust to outliers</b>. |
| <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2012.10333">Karimireddy et al. (2020)</a> |
| suggested such a technique (named CenteredClip) and proved that it does not significantly affect the model's convergence. |
| </p> |
|
|
| |
|
|
| <p> |
| In our case, CenteredClip is useful but not enough to protect from malicious participants, |
| since it implies that the CenteredClip procedure itself is performed by a trusted server. |
| By contrast, in our decentralized system, all participants can aggregate a part of the gradients, |
| and we cannot assume any of them to be trusted. |
| </p> |
|
|
| <p> |
| Recently, <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2106.11257">Gorbunov et al. (2021)</a> |
| proposed a robust aggregation protocol for decentralized systems that does not require this assumption. |
| This protocol uses CenteredClip as a subroutine but is able to detect and ban participants who performed it incorrectly. |
| </p> |
| </li> |
| </ul> |
| </div> |
| <div class="tab-pane fade" id="make-your-own"> |
| <p>In this section, we provide a recipe for you to run a collaborative training experiment yourself.</p> |
| <p> |
| <b>Got confused?</b> Feel free to ask any questions in our <a target="_blank" rel="noopener noreferrer" href="https://discord.gg/uGugx9zYvN">Discord</a>! |
| </p> |
| <ol> |
| <li class="mb-2"> |
| Set up dataset streaming: |
| <ul> |
| <li> |
| <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/docs/datasets/share_dataset.html">Upload</a> your dataset to the Hugging Face Hub |
| in a streaming-friendly format (<a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/datasets/laion/laion_100m_vqgan_f8">example</a>). |
| </li> |
| <li>Set up dataset streaming (see the "Memory-Efficient Training" section).</li> |
| </ul> |
| </li> |
| <li class="mb-2"> |
| Write the code of training peers (<a target="_blank" rel="noopener noreferrer" href="https://github.com/learning-at-home/dalle-hivemind/blob/main/run_trainer.py">example</a>): |
| <ul> |
| <li>Implement your model, set up dataset streaming, and write the training loop.</li> |
| <li> |
| Get familiar with the <a href="https://github.com/learning-at-home/hivemind">hivemind</a> library |
| (<a target="_blank" rel="noopener noreferrer" href="https://learning-at-home.readthedocs.io/en/latest/user/quickstart.html">quickstart</a>). |
| </li> |
| <li> |
| In the training loop, wrap up your PyTorch optimizer with |
| <a target="_blank" rel="noopener noreferrer" href="https://learning-at-home.readthedocs.io/en/latest/modules/optim.html#hivemind.optim.experimental.optimizer.Optimizer">hivemind.Optimizer</a> |
| (<a target="_blank" rel="noopener noreferrer" href="https://github.com/learning-at-home/dalle-hivemind/blob/main/task.py#L121">example</a>). |
| </li> |
| </ul> |
| </li> |
| <li class="mb-2"> |
| <b>(optional)</b> Write the code of auxiliary peers (<a target="_blank" rel="noopener noreferrer" href="https://github.com/learning-at-home/dalle-hivemind/blob/main/run_aux_peer.py">example</a>): |
| <ul> |
| <li> |
| Auxiliary peers are a special kind of peers responsible for |
| logging experiment progress (e.g., to <a target="_blank" rel="noopener noreferrer" href="https://wandb.ai/">Weights & Biases</a>) |
| and uploading model checkpoints (e.g., to <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/docs/transformers/model_sharing">Hugging Face Hub</a>). |
| </li> |
| <li> |
| Such peers don't need to calculate gradients and may be launched on cheap machines without GPUs. |
| </li> |
| <li> |
| They can serve as a convenient entry point to |
| <a href="https://learning-at-home.readthedocs.io/en/latest/modules/dht.html">hivemind.DHT</a> |
| (i.e., their address can be specified as <code>initial_peers</code>). |
| </li> |
| <li> |
| It is useful to fix their address by providing <code>host_maddrs</code> and <code>identity_path</code> |
| arguments to <code>hivemind.DHT</code> |
| (these are forwarded to the underlying <a target="_blank" rel="noopener noreferrer" href="https://libp2p.io/">libp2p</a> daemon). |
| </li> |
| </ul> |
| </li> |
| <li class="mb-2"> |
| <b>(optional)</b> Make it easier for other people to join: |
| |
| <ul> |
| <li> |
| Create notebooks for free GPU providers (Google Colab, Kaggle, AWS SageMaker, etc.). |
| People may run them online and/or download and run them on their own hardware. |
| </li> |
| <li> |
| <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/organizations/new">Create</a> a Hugging Face organization |
| with all resources related to the training |
| (dataset, model, inference demo, how-to-join walkthrough, links to a dashboard with loss and other metrics, etc.). |
| Look at <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/training-transformers-together">ours</a> for an example. |
| </li> |
| <li> |
| Set up an authentication system (see the "Security" section). |
| For example, you can ask people to join your organization with their Hugging Face accounts |
| (the website allows either sharing a link for joining or manually approving new participants). |
| This allows you to screen the peers, |
| acknowledge their contributions (e.g., make a leaderboard), and |
| ban accounts who behave maliciously. You can use our <a href="https://collaborative-training-auth.huggingface.co/docs">authentication system</a> or deploy your own |
| (our <a href="https://github.com/huggingface/collaborative-training-auth/tree/demo-neurips">server implementation</a> might be a good start). |
| </li> |
| <li> |
| Set up an inference demo for your model (e.g., using <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/spaces">Spaces</a>) or |
| a script that periodically uploads the inference results to show the training progress. |
| </li> |
| </ul> |
| </li> |
| </ol> |
| </div> |
| </div> |
|
|
| </div> |
|
|
| <h3 class="my-3">Organizers</h3> |
|
|
| This demonstration was created by |
| <a href="https://twitter.com/sasha_borzunov">Alexander Borzunov*</a>, |
| <a href="https://twitter.com/m_ryabinin">Max Ryabinin*</a>, |
| <a href="https://twitter.com/Tim_Dettmers">Tim Dettmers*</a>, |
| <a href="https://twitter.com/qlhoest">Quentin Lhoest*</a>, |
| <a href="https://twitter.com/LucileSaulnier">Lucile Saulnier*</a>, |
| <a href="https://twitter.com/michael_diskin">Michael Diskin</a>, |
| <a href="https://twitter.com/YJernite">Yacine Jernite</a>, and |
| <a href="https://twitter.com/Thom_Wolf">Thomas Wolf</a>. |
|
|
| <h3 class="my-3">Learn more</h3> |
|
|
| <ul class="mb-5"> |
| <li>A NeurIPS 2021 <a href="https://arxiv.org/abs/2106.10207">paper</a> on collaborative deep learning.</li> |
| <li><a href="https://github.com/learning-at-home/hivemind">hivemind</a> is a PyTorch library for decentralized deep learning.</li> |
| <li><a href="https://github.com/huggingface/datasets">🤗 Datasets</a> allows uploading and streaming training data from the Hub.</li> |
| <li><a href="https://github.com/facebookresearch/bitsandbytes">bitsandbytes</a> contains implementations of 8-bit optimizers.</li> |
| <li>A <a href="https://arxiv.org/abs/2110.02861">paper</a> on blockwise quantization for communication-efficient training.</li> |
| |
| |
| </ul> |
|
|
| |
| |
| |
| <script src="https://getbootstrap.com/docs/5.0/dist/js/bootstrap.min.js"></script> |
| </body> |
| </html> |
|
|