| | --- |
| | |
| | title: README |
| |
|
| | emoji: π |
| |
|
| | colorFrom: orange |
| |
|
| | colorTo: indigo |
| |
|
| | sdk: static |
| |
|
| | pinned: false |
| |
|
| | --- |
| | |
| |
|
| | <div> |
| | <img src="https://raw.githubusercontent.com/NCAI-Research/CALM/main/assets/logo.png" width="380" alt="CALM Logo" /> |
| | <p class="mb-2" style="font-size:30px;font-weight:bold"> |
| | CALM: Collaborative Arabic Language Model |
| | </p> |
| | <p class="mb-2"> |
| | The CALM project is joint effort lead by <u><a target="_blank" href="https://sdaia.gov.sa/ncai/?Lang=en">NCAI</a></u> in collaboration with |
| | <u><a target="_blank" href="https://yandex.com/">Yandex</a></u>, <u><a href="https://huggingface.co/">HuggingFace</a></u> and <u><a href="http://www.washington.edu/">UW</a></u> to train an Arabic language model with |
| | volunteers from around the globe. The project is an adaptation of the framework proposed at the NeurIPS 2021 demonstration: |
| | <u><a target="_blank" href="https://huggingface.co/training-transformers-together">Training Transformers Together</a></u>. |
| | </p> |
| | <p class="mb-2"> |
| | One of the main obstacles facing many researchers in the Arabic NLP community is the lack of computing resources that are needed for training large models. Models with |
| | leading performane on Arabic NLP tasks, such as <u><a target="_blank" href="https://github.com/aub-mind/arabert">AraBERT</a></u>, |
| | <u><a href="https://github.com/CAMeL-Lab/CAMeLBERT" target="_blank" >CamelBERT</a></u>, |
| | <u><a href="https://huggingface.co/aubmindlab/araelectra-base-generator" target="_blank" >AraELECTRA</a></u>, and |
| | <u><a href="https://huggingface.co/qarib">QARiB</a></u>, |
| | took days to train on TPUs. In the spirit of democratization of AI and community enabling, a core value at NCAI, CALM aims to demonstrate the effectiveness |
| | of collaborative training and form a community of volunteers for ANLP researchers with basic level cloud GPUs who wish to train their own models collaboratively. |
| | </p> |
| | <p class="mb-2"> |
| | CALM trains a single BERT model on a dataset that combines MSA, Oscar and Arabic Wikipedia, and dialectal data for the gulf region from existing open source datasets. |
| | Each volunteer GPU trains the model locally at its own pace on a portion of the dataset while another portion is being streamed in the background to reduces local |
| | memory consumption. Computing the gradients and aggregating them is performed in a distributed manner, based on the computing abilities of each participating |
| | volunteer. Details of the distributed training process are further described in the paper |
| | <u><a target="_blank" href="https://papers.nips.cc/paper/2021/hash/41a60377ba920919939d83326ebee5a1-Abstract.html">Deep Learning in Open Collaborations</a></u>. |
| | </p> |
| | |
| | <p class="mb-2" style="font-size:20px;font-weight:bold"> |
| | How to participate in training? |
| | </p> |
| | <p class="mb-2"> |
| | To join the collaborative training, all you have to do is to keep a notebook running for at <b>least 15 minutes</b>, you're free to close it after that and join again |
| | in another time. There are few steps before running the notebook: |
| | </p> |
| | |
| | <ul class="mb-2"> |
| | <li>π Create an account on <u><a target="_blank" href="https://huggingface.co">Huggingface</a></u>.</li> |
| | <li>π Join the <u><a target="_blank" href="https://huggingface.co/CALM">NCAI-CALM Organization</a></u> on Huggingface through the invitation link shared with you by email.</li> |
| | <li>π Get your Access Token, it's later required in the notebook. |
| | </li> |
| | </ul> |
| | |
| | <p class="h2 mb-2" style="font-size:18px;font-weight:bold">How to get my Huggingface Access Token</p> |
| | <ul class="mb-2"> |
| | <li>π Go to your <u><a target="_blank" href="https://huggingface.co">HF account</a></u>.</li> |
| | <li>π Go to Settings β Access Tokens.</li> |
| | <li>π Generate a new Access Token and enter any name for "what's this token for".</li> |
| | <li>π Select <code>read</code> role.</li> |
| | <li>π Copy your access token.</li> |
| | <li>π In cell 4, it will ask you for an Access Token, paste it there.</li> |
| | </ul> |
| | |
| | <p class="mb-2" style="font-size:20px;font-weight:bold"> |
| | Start training |
| | </p> |
| | <p class="mb-2">Pick one of the following methods to run the training code. |
| | <br /><em>NOTE: Kaggle gives you around 40 hrs per week of GPU time, so it's preferred over Colab, unless you have Colab Pro or Colab Pro+.</em></p> |
| | <ul class="mb-2"> |
| | <li>π <span><a href="https://www.kaggle.com/prmais/volunteer-gpu-notebook"> |
| | <img style="display:inline;margin:0px" src="https://img.shields.io/badge/kaggle-Open%20in%20Kaggle-blue.svg"/> |
| | </a></span> <b> (recommended)</b> <br /> |
| | </li> |
| | <li>π <span><a href="https://colab.research.google.com/github/NCAI-Research/CALM/blob/main/notebooks/volunteer-gpu-notebook.ipynb"> |
| | <img style="display:inline;margin:0px" src="https://colab.research.google.com/assets/colab-badge.svg"/> |
| | </a></span> |
| | </li> |
| | <li>π Running locally: If you have additional local computing GPUs, please visit our discord channel for instructions to set it. |
| | </li> |
| | </ul> |
| | |
| | <p class="mb-2" style="font-size:20px;font-weight:bold"> |
| | Issues or questions? |
| | </p> |
| | |
| | <p class="mb-2"> |
| | Feel free to reach us on <u><a target="_blank" href="https://discord.gg/peU5Nx77">Discord</a></u> if you have any questions π |
| | </p> |
| | </div> |
| | |