Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

.ipynb_checkpoints/README-checkpoint.md +175 -0
.ipynb_checkpoints/adapt_tokenizer-checkpoint.py +40 -0
.ipynb_checkpoints/special_tokens_map-checkpoint.json +4 -0
.ipynb_checkpoints/tokenizer-checkpoint.model +3 -0
.ipynb_checkpoints/tokenizer_config-checkpoint.json +34 -0
tokenizer.model +2 -2
tokenizer_config.json +4 -1

.ipynb_checkpoints/README-checkpoint.md ADDED Viewed

	@@ -0,0 +1,175 @@

+---
+license: mit
+---
+This is a version of the [sealion7b](https://huggingface.co/aisingapore/sealion7b) model, sharded to 2 GB chunks.
+Please refer to the previously linked repo for details on usage/implementation/etc. This model was downloaded from the original repo and is redistributed under the same license.
+# SEA-LION
+SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
+The size of the models range from 3 billion to 7 billion parameters.
+This is the card for the SEA-LION 7B base model.
+SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
+## Model Details
+### Model Description
+The SEA-LION model is a significant leap forward in the field of Natural Language Processing,
+specifically trained to understand the SEA regional context.
+SEA-LION is built on the robust MPT architecture and has a vocabulary size of 256K.
+For tokenization, the model employs our custom SEABPETokenizer, which is specially tailored for SEA languages, ensuring optimal model performance.
+The training data for SEA-LION encompasses 980B tokens.
+- **Developed by:** Products Pillar, AI Singapore
+- **Funded by:** Singapore NRF
+- **Model type:** Decoder
+- **Languages:** English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao
+- **License:** MIT License
+### Performance Benchmarks
+SEA-LION has an average performance on general tasks in English (as measured by Hugging Face's LLM Leaderboard):
+| Model       | ARC   | HellaSwag | MMLU  | TruthfulQA | Average |
+|-------------|:-----:|:---------:|:-----:|:----------:|:-------:|
+| SEA-LION 7B | 39.93 | 68.51     | 26.87 |      35.09 | 42.60   |
+## Training Details
+### Data
+SEA-LION was trained on 980B tokens of the following data:
+| Data Source               | Unique Tokens | Multiplier | Total Tokens | Percentage |
+|---------------------------|:-------------:|:----------:|:------------:|:----------:|
+| RefinedWeb - English      |        571.3B |          1 |       571.3B |     58.20% |
+| mC4 - Chinese             |         91.2B |          1 |        91.2B |      9.29% |
+| mC4 - Indonesian          |         3.68B |          4 |        14.7B |      1.50% |
+| mC4 - Malay               |         0.72B |          4 |         2.9B |      0.29% |
+| mC4 - Filipino            |         1.32B |          4 |         5.3B |      0.54% |
+| mC4 - Burmese             |          1.2B |          4 |         4.9B |      0.49% |
+| mC4 - Vietnamese          |         63.4B |          1 |        63.4B |      6.46% |
+| mC4 - Thai                |          5.8B |          2 |        11.6B |      1.18% |
+| WangChanBERTa - Thai      |            5B |          2 |          10B |      1.02% |
+| mC4 - Lao                 |         0.27B |          4 |         1.1B |      0.12% |
+| mC4 - Khmer               |         0.97B |          4 |         3.9B |      0.40% |
+| mC4 - Tamil               |         2.55B |          4 |        10.2B |      1.04% |
+| the Stack - Python        |         20.9B |          2 |        41.8B |      4.26% |
+| the Stack - Javascript    |         55.6B |          1 |        55.6B |      5.66% |
+| the Stack - Shell         |         1.2B5 |          2 |         2.5B |      0.26% |
+| the Stack - SQL           |         6.4B  |          2 |        12.8B |      1.31% |
+| the Stack - Markdown      |         26.6B |          1 |        26.6B |      2.71% |
+| RedPajama - StackExchange |         21.2B |          1 |        21.2B |      2.16% |
+| RedPajama - ArXiv         |         30.6B |          1 |        30.6B |      3.12% |
+### Infrastructure
+SEA-LION was trained using [MosaicML Composer](https://github.com/mosaicml/composer)
+on the following hardware:
+| Training Details     | SEA-LION 7B  |
+|----------------------|:------------:|
+| AWS EC2 p4d.24xlarge | 32 instances |
+| Nvidia A100 40GB GPU | 256          |
+| Training Duration    | 22 days      |
+### Configuration
+| HyperParameter    | SEA-LION 7B        |
+|-------------------|:------------------:|
+| Precision         | bfloat16           |
+| Optimizer         | decoupled_adamw    |
+| Scheduler         | cosine_with_warmup |
+| Learning Rate     | 6.0e-5             |
+| Global Batch Size | 2048               |
+| Micro Batch Size  | 4                  |
+## Technical Specifications
+### Model Architecture and Objective
+SEA-LION is a decoder model using the MPT architecture.
+| Parameter       | SEA-LION 7B |
+|-----------------|:-----------:|
+| Layers          | 32          |
+| d_model         | 4096        |
+| head_dim        | 32          |
+| Vocabulary      | 256000      |
+| Sequence Length | 2048        |
+### Tokenizer Details
+We sample 20M lines from the training data to train the tokenizer.<br>
+The framework for training is [SentencePiece](https://github.com/google/sentencepiece).<br>
+The tokenizer type is Byte-Pair Encoding (BPE).
+## The Team
+Lam Wen Zhi Clarence<br>
+Leong Wei Qi<br>
+Li Yier<br>
+Liu Bing Jie Darius<br>
+Lovenia Holy<br>
+Montalan Jann Railey<br>
+Ng Boon Cheong Raymond<br>
+Ngui Jian Gang<br>
+Nguyen Thanh Ngan<br>
+Ong Tat-Wee David<br>
+Rengarajan Hamsawardhini<br>
+Susanto Yosephine<br>
+Tai Ngee Chia<br>
+Tan Choon Meng<br>
+Teo Jin Howe<br>
+Teo Eng Sipp Leslie<br>
+Teo Wei Yi<br>
+Tjhi William<br>
+Yeo Yeow Tong<br>
+Yong Xianbin<br>
+## Acknowledgements
+AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore.
+Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.
+## Contact
+For more info, please contact us using this [SEA-LION Inquiry Form](https://forms.gle/sLCUVb95wmGf43hi6)
+[Link to SEA-LION's GitHub repository](https://github.com/aisingapore/sealion)
+## Disclaimer
+This the repository for the base model.
+The model has _not_ been aligned for safety.
+Developers and users should perform their own safety fine-tuning and related security measures.
+In no event shall the authors be held liable for any claim, damages, or other liability
+arising from the use of the released weights and codes.
+## References
+```bibtex
+@misc{lowphansirikul2021wangchanberta,
+    title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
+    author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
+    year={2021},
+    eprint={2101.09635},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```

.ipynb_checkpoints/adapt_tokenizer-checkpoint.py ADDED Viewed

	@@ -0,0 +1,40 @@

+from typing import Any
+from transformers import AutoTokenizer, PreTrainedTokenizerBase
+NUM_SENTINEL_TOKENS: int = 100
+def adapt_tokenizer_for_denoising(tokenizer: PreTrainedTokenizerBase) -> None:
+    """Adds sentinel tokens and padding token (if missing).
+    Expands the tokenizer vocabulary to include sentinel tokens
+    used in mixture-of-denoiser tasks as well as a padding token.
+    All added tokens are added as special tokens. No tokens are
+    added if sentinel tokens and padding token already exist.
+    """
+    sentinels_to_add = [f'<extra_id_{i}>' for i in range(NUM_SENTINEL_TOKENS)]
+    tokenizer.add_tokens(sentinels_to_add, special_tokens=True)
+    if tokenizer.pad_token is None:
+        tokenizer.add_tokens('<pad>', special_tokens=True)
+        tokenizer.pad_token = '<pad>'
+        assert tokenizer.pad_token_id is not None
+    sentinels = ''.join([f'<extra_id_{i}>' for i in range(NUM_SENTINEL_TOKENS)])
+    _sentinel_token_ids = tokenizer(sentinels, add_special_tokens=False).input_ids
+    tokenizer.sentinel_token_ids = _sentinel_token_ids
+class AutoTokenizerForMOD(AutoTokenizer):
+    """AutoTokenizer + Adaptation for MOD.
+    A simple wrapper around AutoTokenizer to make instantiating
+    an MOD-adapted tokenizer a bit easier.
+    MOD-adapted tokenizers have sentinel tokens (e.g., <extra_id_0>),
+    a padding token, and a property to get the token ids of the
+    sentinel tokens.
+    """
+    @classmethod
+    def from_pretrained(cls, *args: Any, **kwargs: Any) -> PreTrainedTokenizerBase:
+        """See `AutoTokenizer.from_pretrained` docstring."""
+        tokenizer = super().from_pretrained(*args, **kwargs)
+        adapt_tokenizer_for_denoising(tokenizer)
+        return tokenizer

.ipynb_checkpoints/special_tokens_map-checkpoint.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "eos_token": "<|endoftext|>",
+  "unk_token": "<unk>"
+}

.ipynb_checkpoints/tokenizer-checkpoint.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d3243fc67ced759a4adcca01c0356f5b722057158e99d3cb9502c2572dbda0cf
+size 132

.ipynb_checkpoints/tokenizer_config-checkpoint.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "add_bos_token": false,
+  "add_eos_token": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "auto_map": {
+    "AutoTokenizer": ["tokenization_SEA_BPE.SEABPETokenizer", null]
+  },
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "legacy": true,
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": null,
+  "sp_model_kwargs": {},
+  "tokenizer_class": "SEABPETokenizer",
+  "unk_token": "<unk>"
+}

tokenizer.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:d3243fc67ced759a4adcca01c0356f5b722057158e99d3cb9502c2572dbda0cf
-size 132

 version https://git-lfs.github.com/spec/v1
+oid sha256:c0c576972c98fa150efff77f61a30b46afbc1247ff4697f39e51e90d0a8b2190
+size 4569957

tokenizer_config.json CHANGED Viewed

@@ -20,7 +20,10 @@
     }
   },
   "auto_map": {
-    "AutoTokenizer": ["tokenization_SEA_BPE.SEABPETokenizer", null]
   },
   "bos_token": null,
   "clean_up_tokenization_spaces": false,

     }
   },
   "auto_map": {
+    "AutoTokenizer": [
+      "aisingapore/sealion7b--tokenization_SEA_BPE.SEABPETokenizer",
+      null
+    ]
   },
   "bos_token": null,
   "clean_up_tokenization_spaces": false,