Kishoreuses5 commited on
Commit
1ca8437
·
verified ·
1 Parent(s): 6ae1619

Upload 12 files

Browse files
codet5p/README.md ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: bsd-3-clause
3
+ ---
4
+
5
+ # CodeT5+ 110M Embedding Models
6
+
7
+ ## Model description
8
+
9
+ [CodeT5+](https://github.com/salesforce/CodeT5/tree/main/CodeT5+) is a new family of open code large language models
10
+ with an encoder-decoder architecture that can flexibly operate in different modes (i.e. _encoder-only_, _decoder-only_,
11
+ and _encoder-decoder_) to support a wide range of code understanding and generation tasks.
12
+ It is introduced in the paper:
13
+
14
+ [CodeT5+: Open Code Large Language Models for Code Understanding and Generation](https://arxiv.org/pdf/2305.07922.pdf)
15
+ by [Yue Wang](https://yuewang-cuhk.github.io/)\*, [Hung Le](https://sites.google.com/view/henryle2018/home?pli=1)\*, [Akhilesh Deepak Gotmare](https://akhileshgotmare.github.io/), [Nghi D.Q. Bui](https://bdqnghi.github.io/), [Junnan Li](https://sites.google.com/site/junnanlics), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home) (*
16
+ indicates equal contribution).
17
+
18
+ Compared to the original CodeT5 family (base: `220M`, large: `770M`), CodeT5+ is pretrained with a diverse set of
19
+ pretraining tasks including _span denoising_, _causal language modeling_, _contrastive learning_, and _text-code
20
+ matching_ to learn rich representations from both unimodal code data and bimodal code-text data.
21
+ Additionally, it employs a simple yet effective _compute-efficient pretraining_ method to initialize the model
22
+ components with frozen off-the-shelf LLMs such as [CodeGen](https://github.com/salesforce/CodeGen) to efficiently scale
23
+ up the model (i.e. `2B`, `6B`, `16B`), and adopts a "shallow encoder and deep decoder" architecture.
24
+ Furthermore, it is instruction-tuned to align with natural language instructions (see our InstructCodeT5+ 16B)
25
+ following [Code Alpaca](https://github.com/sahil280114/codealpaca).
26
+
27
+ ## How to use
28
+
29
+ This checkpoint consists of an encoder of CodeT5+ 220M model (pretrained from 2 stages on both unimodal and bimodal) and a projection layer, which can be used to extract code
30
+ embeddings of 256 dimension. It can be easily loaded using the `AutoModel` functionality and employs the
31
+ same [CodeT5](https://github.com/salesforce/CodeT5) tokenizer.
32
+
33
+ ```python
34
+ from transformers import AutoModel, AutoTokenizer
35
+
36
+ checkpoint = "Salesforce/codet5p-110m-embedding"
37
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
38
+
39
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
40
+ model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
41
+
42
+ inputs = tokenizer.encode("def print_hello_world():\tprint('Hello World!')", return_tensors="pt").to(device)
43
+ embedding = model(inputs)[0]
44
+ print(f'Dimension of the embedding: {embedding.size()[0]}, with norm={embedding.norm().item()}')
45
+ # Dimension of the embedding: 256, with norm=1.0
46
+ print(embedding)
47
+ # tensor([ 0.0185, 0.0229, -0.0315, -0.0307, -0.1421, -0.0575, -0.0275, 0.0501,
48
+ # 0.0203, 0.0337, -0.0067, -0.0075, -0.0222, -0.0107, -0.0250, -0.0657,
49
+ # 0.1571, -0.0994, -0.0370, 0.0164, -0.0948, 0.0490, -0.0352, 0.0907,
50
+ # -0.0198, 0.0130, -0.0921, 0.0209, 0.0651, 0.0319, 0.0299, -0.0173,
51
+ # -0.0693, -0.0798, -0.0066, -0.0417, 0.1076, 0.0597, -0.0316, 0.0940,
52
+ # -0.0313, 0.0993, 0.0931, -0.0427, 0.0256, 0.0297, -0.0561, -0.0155,
53
+ # -0.0496, -0.0697, -0.1011, 0.1178, 0.0283, -0.0571, -0.0635, -0.0222,
54
+ # 0.0710, -0.0617, 0.0423, -0.0057, 0.0620, -0.0262, 0.0441, 0.0425,
55
+ # -0.0413, -0.0245, 0.0043, 0.0185, 0.0060, -0.1727, -0.1152, 0.0655,
56
+ # -0.0235, -0.1465, -0.1359, 0.0022, 0.0177, -0.0176, -0.0361, -0.0750,
57
+ # -0.0464, -0.0846, -0.0088, 0.0136, -0.0221, 0.0591, 0.0876, -0.0903,
58
+ # 0.0271, -0.1165, -0.0169, -0.0566, 0.1173, -0.0801, 0.0430, 0.0236,
59
+ # 0.0060, -0.0778, -0.0570, 0.0102, -0.0172, -0.0051, -0.0891, -0.0620,
60
+ # -0.0536, 0.0190, -0.0039, -0.0189, -0.0267, -0.0389, -0.0208, 0.0076,
61
+ # -0.0676, 0.0630, -0.0962, 0.0418, -0.0172, -0.0229, -0.0452, 0.0401,
62
+ # 0.0270, 0.0677, -0.0111, -0.0089, 0.0175, 0.0703, 0.0714, -0.0068,
63
+ # 0.1214, -0.0004, 0.0020, 0.0255, 0.0424, -0.0030, 0.0318, 0.1227,
64
+ # 0.0676, -0.0723, 0.0970, 0.0637, -0.0140, -0.0283, -0.0120, 0.0343,
65
+ # -0.0890, 0.0680, 0.0514, 0.0513, 0.0627, -0.0284, -0.0479, 0.0068,
66
+ # -0.0794, 0.0202, 0.0208, -0.0113, -0.0747, 0.0045, -0.0854, -0.0609,
67
+ # -0.0078, 0.1168, 0.0618, -0.0223, -0.0755, 0.0182, -0.0128, 0.1116,
68
+ # 0.0240, 0.0342, 0.0119, -0.0235, -0.0150, -0.0228, -0.0568, -0.1528,
69
+ # 0.0164, -0.0268, 0.0727, -0.0569, 0.1306, 0.0643, -0.0158, -0.1070,
70
+ # -0.0107, -0.0139, -0.0363, 0.0366, -0.0986, -0.0628, -0.0277, 0.0316,
71
+ # 0.0363, 0.0038, -0.1092, -0.0679, -0.1398, -0.0648, 0.1711, -0.0666,
72
+ # 0.0563, 0.0581, 0.0226, 0.0347, -0.0672, -0.0229, -0.0565, 0.0623,
73
+ # 0.1089, -0.0687, -0.0901, -0.0073, 0.0426, 0.0870, -0.0390, -0.0144,
74
+ # -0.0166, 0.0262, -0.0310, 0.0467, -0.0164, -0.0700, -0.0602, -0.0720,
75
+ # -0.0386, 0.0067, -0.0337, -0.0053, 0.0829, 0.1004, 0.0427, 0.0026,
76
+ # -0.0537, 0.0951, 0.0584, -0.0583, -0.0208, 0.0124, 0.0067, 0.0403,
77
+ # 0.0091, -0.0044, -0.0036, 0.0524, 0.1103, -0.1511, -0.0479, 0.1709,
78
+ # 0.0772, 0.0721, -0.0332, 0.0866, 0.0799, -0.0581, 0.0713, 0.0218],
79
+ # device='cuda:0', grad_fn=<SelectBackward0>)
80
+ ```
81
+
82
+ ## Pretraining data
83
+
84
+ This checkpoint is trained on the stricter permissive subset of the deduplicated version of
85
+ the [github-code dataset](https://huggingface.co/datasets/codeparrot/github-code).
86
+ The data is preprocessed by reserving only permissively licensed code ("mit" “apache-2”, “bsd-3-clause”, “bsd-2-clause”,
87
+ “cc0-1.0”, “unlicense”, “isc”).
88
+ Supported languages (9 in total) are as follows:
89
+ `c`, `c++`, `c-sharp`, `go`, `java`, `javascript`, `php`, `python`, `ruby.`
90
+
91
+ ## Training procedure
92
+
93
+ This checkpoint is first trained on the unimodal code data at the first-stage pretraining and then on bimodal text-code
94
+ pair data using the proposed mixture of pretraining tasks.
95
+ Please refer to the paper for more details.
96
+
97
+ ## Evaluation results
98
+
99
+ We show the zero-shot results of this checkpoint on 6 downstream code retrieval tasks from CodeXGLUE in the following table.
100
+ | Ruby | JavaScript | Go | Python | Java | PHP | Overall |
101
+ | ----- | ---------- | ----- | ------ | ----- | ----- | ------- |
102
+ | 74.51 | 69.07 | 90.69 | 71.55 | 71.82 | 67.72 | 74.23 |
103
+
104
+ ## BibTeX entry and citation info
105
+
106
+ ```bibtex
107
+ @article{wang2023codet5plus,
108
+ title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation},
109
+ author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.},
110
+ journal={arXiv preprint},
111
+ year={2023}
112
+ }
113
+ ```
114
+
115
+ ## Ethical Considerations
116
+ This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.
codet5p/added_tokens.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "[CDEC]": 32102,
3
+ "[ENC]": 32100,
4
+ "[TDEC]": 32101
5
+ }
codet5p/config.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "Salesforce/codet5p-110m-embedding",
3
+ "architectures": [
4
+ "CodeT5p_Embedding"
5
+ ],
6
+ "auto_map": {
7
+ "AutoConfig": "configuration_codet5p_embedding.CodeT5pEmbeddingConfig",
8
+ "AutoModel": "modeling_codet5p_embedding.CodeT5pEmbeddingModel"
9
+ },
10
+ "bos_token_id": 1,
11
+ "d_ff": 3072,
12
+ "d_kv": 64,
13
+ "d_model": 768,
14
+ "embed_dim": 256,
15
+ "decoder_start_token_id": 0,
16
+ "dense_act_fn": "relu",
17
+ "dropout_rate": 0.1,
18
+ "eos_token_id": 2,
19
+ "feed_forward_proj": "relu",
20
+ "gradient_checkpointing": false,
21
+ "id2label": {
22
+ "0": "LABEL_0"
23
+ },
24
+ "initializer_factor": 1.0,
25
+ "is_encoder_decoder": true,
26
+ "is_gated_act": false,
27
+ "label2id": {
28
+ "LABEL_0": 0
29
+ },
30
+ "layer_norm_epsilon": 1e-06,
31
+ "model_type": "codet5p_embedding",
32
+ "n_positions": 512,
33
+ "num_heads": 12,
34
+ "num_layers": 12,
35
+ "output_past": true,
36
+ "pad_token_id": 0,
37
+ "relative_attention_max_distance": 128,
38
+ "relative_attention_num_buckets": 32,
39
+
40
+ "torch_dtype": "float32",
41
+ "transformers_version": "4.21.3",
42
+ "use_cache": true,
43
+ "vocab_size": 32103
44
+ }
codet5p/configuration_codet5p_embedding.py ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2023 Salesforce authors, The EleutherAI, and HuggingFace Teams. All rights reserved.
3
+
4
+ """ CodeT5+ embedding model configuration"""
5
+ from transformers.configuration_utils import PretrainedConfig
6
+ from transformers.utils import logging
7
+
8
+ logger = logging.get_logger(__name__)
9
+
10
+
11
+ class CodeT5pEmbeddingConfig(PretrainedConfig):
12
+ model_type = "codet5p_embedding"
13
+ keys_to_ignore_at_inference = ["past_key_values"]
14
+ attribute_map = {"hidden_size": "d_model", "num_attention_heads": "num_heads", "num_hidden_layers": "num_layers"}
15
+
16
+ def __init__(
17
+ self,
18
+ vocab_size=32103,
19
+ d_model=768,
20
+ embed_dim=256,
21
+ d_kv=64,
22
+ d_ff=3072,
23
+ num_layers=12,
24
+ num_heads=12,
25
+ relative_attention_num_buckets=32,
26
+ relative_attention_max_distance=128,
27
+ dropout_rate=0.1,
28
+ layer_norm_epsilon=1e-6,
29
+ initializer_factor=1.0,
30
+ feed_forward_proj="relu",
31
+ is_encoder_decoder=False,
32
+ use_cache=True,
33
+ pad_token_id=0,
34
+ eos_token_id=2,
35
+ **kwargs
36
+ ):
37
+ self.vocab_size = vocab_size
38
+ self.d_model = d_model
39
+ self.embed_dim = embed_dim
40
+ self.d_kv = d_kv
41
+ self.d_ff = d_ff
42
+ self.num_layers = num_layers
43
+ self.num_heads = num_heads
44
+ self.relative_attention_num_buckets = relative_attention_num_buckets
45
+ self.relative_attention_max_distance = relative_attention_max_distance
46
+ self.dropout_rate = dropout_rate
47
+ self.layer_norm_epsilon = layer_norm_epsilon
48
+ self.initializer_factor = initializer_factor
49
+ self.feed_forward_proj = feed_forward_proj
50
+ self.use_cache = use_cache
51
+
52
+ act_info = self.feed_forward_proj.split("-")
53
+ self.dense_act_fn = act_info[-1]
54
+ self.is_gated_act = act_info[0] == "gated"
55
+
56
+ if len(act_info) > 1 and act_info[0] != "gated" or len(act_info) > 2:
57
+ raise ValueError(
58
+ f"`feed_forward_proj`: {feed_forward_proj} is not a valid activation function of the dense layer."
59
+ "Please make sure `feed_forward_proj` is of the format `gated-{ACT_FN}` or `{ACT_FN}`, e.g. "
60
+ "'gated-gelu' or 'relu'"
61
+ )
62
+
63
+ # for backwards compatibility
64
+ if feed_forward_proj == "gated-gelu":
65
+ self.dense_act_fn = "gelu_new"
66
+
67
+ super().__init__(
68
+ pad_token_id=pad_token_id,
69
+ eos_token_id=eos_token_id,
70
+ is_encoder_decoder=is_encoder_decoder,
71
+ **kwargs,
72
+ )
codet5p/gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
codet5p/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
codet5p/modeling_codet5p_embedding.py ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2023 Salesforce authors, The EleutherAI, and HuggingFace Teams. All rights reserved.
3
+ """ PyTorch CodeT5+ mbedding models.
4
+ The implementation is based on transformers.models.t5.modeling_t5 by adding a projection layer on T5EncoderModel
5
+ """
6
+
7
+ from typing import Optional, Tuple, Union
8
+ import torch
9
+ from torch import nn
10
+ import torch.nn.functional as F
11
+ from transformers import T5EncoderModel
12
+ from transformers.modeling_outputs import (
13
+ BaseModelOutput,
14
+ )
15
+ from .configuration_codet5p_embedding import CodeT5pEmbeddingConfig
16
+
17
+
18
+ class CodeT5pEmbeddingModel(T5EncoderModel):
19
+ config_class = CodeT5pEmbeddingConfig
20
+
21
+ authorized_missing_keys = [
22
+ r"encoder.embed_tokens.weight",
23
+ ]
24
+
25
+ def __init__(self, config: CodeT5pEmbeddingConfig):
26
+ super().__init__(config)
27
+ self.proj = nn.Linear(config.d_model, config.embed_dim)
28
+
29
+ def forward(
30
+ self,
31
+ input_ids: Optional[torch.LongTensor] = None,
32
+ attention_mask: Optional[torch.FloatTensor] = None,
33
+ head_mask: Optional[torch.FloatTensor] = None,
34
+ inputs_embeds: Optional[torch.FloatTensor] = None,
35
+ output_attentions: Optional[bool] = None,
36
+ output_hidden_states: Optional[bool] = None,
37
+ return_dict: Optional[bool] = None,
38
+ ) -> Union[Tuple[torch.FloatTensor], BaseModelOutput]:
39
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
40
+
41
+ encoder_outputs = self.encoder(
42
+ input_ids=input_ids,
43
+ attention_mask=attention_mask,
44
+ inputs_embeds=inputs_embeds,
45
+ head_mask=head_mask,
46
+ output_attentions=output_attentions,
47
+ output_hidden_states=output_hidden_states,
48
+ return_dict=return_dict,
49
+ )
50
+
51
+ embedding = F.normalize(self.proj(encoder_outputs.last_hidden_state[:, 0, :]), dim=-1)
52
+
53
+ return embedding
codet5p/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:097d1bd8c5df11eb82aa5b750d208eee17d570babf94c77158279dc992c6829b
3
+ size 439257889
codet5p/special_tokens_map.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "[ENC]",
4
+ "[TDEC]",
5
+ "[CDEC]"
6
+ ],
7
+ "bos_token": {
8
+ "content": "<s>",
9
+ "lstrip": false,
10
+ "normalized": true,
11
+ "rstrip": false,
12
+ "single_word": false
13
+ },
14
+ "cls_token": {
15
+ "content": "<s>",
16
+ "lstrip": false,
17
+ "normalized": true,
18
+ "rstrip": false,
19
+ "single_word": false
20
+ },
21
+ "eos_token": {
22
+ "content": "</s>",
23
+ "lstrip": false,
24
+ "normalized": true,
25
+ "rstrip": false,
26
+ "single_word": false
27
+ },
28
+ "mask_token": {
29
+ "content": "<mask>",
30
+ "lstrip": true,
31
+ "normalized": true,
32
+ "rstrip": false,
33
+ "single_word": false
34
+ },
35
+ "pad_token": {
36
+ "content": "<pad>",
37
+ "lstrip": false,
38
+ "normalized": true,
39
+ "rstrip": false,
40
+ "single_word": false
41
+ },
42
+ "sep_token": {
43
+ "content": "</s>",
44
+ "lstrip": false,
45
+ "normalized": true,
46
+ "rstrip": false,
47
+ "single_word": false
48
+ },
49
+ "unk_token": {
50
+ "content": "<unk>",
51
+ "lstrip": false,
52
+ "normalized": true,
53
+ "rstrip": false,
54
+ "single_word": false
55
+ }
56
+ }
codet5p/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
codet5p/tokenizer_config.json ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": {
4
+ "__type": "AddedToken",
5
+ "content": "<s>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false
10
+ },
11
+ "clean_up_tokenization_spaces": true,
12
+ "cls_token": {
13
+ "__type": "AddedToken",
14
+ "content": "<s>",
15
+ "lstrip": false,
16
+ "normalized": true,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ },
20
+ "eos_token": {
21
+ "__type": "AddedToken",
22
+ "content": "</s>",
23
+ "lstrip": false,
24
+ "normalized": true,
25
+ "rstrip": false,
26
+ "single_word": false
27
+ },
28
+ "errors": "replace",
29
+ "mask_token": {
30
+ "__type": "AddedToken",
31
+ "content": "<mask>",
32
+ "lstrip": true,
33
+ "normalized": true,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "model_max_length": 512,
38
+ "pad_token": {
39
+ "__type": "AddedToken",
40
+ "content": "<pad>",
41
+ "lstrip": false,
42
+ "normalized": true,
43
+ "rstrip": false,
44
+ "single_word": false
45
+ },
46
+ "sep_token": {
47
+ "__type": "AddedToken",
48
+ "content": "</s>",
49
+ "lstrip": false,
50
+ "normalized": true,
51
+ "rstrip": false,
52
+ "single_word": false
53
+ },
54
+ "tokenizer_class": "RobertaTokenizer",
55
+ "trim_offsets": true,
56
+ "unk_token": {
57
+ "__type": "AddedToken",
58
+ "content": "<unk>",
59
+ "lstrip": false,
60
+ "normalized": true,
61
+ "rstrip": false,
62
+ "single_word": false
63
+ }
64
+ }
codet5p/vocab.json ADDED
The diff for this file is too large to render. See raw diff