Buckets:
| # CPM | |
| ## Overview | |
| CPM モデルは、Zhengyan Zhang、Xu Han、Hao Zhou、Pei Ke、Yuxian Gu によって [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://huggingface.co/papers/2012.00413) で提案されました。葉徳明、秦裕佳、 | |
| Yusheng Su、Haozhe Ji、Jian Guan、Fanchao Qi、Xiaozi Wang、Yanan Zheng、Guoyang Zeng、Huanqi Cao、Shengqi Chen、 | |
| Daixuan Li、Zhenbo Sun、Zhiyuan Liu、Minlie Huang、Wentao Han、Jie Tang、Juanzi Li、Xiaoyan Zhu、Maosong Sun。 | |
| 論文の要約は次のとおりです。 | |
| *事前トレーニングされた言語モデル (PLM) は、さまざまな下流の NLP タスクに有益であることが証明されています。最近ではGPT-3、 | |
| 1,750億個のパラメータと570GBの学習データを備え、数回の撮影(1枚でも)の容量で大きな注目を集めました | |
| ゼロショット)学習。ただし、GPT-3 を適用して中国語の NLP タスクに対処することは依然として困難です。 | |
| GPT-3 の言語は主に英語であり、パラメーターは公開されていません。この技術レポートでは、 | |
| 大規模な中国語トレーニング データに対する生成的事前トレーニングを備えた中国語事前トレーニング済み言語モデル (CPM)。最高に | |
| 私たちの知識の限りでは、26 億のパラメータと 100GB の中国語トレーニング データを備えた CPM は、事前トレーニングされた中国語としては最大のものです。 | |
| 言語モデルは、会話、エッセイの作成、 | |
| クローゼテストと言語理解。広範な実験により、CPM が多くの環境で優れたパフォーマンスを達成できることが実証されています。 | |
| 少数ショット (ゼロショットでも) 学習の設定での NLP タスク。* | |
| このモデルは [canwenxu](https://huggingface.co/canwenxu) によって提供されました。オリジナルの実装が見つかります | |
| ここ: https://github.com/TsinghuaAI/CPM-Generate | |
| CPM のアーキテクチャは、トークン化方法を除いて GPT-2 と同じです。詳細については、[GPT-2 ドキュメント](openai-community/gpt2) を参照してください。 | |
| API リファレンス情報。 | |
| ## CpmTokenizer[[transformers.CpmTokenizer]] | |
| #### transformers.CpmTokenizer[[transformers.CpmTokenizer]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/cpm/tokenization_cpm.py#L34) | |
| Runs pre-tokenization with Jieba-RS segmentation tool. It is used in CPM models. | |
| build_inputs_with_special_tokenstransformers.CpmTokenizer.build_inputs_with_special_tokenshttps://github.com/huggingface/transformers/blob/main/src/transformers/models/cpm/tokenization_cpm.py#L230[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": list[int] | None = None"}]- **token_ids_0** (`list[int]`) -- | |
| List of IDs to which the special tokens will be added. | |
| - **token_ids_1** (`list[int]`, *optional*) -- | |
| Optional second list of IDs for sequence pairs.0`list[int]`List of [input IDs](../glossary#input-ids) with the appropriate special tokens. | |
| Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and | |
| adding special tokens. An XLNet sequence has the following format: | |
| - single sequence: `X ` | |
| - pair of sequences: `A B ` | |
| **Parameters:** | |
| token_ids_0 (`list[int]`) : List of IDs to which the special tokens will be added. | |
| token_ids_1 (`list[int]`, *optional*) : Optional second list of IDs for sequence pairs. | |
| **Returns:** | |
| ``list[int]`` | |
| List of [input IDs](../glossary#input-ids) with the appropriate special tokens. | |
| #### convert_tokens_to_string[[transformers.CpmTokenizer.convert_tokens_to_string]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/cpm/tokenization_cpm.py#L225) | |
| Converts a sequence of tokens (strings for sub-words) in a single string. | |
| #### create_token_type_ids_from_sequences[[transformers.CpmTokenizer.create_token_type_ids_from_sequences]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/cpm/tokenization_cpm.py#L283) | |
| Create a mask from the two sequences passed to be used in a sequence-pair classification task. An XLNet | |
| sequence pair mask has the following format: | |
| ``` | |
| 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | |
| | first sequence | second sequence | | |
| ``` | |
| If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s). | |
| **Parameters:** | |
| token_ids_0 (`list[int]`) : List of IDs. | |
| token_ids_1 (`list[int]`, *optional*) : Optional second list of IDs for sequence pairs. | |
| **Returns:** | |
| ``list[int]`` | |
| List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s). | |
| #### get_special_tokens_mask[[transformers.CpmTokenizer.get_special_tokens_mask]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/cpm/tokenization_cpm.py#L255) | |
| Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding | |
| special tokens using the tokenizer `prepare_for_model` method. | |
| **Parameters:** | |
| token_ids_0 (`list[int]`) : List of IDs. | |
| token_ids_1 (`list[int]`, *optional*) : Optional second list of IDs for sequence pairs. | |
| already_has_special_tokens (`bool`, *optional*, defaults to `False`) : Whether or not the token list is already formatted with special tokens for the model. | |
| **Returns:** | |
| ``list[int]`` | |
| A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. | |
| ## CpmTokenizerFast[[transformers.CpmTokenizerFast]] | |
| #### transformers.CpmTokenizerFast[[transformers.CpmTokenizerFast]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/cpm/tokenization_cpm_fast.py#L28) | |
| Runs pre-tokenization with Jieba-RS segmentation tool. It is used in CPM models. | |
| build_inputs_with_special_tokenstransformers.CpmTokenizerFast.build_inputs_with_special_tokenshttps://github.com/huggingface/transformers/blob/main/src/transformers/models/cpm/tokenization_cpm_fast.py#L145[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": list[int] | None = None"}]- **token_ids_0** (`list[int]`) -- | |
| List of IDs to which the special tokens will be added. | |
| - **token_ids_1** (`list[int]`, *optional*) -- | |
| Optional second list of IDs for sequence pairs.0`list[int]`List of [input IDs](../glossary#input-ids) with the appropriate special tokens. | |
| Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and | |
| adding special tokens. An XLNet sequence has the following format: | |
| - single sequence: `X ` | |
| - pair of sequences: `A B ` | |
| **Parameters:** | |
| token_ids_0 (`list[int]`) : List of IDs to which the special tokens will be added. | |
| token_ids_1 (`list[int]`, *optional*) : Optional second list of IDs for sequence pairs. | |
| **Returns:** | |
| ``list[int]`` | |
| List of [input IDs](../glossary#input-ids) with the appropriate special tokens. | |
| #### create_token_type_ids_from_sequences[[transformers.CpmTokenizerFast.create_token_type_ids_from_sequences]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/cpm/tokenization_cpm_fast.py#L170) | |
| Create a mask from the two sequences passed to be used in a sequence-pair classification task. An XLNet | |
| sequence pair mask has the following format: | |
| ``` | |
| 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | |
| | first sequence | second sequence | | |
| ``` | |
| If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s). | |
| **Parameters:** | |
| token_ids_0 (`list[int]`) : List of IDs. | |
| token_ids_1 (`list[int]`, *optional*) : Optional second list of IDs for sequence pairs. | |
| **Returns:** | |
| ``list[int]`` | |
| List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s). | |
Xet Storage Details
- Size:
- 7.88 kB
- Xet hash:
- 17007a7f6c58ca1712054ff19697670274b94bb79c454e11bc8f7a730486b4e2
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.