Buckets:
| # BARTpho [[bartpho]] | |
| ## 개요 [[overview]] | |
| BARTpho 모델은 Nguyen Luong Tran, Duong Minh Le, Dat Quoc Nguyen에 의해 [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://huggingface.co/papers/2109.09701)에서 제안되었습니다. | |
| 이 논문의 초록은 다음과 같습니다: | |
| *우리는 BARTpho_word와 BARTpho_syllable의 두 가지 버전으로 BARTpho를 제시합니다. | |
| 이는 베트남어를 위해 사전훈련된 최초의 대규모 단일 언어 시퀀스-투-시퀀스 모델입니다. | |
| 우리의 BARTpho는 시퀀스-투-시퀀스 디노이징 모델인 BART의 "large" 아키텍처와 사전훈련 방식을 사용하여, 생성형 NLP 작업에 특히 적합합니다. | |
| 베트남어 텍스트 요약의 다운스트림 작업 실험에서, | |
| 자동 및 인간 평가 모두에서 BARTpho가 강력한 기준인 mBART를 능가하고 최신 성능을 개선했음을 보여줍니다. | |
| 우리는 향후 연구 및 베트남어 생성형 NLP 작업의 응용을 촉진하기 위해 BARTpho를 공개합니다.* | |
| 이 모델은 [dqnguyen](https://huggingface.co/dqnguyen)이 기여했습니다. 원본 코드는 [여기](https://github.com/VinAIResearch/BARTpho)에서 찾을 수 있습니다. | |
| ## 사용 예시 [[usage-example]] | |
| ```python | |
| >>> import torch | |
| >>> from transformers import AutoModel, AutoTokenizer | |
| >>> bartpho = AutoModel.from_pretrained("vinai/bartpho-syllable") | |
| >>> tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable") | |
| >>> line = "Chúng tôi là những nghiên cứu viên." | |
| >>> input_ids = tokenizer(line, return_tensors="pt") | |
| >>> with torch.no_grad(): | |
| ... features = bartpho(**input_ids) # 이제 모델 출력은 튜플입니다 | |
| >>> # With TensorFlow 2.0+: | |
| >>> from transformers import TFAutoModel | |
| >>> bartpho = TFAutoModel.from_pretrained("vinai/bartpho-syllable") | |
| >>> input_ids = tokenizer(line, return_tensors="tf") | |
| >>> features = bartpho(**input_ids) | |
| ``` | |
| ## 사용 팁 [[usage-tips]] | |
| - mBART를 따르며, BARTpho는 BART의 "large" 아키텍처에 인코더와 디코더의 상단에 추가적인 레이어 정규화 레이어를 사용합니다. | |
| 따라서 [BART 문서](bart)에 있는 사용 예시를 BARTpho에 맞게 적용하려면 | |
| BART 전용 클래스를 mBART 전용 클래스로 대체하여 조정해야 합니다. | |
| 예를 들어: | |
| ```python | |
| >>> from transformers import MBartForConditionalGeneration | |
| >>> bartpho = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-syllable") | |
| >>> TXT = "Chúng tôi là <mask> nghiên cứu viên." | |
| >>> input_ids = tokenizer([TXT], return_tensors="pt")["input_ids"] | |
| >>> logits = bartpho(input_ids).logits | |
| >>> masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item() | |
| >>> probs = logits[0, masked_index].softmax(dim=0) | |
| >>> values, predictions = probs.topk(5) | |
| >>> print(tokenizer.decode(predictions).split()) | |
| ``` | |
| - 이 구현은 토큰화만을 위한 것입니다: "monolingual_vocab_file"은 다국어 | |
| XLM-RoBERTa에서 제공되는 사전훈련된 SentencePiece 모델 | |
| "vocab_file"에서 추출된 베트남어 전용 유형으로 구성됩니다. | |
| 다른 언어들도 이 사전훈련된 다국어 SentencePiece 모델 "vocab_file"을 하위 단어 분할에 사용하면, 자신의 언어 전용 "monolingual_vocab_file"과 함께 BartphoTokenizer를 재사용할 수 있습니다. | |
| ## BartphoTokenizer [[bartphotokenizer]][[transformers.BartphoTokenizer]] | |
| #### transformers.BartphoTokenizer[[transformers.BartphoTokenizer]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bartpho/tokenization_bartpho.py#L32) | |
| Adapted from `XLMRobertaTokenizer`. Based on [SentencePiece](https://github.com/google/sentencepiece). | |
| This tokenizer inherits from [PreTrainedTokenizer](/docs/transformers/main/ko/main_classes/tokenizer#transformers.PythonBackend) which contains most of the main methods. Users should refer to | |
| this superclass for more information regarding those methods. | |
| build_inputs_with_special_tokenstransformers.BartphoTokenizer.build_inputs_with_special_tokenshttps://github.com/huggingface/transformers/blob/main/src/transformers/models/bartpho/tokenization_bartpho.py#L160[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": list[int] | None = None"}]- **token_ids_0** (`list[int]`) -- | |
| List of IDs to which the special tokens will be added. | |
| - **token_ids_1** (`list[int]`, *optional*) -- | |
| Optional second list of IDs for sequence pairs.0`list[int]`List of [input IDs](../glossary#input-ids) with the appropriate special tokens. | |
| Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and | |
| adding special tokens. An BARTPho sequence has the following format: | |
| - single sequence: ` X ` | |
| - pair of sequences: ` A B ` | |
| **Parameters:** | |
| vocab_file (`str`) : Path to the vocabulary file. This vocabulary is the pre-trained SentencePiece model available from the multilingual XLM-RoBERTa, also used in mBART, consisting of 250K types. | |
| monolingual_vocab_file (`str`) : Path to the monolingual vocabulary file. This monolingual vocabulary consists of Vietnamese-specialized types extracted from the multilingual vocabulary vocab_file of 250K types. | |
| bos_token (`str`, *optional*, defaults to `"<s>"`) : The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the `cls_token`. | |
| eos_token (`str`, *optional*, defaults to `"</s>"`) : The end of sequence token. When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the `sep_token`. | |
| sep_token (`str`, *optional*, defaults to `"</s>"`) : The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens. | |
| cls_token (`str`, *optional*, defaults to `"<s>"`) : The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens. | |
| unk_token (`str`, *optional*, defaults to `"<unk>"`) : The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. | |
| pad_token (`str`, *optional*, defaults to `"<pad>"`) : The token used for padding, for example when batching sequences of different lengths. | |
| mask_token (`str`, *optional*, defaults to `"<mask>"`) : The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict. | |
| sp_model_kwargs (`dict`, *optional*) : Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, to set: - `enable_sampling`: Enable subword regularization. - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout. - `nbest_size = {0,1}`: No sampling is performed. - `nbest_size > 1`: samples from the nbest_size results. - `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm. - `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout. | |
| sp_model (`SentencePieceProcessor`) : The *SentencePiece* processor that is used for every conversion (string, tokens and IDs). | |
| **Returns:** | |
| ``list[int]`` | |
| List of [input IDs](../glossary#input-ids) with the appropriate special tokens. | |
| #### create_token_type_ids_from_sequences[[transformers.BartphoTokenizer.create_token_type_ids_from_sequences]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bartpho/tokenization_bartpho.py#L214) | |
| Create a mask from the two sequences passed to be used in a sequence-pair classification task. BARTPho does not | |
| make use of token type ids, therefore a list of zeros is returned. | |
| **Parameters:** | |
| token_ids_0 (`list[int]`) : List of IDs. | |
| token_ids_1 (`list[int]`, *optional*) : Optional second list of IDs for sequence pairs. | |
| **Returns:** | |
| ``list[int]`` | |
| List of zeros. | |
| #### get_special_tokens_mask[[transformers.BartphoTokenizer.get_special_tokens_mask]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bartpho/tokenization_bartpho.py#L186) | |
| Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding | |
| special tokens using the tokenizer `prepare_for_model` method. | |
| **Parameters:** | |
| token_ids_0 (`list[int]`) : List of IDs. | |
| token_ids_1 (`list[int]`, *optional*) : Optional second list of IDs for sequence pairs. | |
| already_has_special_tokens (`bool`, *optional*, defaults to `False`) : Whether or not the token list is already formatted with special tokens for the model. | |
| **Returns:** | |
| ``list[int]`` | |
| A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. | |
| #### get_vocab[[transformers.BartphoTokenizer.get_vocab]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bartpho/tokenization_bartpho.py#L244) | |
| Override to use fairseq vocabulary | |
Xet Storage Details
- Size:
- 9.56 kB
- Xet hash:
- d62dc3b0413ad99b90d676b0431b42ceadad95077bff9d41c0950f613676af82
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.