Buckets:
| # Cohere[[cohere]] | |
| ## 개요[[overview]] | |
| The Cohere Command-R 모델은 Cohere팀이 [Command-R: 프로덕션 규모의 검색 증강 생성](https://txt.cohere.com/command-r/)라는 블로그 포스트에서 소개 되었습니다. | |
| 논문 초록: | |
| *Command-R은 기업의 프로덕션 규모 AI를 가능하게 하기 위해 RAG(검색 증강 생성)와 도구 사용을 목표로 하는 확장 가능한 생성 모델입니다. 오늘 우리는 대규모 프로덕션 워크로드를 목표로 하는 새로운 LLM인 Command-R을 소개합니다. Command-R은 높은 효율성과 강력한 정확성의 균형을 맞추는 '확장 가능한' 모델 카테고리를 대상으로 하여, 기업들이 개념 증명을 넘어 프로덕션 단계로 나아갈 수 있게 합니다.* | |
| *Command-R은 검색 증강 생성(RAG)이나 외부 API 및 도구 사용과 같은 긴 문맥 작업에 최적화된 생성 모델입니다. 이 모델은 RAG 애플리케이션을 위한 최고 수준의 통합을 제공하고 기업 사용 사례에서 뛰어난 성능을 발휘하기 위해 우리의 업계 선도적인 Embed 및 Rerank 모델과 조화롭게 작동하도록 설계되었습니다. 기업이 대규모로 구현할 수 있도록 만들어진 모델로서, Command-R은 다음과 같은 특징을 자랑합니다: | |
| - RAG 및 도구 사용에 대한 강력한 정확성 | |
| - 낮은 지연 시간과 높은 처리량 | |
| - 더 긴 128k 컨텍스트와 낮은 가격 | |
| - 10개의 주요 언어에 걸친 강력한 기능 | |
| - 연구 및 평가를 위해 HuggingFace에서 사용 가능한 모델 가중치 | |
| 모델 체크포인트는 [이곳](https://huggingface.co/CohereForAI/c4ai-command-r-v01)에서 확인하세요. | |
| 이 모델은 [Saurabh Dash](https://huggingface.co/saurabhdash)과 [Ahmet Üstün](https://huggingface.co/ahmetustun)에 의해 기여 되었습니다. Hugging Face에서 이 코드의 구현은 [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)에 기반하였습니다. | |
| ## 사용 팁[[usage-tips]] | |
| Hub에 업로드된 체크포인트들은 `dtype = 'float16'`을 사용합니다. | |
| 이는 `AutoModel` API가 체크포인트를 `torch.float32`에서 `torch.float16`으로 변환하는 데 사용됩니다. | |
| 온라인 가중치의 `dtype`은 `model = AutoModelForCausalLM.from_pretrained("path", dtype = "auto")`를 사용하여 모델을 초기화할 때 `dtype="auto"`를 사용하지 않는 한 대부분 무관합니다. 그 이유는 모델이 먼저 다운로드되고(온라인 체크포인트의 `dtype` 사용), 그 다음 `torch`의 기본 `dtype`으로 변환되며(이때 `torch.float32`가 됨), 마지막으로 config에 `dtype`이 제공된 경우 이를 사용하기 때문입니다. | |
| 모델을 `float16`으로 훈련하는 것은 권장되지 않으며 `nan`을 생성하는 것으로 알려져 있습니다. 따라서 모델은 `bfloat16`으로 훈련해야 합니다. | |
| 모델과 토크나이저는 다음과 같이 로드할 수 있습니다: | |
| ```python | |
| # pip install transformers | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| model_id = "CohereForAI/c4ai-command-r-v01" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| model = AutoModelForCausalLM.from_pretrained(model_id) | |
| # Format message with the command-r chat template | |
| messages = [{"role": "user", "content": "Hello, how are you?"}] | |
| input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt") | |
| ## <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|> | |
| gen_tokens = model.generate( | |
| input_ids, | |
| max_new_tokens=100, | |
| do_sample=True, | |
| temperature=0.3, | |
| ) | |
| gen_text = tokenizer.decode(gen_tokens[0]) | |
| print(gen_text) | |
| ``` | |
| - Flash Attention 2를 `attn_implementation="flash_attention_2"`를 통해 사용할 때는, `from_pretrained` 클래스 메서드에 `dtype`을 전달하지 말고 자동 혼합 정밀도 훈련(Automatic Mixed-Precision training)을 사용하세요. `Trainer`를 사용할 때는 단순히 `fp16` 또는 `bf16`을 `True`로 지정하면 됩니다. 그렇지 않은 경우에는 `torch.autocast`를 사용하고 있는지 확인하세요. 이는 Flash Attention이 `fp16`와 `bf16` 데이터 타입만 지원하기 때문에 필요합니다. | |
| ## 리소스[[resources]] | |
| Command-R을 시작하는 데 도움이 되는 Hugging Face와 community 자료 목록(🌎로 표시됨) 입니다. 여기에 포함될 자료를 제출하고 싶으시다면 PR(Pull Request)를 열어주세요. 리뷰 해드리겠습니다! 자료는 기존 자료를 복제하는 대신 새로운 내용을 담고 있어야 합니다. | |
| FP16 모델 로딩 | |
| ```python | |
| # pip install transformers | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| model_id = "CohereForAI/c4ai-command-r-v01" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| model = AutoModelForCausalLM.from_pretrained(model_id) | |
| # command-r 챗 템플릿으로 메세지 형식을 정하세요 | |
| messages = [{"role": "user", "content": "Hello, how are you?"}] | |
| input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt") | |
| ## <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|> | |
| gen_tokens = model.generate( | |
| input_ids, | |
| max_new_tokens=100, | |
| do_sample=True, | |
| temperature=0.3, | |
| ) | |
| gen_text = tokenizer.decode(gen_tokens[0]) | |
| print(gen_text) | |
| ``` | |
| bitsandbytes 라이브러리를 이용해서 4bit 양자화된 모델 로딩 | |
| ```python | |
| # pip install transformers bitsandbytes accelerate | |
| from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig | |
| bnb_config = BitsAndBytesConfig(load_in_4bit=True) | |
| model_id = "CohereForAI/c4ai-command-r-v01" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config) | |
| gen_tokens = model.generate( | |
| input_ids, | |
| max_new_tokens=100, | |
| do_sample=True, | |
| temperature=0.3, | |
| ) | |
| gen_text = tokenizer.decode(gen_tokens[0]) | |
| print(gen_text) | |
| ``` | |
| ## CohereConfig[[transformers.CohereConfig]][[transformers.CohereConfig]] | |
| #### transformers.CohereConfig[[transformers.CohereConfig]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/cohere/configuration_cohere.py#L30) | |
| This is the configuration class to store the configuration of a CohereModel. It is used to instantiate a Cohere | |
| model according to the specified arguments, defining the model architecture. Instantiating a configuration with the | |
| defaults will yield a similar configuration to that of the [CohereForAI/c4ai-command-r-v01](https://huggingface.co/CohereForAI/c4ai-command-r-v01) | |
| Configuration objects inherit from [PreTrainedConfig](/docs/transformers/main/ko/main_classes/configuration#transformers.PreTrainedConfig) and can be used to control the model outputs. Read the | |
| documentation from [PreTrainedConfig](/docs/transformers/main/ko/main_classes/configuration#transformers.PreTrainedConfig) for more information. | |
| ```python | |
| >>> from transformers import CohereModel, CohereConfig | |
| >>> # Initializing a Cohere model configuration | |
| >>> configuration = CohereConfig() | |
| >>> # Initializing a model from the Cohere configuration | |
| >>> model = CohereModel(configuration) | |
| >>> # Accessing the model configuration | |
| >>> configuration = model.config | |
| ``` | |
| **Parameters:** | |
| vocab_size (`int`, *optional*, defaults to `256000`) : Vocabulary size of the model. Defines the number of different tokens that can be represented by the `input_ids`. | |
| hidden_size (`int`, *optional*, defaults to `8192`) : Dimension of the hidden representations. | |
| intermediate_size (`int`, *optional*, defaults to `22528`) : Dimension of the MLP representations. | |
| logit_scale (`float`, *optional*, defaults to 0.0625) : The scaling factor for the output logits. | |
| num_hidden_layers (`int`, *optional*, defaults to `40`) : Number of hidden layers in the Transformer decoder. | |
| num_attention_heads (`int`, *optional*, defaults to `64`) : Number of attention heads for each attention layer in the Transformer decoder. | |
| num_key_value_heads (`int`, *optional*) : This is the number of key_value heads that should be used to implement Grouped Query Attention. If `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details, check out [this paper](https://huggingface.co/papers/2305.13245). If it is not specified, will default to `num_attention_heads`. | |
| hidden_act (`str`, *optional*, defaults to `silu`) : The non-linear activation function (function or string) in the decoder. For example, `"gelu"`, `"relu"`, `"silu"`, etc. | |
| max_position_embeddings (`int`, *optional*, defaults to `8192`) : The maximum sequence length that this model might ever be used with. | |
| initializer_range (`float`, *optional*, defaults to `0.02`) : The standard deviation of the truncated_normal_initializer for initializing all weight matrices. | |
| layer_norm_eps (`float`, *optional*, defaults to `1e-05`) : The epsilon used by the layer normalization layers. | |
| use_cache (`bool`, *optional*, defaults to `True`) : Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if `config.is_decoder=True` or when the model is a decoder-only generative model. | |
| pad_token_id (`int`, *optional*, defaults to `0`) : Token id used for padding in the vocabulary. | |
| bos_token_id (`int`, *optional*, defaults to `5`) : Token id used for beginning-of-stream in the vocabulary. | |
| eos_token_id (`Union[int, list[int]]`, *optional*, defaults to `255001`) : Token id used for end-of-stream in the vocabulary. | |
| tie_word_embeddings (`bool`, *optional*, defaults to `True`) : Whether to tie weight embeddings according to model's `tied_weights_keys` mapping. | |
| rope_parameters (`Union[~modeling_rope_utils.RopeParameters, dict]`, *optional*) : Dictionary containing the configuration parameters for the RoPE embeddings. The dictionary should contain a value for `rope_theta` and optionally parameters used for scaling in case you want to use RoPE with longer `max_position_embeddings`. | |
| attention_bias (`bool`, *optional*, defaults to `False`) : Whether to use a bias in the query, key, value and output projection layers during self-attention. | |
| attention_dropout (`Union[float, int]`, *optional*, defaults to `0.0`) : The dropout ratio for the attention probabilities. | |
| use_qk_norm (`bool`, *optional*, defaults to `False`) : Whether to use query-key normalization in the attention. | |
| ave_vocabulary | |
| ## CohereModel[[transformers.CohereModel]][[transformers.CohereModel]] | |
| #### transformers.CohereModel[[transformers.CohereModel]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/cohere/modeling_cohere.py#L381) | |
| The bare Cohere Model outputting raw hidden-states without any specific head on top. | |
| This model inherits from [PreTrainedModel](/docs/transformers/main/ko/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads | |
| etc.) | |
| This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. | |
| Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage | |
| and behavior. | |
| forwardtransformers.CohereModel.forwardhttps://github.com/huggingface/transformers/blob/main/src/transformers/models/cohere/modeling_cohere.py#L398[{"name": "input_ids", "val": ": torch.LongTensor | None = None"}, {"name": "attention_mask", "val": ": torch.Tensor | None = None"}, {"name": "position_ids", "val": ": torch.LongTensor | None = None"}, {"name": "past_key_values", "val": ": transformers.cache_utils.Cache | None = None"}, {"name": "inputs_embeds", "val": ": torch.FloatTensor | None = None"}, {"name": "use_cache", "val": ": bool | None = None"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]"}]- **input_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. | |
| Indices can be obtained using [AutoTokenizer](/docs/transformers/main/ko/model_doc/auto#transformers.AutoTokenizer). See [PreTrainedTokenizer.encode()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.encode) and | |
| [PreTrainedTokenizer.__call__()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__) for details. | |
| [What are input IDs?](../glossary#input-ids) | |
| - **attention_mask** (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: | |
| - 1 for tokens that are **not masked**, | |
| - 0 for tokens that are **masked**. | |
| [What are attention masks?](../glossary#attention-mask) | |
| - **position_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.n_positions - 1]`. | |
| [What are position IDs?](../glossary#position-ids) | |
| - **past_key_values** (`~cache_utils.Cache`, *optional*) -- | |
| Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention | |
| blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values` | |
| returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`. | |
| Only [Cache](/docs/transformers/main/ko/internal/generation_utils#transformers.Cache) instance is allowed as input, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache). | |
| If no `past_key_values` are passed, [DynamicCache](/docs/transformers/main/ko/internal/generation_utils#transformers.DynamicCache) will be initialized by default. | |
| The model will output the same cache format that is fed as input. | |
| If `past_key_values` are used, the user is expected to input only unprocessed `input_ids` (those that don't | |
| have their past key value states given to this model) of shape `(batch_size, unprocessed_length)` instead of all `input_ids` | |
| of shape `(batch_size, sequence_length)`. | |
| - **inputs_embeds** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*) -- | |
| Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This | |
| is useful if you want more control over how to convert `input_ids` indices into associated vectors than the | |
| model's internal embedding lookup matrix. | |
| - **use_cache** (`bool`, *optional*) -- | |
| If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see | |
| `past_key_values`).0[BaseModelOutputWithPast](/docs/transformers/main/ko/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPast) or `tuple(torch.FloatTensor)`A [BaseModelOutputWithPast](/docs/transformers/main/ko/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPast) or a tuple of | |
| `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various | |
| elements depending on the configuration ([CohereConfig](/docs/transformers/main/ko/model_doc/cohere#transformers.CohereConfig)) and inputs. | |
| The [CohereModel](/docs/transformers/main/ko/model_doc/cohere#transformers.CohereModel) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| - **last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`) -- Sequence of hidden-states at the output of the last layer of the model. | |
| If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1, | |
| hidden_size)` is output. | |
| - **past_key_values** (`Cache`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`) -- It is a [Cache](/docs/transformers/main/ko/internal/generation_utils#transformers.Cache) instance. For more details, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache). | |
| Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if | |
| `config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values` | |
| input) to speed up sequential decoding. | |
| - **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + | |
| one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. | |
| Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. | |
| - **attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, | |
| sequence_length)`. | |
| Attentions weights after the attention softmax, used to compute the weighted average in the self-attention | |
| heads. | |
| **Parameters:** | |
| config ([CohereConfig](/docs/transformers/main/ko/model_doc/cohere#transformers.CohereConfig)) : Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [from_pretrained()](/docs/transformers/main/ko/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights. | |
| **Returns:** | |
| `[BaseModelOutputWithPast](/docs/transformers/main/ko/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPast) or `tuple(torch.FloatTensor)`` | |
| A [BaseModelOutputWithPast](/docs/transformers/main/ko/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPast) or a tuple of | |
| `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various | |
| elements depending on the configuration ([CohereConfig](/docs/transformers/main/ko/model_doc/cohere#transformers.CohereConfig)) and inputs. | |
| ## CohereForCausalLM[[transformers.CohereForCausalLM]][[transformers.CohereForCausalLM]] | |
| #### transformers.CohereForCausalLM[[transformers.CohereForCausalLM]] | |
| [Source](https://github.com/huggingface/transformers/blob/main/src/transformers/models/cohere/modeling_cohere.py#L455) | |
| The Cohere Model for causal language modeling. | |
| This model inherits from [PreTrainedModel](/docs/transformers/main/ko/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads | |
| etc.) | |
| This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. | |
| Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage | |
| and behavior. | |
| forwardtransformers.CohereForCausalLM.forwardhttps://github.com/huggingface/transformers/blob/main/src/transformers/models/cohere/modeling_cohere.py#L471[{"name": "input_ids", "val": ": torch.LongTensor | None = None"}, {"name": "attention_mask", "val": ": torch.Tensor | None = None"}, {"name": "position_ids", "val": ": torch.LongTensor | None = None"}, {"name": "past_key_values", "val": ": transformers.cache_utils.Cache | None = None"}, {"name": "inputs_embeds", "val": ": torch.FloatTensor | None = None"}, {"name": "labels", "val": ": torch.LongTensor | None = None"}, {"name": "use_cache", "val": ": bool | None = None"}, {"name": "logits_to_keep", "val": ": int | torch.Tensor = 0"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]"}]- **input_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. | |
| Indices can be obtained using [AutoTokenizer](/docs/transformers/main/ko/model_doc/auto#transformers.AutoTokenizer). See [PreTrainedTokenizer.encode()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.encode) and | |
| [PreTrainedTokenizer.__call__()](/docs/transformers/main/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__) for details. | |
| [What are input IDs?](../glossary#input-ids) | |
| - **attention_mask** (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: | |
| - 1 for tokens that are **not masked**, | |
| - 0 for tokens that are **masked**. | |
| [What are attention masks?](../glossary#attention-mask) | |
| - **position_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.n_positions - 1]`. | |
| [What are position IDs?](../glossary#position-ids) | |
| - **past_key_values** (`~cache_utils.Cache`, *optional*) -- | |
| Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention | |
| blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values` | |
| returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`. | |
| Only [Cache](/docs/transformers/main/ko/internal/generation_utils#transformers.Cache) instance is allowed as input, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache). | |
| If no `past_key_values` are passed, [DynamicCache](/docs/transformers/main/ko/internal/generation_utils#transformers.DynamicCache) will be initialized by default. | |
| The model will output the same cache format that is fed as input. | |
| If `past_key_values` are used, the user is expected to input only unprocessed `input_ids` (those that don't | |
| have their past key value states given to this model) of shape `(batch_size, unprocessed_length)` instead of all `input_ids` | |
| of shape `(batch_size, sequence_length)`. | |
| - **inputs_embeds** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*) -- | |
| Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This | |
| is useful if you want more control over how to convert `input_ids` indices into associated vectors than the | |
| model's internal embedding lookup matrix. | |
| - **labels** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Labels for computing the masked language modeling loss. Indices should either be in `[0, ..., | |
| config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored | |
| (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`. | |
| - **use_cache** (`bool`, *optional*) -- | |
| If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see | |
| `past_key_values`). | |
| - **logits_to_keep** (`Union[int, torch.Tensor]`, *optional*, defaults to `0`) -- | |
| If an `int`, compute logits for the last `logits_to_keep` tokens. If `0`, calculate logits for all | |
| `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that | |
| token can save memory, which becomes pretty significant for long sequences or large vocabulary size. | |
| If a `torch.Tensor`, must be 1D corresponding to the indices to keep in the sequence length dimension. | |
| This is useful when using packed tensor format (single dimension for batch and sequence length).0[CausalLMOutputWithPast](/docs/transformers/main/ko/main_classes/output#transformers.modeling_outputs.CausalLMOutputWithPast) or `tuple(torch.FloatTensor)`A [CausalLMOutputWithPast](/docs/transformers/main/ko/main_classes/output#transformers.modeling_outputs.CausalLMOutputWithPast) or a tuple of | |
| `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various | |
| elements depending on the configuration ([CohereConfig](/docs/transformers/main/ko/model_doc/cohere#transformers.CohereConfig)) and inputs. | |
| The [CohereForCausalLM](/docs/transformers/main/ko/model_doc/cohere#transformers.CohereForCausalLM) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| - **loss** (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided) -- Language modeling loss (for next-token prediction). | |
| - **logits** (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`) -- Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). | |
| - **past_key_values** (`Cache`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`) -- It is a [Cache](/docs/transformers/main/ko/internal/generation_utils#transformers.Cache) instance. For more details, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache). | |
| Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see | |
| `past_key_values` input) to speed up sequential decoding. | |
| - **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + | |
| one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. | |
| Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. | |
| - **attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, | |
| sequence_length)`. | |
| Attentions weights after the attention softmax, used to compute the weighted average in the self-attention | |
| heads. | |
| Example: | |
| ```python | |
| >> from transformers import AutoTokenizer, CohereForCausalLM | |
| >> model = CohereForCausalLM.from_pretrained("CohereForAI/c4ai-command-r-v01") | |
| >> tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01") | |
| >> prompt = "Hey, are you conscious? Can you talk to me?" | |
| >> inputs = tokenizer(prompt, return_tensors="pt") | |
| >> # Generate | |
| >> generate_ids = model.generate(inputs.input_ids, max_length=30) | |
| >> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] | |
| "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." | |
| ``` | |
| **Parameters:** | |
| config ([CohereForCausalLM](/docs/transformers/main/ko/model_doc/cohere#transformers.CohereForCausalLM)) : Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [from_pretrained()](/docs/transformers/main/ko/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights. | |
| **Returns:** | |
| `[CausalLMOutputWithPast](/docs/transformers/main/ko/main_classes/output#transformers.modeling_outputs.CausalLMOutputWithPast) or `tuple(torch.FloatTensor)`` | |
| A [CausalLMOutputWithPast](/docs/transformers/main/ko/main_classes/output#transformers.modeling_outputs.CausalLMOutputWithPast) or a tuple of | |
| `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various | |
| elements depending on the configuration ([CohereConfig](/docs/transformers/main/ko/model_doc/cohere#transformers.CohereConfig)) and inputs. | |
Xet Storage Details
- Size:
- 28.9 kB
- Xet hash:
- a3d00460877adc8b998602ade2018ac610a4d90eb7dfb06eefa4a87c30d5e83b
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.