Upload folder using huggingface_hub

17c6d62 verified 10 months ago

4.14 kB

	<!--Copyright 2024 The HuggingFace Team. All rights reserved.

	Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
	an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
	specific language governing permissions and limitations under the License.

	⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
	rendered properly in your Markdown viewer.

	-->

	# GGUF and interaction with Transformers

	The GGUF file format is used to store models for inference with [GGML](https://github.com/ggerganov/ggml) and other
	libraries that depend on it, like the very popular [llama.cpp](https://github.com/ggerganov/llama.cpp) or
	[whisper.cpp](https://github.com/ggerganov/whisper.cpp).

	It is a file format [supported by the Hugging Face Hub](https://huggingface.co/docs/hub/en/gguf) with features
	allowing for quick inspection of tensors and metadata within the file.

	This file format is designed as a "single-file-format" where a single file usually contains both the configuration
	attributes, the tokenizer vocabulary and other attributes, as well as all tensors to be loaded in the model. These
	files come in different formats according to the quantization type of the file. We briefly go over some of them
	[here](https://huggingface.co/docs/hub/en/gguf#quantization-types).

	## Support within Transformers

	We have added the ability to load `gguf` files within `transformers` in order to offer further training/fine-tuning
	capabilities to gguf models, before converting back those models to `gguf` to use within the `ggml` ecosystem. When
	loading a model, we first dequantize it to fp32, before loading the weights to be used in PyTorch.

	> [!NOTE]
	> The support is still very exploratory and we welcome contributions in order to solidify it across quantization types
	> and model architectures.

	For now, here are the supported model architectures and quantization types:

	### Supported quantization types

	The initial supported quantization types are decided according to the popular quantized files that have been shared
	on the Hub.

	- F32
	- F16
	- BF16
	- Q4_0
	- Q4_1
	- Q5_0
	- Q5_1
	- Q8_0
	- Q2_K
	- Q3_K
	- Q4_K
	- Q5_K
	- Q6_K
	- IQ1_S
	- IQ1_M
	- IQ2_XXS
	- IQ2_XS
	- IQ2_S
	- IQ3_XXS
	- IQ3_S
	- IQ4_XS
	- IQ4_NL

	> [!NOTE]
	> To support gguf dequantization, `gguf>=0.10.0` installation is required.

	### Supported model architectures

	For now the supported model architectures are the architectures that have been very popular on the Hub, namely:

	- LLaMa
	- Mistral
	- Qwen2
	- Qwen2Moe
	- Phi3
	- Bloom
	- Falcon
	- StableLM
	- GPT2
	- Starcoder2
	- T5
	- Mamba
	- Nemotron
	- Gemma2

	## Example usage

	In order to load `gguf` files in `transformers`, you should specify the `gguf_file` argument to the `from_pretrained`
	methods of both tokenizers and models. Here is how one would load a tokenizer and a model, which can be loaded
	from the exact same file:

	```py
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
	filename = "tinyllama-1.1b-chat-v1.0.Q6_K.gguf"

	tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
	model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)
	```

	Now you have access to the full, unquantized version of the model in the PyTorch ecosystem, where you can combine it
	with a plethora of other tools.

	In order to convert back to a `gguf` file, we recommend using the
	[`convert-hf-to-gguf.py` file](https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py) from llama.cpp.

	Here's how you would complete the script above to save the model and export it back to `gguf`:

	```py
	tokenizer.save_pretrained('directory')
	model.save_pretrained('directory')

	!python ${path_to_llama_cpp}/convert-hf-to-gguf.py ${directory}
	```