Instructions to use Kquant03/Prokaryote-8x7B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Kquant03/Prokaryote-8x7B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Kquant03/Prokaryote-8x7B-GGUF",
	filename="ggml-model-q2_k.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use Kquant03/Prokaryote-8x7B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Kquant03/Prokaryote-8x7B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Kquant03/Prokaryote-8x7B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Kquant03/Prokaryote-8x7B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Kquant03/Prokaryote-8x7B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Kquant03/Prokaryote-8x7B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Kquant03/Prokaryote-8x7B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Kquant03/Prokaryote-8x7B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Kquant03/Prokaryote-8x7B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/Kquant03/Prokaryote-8x7B-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use Kquant03/Prokaryote-8x7B-GGUF with Ollama:
```
ollama run hf.co/Kquant03/Prokaryote-8x7B-GGUF:Q4_K_M
```

Unsloth Studio new

How to use Kquant03/Prokaryote-8x7B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Kquant03/Prokaryote-8x7B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Kquant03/Prokaryote-8x7B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Kquant03/Prokaryote-8x7B-GGUF to start chatting

Docker Model Runner
How to use Kquant03/Prokaryote-8x7B-GGUF with Docker Model Runner:
```
docker model run hf.co/Kquant03/Prokaryote-8x7B-GGUF:Q4_K_M
```

Lemonade

How to use Kquant03/Prokaryote-8x7B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Kquant03/Prokaryote-8x7B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Prokaryote-8x7B-GGUF-Q4_K_M

List all available models

lemonade list

"Nearly 2.5 billion years of prokaryotic cells and nothing else. Why did life remain at stage 1 for two-thirds of its history if complexity offers such benefits?"

An augmentation of the script I used for Eukaryote...Does far, far better on ARC, which was what I've been trying to accomplish. MMLU improvements will require fine-tuning, I'd be willing to bet.

alnrg2arg/test2_4 - base
andysalerno/openchat-nectar-0.3 - expert #1
alnrg2arg/test2_4 - expert #2
abideen/NexoNimbus-7B - expert #3
mlabonne/NeuralDaredevil-7B - expert #4
nfaheem/Marcoroni-7b-DPO-Merge - expert #5
alnrg2arg/test2_4 - expert #6
mlabonne/Beagle14-7B - expert #7
eren23/slerp-test-turdus-beagle - expert #8

Provided files

Name	Quant method	Bits	Size	Max RAM required	Use case
Q2_K Tiny	Q2_K	2	15.6 GB	17.6 GB	smallest, significant quality loss - not recommended for most purposes
Q3_K_M	Q3_K_M	3	20.4 GB	22.4 GB	very small, high quality loss
Q4_0	Q4_0	4	26.4 GB	28.4 GB	legacy; small, very high quality loss - prefer using Q3_K_M
Q4_K_M	Q4_K_M	4	~26.4 GB	~28.4 GB	medium, balanced quality - recommended
Q5_0	Q5_0	5	32.2 GB	34.2 GB	legacy; large, balanced quality
Q5_K_M	Q5_K_M	5	~32.2 GB	~34.2 GB	large, balanced quality - recommended
Q6 XL	Q6_K	6	38.4 GB	40.4 GB	very large, extremely minor degradation
Q8 XXL	Q8_0	8	49.6 GB	51.4 GB	very large, extremely minor degradation - not recommended

"What is a Mixture of Experts (MoE)?"

(from the MistralAI papers...click the quoted question above to navigate to it directly.)

The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.

Mixture of Experts enable models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model. In particular, a MoE model should achieve the same quality as its dense counterpart much faster during pretraining.

So, what exactly is a MoE? In the context of transformer models, a MoE consists of two main elements:

Sparse MoE layers are used instead of dense feed-forward network (FFN) layers. MoE layers have a certain number of “experts” (e.g. 32 in my "frankenMoE"), where each expert is a neural network. In practice, the experts are FFNs, but they can also be more complex networks or even a MoE itself, leading to hierarchical MoEs!

A gate network or router, that determines which tokens are sent to which expert. For example, in the image below, the token “More” is sent to the second expert, and the token "Parameters” is sent to the first network. As we’ll explore later, we can send a token to more than one expert. How to route a token to an expert is one of the big decisions when working with MoEs - the router is composed of learned parameters and is pretrained at the same time as the rest of the network.

At every layer, for every token, a router network chooses two of these groups (the “experts”) to process the token and combine their output additively.

Switch Layer MoE layer from the Switch Transformers paper

So, to recap, in MoEs we replace every FFN layer of the transformer model with an MoE layer, which is composed of a gate network and a certain number of experts.

Although MoEs provide benefits like efficient pretraining and faster inference compared to dense models, they also come with challenges:

Training: MoEs enable significantly more compute-efficient pretraining, but they’ve historically struggled to generalize during fine-tuning, leading to overfitting.
Inference: Although a MoE might have many parameters, only some of them are used during inference. This leads to much faster inference compared to a dense model with the same number of parameters. However, all parameters need to be loaded in RAM, so memory requirements are high. For example, [given a MoE like Mixtral 8x7B](https://huggingface.co/blog/moe), we’ll need to have enough VRAM to hold a dense 47B parameter model. Why 47B parameters and not 8 x 7B = 56B? That’s because in MoE models, only the FFN layers are treated as individual experts, and the rest of the model parameters are shared. At the same time, assuming just two experts are being used per token, the inference speed (FLOPs) is like using a 12B model (as opposed to a 14B model), because it computes 2x7B matrix multiplications, but with some layers shared (more on this soon).

If all our tokens are sent to just a few popular experts, that will make training inefficient. In a normal MoE training, the gating network converges to mostly activate the same few experts. This self-reinforces as favored experts are trained quicker and hence selected more. To mitigate this, an auxiliary loss is added to encourage giving all experts equal importance. This loss ensures that all experts receive a roughly equal number of training examples. The following sections will also explore the concept of expert capacity, which introduces a threshold of how many tokens can be processed by an expert. In transformers, the auxiliary loss is exposed via the aux_loss parameter.

"Wait...but you called this a frankenMoE?"

The difference between MoE and "frankenMoE" lies in the fact that the router layer in a model like the one on this repo is not trained simultaneously.

Downloads last month: 73

GGUF

Model size

47B params

Architecture

llama

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Kquant03/Prokaryote-8x7B-GGUF

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Paper • 2101.03961 • Published Jan 11, 2021 • 13