Instructions to use Kquant03/NurseButtercup-4x7B-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Kquant03/NurseButtercup-4x7B-bf16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Kquant03/NurseButtercup-4x7B-bf16")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Kquant03/NurseButtercup-4x7B-bf16") model = AutoModelForCausalLM.from_pretrained("Kquant03/NurseButtercup-4x7B-bf16") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Kquant03/NurseButtercup-4x7B-bf16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Kquant03/NurseButtercup-4x7B-bf16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kquant03/NurseButtercup-4x7B-bf16", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Kquant03/NurseButtercup-4x7B-bf16
- SGLang
How to use Kquant03/NurseButtercup-4x7B-bf16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Kquant03/NurseButtercup-4x7B-bf16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kquant03/NurseButtercup-4x7B-bf16", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Kquant03/NurseButtercup-4x7B-bf16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kquant03/NurseButtercup-4x7B-bf16", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Kquant03/NurseButtercup-4x7B-bf16 with Docker Model Runner:
docker model run hf.co/Kquant03/NurseButtercup-4x7B-bf16
Unable to create exl2 quant for this model
Hi,
I'm trying to create an exl2 quant for this model but I ran into this error:
| Measured: model.layers.0 (Attention) |
| Duration: 8.47 seconds |
| Completed step: 1/67 |
| Avg time / step (rolling): 8.47 seconds |
| Estimated remaining time: 9min 18sec |
| Last checkpoint layer: None |
-- Layer: model.layers.0 (MoE MLP)
!! Warning: w2.2 has less than 10% calibration for 19/19 rows
!! Warning: w2.3 has less than 10% calibration for 19/19 rows
Traceback (most recent call last):
File "E:\ai\Exl2\exllamav2\convert.py", line 219, in
status = measure_quant(job, save_job, model) # capturing the graceful exits
File "E:\ai\pinokio\bin\miniconda\envs\exl2\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "E:\ai\Exl2\exllamav2\conversion\measure.py", line 538, in measure_quant
m = measure_moe_mlp(module, hidden_states, target_states, quantizers, cache, attn_params)
File "E:\ai\Exl2\exllamav2\conversion\measure.py", line 273, in measure_moe_mlp
quantizers[f"w2.{i}"].prepare()
File "E:\ai\Exl2\exllamav2\conversion\adaptivegptq.py", line 225, in prepare
self.hessian /= self.num_batches
TypeError: unsupported operand type(s) for /=: 'NoneType' and 'int'
I was able to quantize Buttercup-4x7B-V2-laser and others, but not this one. I'm not sure what I have to do to quantize it. I'm using the latest exllamav2 v0.0.18
Thanks!
I was able to quantize Buttercup-4x7B-V2-laser and others, but not this one. I'm not sure what I have to do to quantize it. I'm using the latest exllamav2 v0.0.18
Thanks!
So, I asked the creator of exllama and he says it's because some of my experts aren't activating at all during quantization...might be better to use a different dataset in the quantization