Instructions to use TeeZee/Kyllene-34B-v1.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TeeZee/Kyllene-34B-v1.1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="TeeZee/Kyllene-34B-v1.1") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("TeeZee/Kyllene-34B-v1.1") model = AutoModelForCausalLM.from_pretrained("TeeZee/Kyllene-34B-v1.1") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use TeeZee/Kyllene-34B-v1.1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TeeZee/Kyllene-34B-v1.1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TeeZee/Kyllene-34B-v1.1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/TeeZee/Kyllene-34B-v1.1
- SGLang
How to use TeeZee/Kyllene-34B-v1.1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TeeZee/Kyllene-34B-v1.1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TeeZee/Kyllene-34B-v1.1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TeeZee/Kyllene-34B-v1.1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TeeZee/Kyllene-34B-v1.1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use TeeZee/Kyllene-34B-v1.1 with Docker Model Runner:
docker model run hf.co/TeeZee/Kyllene-34B-v1.1
What’s the prompt format?
Howdy! Would love to test this model for my roleplay. :) What prompt format does it use? Thank you in advance for your answer!
Alpaca works best or in Silly Tavern roleplay or story templates. Also LimaRP specific format. Have fun :)
Hey, thank you so much! I'm jumping to testing then! :D
Oop, reopening since I did some tests and noticed some problems with this model. Perhaps the issue might be with my settings, though? Firstly, the characters seem to be misspelling my name and using "Marianne" or "Mariana" instead of "Marianna". Secondly, they sometimes produce weird and unexpected tokens. And lastly, the outputs seem to be weirdly short? Unless the model was trained for shorter roleplay formats, then it works as intended, ha ha.
For example, below is the answer produced by Nous-Capybara-LimaRPv3:
And this is an answer produced by Kyllene with the same context, etc.:
And an example of misspells and weird token outputs:
I'm using the same settings for it as for my favorite Nous-Capybara-LimaRPv3 model right now, here they are.
Settings: https://files.catbox.moe/ate9pa.json
Story String: https://files.catbox.moe/g28n4f.json
Instruct: https://files.catbox.moe/7x3jod.json
Curious to learn how the model fares for others. Suspecting that perhaps my set-up may be at fault.
Thank for tests, yeah, I'm also curious. I had once name Malice switched to Malcolm; strange tokens, do you mean '[For contect pusrposes]' at the end? If so, I had it also when character card had lots of intructions in [], seems like model follows patterns too well sometimes. Shorter output, no idea, but I'll check, probably next weekend.
Yea, that’s what I meant. Well, the thing with my instructions is that I don’t use the brackets [] for them, ha ha. The shorter outputs seem to be happening mostly when I test the model on my 45k context group chat conversation, even though the previous messages are all, well, long.
Confirmed, comparing to capybara it produces output shorter by 10-30% on average, tested with a few different scenarios on both models, with shorter context - 4k,8k. Additional output '[]' also happened once in one conversation, so also confirmed. Next iteration of this model will be better ;). Thanks again for testing and btw. what hardware are you using for 45k context?
Awesome, can't wait to test the next iteration then, thanks for letting me know! And I have a 3090 with 24GB of VRAM, can easily run up to 45k of context with 4.0bpw exl2 quants. :) I also recommend dropping the bagel model from the merge if you want it to be better suited for longer context roleplays, since bagel is known for its quality and context-processing drop-off at higher numbers. Recently, I've been using brucethemoose's newest RPMerge, and it has been absolutely wonderful, I highly recommend checking it out for some inspiration!
https://huggingface.co/brucethemoose/Yi-34B-200K-RPMerge
possible to get setting without this catbox that don't work in most places?
Oh dear, these settings are old and I don’t use them anymore, but here they are.
{
"temp": 1,
"temperature_last": false,
"top_p": 1,
"top_k": 0,
"top_a": 0,
"tfs": 1,
"epsilon_cutoff": 0,
"eta_cutoff": 0,
"typical_p": 1,
"min_p": 0.1,
"rep_pen": 1.2,
"rep_pen_range": 4096,
"no_repeat_ngram_size": 0,
"penalty_alpha": 0,
"num_beams": 1,
"length_penalty": 0,
"min_length": 0,
"encoder_rep_pen": 1,
"freq_pen": 0,
"presence_pen": 0,
"do_sample": true,
"early_stopping": false,
"dynatemp": false,
"min_temp": 1,
"max_temp": 1.4,
"dynatemp_exponent": 1,
"smoothing_factor": 0,
"add_bos_token": false,
"truncation_length": 2048,
"ban_eos_token": false,
"skip_special_tokens": true,
"streaming": true,
"mirostat_mode": 0,
"mirostat_tau": 5,
"mirostat_eta": 0.1,
"guidance_scale": 1,
"negative_prompt": "",
"grammar_string": "",
"banned_tokens": "",
"ignore_eos_token_aphrodite": false,
"spaces_between_special_tokens_aphrodite": true,
"sampler_order": [
6,
0,
1,
3,
4,
2,
5
],
"logit_bias": [],
"n": 1,
"rep_pen_size": 0,
"genamt": 400,
"max_length": 44032
}