Instructions to use mrm8488/Alpacoom with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mrm8488/Alpacoom with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="mrm8488/Alpacoom")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("mrm8488/Alpacoom", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use mrm8488/Alpacoom with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "mrm8488/Alpacoom" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mrm8488/Alpacoom", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/mrm8488/Alpacoom
- SGLang
How to use mrm8488/Alpacoom with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "mrm8488/Alpacoom" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mrm8488/Alpacoom", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "mrm8488/Alpacoom" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mrm8488/Alpacoom", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use mrm8488/Alpacoom with Docker Model Runner:
docker model run hf.co/mrm8488/Alpacoom
Great work. I had an issue running this in colab
/usr/local/lib/python3.9/dist-packages/bitsandbytes/functional.py in transform(A, to_order, from_order, out, transpose, state, ld)
1696
1697 def transform(A, to_order, from_order='row', out=None, transpose=False, state=None, ld=None):
-> 1698 prev_device = pre_call(A.device)
1699 if state is None: state = (A.shape, from_order)
1700 else: from_order = state[1]
AttributeError: 'NoneType' object has no attribute 'device'
Can you please check.
Thanks. I checked and got it working.
Hi!
I've been running this model for the past couple days, really nice model, tysm for open-sourcing it! π
Anyway, currently having the same issue with the VRAM usage, any development on this?
If it's of any help, I don't see an increase on every call from the looks of it, just occasionally.
Hi!
I've been running this model for the past couple days, really nice model, tysm for open-sourcing it! π
Anyway, currently having the same issue with the VRAM usage, any development on this?
If it's of any help, I don't see an increase on every call from the looks of it, just occasionally.
Messed around with it today, seems like adding a
torch.cuda.empty_cache()
import gc; gc.collect()
to the generate() function helped! :)
Sure! It's really just adding those calls into the function (idt the place you put them matters tbh, they're just garbage collector calls, added two of them just to make sure).
def generate(
instruction,
input=None,
temperature=0.1,
top_p=0.75,
top_k=40,
num_beams=4,
**kwargs,
):
torch.cuda.empty_cache()
import gc; gc.collect()
prompt = generate_prompt(instruction, input)
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()
generation_config = GenerationConfig(
temperature=temperature,
top_p=top_p,
top_k=top_k,
num_beams=num_beams,
**kwargs,
)
with torch.no_grad():
generation_output = model.generate(
input_ids=input_ids,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
max_new_tokens=256,
)
s = generation_output.sequences[0]
output = tokenizer.decode(s)
torch.cuda.empty_cache()
import gc; gc.collect()
return output.split("### Response:")[1].strip().split("Below")[0]
What I've found is that this seems to only occur for large prompts, I'm not sure where the threshold is to trigger it, but from what I can tell the size of the prompt is really what did it.


