Instructions to use Aryanne/Astrohermes-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Aryanne/Astrohermes-3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Aryanne/Astrohermes-3B", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Aryanne/Astrohermes-3B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Aryanne/Astrohermes-3B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Aryanne/Astrohermes-3B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Aryanne/Astrohermes-3B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Aryanne/Astrohermes-3B
- SGLang
How to use Aryanne/Astrohermes-3B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Aryanne/Astrohermes-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Aryanne/Astrohermes-3B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Aryanne/Astrohermes-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Aryanne/Astrohermes-3B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Aryanne/Astrohermes-3B with Docker Model Runner:
docker model run hf.co/Aryanne/Astrohermes-3B
Safetensors model file name
Many thanks for merge.
Working on quanting to gguf, seems that for single file models only "model.safetensors" is currently supported: https://github.com/ggerganov/llama.cpp/blob/fbbc42827b2949b95bcde23ce47bb47d006c895d/convert-hf-to-gguf.py#L180 .
Commenting out 180-181 is working for now, initial output looks good
Many thanks for merge.
Working on quanting to gguf, seems that for single file models only "model.safetensors" is currently supported: https://github.com/ggerganov/llama.cpp/blob/fbbc42827b2949b95bcde23ce47bb47d006c895d/convert-hf-to-gguf.py#L180 .
Commenting out 180-181 is working for now, initial output looks good
🙏thanks
ok I will rename it
hey, can you quantize this model? https://huggingface.co/NousResearch/Obsidian-3B-V0.5 I haven't seen any yet and the projector if possible
have used https://huggingface.co/nisten/obsidian-3b-multimodal-q6-gguf before.
Ran into issues last i tried quanting myself, will try with latest llama.cpp commit tomorrow
Quants up here https://huggingface.co/afrideva/Echo-3B-GGUF
thank you, I'm trying to make a 5b stablelm model here https://huggingface.co/Aryanne/testing-only I quantized locally and at the inference I got an error about graph, idk if it's a problem with the .jsons, my quantization or the model itself, can you take a look?
this is the error
GGML_ASSERT: ggml.c:15158: cgraph->n_nodes < cgraph->size Aborted
maybe 58 was too much layers😅
maybe 58 was too much layers😅
Apparently so, just tried it with latest llama.cpp and got the same error... Your Zephyr-3.43B looks good though, quanting now