Instructions to use malteos/hermeo-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use malteos/hermeo-7b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="malteos/hermeo-7b")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("malteos/hermeo-7b") model = AutoModelForCausalLM.from_pretrained("malteos/hermeo-7b") - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use malteos/hermeo-7b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "malteos/hermeo-7b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "malteos/hermeo-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/malteos/hermeo-7b
- SGLang
How to use malteos/hermeo-7b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "malteos/hermeo-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "malteos/hermeo-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "malteos/hermeo-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "malteos/hermeo-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use malteos/hermeo-7b with Docker Model Runner:
docker model run hf.co/malteos/hermeo-7b
GGUF format
Thanks! I haven't done any quantization myself yet but I'll have a look into it.
Thank you very much - I am actually working on another solely german quantization technique boosting the models German capacities and replies. It works really good so far I think and has lots of potential, but WIP and will likely be updated next week, adding some more stuff.
https://huggingface.co/aari1995/germeo-7b-awq
Also at the moment I sadly have troubles evaluating the model on the German benchmarks as it does not really support AWQ. If you have an idea let me know.
Open for feedback!
What exactly is the problem? The latest transformers version does support AWQ, right? Feel free to reach out to me. I am happy to help.
Yes I also figured that out and it works now, thank you very much!
At the moment I need to find time to do the MMLU Eval as it takes 26 hours on my 3090 ti.
So far the benchmarks look good and are slightly worse but the models output is guaranteed German:
ARC-DE: 0.514
Hellaswag-DE: 0.651
TruthfulQA-DE: 0.508
I'll keep you updated.
https://huggingface.co/aari1995/germeo-7b-awq
Evaluation done. MMLU 0.522 (improvement). Resulting in an average of 0.563 (DE-Average). I think it is a good use case of knowledge transfer from English to German with "keeping the model German". It replies solely in German. @floleuerer created a benchmark for German response rates - in contact to see if there is an improvement.
Malte, would you be up for further experiments on knowledge transfer or a call? I am experimenting also with laser and want to see whether a non-bilingual model can achieve improvements with quantization / pruning methods.