My-Meta-Llama-3.1-8B-Instruct-Open-Router-23
A thin wrapper around meta-llama/Meta-Llama-3.1-8B-Instruct that is packaged as a Hugging Face Inference Endpoint with a custom handler. The weights are not stored in this repo β the handler downloads them from the gated base repo at startup using an HF_TOKEN secret.
Architecture
handler.pyβ definesEndpointHandler. On__init__it loads the tokenizer and model frommeta-llama/Meta-Llama-3.1-8B-Instructinbfloat16withdevice_map="auto". On every request,__call__applies the Llama 3.1 chat template and runsmodel.generate.requirements.txtβ pinstransformers==4.51.3(the Hugging Face Inference Toolkit breaks againsttransformers5.x because it imports helpers fromtransformers.file_utils) and addsaccelerate+huggingface_hub.
Deploying this as an Inference Endpoint
1. Prerequisites
- A Hugging Face account that has accepted the Llama 3.1 Community License at meta-llama/Meta-Llama-3.1-8B-Instruct.
- A User Access Token (Settings β Access Tokens) with at least
readpermission on gated repos.
2. Create the endpoint
From the Hugging Face UI, Deploy β Inference Endpoints on this repo, then:
| Setting | Recommended value |
|---|---|
| Instance type | GPU, at least 24 GB VRAM (e.g. NVIDIA L4, A10G, or A100) β Llama 3.1 8B in bfloat16 is ~16 GB plus KV cache |
| Container | Default (huggingface-inference-toolkit) |
| Task | Custom (the handler is auto-detected via handler.py) |
| Secret | HF_TOKEN = your user access token |
3. Call it
curl https://<your-endpoint>.endpoints.huggingface.cloud \
-H "Authorization: Bearer $HF_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"inputs": {
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "In one sentence, what is Hugging Face?"}
]
},
"parameters": {
"max_new_tokens": 128,
"do_sample": true,
"temperature": 0.7,
"top_p": 0.9
}
}'
Response shape:
{ "generated_text": "Hugging Face is ..." }
Request schema
The handler accepts both {"messages": [...]} and {"inputs": {"messages": [...]}} at the top level.
| Field | Type | Default | Notes |
|---|---|---|---|
messages |
list[{role, content}] |
required | Standard chat format; role β system / user / assistant / tool |
parameters.max_new_tokens |
int |
256 |
|
parameters.do_sample |
bool |
false |
When false, temperature/top_p are ignored |
parameters.temperature |
float |
0.7 |
Only used when do_sample=true |
parameters.top_p |
float |
0.9 |
Only used when do_sample=true |
Limitations
- The handler returns only the assistant's continuation as
generated_text; it does not stream tokens and does not implement the OpenAI chat-completion schema. - Cold start downloads ~16 GB from the base model repo, which can take 1β3 minutes on first boot.
License
Usage is governed by the Llama 3.1 Community License. You must accept it on the base model page before this endpoint will be able to download weights.
Model tree for ericaRC/test-Llama
Base model
meta-llama/Llama-3.1-8B Finetuned
meta-llama/Llama-3.1-8B-Instruct