My-Meta-Llama-3.1-8B-Instruct-Open-Router-23

A thin wrapper around meta-llama/Meta-Llama-3.1-8B-Instruct that is packaged as a Hugging Face Inference Endpoint with a custom handler. The weights are not stored in this repo β€” the handler downloads them from the gated base repo at startup using an HF_TOKEN secret.

Architecture

  • handler.py β€” defines EndpointHandler. On __init__ it loads the tokenizer and model from meta-llama/Meta-Llama-3.1-8B-Instruct in bfloat16 with device_map="auto". On every request, __call__ applies the Llama 3.1 chat template and runs model.generate.
  • requirements.txt β€” pins transformers==4.51.3 (the Hugging Face Inference Toolkit breaks against transformers 5.x because it imports helpers from transformers.file_utils) and adds accelerate + huggingface_hub.

Deploying this as an Inference Endpoint

1. Prerequisites

  • A Hugging Face account that has accepted the Llama 3.1 Community License at meta-llama/Meta-Llama-3.1-8B-Instruct.
  • A User Access Token (Settings β†’ Access Tokens) with at least read permission on gated repos.

2. Create the endpoint

From the Hugging Face UI, Deploy β†’ Inference Endpoints on this repo, then:

Setting Recommended value
Instance type GPU, at least 24 GB VRAM (e.g. NVIDIA L4, A10G, or A100) β€” Llama 3.1 8B in bfloat16 is ~16 GB plus KV cache
Container Default (huggingface-inference-toolkit)
Task Custom (the handler is auto-detected via handler.py)
Secret HF_TOKEN = your user access token

3. Call it

curl https://<your-endpoint>.endpoints.huggingface.cloud \
  -H "Authorization: Bearer $HF_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": {
      "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "In one sentence, what is Hugging Face?"}
      ]
    },
    "parameters": {
      "max_new_tokens": 128,
      "do_sample": true,
      "temperature": 0.7,
      "top_p": 0.9
    }
  }'

Response shape:

{ "generated_text": "Hugging Face is ..." }

Request schema

The handler accepts both {"messages": [...]} and {"inputs": {"messages": [...]}} at the top level.

Field Type Default Notes
messages list[{role, content}] required Standard chat format; role ∈ system / user / assistant / tool
parameters.max_new_tokens int 256
parameters.do_sample bool false When false, temperature/top_p are ignored
parameters.temperature float 0.7 Only used when do_sample=true
parameters.top_p float 0.9 Only used when do_sample=true

Limitations

  • The handler returns only the assistant's continuation as generated_text; it does not stream tokens and does not implement the OpenAI chat-completion schema.
  • Cold start downloads ~16 GB from the base model repo, which can take 1–3 minutes on first boot.

License

Usage is governed by the Llama 3.1 Community License. You must accept it on the base model page before this endpoint will be able to download weights.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ericaRC/test-Llama

Finetuned
(2730)
this model