My-Meta-Llama-3.1-8B-Instruct-Open-Router-23

A thin wrapper around meta-llama/Meta-Llama-3.1-8B-Instruct that is packaged as a Hugging Face Inference Endpoint with a custom handler. The weights are not stored in this repo — the handler downloads them from the gated base repo at startup using an HF_TOKEN secret.

Architecture

handler.py — defines EndpointHandler. On __init__ it loads the tokenizer and model from meta-llama/Meta-Llama-3.1-8B-Instruct in bfloat16 with device_map="auto". On every request, __call__ applies the Llama 3.1 chat template and runs model.generate.
requirements.txt — pins transformers==4.51.3 (the Hugging Face Inference Toolkit breaks against transformers 5.x because it imports helpers from transformers.file_utils) and adds accelerate + huggingface_hub.

Deploying this as an Inference Endpoint

1. Prerequisites

A Hugging Face account that has accepted the Llama 3.1 Community License at meta-llama/Meta-Llama-3.1-8B-Instruct.
A User Access Token (Settings → Access Tokens) with at least read permission on gated repos.

2. Create the endpoint

From the Hugging Face UI, Deploy → Inference Endpoints on this repo, then:

Setting	Recommended value
Instance type	GPU, at least 24 GB VRAM (e.g. NVIDIA L4, A10G, or A100) — Llama 3.1 8B in bfloat16 is ~16 GB plus KV cache
Container	Default (`huggingface-inference-toolkit`)
Task	`Custom` (the handler is auto-detected via `handler.py`)
Secret	`HF_TOKEN` = your user access token

3. Call it

curl https://<your-endpoint>.endpoints.huggingface.cloud \
  -H "Authorization: Bearer $HF_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": {
      "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "In one sentence, what is Hugging Face?"}
      ]
    },
    "parameters": {
      "max_new_tokens": 128,
      "do_sample": true,
      "temperature": 0.7,
      "top_p": 0.9
    }
  }'

Response shape:

{ "generated_text": "Hugging Face is ..." }

Request schema

The handler accepts both {"messages": [...]} and {"inputs": {"messages": [...]}} at the top level.

Field	Type	Default	Notes
`messages`	`list[{role, content}]`	required	Standard chat format; `role` ∈ `system` / `user` / `assistant` / `tool`
`parameters.max_new_tokens`	`int`	`256`
`parameters.do_sample`	`bool`	`false`	When `false`, `temperature`/`top_p` are ignored
`parameters.temperature`	`float`	`0.7`	Only used when `do_sample=true`
`parameters.top_p`	`float`	`0.9`	Only used when `do_sample=true`

Limitations

The handler returns only the assistant's continuation as generated_text; it does not stream tokens and does not implement the OpenAI chat-completion schema.
Cold start downloads ~16 GB from the base model repo, which can take 1–3 minutes on first boot.

License

Usage is governed by the Llama 3.1 Community License. You must accept it on the base model page before this endpoint will be able to download weights.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for ericaRC/test-Llama

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Finetuned

(2730)

this model