allenai/OLMo-7B-0724-Instruct-hf
Text Generation β’ 7B β’ Updated β’ 529 β’ 7
π Whatβs Different?
- TGI-backed API β Optimized for inference, not just a proof-of-concept.
- Hugging Face Transformers-Compatible β Works with any TGI-supported LLM.
- Auto-Optimizations β All the TGI goodies.
Go to Hugging Face Spaces and create a new Space.
app_port: 8080 in README.md.π₯ Important: Unlike other Spaces, TGI requires
app_port: 8080(or your port of choice) for proper routing.
Dockerfile
TGI provides a pre-built inference server for Hugging Face models.
We just need to set up a Dockerfile that pulls the correct image and configures the model.
π Dockerfile
# Use Hugging Face TGI as the base image
FROM ghcr.io/huggingface/text-generation-inference:3.0.2
# Set working directory
WORKDIR /app
# Create and set permissions for cache directories
RUN mkdir -p /data && chmod 777 /data
RUN mkdir -p /.cache && chmod 777 /.cache
RUN mkdir -p /.triton && chmod 777 /.triton
# Expose the model API on port 8080
EXPOSE 8080
# Set Hugging Face token for private models
ARG HF_TOKEN
ENV HF_TOKEN=${HF_TOKEN}
# Run the TGI server with OLMo-7B
CMD ["--model-id", "allenai/OLMo-7B-0724-Instruct-hf", "--port", "8080"]
π· [Placeholder for Screenshot: Space Deploying with TGI]
Once deployed, the TGI API is automatically available at:
https://your-space-url.hf.space/v1/generate
curl
~ curl https://arig23498-tgi-docker-olmo.hf.space/v1/chat/completions \
-X POST \
-d '{
"model": "tgi",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is deep learning?"
}
],
"stream": true,
"max_tokens": 20
}' \
-H 'Content-Type: application/json'
from huggingface_hub import InferenceClient
client = InferenceClient(
base_url="https://arig23498-tgi-docker-olmo.hf.space/v1/",
)
output = client.chat.completions.create(
model="tgi",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is deep learning?"},
],
stream=True,
max_tokens=1024,
)
for chunk in output:
print(chunk.choices[0].delta.content, end="")
π More TGI Features: TGI v3.0.2 Release Notes
πΉ Try it out & scale your own LLM API today! π