hf-inference-api / README.md
goabonga's picture
Initial commit: HF Inference API with Gradio interface
b98ed7e unverified

A newer version of the Gradio SDK is available: 6.5.1

Upgrade
metadata
title: HF Inference API
emoji: πŸ€—
colorFrom: yellow
colorTo: pink
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
license: mit

Hugging Face Inference API

REST API and Gradio interface for Hugging Face model inference.

Features

  • Two inference modes: HF Inference API (lightweight) or local model loading
  • REST API: FastAPI with automatic OpenAPI documentation
  • Gradio UI: Web interface for interactive testing
  • HF Spaces ready: Deploy directly to Hugging Face Spaces

Quick Start

1. Installation

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# For local model inference (optional)
pip install transformers torch

# Copy and configure environment
cp .env.example .env

2. Configure

Edit .env with your settings:

# Use HF Inference API (recommended)
HF_USE_API=true
HF_API_TOKEN=hf_xxxxxxxxxxxxx

# Or load models locally
HF_USE_API=false

3. Run

# Option A: REST API (FastAPI)
python -m app.main

# Option B: Gradio interface
python app.py

Running Options

REST API (FastAPI)

python -m app.main

Gradio Interface

python app.py

Docker

# Build
docker build -t hf-inference-api .

# Run with HF API
docker run -p 8000:8000 \
  -e HF_USE_API=true \
  -e HF_API_TOKEN=hf_xxxxx \
  -e HF_MODEL_NAME=distilbert-base-uncased-finetuned-sst-2-english \
  hf-inference-api

# Run with local model
docker run -p 8000:8000 \
  -e HF_USE_API=false \
  -e HF_MODEL_NAME=distilbert-base-uncased-finetuned-sst-2-english \
  hf-inference-api

Hugging Face Spaces

  1. Create a new Space at https://huggingface.co/new-space
  2. Select Gradio as SDK
  3. Push these files:
    • app.py
    • requirements.txt
    • app/ folder
  4. Add HF_API_TOKEN in Space Settings > Secrets

API Endpoints

Health Check

curl http://localhost:8000/health

Response:

{
  "status": "ok",
  "model_loaded": true,
  "model_name": "distilbert-base-uncased-finetuned-sst-2-english"
}

Inference

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"inputs": "I love this product!"}'

Response:

{
  "predictions": [[{"label": "POSITIVE", "score": 0.9998}]],
  "model_name": "distilbert-base-uncased-finetuned-sst-2-english"
}

Batch Inference

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"inputs": ["I love this!", "This is terrible."]}'

With Parameters

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": "The capital of France is",
    "parameters": {"max_new_tokens": 50}
  }'

Configuration

Environment Variables

Variable Default Description
HF_USE_API true Use HF Inference API (true) or local model (false)
HF_API_TOKEN None HF API token (required if HF_USE_API=true)
HF_MODEL_NAME cardiffnlp/twitter-roberta-base-sentiment-latest Hugging Face model ID
HF_TASK text-classification Pipeline task type
HF_HOST 0.0.0.0 Server host
HF_PORT 8000 Server port
HF_DEVICE cpu Device for local inference (cpu, cuda, cuda:0)
HF_MAX_BATCH_SIZE 32 Maximum batch size for local inference

Inference Modes

HF Inference API (Recommended)

HF_USE_API=true
HF_API_TOKEN=hf_xxxxxxxxxxxxx

Pros:

  • No model download required
  • Lightweight (no torch/transformers)
  • Fast startup
  • Free tier available

Cons:

  • Requires internet connection
  • Rate limits on free tier
  • API token required

Local Model

HF_USE_API=false

Requires additional dependencies:

pip install transformers torch

Pros:

  • No internet required after download
  • No rate limits
  • Full control

Cons:

  • Large dependencies (~2GB for torch)
  • Model download on first run
  • More RAM/CPU required

Supported Tasks

Task Description Example Model
text-classification Classify text into categories distilbert-base-uncased-finetuned-sst-2-english
sentiment-analysis Analyze sentiment (alias for text-classification) nlptown/bert-base-multilingual-uncased-sentiment
text-generation Generate text from prompt gpt2, mistralai/Mistral-7B-v0.1
summarization Summarize long text facebook/bart-large-cnn
translation Translate text Helsinki-NLP/opus-mt-en-fr
fill-mask Fill in masked tokens bert-base-uncased
question-answering Answer questions given context deepset/roberta-base-squad2
feature-extraction Extract embeddings sentence-transformers/all-MiniLM-L6-v2

Project Structure

hf-inference-api/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ config.py        # Settings (pydantic-settings)
β”‚   β”œβ”€β”€ inference.py     # Inference engine (API + local)
β”‚   β”œβ”€β”€ main.py          # FastAPI application
β”‚   └── models.py        # Pydantic models
β”œβ”€β”€ app.py               # Gradio interface
β”œβ”€β”€ .env.example         # Environment template
β”œβ”€β”€ .gitignore
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ README.md
└── requirements.txt

Examples

Text Classification

HF_MODEL_NAME=distilbert-base-uncased-finetuned-sst-2-english
HF_TASK=text-classification
curl -X POST http://localhost:8000/predict \
  -d '{"inputs": "I love this movie!"}'

Text Generation

HF_MODEL_NAME=gpt2
HF_TASK=text-generation
curl -X POST http://localhost:8000/predict \
  -d '{"inputs": "Once upon a time", "parameters": {"max_new_tokens": 50}}'

Summarization

HF_MODEL_NAME=facebook/bart-large-cnn
HF_TASK=summarization
curl -X POST http://localhost:8000/predict \
  -d '{"inputs": "Long article text here..."}'

Translation (EN -> FR)

HF_MODEL_NAME=Helsinki-NLP/opus-mt-en-fr
HF_TASK=translation
curl -X POST http://localhost:8000/predict \
  -d '{"inputs": "Hello, how are you?"}'