---
title: HF Inference API
emoji: 🤗
colorFrom: yellow
colorTo: pink
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
license: mit
---

# Hugging Face Inference API

REST API and Gradio interface for Hugging Face model inference.

## Features

- **Two inference modes**: HF Inference API (lightweight) or local model loading
- **REST API**: FastAPI with automatic OpenAPI documentation
- **Gradio UI**: Web interface for interactive testing
- **HF Spaces ready**: Deploy directly to Hugging Face Spaces

## Quick Start

### 1. Installation

```bash
# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# For local model inference (optional)
pip install transformers torch

# Copy and configure environment
cp .env.example .env
```

### 2. Configure

Edit `.env` with your settings:

```bash
# Use HF Inference API (recommended)
HF_USE_API=true
HF_API_TOKEN=hf_xxxxxxxxxxxxx

# Or load models locally
HF_USE_API=false
```

### 3. Run

```bash
# Option A: REST API (FastAPI)
python -m app.main

# Option B: Gradio interface
python app.py
```

## Running Options

### REST API (FastAPI)

```bash
python -m app.main
```

- URL: http://localhost:8000
- Swagger: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc

### Gradio Interface

```bash
python app.py
```

- URL: http://localhost:7860

### Docker

```bash
# Build
docker build -t hf-inference-api .

# Run with HF API
docker run -p 8000:8000 \
  -e HF_USE_API=true \
  -e HF_API_TOKEN=hf_xxxxx \
  -e HF_MODEL_NAME=distilbert-base-uncased-finetuned-sst-2-english \
  hf-inference-api

# Run with local model
docker run -p 8000:8000 \
  -e HF_USE_API=false \
  -e HF_MODEL_NAME=distilbert-base-uncased-finetuned-sst-2-english \
  hf-inference-api
```

### Hugging Face Spaces

1. Create a new Space at https://huggingface.co/new-space
2. Select **Gradio** as SDK
3. Push these files:
   - `app.py`
   - `requirements.txt`
   - `app/` folder
4. Add `HF_API_TOKEN` in Space Settings > Secrets

## API Endpoints

### Health Check

```bash
curl http://localhost:8000/health
```

Response:
```json
{
  "status": "ok",
  "model_loaded": true,
  "model_name": "distilbert-base-uncased-finetuned-sst-2-english"
}
```

### Inference

```bash
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"inputs": "I love this product!"}'
```

Response:
```json
{
  "predictions": [[{"label": "POSITIVE", "score": 0.9998}]],
  "model_name": "distilbert-base-uncased-finetuned-sst-2-english"
}
```

### Batch Inference

```bash
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"inputs": ["I love this!", "This is terrible."]}'
```

### With Parameters

```bash
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": "The capital of France is",
    "parameters": {"max_new_tokens": 50}
  }'
```

## Configuration

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `HF_USE_API` | `true` | Use HF Inference API (`true`) or local model (`false`) |
| `HF_API_TOKEN` | `None` | HF API token (required if `HF_USE_API=true`) |
| `HF_MODEL_NAME` | `cardiffnlp/twitter-roberta-base-sentiment-latest` | Hugging Face model ID |
| `HF_TASK` | `text-classification` | Pipeline task type |
| `HF_HOST` | `0.0.0.0` | Server host |
| `HF_PORT` | `8000` | Server port |
| `HF_DEVICE` | `cpu` | Device for local inference (`cpu`, `cuda`, `cuda:0`) |
| `HF_MAX_BATCH_SIZE` | `32` | Maximum batch size for local inference |

### Inference Modes

#### HF Inference API (Recommended)

```bash
HF_USE_API=true
HF_API_TOKEN=hf_xxxxxxxxxxxxx
```

Pros:
- No model download required
- Lightweight (no torch/transformers)
- Fast startup
- Free tier available

Cons:
- Requires internet connection
- Rate limits on free tier
- API token required

#### Local Model

```bash
HF_USE_API=false
```

Requires additional dependencies:
```bash
pip install transformers torch
```

Pros:
- No internet required after download
- No rate limits
- Full control

Cons:
- Large dependencies (~2GB for torch)
- Model download on first run
- More RAM/CPU required

## Supported Tasks

| Task | Description | Example Model |
|------|-------------|---------------|
| `text-classification` | Classify text into categories | `distilbert-base-uncased-finetuned-sst-2-english` |
| `sentiment-analysis` | Analyze sentiment (alias for text-classification) | `nlptown/bert-base-multilingual-uncased-sentiment` |
| `text-generation` | Generate text from prompt | `gpt2`, `mistralai/Mistral-7B-v0.1` |
| `summarization` | Summarize long text | `facebook/bart-large-cnn` |
| `translation` | Translate text | `Helsinki-NLP/opus-mt-en-fr` |
| `fill-mask` | Fill in masked tokens | `bert-base-uncased` |
| `question-answering` | Answer questions given context | `deepset/roberta-base-squad2` |
| `feature-extraction` | Extract embeddings | `sentence-transformers/all-MiniLM-L6-v2` |

## Project Structure

```
hf-inference-api/
├── app/
│   ├── __init__.py
│   ├── config.py        # Settings (pydantic-settings)
│   ├── inference.py     # Inference engine (API + local)
│   ├── main.py          # FastAPI application
│   └── models.py        # Pydantic models
├── app.py               # Gradio interface
├── .env.example         # Environment template
├── .gitignore
├── Dockerfile
├── README.md
└── requirements.txt
```

## Examples

### Text Classification

```bash
HF_MODEL_NAME=distilbert-base-uncased-finetuned-sst-2-english
HF_TASK=text-classification
```

```bash
curl -X POST http://localhost:8000/predict \
  -d '{"inputs": "I love this movie!"}'
```

### Text Generation

```bash
HF_MODEL_NAME=gpt2
HF_TASK=text-generation
```

```bash
curl -X POST http://localhost:8000/predict \
  -d '{"inputs": "Once upon a time", "parameters": {"max_new_tokens": 50}}'
```

### Summarization

```bash
HF_MODEL_NAME=facebook/bart-large-cnn
HF_TASK=summarization
```

```bash
curl -X POST http://localhost:8000/predict \
  -d '{"inputs": "Long article text here..."}'
```

### Translation (EN -> FR)

```bash
HF_MODEL_NAME=Helsinki-NLP/opus-mt-en-fr
HF_TASK=translation
```

```bash
curl -X POST http://localhost:8000/predict \
  -d '{"inputs": "Hello, how are you?"}'
```