--- title: HF Inference API emoji: 🤗 colorFrom: yellow colorTo: pink sdk: gradio sdk_version: 6.2.0 app_file: app.py pinned: false license: mit --- # Hugging Face Inference API REST API and Gradio interface for Hugging Face model inference. ## Features - **Two inference modes**: HF Inference API (lightweight) or local model loading - **REST API**: FastAPI with automatic OpenAPI documentation - **Gradio UI**: Web interface for interactive testing - **HF Spaces ready**: Deploy directly to Hugging Face Spaces ## Quick Start ### 1. Installation ```bash # Create virtual environment python -m venv venv source venv/bin/activate # Install dependencies pip install -r requirements.txt # For local model inference (optional) pip install transformers torch # Copy and configure environment cp .env.example .env ``` ### 2. Configure Edit `.env` with your settings: ```bash # Use HF Inference API (recommended) HF_USE_API=true HF_API_TOKEN=hf_xxxxxxxxxxxxx # Or load models locally HF_USE_API=false ``` ### 3. Run ```bash # Option A: REST API (FastAPI) python -m app.main # Option B: Gradio interface python app.py ``` ## Running Options ### REST API (FastAPI) ```bash python -m app.main ``` - URL: http://localhost:8000 - Swagger: http://localhost:8000/docs - ReDoc: http://localhost:8000/redoc ### Gradio Interface ```bash python app.py ``` - URL: http://localhost:7860 ### Docker ```bash # Build docker build -t hf-inference-api . # Run with HF API docker run -p 8000:8000 \ -e HF_USE_API=true \ -e HF_API_TOKEN=hf_xxxxx \ -e HF_MODEL_NAME=distilbert-base-uncased-finetuned-sst-2-english \ hf-inference-api # Run with local model docker run -p 8000:8000 \ -e HF_USE_API=false \ -e HF_MODEL_NAME=distilbert-base-uncased-finetuned-sst-2-english \ hf-inference-api ``` ### Hugging Face Spaces 1. Create a new Space at https://huggingface.co/new-space 2. Select **Gradio** as SDK 3. Push these files: - `app.py` - `requirements.txt` - `app/` folder 4. Add `HF_API_TOKEN` in Space Settings > Secrets ## API Endpoints ### Health Check ```bash curl http://localhost:8000/health ``` Response: ```json { "status": "ok", "model_loaded": true, "model_name": "distilbert-base-uncased-finetuned-sst-2-english" } ``` ### Inference ```bash curl -X POST http://localhost:8000/predict \ -H "Content-Type: application/json" \ -d '{"inputs": "I love this product!"}' ``` Response: ```json { "predictions": [[{"label": "POSITIVE", "score": 0.9998}]], "model_name": "distilbert-base-uncased-finetuned-sst-2-english" } ``` ### Batch Inference ```bash curl -X POST http://localhost:8000/predict \ -H "Content-Type: application/json" \ -d '{"inputs": ["I love this!", "This is terrible."]}' ``` ### With Parameters ```bash curl -X POST http://localhost:8000/predict \ -H "Content-Type: application/json" \ -d '{ "inputs": "The capital of France is", "parameters": {"max_new_tokens": 50} }' ``` ## Configuration ### Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `HF_USE_API` | `true` | Use HF Inference API (`true`) or local model (`false`) | | `HF_API_TOKEN` | `None` | HF API token (required if `HF_USE_API=true`) | | `HF_MODEL_NAME` | `cardiffnlp/twitter-roberta-base-sentiment-latest` | Hugging Face model ID | | `HF_TASK` | `text-classification` | Pipeline task type | | `HF_HOST` | `0.0.0.0` | Server host | | `HF_PORT` | `8000` | Server port | | `HF_DEVICE` | `cpu` | Device for local inference (`cpu`, `cuda`, `cuda:0`) | | `HF_MAX_BATCH_SIZE` | `32` | Maximum batch size for local inference | ### Inference Modes #### HF Inference API (Recommended) ```bash HF_USE_API=true HF_API_TOKEN=hf_xxxxxxxxxxxxx ``` Pros: - No model download required - Lightweight (no torch/transformers) - Fast startup - Free tier available Cons: - Requires internet connection - Rate limits on free tier - API token required #### Local Model ```bash HF_USE_API=false ``` Requires additional dependencies: ```bash pip install transformers torch ``` Pros: - No internet required after download - No rate limits - Full control Cons: - Large dependencies (~2GB for torch) - Model download on first run - More RAM/CPU required ## Supported Tasks | Task | Description | Example Model | |------|-------------|---------------| | `text-classification` | Classify text into categories | `distilbert-base-uncased-finetuned-sst-2-english` | | `sentiment-analysis` | Analyze sentiment (alias for text-classification) | `nlptown/bert-base-multilingual-uncased-sentiment` | | `text-generation` | Generate text from prompt | `gpt2`, `mistralai/Mistral-7B-v0.1` | | `summarization` | Summarize long text | `facebook/bart-large-cnn` | | `translation` | Translate text | `Helsinki-NLP/opus-mt-en-fr` | | `fill-mask` | Fill in masked tokens | `bert-base-uncased` | | `question-answering` | Answer questions given context | `deepset/roberta-base-squad2` | | `feature-extraction` | Extract embeddings | `sentence-transformers/all-MiniLM-L6-v2` | ## Project Structure ``` hf-inference-api/ ├── app/ │ ├── __init__.py │ ├── config.py # Settings (pydantic-settings) │ ├── inference.py # Inference engine (API + local) │ ├── main.py # FastAPI application │ └── models.py # Pydantic models ├── app.py # Gradio interface ├── .env.example # Environment template ├── .gitignore ├── Dockerfile ├── README.md └── requirements.txt ``` ## Examples ### Text Classification ```bash HF_MODEL_NAME=distilbert-base-uncased-finetuned-sst-2-english HF_TASK=text-classification ``` ```bash curl -X POST http://localhost:8000/predict \ -d '{"inputs": "I love this movie!"}' ``` ### Text Generation ```bash HF_MODEL_NAME=gpt2 HF_TASK=text-generation ``` ```bash curl -X POST http://localhost:8000/predict \ -d '{"inputs": "Once upon a time", "parameters": {"max_new_tokens": 50}}' ``` ### Summarization ```bash HF_MODEL_NAME=facebook/bart-large-cnn HF_TASK=summarization ``` ```bash curl -X POST http://localhost:8000/predict \ -d '{"inputs": "Long article text here..."}' ``` ### Translation (EN -> FR) ```bash HF_MODEL_NAME=Helsinki-NLP/opus-mt-en-fr HF_TASK=translation ``` ```bash curl -X POST http://localhost:8000/predict \ -d '{"inputs": "Hello, how are you?"}' ```